123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297 |
- =====================================
- Performance Tips for Frontend Authors
- =====================================
- .. contents::
- :local:
- :depth: 2
- Abstract
- ========
- The intended audience of this document is developers of language frontends
- targeting LLVM IR. This document is home to a collection of tips on how to
- generate IR that optimizes well.
- IR Best Practices
- =================
- As with any optimizer, LLVM has its strengths and weaknesses. In some cases,
- surprisingly small changes in the source IR can have a large effect on the
- generated code.
- Beyond the specific items on the list below, it's worth noting that the most
- mature frontend for LLVM is Clang. As a result, the further your IR gets from
- what Clang might emit, the less likely it is to be effectively optimized. It
- can often be useful to write a quick C program with the semantics you're trying
- to model and see what decisions Clang's IRGen makes about what IR to emit.
- Studying Clang's CodeGen directory can also be a good source of ideas. Note
- that Clang and LLVM are explicitly version locked so you'll need to make sure
- you're using a Clang built from the same svn revision or release as the LLVM
- library you're using. As always, it's *strongly* recommended that you track
- tip of tree development, particularly during bring up of a new project.
- The Basics
- ^^^^^^^^^^^
- #. Make sure that your Modules contain both a data layout specification and
- target triple. Without these pieces, non of the target specific optimization
- will be enabled. This can have a major effect on the generated code quality.
- #. For each function or global emitted, use the most private linkage type
- possible (private, internal or linkonce_odr preferably). Doing so will
- make LLVM's inter-procedural optimizations much more effective.
- #. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds
- of predecessors). Among other issues, the register allocator is known to
- perform badly with confronted with such structures. The only exception to
- this guidance is that a unified return block with high in-degree is fine.
- Use of allocas
- ^^^^^^^^^^^^^^
- An alloca instruction can be used to represent a function scoped stack slot,
- but can also represent dynamic frame expansion. When representing function
- scoped variables or locations, placing alloca instructions at the beginning of
- the entry block should be preferred. In particular, place them before any
- call instructions. Call instructions might get inlined and replaced with
- multiple basic blocks. The end result is that a following alloca instruction
- would no longer be in the entry basic block afterward.
- The SROA (Scalar Replacement Of Aggregates) and Mem2Reg passes only attempt
- to eliminate alloca instructions that are in the entry basic block. Given
- SSA is the canonical form expected by much of the optimizer; if allocas can
- not be eliminated by Mem2Reg or SROA, the optimizer is likely to be less
- effective than it could be.
- Avoid loads and stores of large aggregate type
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- LLVM currently does not optimize well loads and stores of large :ref:`aggregate
- types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider
- loading individual fields from memory.
- Aggregates that are smaller than the largest (performant) load or store
- instruction supported by the targeted hardware are well supported. These can
- be an effective way to represent collections of small packed fields.
- Prefer zext over sext when legal
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- On some architectures (X86_64 is one), sign extension can involve an extra
- instruction whereas zero extension can be folded into a load. LLVM will try to
- replace a sext with a zext when it can be proven safe, but if you have
- information in your source language about the range of a integer value, it can
- be profitable to use a zext rather than a sext.
- Alternatively, you can :ref:`specify the range of the value using metadata
- <range-metadata>` and LLVM can do the sext to zext conversion for you.
- Zext GEP indices to machine register width
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Internally, LLVM often promotes the width of GEP indices to machine register
- width. When it does so, it will default to using sign extension (sext)
- operations for safety. If your source language provides information about
- the range of the index, you may wish to manually extend indices to machine
- register width using a zext instruction.
- When to specify alignment
- ^^^^^^^^^^^^^^^^^^^^^^^^^^
- LLVM will always generate correct code if you don’t specify alignment, but may
- generate inefficient code. For example, if you are targeting MIPS (or older
- ARM ISAs) then the hardware does not handle unaligned loads and stores, and
- so you will enter a trap-and-emulate path if you do a load or store with
- lower-than-natural alignment. To avoid this, LLVM will emit a slower
- sequence of loads, shifts and masks (or load-right + load-left on MIPS) for
- all cases where the load / store does not have a sufficiently high alignment
- in the IR.
- The alignment is used to guarantee the alignment on allocas and globals,
- though in most cases this is unnecessary (most targets have a sufficiently
- high default alignment that they’ll be fine). It is also used to provide a
- contract to the back end saying ‘either this load/store has this alignment, or
- it is undefined behavior’. This means that the back end is free to emit
- instructions that rely on that alignment (and mid-level optimizers are free to
- perform transforms that require that alignment). For x86, it doesn’t make
- much difference, as almost all instructions are alignment-independent. For
- MIPS, it can make a big difference.
- Note that if your loads and stores are atomic, the backend will be unable to
- lower an under aligned access into a sequence of natively aligned accesses.
- As a result, alignment is mandatory for atomic loads and stores.
- Other Things to Consider
- ^^^^^^^^^^^^^^^^^^^^^^^^
- #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing
- analysis), prefer GEPs
- #. Prefer globals over inttoptr of a constant address - this gives you
- dereferencability information. In MCJIT, use getSymbolAddress to provide
- actual address.
- #. Be wary of ordered and atomic memory operations. They are hard to optimize
- and may not be well optimized by the current optimizer. Depending on your
- source language, you may consider using fences instead.
- #. If calling a function which is known to throw an exception (unwind), use
- an invoke with a normal destination which contains an unreachable
- instruction. This form conveys to the optimizer that the call returns
- abnormally. For an invoke which neither returns normally or requires unwind
- code in the current function, you can use a noreturn call instruction if
- desired. This is generally not required because the optimizer will convert
- an invoke with an unreachable unwind destination to a call instruction.
- #. Use profile metadata to indicate statically known cold paths, even if
- dynamic profiling information is not available. This can make a large
- difference in code placement and thus the performance of tight loops.
- #. When generating code for loops, try to avoid terminating the header block of
- the loop earlier than necessary. If the terminator of the loop header
- block is a loop exiting conditional branch, the effectiveness of LICM will
- be limited for loads not in the header. (This is due to the fact that LLVM
- may not know such a load is safe to speculatively execute and thus can't
- lift an otherwise loop invariant load unless it can prove the exiting
- condition is not taken.) It can be profitable, in some cases, to emit such
- instructions into the header even if they are not used along a rarely
- executed path that exits the loop. This guidance specifically does not
- apply if the condition which terminates the loop header is itself invariant,
- or can be easily discharged by inspecting the loop index variables.
- #. In hot loops, consider duplicating instructions from small basic blocks
- which end in highly predictable terminators into their successor blocks.
- If a hot successor block contains instructions which can be vectorized
- with the duplicated ones, this can provide a noticeable throughput
- improvement. Note that this is not always profitable and does involve a
- potentially large increase in code size.
- #. When checking a value against a constant, emit the check using a consistent
- comparison type. The GVN pass *will* optimize redundant equalities even if
- the type of comparison is inverted, but GVN only runs late in the pipeline.
- As a result, you may miss the opportunity to run other important
- optimizations. Improvements to EarlyCSE to remove this issue are tracked in
- Bug 23333.
- #. Avoid using arithmetic intrinsics unless you are *required* by your source
- language specification to emit a particular code sequence. The optimizer
- is quite good at reasoning about general control flow and arithmetic, it is
- not anywhere near as strong at reasoning about the various intrinsics. If
- profitable for code generation purposes, the optimizer will likely form the
- intrinsics itself late in the optimization pipeline. It is *very* rarely
- profitable to emit these directly in the language frontend. This item
- explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`.
- #. Avoid using the :ref:`assume intrinsic <int_assume>` until you've
- established that a) there's no other way to express the given fact and b)
- that fact is critical for optimization purposes. Assumes are a great
- prototyping mechanism, but they can have negative effects on both compile
- time and optimization effectiveness. The former is fixable with enough
- effort, but the later is fairly fundamental to their designed purpose.
- Describing Language Specific Properties
- =======================================
- When translating a source language to LLVM, finding ways to express concepts
- and guarantees available in your source language which are not natively
- provided by LLVM IR will greatly improve LLVM's ability to optimize your code.
- As an example, C/C++'s ability to mark every add as "no signed wrap (nsw)" goes
- a long way to assisting the optimizer in reasoning about loop induction
- variables and thus generating more optimal code for loops.
- The LLVM LangRef includes a number of mechanisms for annotating the IR with
- additional semantic information. It is *strongly* recommended that you become
- highly familiar with this document. The list below is intended to highlight a
- couple of items of particular interest, but is by no means exhaustive.
- Restricted Operation Semantics
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- #. Add nsw/nuw flags as appropriate. Reasoning about overflow is
- generally hard for an optimizer so providing these facts from the frontend
- can be very impactful.
- #. Use fast-math flags on floating point operations if legal. If you don't
- need strict IEEE floating point semantics, there are a number of additional
- optimizations that can be performed. This can be highly impactful for
- floating point intensive computations.
- Describing Aliasing Properties
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- #. Add noalias/align/dereferenceable/nonnull to function arguments and return
- values as appropriate
- #. Use pointer aliasing metadata, especially tbaa metadata, to communicate
- otherwise-non-deducible pointer aliasing facts
- #. Use inbounds on geps. This can help to disambiguate some aliasing queries.
- Modeling Memory Effects
- ^^^^^^^^^^^^^^^^^^^^^^^^
- #. Mark functions as readnone/readonly/argmemonly or noreturn/nounwind when
- known. The optimizer will try to infer these flags, but may not always be
- able to. Manual annotations are particularly important for external
- functions that the optimizer can not analyze.
- #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end
- intrinsics where possible. Common profitable uses are for stack like data
- structures (thus allowing dead store elimination) and for describing
- life times of allocas (thus allowing smaller stack sizes).
- #. Mark invariant locations using !invariant.load and TBAA's constant flags
- Pass Ordering
- ^^^^^^^^^^^^^
- One of the most common mistakes made by new language frontend projects is to
- use the existing -O2 or -O3 pass pipelines as is. These pass pipelines make a
- good starting point for an optimizing compiler for any language, but they have
- been carefully tuned for C and C++, not your target language. You will almost
- certainly need to use a custom pass order to achieve optimal performance. A
- couple specific suggestions:
- #. For languages with numerous rarely executed guard conditions (e.g. null
- checks, type checks, range checks) consider adding an extra execution or
- two of LoopUnswith and LICM to your pass order. The standard pass order,
- which is tuned for C and C++ applications, may not be sufficient to remove
- all dischargeable checks from loops.
- #. If you language uses range checks, consider using the IRCE pass. It is not
- currently part of the standard pass order.
- #. A useful sanity check to run is to run your optimized IR back through the
- -O2 pipeline again. If you see noticeable improvement in the resulting IR,
- you likely need to adjust your pass order.
- I Still Can't Find What I'm Looking For
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- If you didn't find what you were looking for above, consider proposing an piece
- of metadata which provides the optimization hint you need. Such extensions are
- relatively common and are generally well received by the community. You will
- need to ensure that your proposal is sufficiently general so that it benefits
- others if you wish to contribute it upstream.
- You should also consider describing the problem you're facing on `llvm-dev
- <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ and asking for advice.
- It's entirely possible someone has encountered your problem before and can
- give good advice. If there are multiple interested parties, that also
- increases the chances that a metadata extension would be well received by the
- community as a whole.
- Adding to this document
- =======================
- If you run across a case that you feel deserves to be covered here, please send
- a patch to `llvm-commits
- <http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review.
- If you have questions on these items, please direct them to `llvm-dev
- <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_. The more relevant
- context you are able to give to your question, the more likely it is to be
- answered.
|