|
@@ -207,23 +207,23 @@ EXIT STATUS
|
|
|
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
|
|
|
to standard error, and the tool returns 1.
|
|
|
|
|
|
-HOW MCA WORKS
|
|
|
--------------
|
|
|
+HOW LLVM-MCA WORKS
|
|
|
+------------------
|
|
|
|
|
|
-MCA takes assembly code as input. The assembly code is parsed into a sequence
|
|
|
-of MCInst with the help of the existing LLVM target assembly parsers. The
|
|
|
-parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
|
|
|
-a performance report.
|
|
|
+:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
|
|
|
+into a sequence of MCInst with the help of the existing LLVM target assembly
|
|
|
+parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
|
|
|
+to generate a performance report.
|
|
|
|
|
|
The Pipeline module simulates the execution of the machine code sequence in a
|
|
|
loop of iterations (default is 100). During this process, the pipeline collects
|
|
|
a number of execution related statistics. At the end of this process, the
|
|
|
pipeline generates and prints a report from the collected statistics.
|
|
|
|
|
|
-Here is an example of a performance report generated by MCA for a dot-product
|
|
|
-of two packed float vectors of four elements. The analysis is conducted for
|
|
|
-target x86, cpu btver2. The following result can be produced via the following
|
|
|
-command using the example located at
|
|
|
+Here is an example of a performance report generated by the tool for a
|
|
|
+dot-product of two packed float vectors of four elements. The analysis is
|
|
|
+conducted for target x86, cpu btver2. The following result can be produced via
|
|
|
+the following command using the example located at
|
|
|
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
|
|
|
|
|
|
.. code-block:: bash
|
|
@@ -316,7 +316,7 @@ pressure should be uniformly distributed between multiple resources.
|
|
|
|
|
|
Timeline View
|
|
|
^^^^^^^^^^^^^
|
|
|
-MCA's timeline view produces a detailed report of each instruction's state
|
|
|
+The timeline view produces a detailed report of each instruction's state
|
|
|
transitions through an instruction pipeline. This view is enabled by the
|
|
|
command line option ``-timeline``. As instructions transition through the
|
|
|
various stages of the pipeline, their states are depicted in the view report.
|
|
@@ -331,7 +331,7 @@ These states are represented by the following characters:
|
|
|
|
|
|
Below is the timeline view for a subset of the dot-product example located in
|
|
|
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
|
|
|
-MCA using the following command:
|
|
|
+:program:`llvm-mca` using the following command:
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
@@ -366,7 +366,7 @@ MCA using the following command:
|
|
|
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
|
The timeline view is interesting because it shows instruction state changes
|
|
|
-during execution. It also gives an idea of how MCA processes instructions
|
|
|
+during execution. It also gives an idea of how the tool processes instructions
|
|
|
executed on the target, and how their timing information might be calculated.
|
|
|
|
|
|
The timeline view is structured in two tables. The first table shows
|
|
@@ -415,8 +415,8 @@ and therefore consuming temporary registers).
|
|
|
|
|
|
Table *Average Wait times* helps diagnose performance issues that are caused by
|
|
|
the presence of long latency instructions and potentially long data dependencies
|
|
|
-which may limit the ILP. Note that MCA, by default, assumes at least 1cy
|
|
|
-between the dispatch event and the issue event.
|
|
|
+which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at
|
|
|
+least 1cy between the dispatch event and the issue event.
|
|
|
|
|
|
When the performance is limited by data dependencies and/or long latency
|
|
|
instructions, the number of cycles spent while in the *ready* state is expected
|
|
@@ -602,9 +602,9 @@ entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
|
|
|
the target scheduling model.
|
|
|
|
|
|
Instructions that are dispatched to the schedulers consume scheduler buffer
|
|
|
-entries. MCA queries the scheduling model to determine the set of
|
|
|
-buffered resources consumed by an instruction. Buffered resources are treated
|
|
|
-like scheduler resources.
|
|
|
+entries. :program:`llvm-mca` queries the scheduling model to determine the set
|
|
|
+of buffered resources consumed by an instruction. Buffered resources are
|
|
|
+treated like scheduler resources.
|
|
|
|
|
|
Instruction Issue
|
|
|
"""""""""""""""""
|
|
@@ -612,22 +612,21 @@ Each processor scheduler implements a buffer of instructions. An instruction
|
|
|
has to wait in the scheduler's buffer until input register operands become
|
|
|
available. Only at that point, does the instruction becomes eligible for
|
|
|
execution and may be issued (potentially out-of-order) for execution.
|
|
|
-Instruction latencies are computed by MCA with the help of the scheduling
|
|
|
-model.
|
|
|
-
|
|
|
-MCA's scheduler is designed to simulate multiple processor schedulers. The
|
|
|
-scheduler is responsible for tracking data dependencies, and dynamically
|
|
|
-selecting which processor resources are consumed by instructions.
|
|
|
-
|
|
|
-The scheduler delegates the management of processor resource units and resource
|
|
|
-groups to a resource manager. The resource manager is responsible for
|
|
|
-selecting resource units that are consumed by instructions. For example, if an
|
|
|
-instruction consumes 1cy of a resource group, the resource manager selects one
|
|
|
-of the available units from the group; by default, the resource manager uses a
|
|
|
+Instruction latencies are computed by :program:`llvm-mca` with the help of the
|
|
|
+scheduling model.
|
|
|
+
|
|
|
+:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
|
|
|
+schedulers. The scheduler is responsible for tracking data dependencies, and
|
|
|
+dynamically selecting which processor resources are consumed by instructions.
|
|
|
+It delegates the management of processor resource units and resource groups to a
|
|
|
+resource manager. The resource manager is responsible for selecting resource
|
|
|
+units that are consumed by instructions. For example, if an instruction
|
|
|
+consumes 1cy of a resource group, the resource manager selects one of the
|
|
|
+available units from the group; by default, the resource manager uses a
|
|
|
round-robin selector to guarantee that resource usage is uniformly distributed
|
|
|
between all units of a group.
|
|
|
|
|
|
-MCA's scheduler implements three instruction queues:
|
|
|
+:program:`llvm-mca`'s scheduler implements three instruction queues:
|
|
|
|
|
|
* WaitQueue: a queue of instructions whose operands are not ready.
|
|
|
* ReadyQueue: a queue of instructions ready to execute.
|
|
@@ -638,8 +637,8 @@ scheduler are either placed into the WaitQueue or into the ReadyQueue.
|
|
|
|
|
|
Every cycle, the scheduler checks if instructions can be moved from the
|
|
|
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
|
|
|
-issued. The algorithm prioritizes older instructions over younger
|
|
|
-instructions.
|
|
|
+issued to the underlying pipelines. The algorithm prioritizes older instructions
|
|
|
+over younger instructions.
|
|
|
|
|
|
Write-Back and Retire Stage
|
|
|
"""""""""""""""""""""""""""
|
|
@@ -656,15 +655,13 @@ for the instruction during the register renaming stage.
|
|
|
|
|
|
Load/Store Unit and Memory Consistency Model
|
|
|
""""""""""""""""""""""""""""""""""""""""""""
|
|
|
-To simulate an out-of-order execution of memory operations, MCA utilizes a
|
|
|
-simulated load/store unit (LSUnit) to simulate the speculative execution of
|
|
|
-loads and stores.
|
|
|
+To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
|
|
|
+utilizes a simulated load/store unit (LSUnit) to simulate the speculative
|
|
|
+execution of loads and stores.
|
|
|
|
|
|
-Each load (or store) consumes an entry in the load (or store) queue. The
|
|
|
-number of slots in the load/store queues is unknown by MCA, since there is no
|
|
|
-mention of it in the scheduling model. In practice, users can specify flags
|
|
|
-``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
|
|
|
-store queues respectively. The queues are unbounded by default.
|
|
|
+Each load (or store) consumes an entry in the load (or store) queue. Users can
|
|
|
+specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
|
|
|
+load and store queues respectively. The queues are unbounded by default.
|
|
|
|
|
|
The LSUnit implements a relaxed consistency model for memory loads and stores.
|
|
|
The rules are:
|
|
@@ -701,15 +698,15 @@ cache. It only knows if an instruction "MayLoad" and/or "MayStore." For
|
|
|
loads, the scheduling model provides an "optimistic" load-to-use latency (which
|
|
|
usually matches the load-to-use latency for when there is a hit in the L1D).
|
|
|
|
|
|
-MCA does not know about serializing operations or memory-barrier like
|
|
|
-instructions. The LSUnit conservatively assumes that an instruction which has
|
|
|
-both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
|
|
|
-That means, it serializes loads without forcing a flush of the load queue.
|
|
|
-Similarly, instructions that "MayStore" and have unmodeled side effects are
|
|
|
-treated like store barriers. A full memory barrier is a "MayLoad" and
|
|
|
-"MayStore" instruction with unmodeled side effects. This is inaccurate, but it
|
|
|
-is the best that we can do at the moment with the current information available
|
|
|
-in LLVM.
|
|
|
+:program:`llvm-mca` does not know about serializing operations or memory-barrier
|
|
|
+like instructions. The LSUnit conservatively assumes that an instruction which
|
|
|
+has both "MayLoad" and unmodeled side effects behaves like a "soft"
|
|
|
+load-barrier. That means, it serializes loads without forcing a flush of the
|
|
|
+load queue. Similarly, instructions that "MayStore" and have unmodeled side
|
|
|
+effects are treated like store barriers. A full memory barrier is a "MayLoad"
|
|
|
+and "MayStore" instruction with unmodeled side effects. This is inaccurate, but
|
|
|
+it is the best that we can do at the moment with the current information
|
|
|
+available in LLVM.
|
|
|
|
|
|
A load/store barrier consumes one entry of the load/store queue. A load/store
|
|
|
barrier enforces ordering of loads/stores. A younger load cannot pass a load
|