|
@@ -305,9 +305,9 @@ spent on average every iteration. The second table correlates the resource
|
|
cycles to the machine instruction in the sequence. For example, every iteration
|
|
cycles to the machine instruction in the sequence. For example, every iteration
|
|
of the instruction vmulps always executes on resource unit [6]
|
|
of the instruction vmulps always executes on resource unit [6]
|
|
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
|
|
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
|
|
-per iteration. Note that on Jaguar, vector floating-point multiply can only be
|
|
|
|
-issued to pipeline JFPU1, while horizontal floating-point additions can only be
|
|
|
|
-issued to pipeline JFPU0.
|
|
|
|
|
|
+per iteration. Note that on AMD Jaguar, vector floating-point multiply can
|
|
|
|
+only be issued to pipeline JFPU1, while horizontal floating-point additions can
|
|
|
|
+only be issued to pipeline JFPU0.
|
|
|
|
|
|
The resource pressure view helps with identifying bottlenecks caused by high
|
|
The resource pressure view helps with identifying bottlenecks caused by high
|
|
usage of specific hardware resources. Situations with resource pressure mainly
|
|
usage of specific hardware resources. Situations with resource pressure mainly
|
|
@@ -427,3 +427,125 @@ instructions. When performance is mostly limited by the lack of hardware
|
|
resources, the delta between the two counters is small. However, the number of
|
|
resources, the delta between the two counters is small. However, the number of
|
|
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
|
|
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
|
|
especially when compared to other low latency instructions.
|
|
especially when compared to other low latency instructions.
|
|
|
|
+
|
|
|
|
+Extra Statistics to Further Diagnose Performance Issues
|
|
|
|
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
+The ``-all-stats`` command line option enables extra statistics and performance
|
|
|
|
+counters for the dispatch logic, the reorder buffer, the retire control unit,
|
|
|
|
+and the register file.
|
|
|
|
+
|
|
|
|
+Below is an example of ``-all-stats`` output generated by MCA for the
|
|
|
|
+dot-product example discussed in the previous sections.
|
|
|
|
+
|
|
|
|
+.. code-block:: none
|
|
|
|
+
|
|
|
|
+ Dynamic Dispatch Stall Cycles:
|
|
|
|
+ RAT - Register unavailable: 0
|
|
|
|
+ RCU - Retire tokens unavailable: 0
|
|
|
|
+ SCHEDQ - Scheduler full: 272
|
|
|
|
+ LQ - Load queue full: 0
|
|
|
|
+ SQ - Store queue full: 0
|
|
|
|
+ GROUP - Static restrictions on the dispatch group: 0
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+ Dispatch Logic - number of cycles where we saw N instructions dispatched:
|
|
|
|
+ [# dispatched], [# cycles]
|
|
|
|
+ 0, 24 (3.9%)
|
|
|
|
+ 1, 272 (44.6%)
|
|
|
|
+ 2, 314 (51.5%)
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+ Schedulers - number of cycles where we saw N instructions issued:
|
|
|
|
+ [# issued], [# cycles]
|
|
|
|
+ 0, 7 (1.1%)
|
|
|
|
+ 1, 306 (50.2%)
|
|
|
|
+ 2, 297 (48.7%)
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+ Scheduler's queue usage:
|
|
|
|
+ JALU01, 0/20
|
|
|
|
+ JFPU01, 18/18
|
|
|
|
+ JLSAGU, 0/12
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+ Retire Control Unit - number of cycles where we saw N instructions retired:
|
|
|
|
+ [# retired], [# cycles]
|
|
|
|
+ 0, 109 (17.9%)
|
|
|
|
+ 1, 102 (16.7%)
|
|
|
|
+ 2, 399 (65.4%)
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+ Register File statistics:
|
|
|
|
+ Total number of mappings created: 900
|
|
|
|
+ Max number of mappings used: 35
|
|
|
|
+
|
|
|
|
+ * Register File #1 -- JFpuPRF:
|
|
|
|
+ Number of physical registers: 72
|
|
|
|
+ Total number of mappings created: 900
|
|
|
|
+ Max number of mappings used: 35
|
|
|
|
+
|
|
|
|
+ * Register File #2 -- JIntegerPRF:
|
|
|
|
+ Number of physical registers: 64
|
|
|
|
+ Total number of mappings created: 0
|
|
|
|
+ Max number of mappings used: 0
|
|
|
|
+
|
|
|
|
+If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
|
|
|
|
+SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
|
|
|
|
+logic is unable to dispatch a group of two instructions because the scheduler's
|
|
|
|
+queue is full.
|
|
|
|
+
|
|
|
|
+Looking at the *Dispatch Logic* table, we see that the pipeline was only able
|
|
|
|
+to dispatch two instructions 51.5% of the time. The dispatch group was limited
|
|
|
|
+to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
|
|
|
|
+dispatch statistics are displayed by either using the command option
|
|
|
|
+``-all-stats`` or ``-dispatch-stats``.
|
|
|
|
+
|
|
|
|
+The next table, *Schedulers*, presents a histogram displaying a count,
|
|
|
|
+representing the number of instructions issued on some number of cycles. In
|
|
|
|
+this case, of the 610 simulated cycles, single
|
|
|
|
+instructions were issued 306 times (50.2%) and there were 7 cycles where
|
|
|
|
+no instructions were issued.
|
|
|
|
+
|
|
|
|
+The *Scheduler's queue usage* table shows that the maximum number of buffer
|
|
|
|
+entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
|
|
|
|
+reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
|
|
|
|
+three schedulers:
|
|
|
|
+
|
|
|
|
+* JALU01 - A scheduler for ALU instructions.
|
|
|
|
+* JFPU01 - A scheduler floating point operations.
|
|
|
|
+* JLSAGU - A scheduler for address generation.
|
|
|
|
+
|
|
|
|
+The dot-product is a kernel of three floating point instructions (a vector
|
|
|
|
+multiply followed by two horizontal adds). That explains why only the floating
|
|
|
|
+point scheduler appears to be used.
|
|
|
|
+
|
|
|
|
+A full scheduler queue is either caused by data dependency chains or by a
|
|
|
|
+sub-optimal usage of hardware resources. Sometimes, resource pressure can be
|
|
|
|
+mitigated by rewriting the kernel using different instructions that consume
|
|
|
|
+different scheduler resources. Schedulers with a small queue are less resilient
|
|
|
|
+to bottlenecks caused by the presence of long data dependencies.
|
|
|
|
+The scheduler statistics are displayed by
|
|
|
|
+using the command option ``-all-stats`` or ``-scheduler-stats``.
|
|
|
|
+
|
|
|
|
+The next table, *Retire Control Unit*, presents a histogram displaying a count,
|
|
|
|
+representing the number of instructions retired on some number of cycles. In
|
|
|
|
+this case, of the 610 simulated cycles, two instructions were retired during
|
|
|
|
+the same cycle 399 times (65.4%) and there were 109 cycles where no
|
|
|
|
+instructions were retired. The retire statistics are displayed by using the
|
|
|
|
+command option ``-all-stats`` or ``-retire-stats``.
|
|
|
|
+
|
|
|
|
+The last table presented is *Register File statistics*. Each physical register
|
|
|
|
+file (PRF) used by the pipeline is presented in this table. In the case of AMD
|
|
|
|
+Jaguar, there are two register files, one for floating-point registers
|
|
|
|
+(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
|
|
|
|
+the 900 instructions processed, there were 900 mappings created. Since this
|
|
|
|
+dot-product example utilized only floating point registers, the JFPuPRF was
|
|
|
|
+responsible for creating the 900 mappings. However, we see that the pipeline
|
|
|
|
+only used a maximum of 35 of 72 available register slots at any given time. We
|
|
|
|
+can conclude that the floating point PRF was the only register file used for
|
|
|
|
+the example, and that it was never resource constrained. The register file
|
|
|
|
+statistics are displayed by using the command option ``-all-stats`` or
|
|
|
|
+``-register-file-stats``.
|
|
|
|
+
|
|
|
|
+In this example, we can conclude that the IPC is mostly limited by data
|
|
|
|
+dependencies, and not by resource pressure.
|