219 lines
8.1 KiB
Plaintext
219 lines
8.1 KiB
Plaintext
|
perf-arm-spe(1)
|
||
|
================
|
||
|
|
||
|
NAME
|
||
|
----
|
||
|
perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
|
||
|
|
||
|
SYNOPSIS
|
||
|
--------
|
||
|
[verse]
|
||
|
'perf record' -e arm_spe//
|
||
|
|
||
|
DESCRIPTION
|
||
|
-----------
|
||
|
|
||
|
The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
|
||
|
events down to individual instructions. Rather than being interrupt-driven, it picks an
|
||
|
instruction to sample and then captures data for it during execution. Data includes execution time
|
||
|
in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
|
||
|
|
||
|
The sampling has 5 stages:
|
||
|
|
||
|
1. Choose an operation
|
||
|
2. Collect data about the operation
|
||
|
3. Optionally discard the record based on a filter
|
||
|
4. Write the record to memory
|
||
|
5. Interrupt when the buffer is full
|
||
|
|
||
|
Choose an operation
|
||
|
~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
|
||
|
architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
|
||
|
architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
|
||
|
sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
|
||
|
perturbation is also added to the sampling interval by default.
|
||
|
|
||
|
Collect data about the operation
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
Program counter, PMU events, timings and data addresses related to the operation are recorded.
|
||
|
Sampling ensures there is only one sampled operation is in flight.
|
||
|
|
||
|
Optionally discard the record based on a filter
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
Based on programmable criteria, choose whether to keep the record or discard it. If the record is
|
||
|
discarded then the flow stops here for this sample.
|
||
|
|
||
|
Write the record to memory
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
The record is appended to a memory buffer
|
||
|
|
||
|
Interrupt when the buffer is full
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
|
||
|
Perf saves the raw data in the perf.data file.
|
||
|
|
||
|
Opening the file
|
||
|
----------------
|
||
|
|
||
|
Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
|
||
|
recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
|
||
|
the data, Perf generates "synthetic samples" as if these were generated at the time of the
|
||
|
recording. These samples are the same as if normal sampling was done by Perf without using SPE,
|
||
|
although they may have more attributes associated with them. For example a normal sample may have
|
||
|
just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
|
||
|
|
||
|
Why Sampling?
|
||
|
-------------
|
||
|
|
||
|
- Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
|
||
|
hardware. Only one sampled operation is in flight at a time.
|
||
|
|
||
|
- Allows precise attribution data, including: Full PC of instruction, data virtual and physical
|
||
|
addresses.
|
||
|
|
||
|
- Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
|
||
|
indicates which particular cache was hit, but the meaning is implementation defined because
|
||
|
different implementations can have different cache configurations.)
|
||
|
|
||
|
However, SPE does not provide any call-graph information, and relies on statistical methods.
|
||
|
|
||
|
Collisions
|
||
|
----------
|
||
|
|
||
|
When an operation is sampled while a previous sampled operation has not finished, a collision
|
||
|
occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
|
||
|
should be set to avoid collisions.
|
||
|
|
||
|
The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
|
||
|
count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
|
||
|
number for samples dropped that would have made it through the filter, but can be a rough
|
||
|
guide.
|
||
|
|
||
|
The effect of microarchitectural sampling
|
||
|
-----------------------------------------
|
||
|
|
||
|
If an implementation samples micro-operations instead of instructions, the results of sampling must
|
||
|
be weighted accordingly.
|
||
|
|
||
|
For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
|
||
|
becomes twice as likely to appear in the sample population.
|
||
|
|
||
|
The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
|
||
|
estimated from the 'sample_pop' and 'inst_retired' PMU events.
|
||
|
|
||
|
Kernel Requirements
|
||
|
-------------------
|
||
|
|
||
|
The ARM_SPE_PMU config must be set to build as either a module or statically.
|
||
|
|
||
|
Depending on CPU model, the kernel may need to be booted with page table isolation disabled
|
||
|
(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
|
||
|
inaccessible. Try passing 'kpti=off' on the kernel command line".
|
||
|
|
||
|
Capturing SPE with perf command-line tools
|
||
|
------------------------------------------
|
||
|
|
||
|
You can record a session with SPE samples:
|
||
|
|
||
|
perf record -e arm_spe// -- ./mybench
|
||
|
|
||
|
The sample period is set from the -c option, and because the minimum interval is used by default
|
||
|
it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
|
||
|
|
||
|
Config parameters
|
||
|
~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
These are placed between the // in the event and comma separated. For example '-e
|
||
|
arm_spe/load_filter=1,min_latency=10/'
|
||
|
|
||
|
branch_filter=1 - collect branches only (PMSFCR.B)
|
||
|
event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
|
||
|
jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND)
|
||
|
load_filter=1 - collect loads only (PMSFCR.LD)
|
||
|
min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR)
|
||
|
pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
|
||
|
pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
|
||
|
store_filter=1 - collect stores only (PMSFCR.ST)
|
||
|
ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)
|
||
|
|
||
|
+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
|
||
|
than only the execution latency.
|
||
|
|
||
|
Only some events can be filtered on; these include:
|
||
|
|
||
|
bit 1 - instruction retired (i.e. omit speculative instructions)
|
||
|
bit 3 - L1D refill
|
||
|
bit 5 - TLB refill
|
||
|
bit 7 - mispredict
|
||
|
bit 11 - misaligned access
|
||
|
|
||
|
So to sample just retired instructions:
|
||
|
|
||
|
perf record -e arm_spe/event_filter=2/ -- ./mybench
|
||
|
|
||
|
or just mispredicted branches:
|
||
|
|
||
|
perf record -e arm_spe/event_filter=0x80/ -- ./mybench
|
||
|
|
||
|
Viewing the data
|
||
|
~~~~~~~~~~~~~~~~~
|
||
|
|
||
|
By default perf report and perf script will assign samples to separate groups depending on the
|
||
|
attributes/events of the SPE record. Because instructions can have multiple events associated with
|
||
|
them, the samples in these groups are not necessarily unique. For example perf report shows these
|
||
|
groups:
|
||
|
|
||
|
Available samples
|
||
|
0 arm_spe//
|
||
|
0 dummy:u
|
||
|
21 l1d-miss
|
||
|
897 l1d-access
|
||
|
5 llc-miss
|
||
|
7 llc-access
|
||
|
2 tlb-miss
|
||
|
1K tlb-access
|
||
|
36 branch-miss
|
||
|
0 remote-access
|
||
|
900 memory
|
||
|
|
||
|
The arm_spe// and dummy:u events are implementation details and are expected to be empty.
|
||
|
|
||
|
To get a full list of unique samples that are not sorted into groups, set the itrace option to
|
||
|
generate 'instruction' samples. The period option is also taken into account, so set it to 1
|
||
|
instruction unless you want to further downsample the already sampled SPE data:
|
||
|
|
||
|
perf report --itrace=i1i
|
||
|
|
||
|
Memory access details are also stored on the samples and this can be viewed with:
|
||
|
|
||
|
perf report --mem-mode
|
||
|
|
||
|
Common errors
|
||
|
~~~~~~~~~~~~~
|
||
|
|
||
|
- "Cannot find PMU `arm_spe'. Missing kernel support?"
|
||
|
|
||
|
Module not built or loaded, KPTI not disabled (see above), or running on a VM
|
||
|
|
||
|
- "Arm SPE CONTEXT packets not found in the traces."
|
||
|
|
||
|
Root privilege is required to collect context packets. But these only increase the accuracy of
|
||
|
assigning PIDs to kernel samples. For userspace sampling this can be ignored.
|
||
|
|
||
|
- Excessively large perf.data file size
|
||
|
|
||
|
Increase sampling interval (see above)
|
||
|
|
||
|
|
||
|
SEE ALSO
|
||
|
--------
|
||
|
|
||
|
linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
|
||
|
linkperf:perf-inject[1]
|