# Collect ETM data for AutoFDO

[TOC]

## Introduction

ETM is a hardware feature available on arm64 devices. It collects the instruction stream running on
each cpu. ARM uses ETM as an alternative for LBR (last branch record) on x86.
Simpleperf supports collecting ETM data, and converting it to input files for AutoFDO, which can
then be used for PGO (profile-guided optimization) during compilation.

On ARMv8, ETM is considered as an external debug interface (unless ARMv8.4 Self-hosted Trace
extension is impelemented). So it needs to be enabled explicitly in the bootloader, and isn't
available on user devices. For Pixel devices, it's available on EVT and DVT devices on Pixel 4,
Pixel 4a (5G) and Pixel 5. To test if it's available on other devices, you can follow commands in
this doc and see if you can record any ETM data.

## Examples

Below are examples collecting ETM data for AutoFDO. It has two steps: first recording ETM data,
second converting ETM data to AutoFDO input files.

Record ETM data:

```sh
# preparation: we need to be root to record ETM data
$ adb root
$ adb shell
redfin:/ \# cd data/local/tmp
redfin:/data/local/tmp \#

# Do a system wide collection, it writes output to perf.data.
# If only want ETM data for kernel, use `-e cs-etm:k`.
# If only want ETM data for userspace, use `-e cs-etm:u`.
redfin:/data/local/tmp \# simpleperf record -e cs-etm --duration 3 -a

# To reduce file size and time converting to AutoFDO input files, we recommend converting ETM data
# into an intermediate branch-list format.
redfin:/data/local/tmp \# simpleperf inject --output branch-list -o branch_list.data
```

Converting ETM data to AutoFDO input files needs to read binaries.
So for userspace libraries, they can be converted on device. For kernel, it needs
to be converted on host, with vmlinux and kernel modules available.

Convert ETM data for userspace libraries:

```sh
# Injecting ETM data on device. It writes output to perf_inject.data.
# perf_inject.data is a text file, containing branch counts for each library.
redfin:/data/local/tmp \# simpleperf inject -i branch_list.data
```

Convert ETM data for kernel:

```sh
# pull ETM data to host.
host $ adb pull /data/local/tmp/branch_list.data
# download vmlinux and kernel modules to <binary_dir>
# host simpleperf is in <aosp-top>/system/extras/simpleperf/scripts/bin/linux/x86_64/simpleperf,
# or you can build simpleperf by `mmma system/extras/simpleperf`.
host $ simpleperf inject --symdir <binary_dir> -i branch_list.data
```

The generated perf_inject.data may contain branch info for multiple binaries. But AutoFDO only
accepts one at a time. So we need to split perf_inject.data.
The format of perf_inject.data is below:

```perf_inject.data format

executed range with count info for binary1
branch with count info for binary1
// name for binary1

executed range with count info for binary2
branch with count info for binary2
// name for binary2

...
```

We need to split perf_inject.data, and make sure one file only contains info for one binary.

Then we can use [AutoFDO](https://github.com/google/autofdo) to create profile. Follow README.md
in AutoFDO to build create_llvm_prof, then use `create_llvm_prof` to create profiles for clang.

```sh
# perf_inject_binary1.data is split from perf_inject.data, and only contains branch info for binary1.
host $ create_llvm_prof -profile perf_inject_binary1.data -profiler text -binary path_of_binary1 -out a.prof -format binary

# perf_inject_kernel.data is split from perf_inject.data, and only contains branch info for [kernel.kallsyms].
host $ create_llvm_prof -profile perf_inject_kernel.data -profiler text -binary vmlinux -out a.prof -format binary
```

Then we can use a.prof for PGO during compilation, via `-fprofile-sample-use=a.prof`.
[Here](https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers) are more details.

### A complete example: etm_test_loop.cpp

`etm_test_loop.cpp` is an example to show the complete process.
The source code is in [etm_test_loop.cpp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/etm_test_loop.cpp).
The build script is in [Android.bp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/Android.bp).
It builds an executable called `etm_test_loop`, which runs on device.

**Step 1: Build `etm_test_loop` binary**

```sh
(host) <AOSP>$ . build/envsetup.sh
(host) <AOSP>$ lunch aosp_arm64-trunk_staging-userdebug
(host) <AOSP>$ make etm_test_loop
```

**Step 2: Run `etm_test_loop` on device, and collect ETM data for its running**

```sh
(host) <AOSP>$ adb push out/target/product/generic_arm64/system/bin/etm_test_loop /data/local/tmp
(host) <AOSP>$ adb root
(host) <AOSP>$ adb shell
(device) / $ cd /data/local/tmp
(device) /data/local/tmp $ chmod a+x etm_test_loop
(device) /data/local/tmp $ simpleperf record -e cs-etm:u ./etm_test_loop
simpleperf I cmd_record.cpp:809] Recorded for 0.033556 seconds. Start post processing.
simpleperf I cmd_record.cpp:879] Aux data traced: 1,134,720
(device) /data/local/tmp $ simpleperf inject -i perf.data --output branch-list -o branch_list.data
(device) /data/local/tmp $ exit
(host) <AOSP>$ adb pull /data/local/tmp/branch_list.data
```

**Step 3: Convert ETM data to AutoFDO profile**

```sh
# Build simpleperf tool on host.
(host) <AOSP>$ make simpleperf_ndk
(host) <AOSP>$ simpleperf inject -i branch_list.data -o perf_inject_etm_test_loop.data --symdir out/target/product/generic_arm64/symbols/system/bin
(host) <AOSP>$ cat perf_inject_etm_test_loop.data
14
4000-4010:1
4014-4048:1
...
418c->0:1
// build_id: 0xa6fc5b506adf9884cdb680b4893c505a00000000
// /data/local/tmp/etm_test_loop

(host) <AOSP>$ create_llvm_prof -profile perf_inject_etm_test_loop.data -profiler text -binary out/target/product/generic_arm64/symbols/system/bin/etm_test_loop -out etm_test_loop.afdo -format binary
(host) <AOSP>$ ls -lh etm_test_loop.afdo
rw-r--r-- 1 user group 241 Apr 30 09:52 etm_test_loop.afdo
```

**Step 4: Use AutoFDO profile to build optimized binary**

```sh
(host) <AOSP>$ cp etm_test_loop.afdo toolchain/pgo-profiles/sampling/
(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/Android.bp
# Edit Android.bp to add a fdo_profile module:
#
# fdo_profile {
#    name: "etm_test_loop",
#    profile: "etm_test_loop.afdo"
# }
(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/afdo_profiles.mk
# Edit afdo_profiles.mk to add etm_test_loop profile mapping:
#
# AFDO_PROFILES += keystore2://toolchain/pgo-profiles/sampling:keystore2 \
#  ...
#  server_configurable_flags://toolchain/pgo-profiles/sampling:server_configurable_flags \
#  etm_test_loop://toolchain/pgo-profiles/sampling:etm_test_loop
#
(host) <AOSP>$ vi system/extras/simpleperf/runtest/Android.bp
# Edit Android.bp to enable afdo for etm_test_loop:
#
# cc_binary {
#    name: "etm_test_loop",
#    srcs: ["etm_test_loop.cpp"],
#    afdo: true,
# }
(host) <AOSP>$ make etm_test_loop
```

We can check if `etm_test_loop.afdo` is used when building etm_test_loop.

```sh
(host) <AOSP>$ gzip -d out/verbose.log.gz
(host) <AOSP>$ cat out/verbose.log | grep etm_test_loop.afdo
   ... -fprofile-sample-use=toolchain/pgo-profiles/sampling/etm_test_loop.afdo ...
```

If comparing the disassembly of `out/target/product/generic_arm64/symbols/system/bin/etm_test_loop`
before and after optimizing with AutoFDO data, we can see different preferences when branching.

### A complete example: kernel

This example demonstrates how to collect ETM data for the Android kernel on a device, convert it to
an AutoFDO profile on the host machine, and then use that profile to build an optimized kernel.


**Step 1 (Optional): Build a Kernel with `-fdebug-info-for-profiling`**

While not strictly required, we recommend building the vmlinux file with the
`-fdebug-info-for-profiling` compiler flag. This option adds extra debug information that helps map
instructions accurately to source code, improving profile quality. For more details, see
[this LLVM review](https://reviews.llvm.org/D25435).

An example of how to add this flag to a kernel build can be found in
[this Android kernel commit](https://android-review.googlesource.com/c/kernel/common/+/3101987).


**Step 2: Collect ETM data for the kernel on device**

```sh
(host) $ adb root && adb shell
(device) / $ cd /data/local/tmp
# Record ETM data while running a representative workload (e.g., launching applications or
# running benchmarks):
(device) / $ simpleperf record -e cs-etm:k -a --duration 60 -z -o perf.data
simpleperf I cmd_record.cpp:826] Recorded for 60.0796 seconds. Start post processing.
simpleperf I cmd_record.cpp:902] Aux data traced: 91,780,432
simpleperf I cmd_record.cpp:894] Record compressed: 27.76 MB (original 110.13 MB, ratio 4)
# Convert the raw ETM data to a branch list to reduce file size:
(device) / $ mkdir branch_data
(device) / $ simpleperf inject -i perf.data -o branch_data/branch01.data --output branch-list \
             --binary kernel.kallsyms
(device) / $ ls branch01.data
-rw-rw-rw- 1 root  root  437K 2024-10-17 23:03 branch01.data
# Run the record command and the inject command multiple times to capture a wider range of kernel
# code execution. ETM data traces the instruction stream, and under heavy load, much of this data
# can be lost due to overflow and rate limiting within simpleperf. Recording multiple profiles and
# merging them improves coverage.
```

Alternative: Instead of manual recording, you can use `profcollectd` to continuously collect ETM
data in the background. See the [Collect ETM Data with a Daemon](#collect-etm-data-with-a-daemon)
section for more information.


**Step 3: Convert ETM data to AutoFDO Profile on Host**

```sh
(host) $ adb pull /data/local/tmp/branch_data
(host) $ cd branch_data
# Download the corresponding vmlinux file and place it in the current directory.
# Merge the branch data files and generate an AutoFDO profile:
(host) $ simpleperf inject -i branch01.data,branch02.data,... --binary kernel.kallsyms --symdir . \
         --allow-mismatched-build-id -o kernel.autofdo -j 20
(host) $ ls -lh kernel.autofdo
-rw-r--r-- 1 yabinc primarygroup 1.3M Oct 17 16:39 kernel.autofdo
# Convert the AutoFDO profile to the LLVM profile format:
(host) $ create_llvm_prof --profiler text --binary=vmlinux --profile=kernel.autofdo \
				--out=kernel.llvm_profdata --format extbinary
(host) $ ls -lh kernel.llvm_profdata
-rw-r--r-- 1 yabinc primarygroup 1.4M Oct 17 19:00 kernel.llvm_profdata
```

**Step 4: Use the AutoFDO Profile when Building a New Kernel**

Integrate the generated kernel.llvm_profdata file into your kernel build process. An example of
how to use this profile data with vmlinux can be found in
[this Android kernel commit](https://android-review.googlesource.com/c/kernel/common/+/3293642).


## Convert ETM data for llvm-bolt (experiment)

We can also convert ETM data to profiles for [llvm-bolt](https://github.com/llvm/llvm-project/tree/main/bolt).
The binaries should have an unstripped symbol table, and linked with relocations (--emit-relocs or
-q linker flag).

```sh
# symdir is the directory containting etm_test_loop with unstripped symbol table and relocations.
(host) $ simpleperf inject -i perf.data --output bolt -o perf_inject_bolt.data --symdir symdir
# Remove the comment line.
(host) $ sed -i '/^\/\//d' perf_inject_bolt.data
(host) $ <LLVM_BIN>/perf2bolt --pa -p=perf_inject_bolt.data -o perf.fdata symdir/etm_test_loop
# --no-huge-pages and --align-text=0x4000 are used to avoid generating big binaries due to
# alignment. See https://github.com/facebookarchive/BOLT/issues/138.
# However, if the original binary is built with huge page alignments (-z max-page-size=0x200000),
# then don't use these flags.
(host) $ <LLVM_BIN>/llvm-bolt symdir/etm_test_loop -o etm_test_loop.bolt -data=perf.fdata \
         -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold \
         -split-eh -dyno-stats --no-huge-pages --align-text=0x4000
```

## Collect ETM data with a daemon

Android also has a daemon collecting ETM data periodically. It only runs on userdebug and eng
devices. The source code is in https://android.googlesource.com/platform/system/extras/+/main/profcollectd/.

## Options for collecting ETM data

Simpleperf provides several options for ETM data collection, which are listed in the
"ETM recording options" section of the `simpleperf record -h` output. Here's an introduction to some
of them:

ETM traces the instruction stream and can generate a large amount of data in a short time. The
kernel uses a buffer to store this data.  The default buffer size is 4MB, which can be controlled
with the `--aux-buffer-size` option. Simpleperf periodically reads data from this buffer, by default
every 100ms. This interval can be adjusted using the `--etm-flush-interval` option. If the buffer
overflows, excess ETM data is lost. The default data generation rate is 40MB/s. This is true when
using ETR, TRBE might copy data more frequently.

To reduce storage size, ETM data can be compressed before being written to disk using the `-z`
option. In practice, this reduces storage size by 75%.

Another way to reduce storage size is to decode ETM data before storing it, using the `--decode-etm`
option. This can achieve around a 98% reduction in storage size. However, it doubles CPU cycles and
and power for recording, and can lead to data loss if processing doesn't keep up with the data
generation rate. For this reason, profcollectd currently uses `-z` for compression instead of
`--decode-etm`.

## Support ETM in the kernel

To let simpleperf use ETM function, we need to enable Coresight driver in the kernel, which lives in
`<linux_kernel>/drivers/hwtracing/coresight`.

The Coresight driver can be enabled by below kernel configs:

```config
	CONFIG_CORESIGHT=y
	CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
	CONFIG_CORESIGHT_SOURCE_ETM4X=y
```

On Kernel 5.10+, we recommend building Coresight driver as kernel modules. Because it works with
GKI kernel.

```config
	CONFIG_CORESIGHT=m
	CONFIG_CORESIGHT_LINK_AND_SINK_TMC=m
	CONFIG_CORESIGHT_SOURCE_ETM4X=m
```

Android common kernel 5.10+ should have all the Coresight patches needed to collect ETM data.
Android common kernel 5.4 misses two patches. But by adding patches in
https://android-review.googlesource.com/q/topic:test_etm_on_hikey960_5.4, we can collect ETM data
on hikey960 with 5.4 kernel.
For Android common kernel 4.14 and 4.19, we have backported all necessary Coresight patches.

Besides Coresight driver, we also need to add Coresight devices in device tree. An example is in
https://github.com/torvalds/linux/blob/master/arch/arm64/boot/dts/arm/juno-base.dtsi. There should
be a path flowing ETM data from ETM device through funnels, ETF and replicators, all the way to
ETR, which writes ETM data to system memory.

One optional flag in ETM device tree is "arm,coresight-loses-context-with-cpu". It saves ETM
registers when a CPU enters low power state. It may be needed to avoid
"coresight_disclaim_device_unlocked" warning when doing system wide collection.

One optional flag in ETR device tree is "arm,scatter-gather". Simpleperf requests 4M system memory
for ETR to store ETM data. Without IOMMU, the memory needs to be contiguous. If the kernel can't
fulfill the request, simpleperf will report out of memory error. Fortunately, we can use
"arm,scatter-gather" flag to let ETR run in scatter gather mode, which uses non-contiguous memory.


### A possible problem: trace_id mismatch

Each CPU has an ETM device, which has a unique trace_id assigned from the kernel.
The formula is: `trace_id = 0x10 + cpu * 2`, as in https://github.com/torvalds/linux/blob/master/include/linux/coresight-pmu.h#L37.
If the formula is modified by local patches, then simpleperf inject command can't parse ETM data
properly and is likely to give empty output.


## Enable ETM in the bootloader

Unless ARMv8.4 Self-hosted Trace extension is implemented, ETM is considered as an external debug
interface. It may be disabled by fuse (like JTAG). So we need to check if ETM is disabled, and
if bootloader provides a way to reenable it.

We can tell if ETM is disable by checking its TRCAUTHSTATUS register, which is exposed in sysfs,
like /sys/bus/coresight/devices/coresight-etm0/mgmt/trcauthstatus. To reenable ETM, we need to
enable non-Secure non-invasive debug on ARM CPU. The method depends on chip vendors(SOCs).


## Related docs

* [Arm Architecture Reference Manual Armv8, D3 AArch64 Self-hosted Trace](https://developer.arm.com/documentation/ddi0487/latest)
* [ARM ETM Architecture Specification](https://developer.arm.com/documentation/ihi0064/latest/)
* [ARM CoreSight Architecture Specification](https://developer.arm.com/documentation/ihi0029/latest)
* [CoreSight Components Technical Reference Manual](https://developer.arm.com/documentation/ddi0314/h/)
* [CoreSight Trace Memory Controller Technical Reference Manual](https://developer.arm.com/documentation/ddi0461/b/)
* [OpenCSD library for decoding ETM data](https://github.com/Linaro/OpenCSD)
* [AutoFDO tool for converting profile data](https://github.com/google/autofdo)
