ByteWise - Use `perf` to Gauge the Performance

--- title: ByteWise - Use `perf` to Gauge the Performance tags: ByteWise, perf, linux, case study --- # Use `perf` to Gauge the Performance In this article, I will do a case study on this stack overflow topic, [Why is processing a sorted array faster than processing an unsorted array?](https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array). :::info **Note:** Other than `perf`, in fact, there are tons of performance gauging tools that we can use on Linux. ![](https://i.imgur.com/wIMAp3B.png) ::: ## `perf`: Linux Profiler ### Introduction **`perf` Linux profiler** has also been called **Performance Counters for Linux (PCL)**, **Linux perf events (LPE)**, or **perf_events**. `perf` is a profiler tool from Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple commandline interface [^3]. `perf` supports many profiling/tracing features * **Hardware events**: CPU Performance Monitoring Units (PMU) also called Performane Monitoring Counters (PMCs), e.g. branch-misses, cache-misses, cpu-cycles, instructions. :::info The design and functionality of a PMU is CPU-specific and documented by the CPU vendor. For a listing of PMU hardware events for Intel and AMD processors, see * Intel PMU event tables: Appendix A of manual [here](http://www.intel.com/Assets/PDF/manual/253669.pdf) * AMD PMU event table: section 3.14 of manual [here](http://support.amd.com/us/Processor_TechDocs/31116.pdf) ::: * **Software events**: These are low level events on kernel counters, e.g. CPU migrations, minor faults, major fault, etc. * **[Kernel Tracepoint](#Kernel-Tracepoints) Events**: This are static kernel-level instrumentation points that are hardcoded in interesting and logical places in the kernel. * **User Statically-Defined Tracing(USDT)**: these are static tracepoints for user-level programs and applications. * **Dynamic Tracing**: Software can be dynamically instrumented, creating events in any location. For kernel software, this uses the kprobes framework. For user-level software, uprobes. * **Timed Profiling**: Snapshots can be collected at an arbitrary frequency, using `perf record --freq=${HZ}`. This is commonly used for CPU usage profiling, and works by creating custom timed interrupt events. `perf` can instruments in three ways * **counting** events in-kernel context, where a summary of counts is printed by `perf`. this mode does not generate a perf.data file. * **sampling** events, which writes event data to a kernel buffer, which is read at a gentle asynchronous rate by the `perf` command to write to the perf.data file. This file is then read by the `perf report` command or so. * **bpf** programs on events, a new feature in Linux 4.4+ kernels that can execute custom user-defined programs in kernel space, which can perform efficient filters and summaries of the data, e.g., efficiently-measured latency histograms. `perf` can help you solve advanced performance and troubleshooting functions. Questions that can be answered include[^1] [^2]: * Why is the kernel on-CPU so much? What code paths? * Which code-paths are causing CPU level 2 cache misses? * Are the CPUs stalled on memory I/O? * Which code-paths are allocating memory, and how much? * What is triggering TCP retransmits? * Is a certain kernel function being called, and how often? * What reasons are threads leaving the CPU? ### Subcommands `perf` is used with several commands * `list`: list available events ```shell List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] alignment-faults [Software event] bpf-output [Software event] cgroup-switches [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] [...] ``` * `stat`: measure total event count for single program or system-wide. `stat` is a statistical result, it counts whenever a cache miss occurs. It does not mark any information about which incident triggers the event. Therefore, that is where `perf record` comes in to rescue. ```shell Performance counter stats for './unsort': 311.83 msec task-clock # 0.997 CPUs utilized 27 context-switches # 86.585 /sec 0 cpu-migrations # 0.000 /sec 86 page-faults # 275.790 /sec 1,344,067,559 cycles # 4.310 GHz (83.33%) 238,075 stalled-cycles-frontend # 0.02% frontend cycles idle (83.34%) 232,294 stalled-cycles-backend # 0.02% backend cycles idle (83.33%) 1,282,482,203 instructions # 0.95 insn per cycle # 0.00 stalled cycles per insn (83.33%) 196,671,980 branches # 630.700 M/sec (83.33%) 37,960,576 branch-misses # 19.30% of all branches (83.34%) 0.312754940 seconds time elapsed 0.312368000 seconds user 0.000000000 seconds sys ``` * `record`: :::warning :exclamation: **Warning**: When using the **sampling mode** with `perf record`, you'll need to be a little careful about the overheads, as the capture files can quickly become hundreds of Mbytes. It depends on the rate of the event you are tracing: the more frequent, the higher the overhead and larger the perf.data size. ::: * `annotate`: annotate sources or assembly ### Installation 1. Install `perf` ```shell sudo apt install linux-tools-$(uname -r) linux-tools-generic ``` 2. `sysctl -w kernel.perf_event_paranoid=-1` :::danger :octagonal_sign: **Error**: If it pops up the error message, please make sure the access control, `kernel.perf_event_paranoid` is configured correctly. ![](https://i.imgur.com/3cx9vHM.png) ::: :::info #### :bookmark: [The Linux kernel user's and administrator's guide / Perf events and tool security / Unprivileged users](https://docs.kernel.org/admin-guide/perf-security.html?highlight=perf_event_paranoid#unprivileged-users) > perf_events scope and access control for unprivileged processes is governed by perf_event_paranoid 2 setting: **-1:** Impose no scope and access restrictions on using perf_events performance monitoring. Per-user per-cpu perf_event_mlock_kb 2 locking limit is ignored when allocating memory buffers for storing performance data. This is the least secure mode since allowed monitored scope is maximized and no perf_events specific limits are imposed on resources allocated for performance monitoring. **>=0:** scope includes per-process and system wide performance monitoring but excludes raw tracepoints and ftrace function tracepoints monitoring. CPU and system events happened when executing either in user or in kernel space can be monitored and captured for later analysis. Per-user per-cpu perf_event_mlock_kb locking limit is imposed but ignored for unprivileged processes with CAP_IPC_LOCK 6 capability. **>=1:** scope includes per-process performance monitoring only and excludes system wide performance monitoring. CPU and system events happened when executing either in user or in kernel space can be monitored and captured for later analysis. Per-user per-cpu perf_event_mlock_kb locking limit is imposed but ignored for unprivileged processes with CAP_IPC_LOCK capability. **>=2:** scope includes per-process performance monitoring only. CPU and system events happened when executing in user space only can be monitored and captured for later analysis. Per-user per-cpu perf_event_mlock_kb locking limit is imposed but ignored for unprivileged processes with CAP_IPC_LOCK capability. ::: 3. `echo 0 | sudo tee /proc/sys/kernel/kptr_restrict` :::info **Note**: If you are measuring *cache miss event*, you have to disable ***kernel pointer***. ::: ### Kernel Tracepoints These tracepoints are hard coded in interesting and logical locations of the kernel, so that higher-level behavior can be easily traced. For example, system calls, TCP events, file system I/O, disk I/O, etc. These are grouped into libraries of tracepoints; eg, "sock:" for socket events, "sched:" for CPU scheduler events. A key value of tracepoints is that they should have a stable API, so if you write tools that use them on one kernel version, they should work on later versions as well. Tracepoints are usually added to kernel code by placing a macro from include/trace/events/*. XXX cover implementation. ## Branch Predictions ## [Clear the Linux Cache Manually](https://medium.com/hungys-blog/clear-linux-memory-cache-manually-90bec95ea003) [^1]: Brendan Gregg, "perf Examples", *Brendan Gregg's Blog*, https://www.brendangregg.com/perf.html [^2]: Taiyou Kuo, "Performance Evaluation: perf, eBPF", https://hackmd.io/@st9540808/HkBl5kCSU [^3]: "Linux kernel profiling with perf", https://perf.wiki.kernel.org/index.php/Tutorial