Systems Performance Analysis

# Systems Performance Analysis ###### tags: `Performance Analysis` `Debugging` `Profiling` `Tracing` `Linux` `Kernel` `eBPF` `perf` `Operating System` `Kubernetes` `System Administration` `System Design` `Optimization` `Embedded System` Systems latency/throughput analysis via profiling & tracing methods. > Who (the instances) / Why (the costs) / How (the delta) / What (the instructions) are about to the CPU work NOW: > ![image](https://hackmd.io/_uploads/H1lJIYs4xe.png =500x) > (Source: [[Slides] Linux Profiling at Netflix - Brendan Gregg](https://www.slideshare.net/slideshow/scale2015-linux-perfprofiling/44966387)) :::warning The functions discussed in this article are primarily based on: - **Debian GNU/Linux:** v12 (bookworm) - **Linux kernel:** v6.1 - **Glibc:** v2.36 ::: ## Overviews - [Linux tracing systems & how they fit together - Julia Evans](https://jvns.ca/blog/2017/07/05/linux-tracing-systems/) - [Linux Crisis Tools - Brendan Gregg's Blog](https://www.brendangregg.com/blog/2024-03-24/linux-crisis-tools.html) - [改進功能和效能 + Ftrace/eBPF - Linux 核心設計/實作課程作業 - kecho + khttpd](https://hackmd.io/@sysprog/linux2023-ktcp/%2F%40sysprog%2Flinux2023-ktcp-c?utm_source=preview-mode&utm_medium=rec) - [[Slides] 以 eBPF 構建一個更為堅韌的 Kubernetes 叢集 - HungWei Chiu](https://www.slideshare.net/hongweiqiu/ebpf-kubernetes) | Package | Provides | Notes | | -------- | -------- | -------- | | procps| ps(1), vmstat(8), uptime(1), top(1)| basic stats| |util-linux|dmesg(1), lsblk(1), lscpu(1)|system log, device info| |sysstat|iostat(1), mpstat(1), pidstat(1), sar(1)|device stats| |iproute2|ip(8), ss(8), nstat(8), tc(8)|preferred net tools| |numactl|numastat(8)|NUMA stats| |tcpdump|tcpdump(8)|Network sniffer| |linux-tools-common linux-tools-$(uname -r)|perf(1), turbostat(8)|profiler and PMU stats| |bpfcc-tools (bcc)|opensnoop(8), execsnoop(8), runqlat(8), softirqs(8), hardirqs(8), ext4slower(8), ext4dist(8), biotop(8), biosnoop(8), biolatency(8), tcptop(8), tcplife(8), trace(8), argdist(8), funccount(8), profile(8), ...|canned eBPF tools[1]| |bpftrace|bpftrace, basic versions of opensnoop(8), execsnoop(8), runqlat(8), biosnoop(8), etc.|eBPF scripting[1]| |trace-cmd|trace-cmd(1)|Ftrace CLI| |nicstat|nicstat(1)|net device stats| |ethtool|ethtool(8)|net device info| |tiptop|tiptop(1)|PMU/PMC top| |cpuid|cpuid(1)|CPU details| |msr-tools|rdmsr(8), wrmsr(8)|CPU digging| ## Analysis Methodologies Observation → Questioning/Motivation → Research → Design/Hypothesis → Implementation/Experiment → Analysis/Conclusion → Proofs → Feedback/Improvement > More: DMAIC (*Define, Measure, Analyze, Improve & Control*), OODA Loop (*Observe, Orient, Decide, Act*) Resources: - :+1: [Performance Analysis Methodology - Brendan Gregg](https://www.brendangregg.com/methodology.html) ### [Review] Performance Analysis #### Metricies - Throughput: Task/transaction completion count per time period. - Response Time: Time the task has taken. - Elapsed Time: Total Response Time (including time sharing, IO time, ...). - Waiting Time: Total time of the task waiting to be done. #### Measurements - CPI: Average cycles per instruction. \begin{flalign} \text{CPI} = \frac{\text{Clock Cycle}}{\text{Instruction Count}} && \end{flalign} - CPU Time: Time cost for the task completion. \begin{flalign} \text{CPU Time} &= \text{Clock Cycle} \times \text{Cycle Time} \\ &= (\text{Instruction Count} \times \text{CPI}_{avg}) \times (\frac{1}{\text{Clock Rate}}) && \end{flalign} \begin{flalign} \text{Performance} &= \frac{1}{\text{CPU Time}} && \end{flalign} - SPEC Power Benchmark: Power comsumption of server at different workload levels. \begin{flalign} \text{Performance (ssj_ops per sec)} &= \frac{\text{ssj_ops}}{\text{time}} && \end{flalign} \begin{flalign} \text{Power Efficiency (workload per Watt)} &= \frac{\sum_{i=1}^{10} \text{ssj_ops (server side Java operations)}_i}{\sum_{i=1}^{10} \text{Power (Joule per sec)}_i} && \end{flalign} ![image](https://hackmd.io/_uploads/B1JwUADQA.png =500x) ### Review: Amdahl's Law *Limitation of Parallel Computing* \begin{equation} Speedup = \frac{1}{(1 - P) + \frac{P}{N}} \end{equation} > For example: [Amdahl's Law - 計算機組織 - chi_gitbook](https://chi_gitbook.gitbooks.io/personal-note/content/amdahls_law.html) ### Review: Gustafson's Law Instead of only focusing on part of the system performance gain in Amdahl's law, we should take full advantage of parallel computing by solving the problem in scale. \begin{align} Speedup &= \frac{Serial + Parellel \times N}{Serial + Parellel} & (Serial + Parellel = 1) &\\ &= N + (1 - N) \times Serial & (Serial \nearrow, Speedup \searrow) &\\ &= 1 + (N - 1) \times Parellel & (Parellel \nearrow, Speedup \nearrow) &\\ \end{align} ### Bottleneck Analysis Find out the critical drawback in the system that contributes to the buckets effect. - **Find out the Minimal Throughput:** Calculate the maximum capacity of each service in the system & pick up the culprit. - Example: CPU, RAM, datapath, I/O bottleneck analysis ![image](https://hackmd.io/_uploads/H1OgZZu7C.png =500x) - **Use Model Simulation:** If the system is fairly complex, utilize *queuing theory models*, *state machines*, *Markov chains*, etc., to simulate, analyze, & tuning the system performance. #### Simple Use Case [Throughput vs. Latency: How To Debug A Latency Problem - Studying With Alex](https://www.youtube.com/watch?v=f7VsHLk_Z8c) ![image](https://hackmd.io/_uploads/rJkANEnjR.png =500x) ### Utilization Saturation and Errors (USE) Method *Directs the construction of a checklist, which for server analysis can be used for quickly identifying resource bottlenecks or errors.* By posing questions and seeks answers, instead of beginning with given metrics (partial answers) and trying to work backwards, the USE Method is a methodology for analyzing the performance of any system. :+1: [The USE Method - Brendan Gregg](https://www.brendangregg.com/usemethod.html) ### Method R *A latency-based performance-tracing methodologies especially fitting in database administration & application development.* By identifying and resolving the most significant performance bottlenecks in a systematic and efficient manner, performance optimization approach primarily used in the context of Oracle database systems Method R Workbench is an Oracle trace file management system for analyzing, managing, and manipulating thousands of files at a time. ### Parallel Programming - SIMD: AVX/SSE - MIMD: CUDA ### Cache-Friendly Programming [每位程式開發者都該有的記憶體知識 - jserv - GitHub](https://github.com/sysprog21/cpumemory-zhtw/tree/master) #### Instruction (I-Cache) - Instruction Bundling - Optimization of Basic Blocks - Branchless Programming #### Data (D-Cache) - Data Bundling - Array Access Pattern Optimization More: [Programmatically get the cache line size? - StackOverflow](https://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size) ## Tracing/Profiling Frontend Tools ### Slabtop ```sh slabtop ``` ![image](https://hackmd.io/_uploads/SJ3DQhFWel.png =500x) ### Iostat *Used for Monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates.* > This can be used to change system configuration to better balance the input/output load between physical disks. ```sh iostat -xz 1 ``` ```sh # Booting-Time avg-cpu: %user %nice %system %iowait %steal %idle 0.03 0.00 0.01 0.05 0.00 99.92 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util loop0 0.00 0.00 0.00 0.00 0.00 1.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.15 8.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop2 0.00 0.00 0.00 0.00 0.15 7.48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop3 0.00 0.00 0.00 0.00 0.11 5.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... nvme0n1 0.12 4.13 0.05 30.56 0.19 34.60 0.41 7.16 0.21 34.47 0.31 17.53 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.09 0.00 0.02 nvme1n1 0.07 0.85 0.02 24.01 3.93 12.90 0.89 11.77 1.35 60.31 29.63 13.22 0.00 0.00 0.00 0.00 0.00 0.00 0.24 35.25 0.04 1.53 sda 0.00 0.02 0.00 0.00 1.91 41.35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 # Real-Time avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.08 0.00 99.92 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 2.00 40.00 8.00 80.00 9.00 20.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 13.00 0.03 2.00 ... ``` ### sar Analysis for page fault & swap in/out rates. ### VM Operations Tune the operation of the virtual memory (VM) subsystem of the Linux kernel and the writeout of dirty data to disk. (See the following example code.) Resources: - [Documentation for /proc/sys/vm - Kernel Docs](https://docs.kernel.org/admin-guide/sysctl/vm.html) ### strace *User-Space Syscall/Signal Tracing Tool* > strace is a utility able to trace the *system calls* and *signals* utilizing ***Ptrace*** under the hood (either the **GDB** adopts this method). Whenever the monitored child process raises a syscall, it can't get the way until the supervised parent process makes an agreement. Example code: ```sh # When running an I/O-intensive benchmark, you want to be sure that # the various settings you try are all actually doing disk I/O, so # Linux allows you to drop caches rather than do a full reboot. sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" sudo strace -c python3 ~/user_program.py # -c: count time, calls, errors, ... ``` Output: ```sh % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 14.83 0.000215 2 89 18 newfstatat 14.28 0.000207 3 66 rt_sigaction 13.45 0.000195 13 14 getdents64 13.38 0.000194 7 27 mmap 11.38 0.000165 6 27 5 openat 8.07 0.000117 39 3 munmap 6.07 0.000088 4 22 close 3.31 0.000048 8 6 mprotect 3.03 0.000044 2 20 read 2.21 0.000032 2 13 6 ioctl 2.07 0.000030 1 23 3 lseek 1.72 0.000025 12 2 write 1.45 0.000021 3 7 brk 1.03 0.000015 3 4 fcntl ... ------ ----------- ----------- --------- --------- ---------------- 100.00 0.001450 4 347 36 total ``` :::info **Practice: Tracing `brk` Syscall & `malloc` Wrapper Function Call** - Source code: ```c= // program for tracing brk syscalls #include <unistd.h> #include <stdlib.h> int main() { // return: data section's end address char *ptr1 = sbrk(0); sleep(1); // adjust raw data section's end point directly // (this line equals to sbrk(+0x200)) brk(ptr1 + 512); sleep(1); // use standard C library's malloc call // (Note: malloc tends to alloc in need or overcommit, // actually, at compile time, unused malloc will be optimized out) char *ptr2 = (char*) malloc(512); sleep(1); // this line causes actual allocation ptr2[400] = 'a'; sleep(1); free(ptr2); sleep(1); return 0; } ``` - Compiled code: ![image](https://hackmd.io/_uploads/H1Yui7oWkx.png =400x) - `strace` output: ```sh ... # [line 08] sbrk(0); brk(NULL) = 0x564cf6417000 # ptr1 = 0x564cf6417000 clock_nanosleep(...) = 0 # [line 13] brk(ptr1 + 512 /*0x200*/); brk(0x564cf6417200) = 0x564cf6417200 # end address = ptr1 + 0x200 clock_nanosleep(...) = 0 # [line 19] char *ptr2 = (char*) malloc(512); # getrandom: # get random number for tcache_key_initialize(), # & don't block if unavailable getrandom("\xd0\x5c\x19\xe1\xa3\x36\xb5\x3f", 8, GRND_NONBLOCK) = 8 # add 4KB to heap (0x7200 -> 0x8200) brk(0x564cf6438200) = 0x564cf6438200 # add additional size to align end point with 4KB (as 0x9000) brk(0x564cf6439000) = 0x564cf6439000 clock_nanosleep(...) = 0 # [line 23] ptr2[400] = 'a'; (no syscall) clock_nanosleep(...) = 0 # [line 26] free(ptr2); (no syscall) clock_nanosleep(...) = 0 exit_group(0) = ? +++ exited with 0 +++ ``` - Delve into `malloc`'s source code: - [/malloc/malloc.c - Glibc - bootlin elixir](https://elixir.bootlin.com/glibc/glibc-2.36/source/malloc/malloc.c#L3280) - [/malloc/arena.c - Glibc - bootlin elixir](https://elixir.bootlin.com/glibc/glibc-2.36/source/malloc/arena.c#L313) :::warning **Why use this `malloc`?** This is not the fastest, most space-conserving, most portable, or most tunable malloc ever written. However it is among the fastest while also being among the most space-conserving, portable and tunable. Consistent balance across these factors results in a good general-purpose allocator for malloc-intensive programs. The main properties of the algorithms are: * For large (>= 512 bytes) requests, it is a pure best-fit allocator, with ties normally decided via FIFO (i.e. least recently used). * For small (<= 64 bytes by default) requests, it is a caching allocator, that maintains pools of quickly recycled chunks. * In between, and for combinations of large and small requests, it does the best it can trying to meet both goals at once. * For very large requests (>= 128KB by default), it relies on system memory mapping facilities, if supported. ::: ```c= // malloc entry point void * __libc_malloc (size_t bytes) { mstate ar_ptr; void *victim; _Static_assert (PTRDIFF_MAX <= SIZE_MAX / 2, "PTRDIFF_MAX is not more than half of SIZE_MAX"); if (!__malloc_initialized) ptmalloc_init (); // init tcache which performs double-free check #if USE_TCACHE /* int_free also calls request2size, be careful to not pad twice. */ size_t tbytes = checked_request2size (bytes); if (tbytes == 0) { __set_errno (ENOMEM); return NULL; } size_t tc_idx = csize2tidx (tbytes); MAYBE_INIT_TCACHE (); DIAG_PUSH_NEEDS_COMMENT; if (tc_idx < mp_.tcache_bins && tcache && tcache->counts[tc_idx] > 0) { victim = tcache_get (tc_idx); return tag_new_usable (victim); } DIAG_POP_NEEDS_COMMENT; #endif if (SINGLE_THREAD_P) { victim = tag_new_usable (_int_malloc (&main_arena, bytes)); assert (!victim || chunk_is_mmapped (mem2chunk (victim)) || &main_arena == arena_for_chunk (mem2chunk (victim))); return victim; } arena_get (ar_ptr, bytes); victim = _int_malloc (ar_ptr, bytes); /* Retry with another arena only if we were able to find a usable arena before. */ if (!victim && ar_ptr != NULL) { LIBC_PROBE (memory_malloc_retry, 1, bytes); ar_ptr = arena_get_retry (ar_ptr, bytes); victim = _int_malloc (ar_ptr, bytes); } if (ar_ptr != NULL) __libc_lock_unlock (ar_ptr->mutex); victim = tag_new_usable (victim); assert (!victim || chunk_is_mmapped (mem2chunk (victim)) || ar_ptr == arena_for_chunk (mem2chunk (victim))); return victim; ``` ::: Refs: - [How many kernel system calls do runtimes make? - Hussein Nasser](https://www.youtube.com/watch?v=ERaGORGfLF4) - [Why drop caches in Linux? - ServerFault](https://serverfault.com/questions/597115/why-drop-caches-in-linux) ### ltrace *A lightweight strace that only records syscalls called from shared library syscall API* ltrace intercepts and records dynamic library calls which are called by an executed process and the signals received by that process. ### dtrace > With *dtrace*, the programmer writes probes in a language with a *C-like* syntax called *D*. These probes define what *dtrace* should do when it invokes a system call, exits a function, or whatever else you'd like. These probes are stored in a script file that looks something like this. ![image](https://hackmd.io/_uploads/SyWRVEPA0.png =500x) #### Writting D Script - probe.d ```sh #!/usr/sbin/dtrace -qs #pragma D option flowindent #pragma D option dynvarsize=64m # Whenever the read system call is invoked, the tracer should print out the string "read x1". syscall::read:entry { printf("read x1"); } ``` #### Running dtrace ```sh dtrace -s probe.d ``` Refs: - [Hooked on DTrace, part 1 - Big Nerd Ranch](https://bignerdranch.com/blog/hooked-on-dtrace-part-1/) - [Dynamic Tracing with Dtrace and SystemTap - myaut](https://myaut.github.io/dtrace-stap-book/tools/dtrace.html) - [New Features for Dtrace - Linux Kernel](https://lore.kernel.org/all/ZhBRSM2j0v7cOLn%2F@oracle.com/T/#u) ### BPF-Based Applications ![image](https://hackmd.io/_uploads/H10YL6wXR.png =500x) - **libbpf C/C++ Library:** The *libbpf library* is a *C/C++-based generic ***eBPF*** library*. It uses *clang/LLVM* as a backend to compile scripts to ***eBPF bytecode***. - **BCC (BPF Compiler Collection):** *BCC* is a framework built on top of ***libbpf C/C++ Library*** (see the above intro), that enables users to write *python* programs with ***eBPF*** programs embedded inside them. Running the *python* program will generate the ***eBPF bytecode***, load it into the kernel, run, and finally collect the data and displays it in user space. - **bpftrace:** *bpftrace* is a *high-level language* for tracing running on **BCC**. It uses *LLVM* as a backend to compile scripts to ***eBPF bytecode*** and makes use of **BCC** for tracing. The *bpftrace language* is inspired by *awk*, *C*, and the predecessors: **DTrace** and **SystemTap**. ## Tracing/Profiling Frameworks ### ftrace *ftrace* is an internal tracer for debugging and analyzing by revealing kernel activities beyond user space. It can do *latency tracing*, which inspects events between interrupts, preemption, and task scheduling; or do *event tracing*, enabling the monitoring of hundreds of static event points throughout the kernel via the ==tracefs file system==. [ftrace - Linux Kernel](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) ### ptrace ptrace is a system call which a program can use to: - trace system calls - read and write memory and registers - manipulate signal delivery to the traced process :::warning Both ++gdb++ & ++strace++ adopt *ptrace* to trace program's syscalls. ::: ![image](https://hackmd.io/_uploads/Hyj5DNv00.png =500x) Resources: - :+1: [Strace是如何工作的 ? - One Man's Yammer](https://laoar.github.io/blog/2017/06/12/strace/) - :+1: [Intercepting and modifying Linux system calls with ptrace - Phil Eaton](https://notes.eatonphil.com/2023-10-01-intercepting-and-modifying-linux-system-calls-with-ptrace.html) - [How does strace work? - Joe Damato - Packagecloud Blog](https://blog.packagecloud.io/how-does-strace-work/) - [基於 ptrace 在 Linux 上打造具體而微的 debugger - Yiwei Lin](https://hackmd.io/@RinHizakura/BJH7zsU99): debugger implementation & debug info ### eBPF *An In-Kernel Event-Driven Virtual-Machine Execution Entity (like a JavaScript Engine)* Linux eBPF subsystem has several tracing capabilities: **kernel dynamic tracing (kprobes)**, **user-level dynamic tracing (uprobes)**, & **tracepoints**. > "eBPF does to Linux what JavaScript does to HTML." — Brendan Gregg - View in a higher level perspective ![image](https://hackmd.io/_uploads/S1gsU6PmC.png =500x) - How does it work in the nutshell ![image](https://hackmd.io/_uploads/SJ_iLTvQR.png =500x) - Detailed mechanism implementation ![image](https://hackmd.io/_uploads/BkgAUaP7R.jpg =500x) | | classic BPF (cBPF) | extended BPF (eBPF2013+) | |:---------:|:------------------:|:------------------------------:| | Word Size | 32-bit | 64-bit | | Registers | 2 | 10+1 | | Storage | 16 Slots | Stack (512B) & Map Storage (∞) | | Events | Packets | Multi-Event Sources | More: - [fentry – 记 Linux 内核观测的冰山一角 - King's Way](https://stdio.io/1646) - [Static Keys - Linux Kernel](https://docs.kernel.org/staging/static-keys.html) - [Jump Label - A Geek's Page](https://wangcong.org/2010/09/24/jump-label/) - [/arch/x86/kernel/jump_label.c](https://elixir.bootlin.com/linux/v6.1/source/arch/x86/kernel/jump_label.c) ```c= struct jump_entry { s32 code; // address of the branch (32-bit: the offset from this address (possible 64-bit)) s32 target; // address of the jmp target long key; // key may be far away from the core kernel under KASLR }; static struct jump_label_patch __jump_label_patch(struct jump_entry *entry, enum jump_label_type type) { const void *expect, *code, *nop; const void *addr, *dest; int size; addr = (void *)jump_entry_code(entry); dest = (void *)jump_entry_target(entry); size = arch_jump_entry_size(entry); switch (size) { case JMP8_INSN_SIZE: code = text_gen_insn(JMP8_INSN_OPCODE, addr, dest); nop = x86_nops[size]; break; case JMP32_INSN_SIZE: code = text_gen_insn(JMP32_INSN_OPCODE, addr, dest); nop = x86_nops[size]; break; default: BUG(); } if (type == JUMP_LABEL_JMP) expect = nop; else expect = code; if (memcmp(addr, expect, size)) { /* * The location is not an op that we were expecting. * Something went wrong. Crash the box, as something could be * corrupting the kernel. */ pr_crit("jump_label: Fatal kernel bug, unexpected op at %pS [%p] (%5ph != %5ph)) size:%d type:%d\n", addr, addr, addr, expect, size, type); BUG(); } if (type == JUMP_LABEL_NOP) code = nop; return (struct jump_label_patch){.code = code, .size = size}; } ``` Refs: - [What is eBPF? - ebpf.io](https://ebpf.io/tw-cn/what-is-ebpf/) - [Linux 核心設計: 透過 eBPF 觀察作業系統行為 - jserv](https://hackmd.io/@sysprog/linux-ebpf) :::success **Boost Tracing/Monitoring Functions in Distributed Systems** For system monitoring in distributed systems, we usually use sidecar proxies alongside containers to implement distributed tracing on system meshes. However, this approach can cause significant performance overhead at the application level. Given the downsides of using sidecars, we can now achieve this using eBPF instead. ::: ### perf perf is a user application that utilizes *perf_events*, which is a performance monitoring subsystem in the Linux kernel that provides low-overhead access to: > ![image](https://hackmd.io/_uploads/BkXovtsExl.png) > ![](https://hackmd.io/_uploads/r1QG7Yi4gl.jpg =500x) > (Source: [[Slides] Linux Profiling at Netflix - Brendan Gregg](https://www.slideshare.net/slideshow/scale2015-linux-perfprofiling/44966387)) ![image](https://hackmd.io/_uploads/S1zijNP0C.png =500x) - **[Hardware] Counting or Sampling via Performance Monitoring Counters (PMC):** CPU cycles, cache misses, instructions, branch misses, ... - **[Software] Events Emitting from Operating System:** CPU clocks, CPU migrations, page faults, context switches, ... - **[Tracepoint] Events Emitting from Custom Embedded Triggerers:** - **Static Tracing:** ```javascript= function kernel_func() { ... // if tracepoint, execute. if (tracepoint_func) tracepoint_func(); ... } ``` - **Kernel Tracepoints:** `sched:sched_switch`, `syscalls:sys_enter`, `block:block_rq_insert`, `net:net_dev_start_xmit`, ... - **USDT (User-level Statically-Defined Tracing):** Allowing users to create low-ovehead probes that can be enabled at runtime with external tracing tools such as BPF on Linux. - **Dynamic Tracing:** ```javascript= // Concept function kernel_func() { ... // replace original subroutine - normal_kernel_routine(); // by injecting intrusive custom handler + custom_tmp_handler(); ... } function custom_tmp_handler() { // custom pre-processing handler ... ... // execute original subroutine specific_operation(); ... // Custom post-processing handler ... ... } ``` :::warning Since here we've mutated the default control flow, which is considered a malicious action of hacking, but actually it is a user-delegated proxying handler (legal action), we need to turn off the hardware protections (e.g., Control Flow Enforcement Technology (CET)) for system calls. More: - [System Calls - The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/#system-calls) ::: :::info :arrow_right: For more on security's topics: [Hardware/Software-Based Security - shibarashinu]([/uCs2wKY8Q_-T7Ome5GgMgQ](https://hackmd.io/@shibarashinu/H1UUR_p-0)). ::: Refs: - [深入探索 perf CPU Profiling 实现原理 - mazhen.tech](https://mazhen.tech/p/%E6%B7%B1%E5%85%A5%E6%8E%A2%E7%B4%A2-perf-cpu-profiling-%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86/) - [perf Examples - Brendan Gregg](https://www.brendangregg.com/perf.html) - [Using the Linux Kernel Tracepoints - Linux Kernel](https://docs.kernel.org/trace/tracepoints.html) - [Linux 效能分析工具: Perf - NCKU CSIE Wiki](https://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) ## Tracing/Profiling Methods Bascially all foundamental tracing methods are defined & implemented in the performance monitoring subsystem: *perf_events*. ![image](https://hackmd.io/_uploads/SyjgPaw7A.png =500x) ### kprobe *Kernel Dynamic Tracing* kprobe dynamically replaces the function entry with a breakpoint, then wraps the normal function with additional handlers. So while the kernel is trapped, it is forced to switch to our exclusive routines. ![image](https://hackmd.io/_uploads/ryPfDaDmR.png =500x) Refs: - [Kprobe 筆記 - ztex, Tony, Liu](https://ztex.medium.com/kprobe-%E7%AD%86%E8%A8%98-59d4bdb1e1fe) - [kprobes Intro - Linux Kernel](https://docs.kernel.org/trace/kprobes.html) ### uprobe *User-Space Dynamic Tracing* uprobe allows tracing of user-space processes by inserting breakpoints into the application's code. It can trace events like function calls and system calls. ### Tracepoints *Kernel Static Tracing* Tracepoints are statically defined *hooks* placed at specific locations within the kernel code.