# BPF Tutorial Report Instrumenting the OS for Understanding and Improving Performance - Part 1 Report Yidong FANG ## Setup I am using Ubuntu 20.04 with 5.4.0-52-generic kernel installed on my personal workstation. Thus, I simply follow the instruction at the [github](https://github.com/iovisor/bcc/blob/master/INSTALL.md#ubuntu---binary) to install `bcc` tool using `apt`. Note that the probloem of outdate metioned there seems to have been resolved as the package was updated. ``` sudo apt-get install bpfcc-tools linux-headers-$(uname -r) ``` ## Beginner bcc tools ### ext4slower trace the ext4 file system operations that slower than 10ms (default) only for PID 3202 (Chrome parent process) ``` ➜ ~ sudo ext4slower-bpfcc -p 3202 [sudo] password for edydfang: Tracing ext4 operations slower than 10 ms TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 23:01:48 ThreadPoolFore 3202 S 0 0 11.06 Top Sites-journal 23:01:57 ThreadPoolFore 3202 S 0 0 11.61 .com.google.Chrome.cxvR6h 23:01:58 ThreadPoolFore 3202 S 0 0 13.18 .com.google.Chrome.NMilMl 23:02:07 ThreadPoolFore 3202 S 0 0 11.78 History-journal 23:02:07 ThreadPoolFore 3202 S 0 0 17.19 .com.google.Chrome.3cYsMl 23:02:08 ThreadPoolFore 3202 S 0 0 13.33 .com.google.Chrome.2LrkTj 23:02:14 ThreadPoolFore 3202 S 0 0 14.41 Network Action Predictor-journal 23:02:14 ThreadPoolFore 3202 S 0 0 10.15 Network Action Predictor-journal 23:02:18 ThreadPoolFore 3202 S 0 0 10.08 .com.google.Chrome.XY9sEi 23:02:23 ThreadPoolFore 3202 S 0 0 45.49 001949.ldb 23:02:59 ThreadPoolFore 3202 S 0 0 19.59 000059.ldb 23:02:59 ThreadPoolFore 3202 S 0 0 10.21 Extension State ``` ### biolatency print a histogram for I/O latency per disk. ``` ➜ ~ sudo biolatency-bpfcc -D Tracing block device I/O... Hit Ctrl-C to end. ^C disk = b'nvme0n1' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 19 |******** | 32 -> 63 : 93 |****************************************| 64 -> 127 : 82 |*********************************** | 128 -> 255 : 24 |********** | 256 -> 511 : 1 | | 512 -> 1023 : 4 |* | 1024 -> 2047 : 41 |***************** | 2048 -> 4095 : 50 |********************* | 4096 -> 8191 : 17 |******* | 8192 -> 16383 : 1 | | disk = b'sda' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 15 |******* | 256 -> 511 : 14 |****** | 512 -> 1023 : 52 |************************* | 1024 -> 2047 : 82 |****************************************| 2048 -> 4095 : 24 |*********** | 4096 -> 8191 : 73 |*********************************** | 8192 -> 16383 : 45 |********************* | 16384 -> 32767 : 5 |** | 32768 -> 65535 : 8 |*** | 65536 -> 131071 : 6 |** | 131072 -> 262143 : 4 |* | disk = b'' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 13 |****************************************| 2048 -> 4095 : 2 |****** | 4096 -> 8191 : 3 |********* | ``` ### biosnoop ``` ➜ ~ sudo biosnoop-bpfcc -Q TIME(s) COMM PID DISK T SECTOR BYTES QUE(ms) LAT(ms) 0.000000 ? 0 R 0 8 0.00 1.27 0.948849 ? 0 R 0 512 0.00 0.45 0.964692 ? 0 R 0 512 0.00 15.77 1.022533 ? 0 R 0 512 0.00 57.76 2.016416 ? 0 R 0 8 0.00 1.69 3.007700 kworker/u16:0 44380 nvme0n1 W 204800128 4096 0.08 0.96 3.007722 kworker/u16:0 44380 nvme0n1 W 347681152 4096 0.07 0.97 3.007731 kworker/u16:0 44380 nvme0n1 W 456458352 4096 0.07 0.98 3.007764 kworker/u16:0 44380 nvme0n1 W 456458376 4096 0.07 1.01 3.007772 kworker/u16:0 44380 nvme0n1 W 456458696 4096 0.06 1.02 3.007778 kworker/u16:0 44380 nvme0n1 W 456465416 4096 0.06 1.02 3.007783 kworker/u16:0 44380 nvme0n1 W 456534296 4096 0.06 1.03 3.007788 kworker/u16:0 44380 nvme0n1 W 225879872 4096 0.04 1.03 3.008205 kworker/u16:0 44380 nvme0n1 W 225885360 4096 0.12 0.05 3.008224 kworker/u16:0 44380 nvme0n1 W 225885440 4096 0.10 0.06 3.008231 kworker/u16:0 44380 nvme0n1 W 225885504 4096 0.09 0.07 3.008237 kworker/u16:0 44380 nvme0n1 W 225885592 4096 0.08 0.07 3.008249 kworker/u16:0 44380 nvme0n1 W 225886120 4096 0.08 0.08 ``` ### profile stack count of Xorg process in folded format ``` ➜ ~ sudo profile-bpfcc -p 1848 -f ^CInputThread;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];epoll_wait;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_epoll_wait;do_epoll_wait;do_epoll_wait 1 Xorg;__xstat;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_newstat;__do_sys_newstat;vfs_statx;user_path_at_empty;filename_lookup;path_lookupat.isra.0;link_path_walk.part.0;walk_component;lookup_fast;__d_lookup_rcu;__d_lookup_rcu 1 Xorg;[unknown];__xstat;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_newstat;__do_sys_newstat;vfs_statx;user_path_at_empty;getname_flags;memset_erms;memset_erms 1 Xorg;writev;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_writev;do_writev;vfs_writev;do_iter_write;do_iter_readv_writev;sock_write_iter;sock_sendmsg;unix_stream_sendmsg;sock_alloc_send_pskb;skb_set_owner_w;skb_set_owner_w 1 Xorg;[unknown];[unknown] 1 Xorg;__xstat;entry_SYSCALL_64_after_hwframe;do_syscall_64;do_syscall_64 1 Xorg;writev;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_writev;do_writev;vfs_writev;do_iter_write;do_iter_readv_writev;sock_write_iter;sock_sendmsg;unix_stream_sendmsg;sock_def_readable;__wake_up_sync_key;__wake_up_common_lock;__lock_text_start;__lock_text_start 1 Xorg;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];epoll_wait 1 Xorg;[unknown];[unknown];[unknown] 1 Xorg;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];setitimer;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_setitimer;do_setitimer;do_setitimer 1 Xorg;[unknown];[unknown] 1 Xorg;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown] 1 Xorg;__xstat;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_newstat;__do_sys_newstat;cp_new_stat;from_kuid_munged;map_id_up;map_id_up 1 Xorg;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];epoll_wait;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_epoll_wait;do_epoll_wait;ep_poll;ep_scan_ready_list.constprop.0;mutex_lock;rcu_all_qs;rcu_all_qs 1 Xorg;dixLookupResourceByType 1 InputThread;[unknown];[unknown];__libc_read;entry_SYSCALL_64_after_hwframe;__x64_sys_read;__x64_sys_read 1 Xorg;writev;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_writev;do_writev;__fdget_pos;__fget_light;__fget;__fget 1 Xorg;__libc_recvmsg 1 Xorg;__libc_recvmsg;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_recvmsg;__sys_recvmsg;___sys_recvmsg;____sys_recvmsg;sock_recvmsg;unix_stream_recvmsg;unix_stream_read_generic;unix_stream_read_generic 1 InputThread;ioctl;_nv000531kms;_nv000531kms 1 Xorg;[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];[unknown];epoll_wait;entry_SYSCALL_64_after_hwframe;do_syscall_64;__x64_sys_epoll_wait;do_epoll_wait;ep_poll;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__sched_text_start;finish_task_switch;finish_task_switch 2 Xorg;__sched_yield 2 ``` ## Beginner generic bcc tools ### argdist ``` ~ sudo argdist-bpfcc -c -H 'p:c:write(int fd, void *buf, size_t len):size_t:len' [19:49:24] len : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 456 |****************************************| ... | [19:49:34] len : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 520 |****************************************| 16 -> 31 : 1 | | 32 -> 63 : 11 | | 64 -> 127 : 67 |***** | ``` ### funccount ``` ➜ ~ sudo funccount-bpfcc -i 1 'vfs_*' Tracing 66 functions for "b'vfs_*'"... Hit Ctrl-C to end. FUNC COUNT b'vfs_statx' 1 b'vfs_write' 15 b'vfs_writev' 75 b'vfs_read' 83 b'vfs_open' 127 FUNC COUNT b'vfs_write' 28 b'vfs_read' 45 b'vfs_writev' 106 FUNC COUNT b'vfs_open' 3 b'vfs_write' 169 b'vfs_read' 233 b'vfs_writev' 375 FUNC COUNT b'vfs_write' 326 b'vfs_read' 445 b'vfs_writev' 616 ``` ## Intermediate bpftrace one-liners ### Syscall Counts By Process Count the sum of system calls for different processes called during a period. This it done by inject the eBPF program for all system call entry point. ``` ➜ ~ sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' Attaching 1 probe... ^C @[wpa_supplicant]: 1 @[gvfs-afc-volume]: 2 @[ThreadPoolForeg]: 2 @[gsd-media-keys]: 4 @[ibus-x11]: 4 @[ibus-extension-]: 4 @[update-notifier]: 4 @[evolution-alarm]: 4 @[gsd-color]: 4 @[gsd-keyboard]: 4 @[gsd-wacom]: 4 @[gsd-xsettings]: 4 @[systemd-journal]: 5 @[gsd-power]: 6 @[sudo]: 7 @[gmain]: 13 @[InputThread]: 15 @[at-spi2-registr]: 16 @[bpftrace]: 17 @[dbus-daemon]: 27 @[containerd]: 30 @[fcitx]: 32 @[acpid]: 45 @[gdbus]: 54 @[Chrome_IOThread]: 57 @[code]: 67 @[Chrome_ChildIOT]: 76 @[gnome-terminal-]: 99 @[zerotier-one]: 153 @[Xorg]: 284 @[chrome]: 445 @[gnome-shell]: 483 ``` ### Distribution of read() Bytes Histogram of Return Value of `Read` for Chrome Main Process. The eBPF program inject at the end of `read` system calls and gather the return values. ``` ➜ ~ sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /pid == 6449/ { @bytes = hist(args->ret); }' Attaching 1 probe... ^C @bytes: [0] 72 |@@@@@@@@@@@@@ | [1] 284 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 4) 0 | | [4, 8) 0 | | [8, 16) 6 |@ | [16, 32) 77 |@@@@@@@@@@@@@@ | [32, 64) 24 |@@@@ | [64, 128) 0 | | [128, 256) 0 | | [256, 512) 24 |@@@@ | [512, 1K) 0 | | [1K, 2K) 48 |@@@@@@@@ | ``` ### Kernel Struct Tracing This uses kernel dynamic tracing of the vfs_open() function and print one variable in its parameter struct to get the dir name in the path. It shows the sequence of dirname being accesssed by open system calls. ``` ➜ ~ cat path.bt #include <linux/path.h> #include <linux/dcache.h> kprobe:vfs_open { printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name)); } ➜ ~ sudo bpftrace path.bt Attaching 1 probe... open path: dev_mcast open path: energy_uj open path: energy_uj open path: temp open path: .com.google.Chrome.qB0svQ open path: .com.google.Chrome.8Jao0O .... open path: .com.google.Chrome.noAQpO open path: .com.google.Chrome.orUFMR open path: .com.google.Chrome.uIjcWP open path: tab-new-symbolic.svg open path: open-menu-symbolic.svg open path: edit-find-symbolic.svg open path: window-minimize-symbolic.svg open path: window-maximize-symbolic.svg open path: .com.google.Chrome.wK8haO open path: .com.google.Chrome.6AXh4N open path: .com.google.Chrome.zQT4XR open path: .com.google.Chrome.SL7MOR open path: .com.google.Chrome.CCQ67R open path: tab-new-symbolic.svg open path: open-menu-symbolic.svg open path: edit-find-symbolic.svg open path: tab-new-symbolic.svg open path: open-menu-symbolic.svg open path: edit-find-symbolic.svg ``` ## Intermediate bpftrace tools I wrote the following dummy muti-thread program to saturate the CPU of the machine by runing CPU intensive loop with the thread number more than the number of processors. ```cpp= #include <iostream> #include <thread> #include <vector> using namespace std; // A dummy function void foo() { int num = 1; for (int i = 0; i < 999999999; i++) { num = (num*(i+1)) % 100000; } cout << num << endl; return; } int main() { vector<thread> pool; for(int i = 0; i < 100; i++ ) { pool.push_back(thread(foo)); } for(std::thread &every_thread : pool) { every_thread.join();} return 0; } ``` Then I used the following tools to compare the results between the Idle time and the program running time. ### runqlen ``` # Idle ➜ ~ sudo bpftrace runqlen.bt Attaching 2 probes... Sampling run queue length at 99 Hertz... Hit Ctrl-C to end. ^C @runqlen: [0, 1) 8079 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1, 2) 1 | | # Busy ➜ ~ sudo bpftrace runqlen.bt Attaching 2 probes... Sampling run queue length at 99 Hertz... Hit Ctrl-C to end. ^C @runqlen: [10, 11) 18 | | [11, 12) 3789 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [12, 13) 1560 |@@@@@@@@@@@@@@@@@@@@@ | [13, 14) 769 |@@@@@@@@@@ | ``` We can see that the lengh of the running queue increased about 10 jobs per processor, which matches our workload. ### runqlat ``` # Idle ➜ ~ sudo bpftrace runqlat.bt Attaching 5 probes... Tracing CPU scheduler... Hit Ctrl-C to end. ^C @usecs: [0] 34 | | [1] 517 |@@ | [2, 4) 1532 |@@@@@@@@ | [4, 8) 2993 |@@@@@@@@@@@@@@@@@ | [8, 16) 5741 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 6334 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 9103 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 2881 |@@@@@@@@@@@@@@@@ | [128, 256) 700 |@@@ | [256, 512) 219 |@ | [512, 1K) 61 | | [1K, 2K) 19 | | [2K, 4K) 11 | | # Busy ➜ ~ sudo bpftrace runqlat.bt Attaching 5 probes... Tracing CPU scheduler... Hit Ctrl-C to end. ^C @usecs: [0] 767 |@ | [1] 3855 |@@@@@@@@@ | [2, 4) 1436 |@@@ | [4, 8) 1591 |@@@@ | [8, 16) 1140 |@@ | [16, 32) 483 |@ | [32, 64) 712 |@ | [64, 128) 425 |@ | [128, 256) 493 |@ | [256, 512) 179 | | [512, 1K) 50 | | [1K, 2K) 62 | | [2K, 4K) 137 | | [4K, 8K) 267 | | [8K, 16K) 233 | | [16K, 32K) 710 |@ | [32K, 64K) 20286 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 905 |@@ | ``` We can see that the queue scheduling latency also increased significantly when there are a lot of threads runnning. This is because the kernel needs to do do a lot of round robin fashin schedling between different threads and let others to wait. And there are also some context switch overhead. We may be able to calculate the context switching overhead according to those data.