Performance engineering

tags: `fio` `profiler` `htop` `iostat` `perf` `strace`

ztex

fio (flexible IO)

Fio spawns a number of threads or processes doing a particular type of I/O action as 
specified by the user

see: https://fio.readthedocs.io/en/latest/fio_doc.html
job files

























#I/O type

Defines the I/O pattern issued to the file(s). We may only be reading sequentially from this file(s), or we may be writing randomly. Or even mixing reads and writes, sequentially or randomly. Should we be doing buffered I/O, or direct/raw I/O?

#Block size

In how large chunks are we issuing I/O? This may be a single value, or it may describe a range of block sizes.

#I/O size

How much data are we going to be reading/writing.

#I/O engine

How do we issue I/O? We could be memory mapping the file, we could be using regular read/write, we could be using splice, async I/O, or even SG (SCSI generic sg).

#I/O depth

If the I/O engine is async, how large a queuing depth do we want to maintain?
Target file/device

How many files are we spreading the workload over.
Threads, processes and job synchronization

How many threads or processes should we spread this workload over.

example

; -- start job file --
[random-writers]
ioengine=libaio
iodepth=4
rw=randwrite
bs=32k
direct=0
size=64m
numjobs=4
; -- end job file --

上面代表, 用 libaio (asynchronous), 4 個 io unit, 做 random write, blocksize 32k, buffered IO, 打滿 size 就停, 共 4 個 jobs fork() 出來
ztex

io depth: 代表有幾個 io unit. 這對於 synchronous io 沒意義, 因為一次只能送一個 io request 然後等 io request 回來

ioengine: libaio vs psync

	libaio	pysnc
CPU% (htop)	106	37.9
MEM% (htop)	10.8	10.8
IPOS	34.4	18.6
wMiB/s	134	72.7

改變 iodepth 觀察兩種 ioengine 的差距 (iodepth = 1 vs. 2)

	libaio	pysnc
IPOS	34.4 to 45.0	18.6 to 18.6
wMiB/s	134 to 180	72.7 to 72.7

使用 perf 觀察 block:blook_rq_complete 次數

ioengine	times
libaio	1424
pysnc	372

參數改變, ioengine = libaio -> psync, 代表從 asynchronous io 變成 synchronous.
從 IOPS 來說, 效能是明顯下降的.
增加 iodepth 的實驗, 說明對於 psync 來說增加 io unit 沒有幫助
rq_complete 次數, 說明 libaio 完成比較多次 IO
這是因為 libaio 一次提交很多 io request 之後就 return, 可受益於 io scheduler 的 elevator algorithm, 也在效能上有所反映.

buffered IO vs non-buffered IO

	non-buffered	buffered
IPOS	34.4	173
wMiB/s	134	676

使用 perf 觀察 block:blook_rq_complete 次數

ioengine	times
non-buffered	785
buffered	11,517

direct 從 1 變成 0, 代表從 non-buffered IO 變成 buffered IO.
從 IOPS 可以知道效能明顯變好
從 CPU% 知道 buffered IO 的 CPU 使用率變高, 但奇怪的是 MEM% 卻維持不變
透過 flame graph 觀察到 posix_fadvise 占比變高, 推測 cache 的效能會因此變好
透過 perf 知道 block_rq_complete 次數明顯變多.
buffered IO 會集合比較多 io requsts 然後交給 io scheduler, 這解釋為甚麼效能變好.

block size: 1k vs 4k vs 4M

	1k	4k	4M
IPOS	37.3	37.2	0.116
wMiB/s	36.4	145	464

使用 perf 觀察 block:blook_rq_complete 次數

ioengine	times
1k	2337
4k	818
4M	44

strace

	1k	4k	4M
nanosleep	7738/2967	7060/2965	291/2874
shmdt	5000/1	5000/1	52000/1

可以看出來 block size 越大, sleep 時間越短, 需要花更多時間在 share memory mapping 處理
欄位表示: (usec/call) / (call)
如: 7738/2967 表示該 system call
被呼叫了 2967 次, 平均每次呼叫花 7738 usec
ztex

隨著 block size 的增加, IOPS 越少, write bytes / s 越多.
一次寫入的 block size 變大, IO size 不變的情況下, 這使得需要的 io request 少, block:blook_rq_complete 的次數減少, 解釋了為甚麼 IOPS 變少
但是每次寫的總量反而是增加的, 這在 fio 的統計資訊可以看出來
block size 越大需要花更多時間在 share memory mapping 處理

regular file vs sparse file vs fallocate file

sparse file 為了節省空間, 只會在 header 標示檔案大小
fallocate file 會向 disk 要空間, 但不會清空, 只是標記為 uninitialized 所以在 allocate 的時候很快速, 等要操作時在處理
效能來說 regular > fallocate > sparse, 因為 fallocate 要處理 mark; sparse 需要處理 metadata.
sparse file, fallocated file 隨著寫入次數增加, 效能也有所變化
io request 次數雖然 sparse 比較多, 但 IOPS 比較小, 這是因為許多 io requests 會拿去寫 meta data 實際寫入的 io 不多
rw 從 write 改成 read
- 最明顯的變化應該是 sparse file 在 read 的效能上表現非常好
- 這應該得益於 sparse file 可以從 meta data 上就得到許多資訊, 甚至最後 read total 到達 47.9GB
- sparse file 的 block_bio_queue 觸發比較少, 說明 sparse file 把 bio request 丟進 queue 的數量更少, 但最後卻 read 更多

Performance profiler

htop

interactive process viewer
即時看到 process 造成的 system loading. 包括, cpu utilization, memory usage 等等.
我自己覺得這個很適合作為第一步, 看哪個 process 造成比較大的資源消耗
manual: https://man7.org/linux/man-pages/man1/htop.1.html
Q: 上面cpu使用率，各個顏色的意義?
- A:
  - (Blue) low priority processes (nice > 0)
  - (Green) normal (user) processes
  - (Red) kernel processes
  - (Yellow) IRQ time
  - (Magenta) Soft IRQ time
  - (Grey) IO Wait time\
Q: load average代表的意義?
- A: 三個數字分別為 1, 5, and 15 分鐘, 三種 period
  load average 代表一段 period 中 process loading.
  eg. 假設 single core 0.0~1.0 代表不用等待, 1.0 代表再多來一點 process 就需要等待. 數字越小越好.
  1.0 也不代表 full load, 還需要看有幾個 cores 之類的
Q: memory欄位可以看出甚麼?
- A: 不包括 buffers and cached memory 的 memory usage.
  (Green): Used memory pages
  (Blue): Buffer pages
  (Yellow): Cache pages
Q: process各欄位可以提供甚麼資訊?
PID: process ID number.
USER: The process’s owner.
PR: The process’s priority. 數字越小 priority 越高
NI: The nice value of the process, which affects its priority.
VIRT: How much virtual memory the process is using.
RES: How much physical RAM the process is using(KB).
SHR: How much shared memory the process is using.
S: The current status of the process (zombied, sleeping, running, uninterruptedly sleeping, or traced).
%CPU: The percentage of the processor time used by the process.
%MEM: The percentage of physical RAM used by the process.
TIME+: How much processor time the process has used.
COMMAND: The name of the command that started the process.

iostat

monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates.
iostat -x 1: shows detail, refresh every second
用於分析各裝置 io 的效能
manual: https://linux.die.net/man/1/iostat
Interpretations:
- @user: CPU utilization for the user
- @nice: the CPU utilization for apps with nice priority
- @system: the CPU being utilized by the system
- @iowait: the time percentage during which CPU was idle but there was an outstanding i/o request
- @steal: percentage of time CPU was waiting as the hypervisor was working on another CPU
- @idle: is the percentage of time system was idle with no outstanding request
- Device: 裝置, 哪顆硬碟, 哪塊 RAID (md) …etc
- tps: transfers per second, transfer = io request
- Blk_read/s: 每秒從 device 讀幾個 blocks 的 data. Blk_wrtn/s 類推
- Blk_read: 總共讀了幾個 blocks. Blk_wrtn 類推
- kB_read/s: 每秒從 device 讀幾個 kilobytes 的 data. kB_wrtn/s 類推
- rrqm/s: 每秒有幾個 read requests 被 merge 進 queue. wrqm/s 類推
- r/s: 每秒幾個 read requests 被 issue. w/s 類推
- see: https://www.linuxtechi.com/monitor-linux-systems-performance-iostat-command/
explan @nice and @steal
- @nice - change process priority
  - nice() adds inc to the nice value for the calling thread. (A higher nice value means a lower priority)
  - The range of the nice value is +19 (low priority) to -20 (high
    priority). Attempts to set a nice value outside the range are
    clamped to the range.
  - nice 是 unix system call, 用來改變 process 的 priority, iostat 會看那些用 nice 的 process 多耗資源
  - see:
    - https://man7.org/linux/man-pages/man2/nice.2.html
    - https://stackoverflow.com/questions/22114653/what-does-the-nice-priority-mean-about-iostat-commands-nice
- @steal
  - The hypervisor means the layer that manages a virtual environment, like VMware, XEN or VirtualBox.
  - The field itself means the time the VM CPU has to wait for others VMs (virtual machines) finishing their turn (slice), or for a task of the hypervisor itself.
  - The st field is present in the iostat, vmstat, sar and top commands.
  - 我的 process 沒有跑在 hypervisor 上, 所以 0%
  - see: https://unix.stackexchange.com/questions/264958/iostat-what-does-the-steal-field-mean

perf

see: http://www.brendangregg.com/perf.html#Examples
Performance analysis tools for Linux. it can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing).
perf: events (perf 中有三種事件)
- Hardware: Triggered by PMU (Performance Monitoring Unit), eg: cache-misses、cpu-cycles、instructions、branch-misses …
- Software: Triggered by kernel process, eg: context-switches、page-faults、cpu-clock、cpu-migrations …
- tracepoint: hooks in kernel source code
- see:

perf 裡面有三種事件: 由硬體 ( cpu 支援) PMU 發起的 event; 由 kernel process 發起的; 在 Linux kernel source code 留下的 hook 觸發的事件;
(note) 可在 include/trace/events/block.h 找到相關 events
ztex

perf: list
- List events
- Different platforms(CPU) support different event
perf: top
- Show hot spot in real time.
- The graph shows spending 1.61% of all the events on get_io_u in fio.
- [.] = user space
- [k] = kernel space
perf stat : optimizing a specific target.
- compare to top, we use stat to monitoring a series of events happen to a particular target.
- perf stat –repeat 5 -e cache-misses,cache-references,instructions,cycles,cpu-clock fio FIO_JOB_FILE
  - repeat 5 times
  - events: cache-misses,cache references,instructions,cycles,cpu-clock
  - target: fio FIO_JOB_FILE
Flame graph

perf record -g -a $1 $2 # Run a command and record its profile into perf.data
perf script > /tmp/perf.raw.events # Read perf.data (created by perf record) and display trace output
stackcollapse-perf.pl /tmp/perf.raw.events > /tmp/perf.events.folded
flamegraph.pl /tmp/perf.events.folded > perf.events.svg

Flame graph 基於一些 profiler (perf_event, Dtrace, …) 的輸出, 將 call stack 視覺化.
X-axis 不是按照時間排序的, 左到右是依照字母排序的, 寬的 box 代表該函式呼叫在 stack 中的頻率越大.
Y-axis 是 stack depth, root 在底, leaf 在 top. 所以假設 a 在 b 下面, a 是 b 的 parent.
從 stackcollapse-perf.pl, 可以看出來, 靠著分析 raw event 將 call stack parse 成 single line. 這也是為甚麼時間會消失, 取而代之的是頻率.
ztex

strace

trace system calls and signals
strace intercepts and records the system calls which are called by a process and the signals which are received by a process.
see:
- https://man7.org/linux/man-pages/man1/strace.1.html
- https://www.howtoforge.com/linux-strace-command/
-c --summary-only
- 可以用來決定哪個 system call 值得追蹤.
- 比如 fio 用在 nanosleep 的比例超高.
-i Print the instruction pointer at the time of the system call
-T Show the time spent in system calls.
-k Print the execution stack trace of the traced processes after each system call.

Summary:
htop: 可以即時看到 process 對系統的 loading, 包括 cpu utilization, memory loading … etc
iostat: 可以即時看到 device 的 io 效能, 包括目前的讀寫效率. io queue 大小, io request issue 的時間等等 (utilization 代表意義)
perf: 可以分析 cpu 的效能表現, 包括 event 的發生頻率, 比如 cache misses 的發生頻率. 搭配 flame graph 可以視覺化 call stack 的情形, 包括某個 call 的 sample 占比.
strace: 追蹤 signals 跟 system calls. 比如可以先用 -c 看哪個 system call 很花時間, 值得追蹤. 在用 -i 去追蹤每個 system call 花費的時間來找 bottleneck.
ztex

performance engineering

see:

fuse-xfs (https://github.com/wchaoyi/fuse-xfs)

嘗試一次讀取比較多個 block
ztex

推測是因為一次讀比較多 block 讓 performance 變好
non-buffered IO, 沒有有效的集中 read requests 向下層發送, 所以不會有效果
推測, 在 blocks 數: 32 - 64, 會觸發 split
所以在 32 之前的 request size, 皆維持 request size = blocks * block size
reference: Why is the size of my IO requests being limited, to about 512K?

Performance engineering

tags: fio profiler htop iostat perf strace

fio (flexible IO)

ioengine: libaio vs psync

buffered IO vs non-buffered IO

block size: 1k vs 4k vs 4M

regular file vs sparse file vs fallocate file

Performance profiler

htop

iostat

perf

strace

performance engineering

fuse-xfs (https://github.com/wchaoyi/fuse-xfs)

Read more

Protect the System Call, Protect (most of) the World with BASTION

System call protection study

GNU GCC plugin study

[Embedded System] Deploy Linux v5.13 onto qemu vexpress-a9

tags: `fio` `profiler` `htop` `iostat` `perf` `strace`