Try   HackMD

Performance engineering

tags: fio profiler htop iostat perf strace

ztex

fio (flexible IO)

Fio spawns a number of threads or processes doing a particular type of I/O action as 
specified by the user
#I/O type Defines the I/O pattern issued to the file(s). We may only be reading sequentially from this file(s), or we may be writing randomly. Or even mixing reads and writes, sequentially or randomly. Should we be doing buffered I/O, or direct/raw I/O? #Block size In how large chunks are we issuing I/O? This may be a single value, or it may describe a range of block sizes. #I/O size How much data are we going to be reading/writing. #I/O engine How do we issue I/O? We could be memory mapping the file, we could be using regular read/write, we could be using splice, async I/O, or even SG (SCSI generic sg). #I/O depth If the I/O engine is async, how large a queuing depth do we want to maintain? Target file/device How many files are we spreading the workload over. Threads, processes and job synchronization How many threads or processes should we spread this workload over.
  • example
; -- start job file --
[random-writers]
ioengine=libaio
iodepth=4
rw=randwrite
bs=32k
direct=0
size=64m
numjobs=4
; -- end job file --

上面代表, 用 libaio (asynchronous), 4 個 io unit, 做 random write, blocksize 32k, buffered IO, 打滿 size 就停, 共 4 個 jobs fork() 出來
ztex

  • io depth: 代表有幾個 io unit. 這對於 synchronous io 沒意義, 因為一次只能送一個 io request 然後等 io request 回來

ioengine: libaio vs psync

libaio pysnc
CPU% (htop) 106 37.9
MEM% (htop) 10.8 10.8
IPOS 34.4 18.6
wMiB/s 134 72.7
  • 改變 iodepth 觀察兩種 ioengine 的差距 (iodepth = 1 vs. 2)
libaio pysnc
IPOS 34.4 to 45.0 18.6 to 18.6
wMiB/s 134 to 180 72.7 to 72.7
  • 使用 perf 觀察 block:blook_rq_complete 次數
ioengine times
libaio 1424
pysnc 372
  • 參數改變, ioengine = libaio -> psync, 代表從 asynchronous io 變成 synchronous.
  • 從 IOPS 來說, 效能是明顯下降的.
  • 增加 iodepth 的實驗, 說明對於 psync 來說增加 io unit 沒有幫助
  • rq_complete 次數, 說明 libaio 完成比較多次 IO
  • 這是因為 libaio 一次提交很多 io request 之後就 return, 可受益於 io scheduler 的 elevator algorithm, 也在效能上有所反映.

buffered IO vs non-buffered IO

non-buffered buffered
IPOS 34.4 173
wMiB/s 134 676
  • 使用 perf 觀察 block:blook_rq_complete 次數
ioengine times
non-buffered 785
buffered 11,517
  • direct 從 1 變成 0, 代表從 non-buffered IO 變成 buffered IO.
  • 從 IOPS 可以知道效能明顯變好
  • 從 CPU% 知道 buffered IO 的 CPU 使用率變高, 但奇怪的是 MEM% 卻維持不變
  • 透過 flame graph 觀察到 posix_fadvise 占比變高, 推測 cache 的效能會因此變好
  • 透過 perf 知道 block_rq_complete 次數明顯變多.
  • buffered IO 會集合比較多 io requsts 然後交給 io scheduler, 這解釋為甚麼效能變好.

block size: 1k vs 4k vs 4M

1k 4k 4M
IPOS 37.3 37.2 0.116
wMiB/s 36.4 145 464
  • 使用 perf 觀察 block:blook_rq_complete 次數
ioengine times
1k 2337
4k 818
4M 44
  • strace
1k 4k 4M
nanosleep 7738/2967 7060/2965 291/2874
shmdt 5000/1 5000/1 52000/1

可以看出來 block size 越大, sleep 時間越短, 需要花更多時間在 share memory mapping 處理
欄位表示: (usec/call) / (call)
如: 7738/2967 表示該 system call
被呼叫了 2967 次, 平均每次呼叫花 7738 usec
ztex

  • 隨著 block size 的增加, IOPS 越少, write bytes / s 越多.
  • 一次寫入的 block size 變大, IO size 不變的情況下, 這使得需要的 io request 少, block:blook_rq_complete 的次數減少, 解釋了為甚麼 IOPS 變少
  • 但是每次寫的總量反而是增加的, 這在 fio 的統計資訊可以看出來
  • block size 越大需要花更多時間在 share memory mapping 處理

regular file vs sparse file vs fallocate file

  • sparse file 為了節省空間, 只會在 header 標示檔案大小

  • fallocate file 會向 disk 要空間, 但不會清空, 只是標記為 uninitialized 所以在 allocate 的時候很快速, 等要操作時在處理

  • 效能來說 regular > fallocate > sparse, 因為 fallocate 要處理 mark; sparse 需要處理 metadata.

  • sparse file, fallocated file 隨著 寫入次數 增加, 效能也有所變化

  • io request 次數雖然 sparse 比較多, 但 IOPS 比較小, 這是因為許多 io requests 會拿去寫 meta data 實際寫入的 io 不多

  • rw 從 write 改成 read

    • 最明顯的變化應該是 sparse file 在 read 的效能上表現非常好
    • 這應該得益於 sparse file 可以從 meta data 上就得到許多資訊, 甚至最後 read total 到達 47.9GB
    • sparse file 的 block_bio_queue 觸發比較少, 說明 sparse file 把 bio request 丟進 queue 的數量更少, 但最後卻 read 更多

Performance profiler

htop

  • interactive process viewer

  • 即時看到 process 造成的 system loading. 包括, cpu utilization, memory usage 等等.

  • 我自己覺得這個很適合作為第一步, 看哪個 process 造成比較大的資源消耗

  • manual: https://man7.org/linux/man-pages/man1/htop.1.html

  • Q: 上面cpu使用率,各個顏色的意義?

    • A:
      • (Blue) low priority processes (nice > 0)
      • (Green) normal (user) processes
      • (Red) kernel processes
      • (Yellow) IRQ time
      • (Magenta) Soft IRQ time
      • (Grey) IO Wait time\
  • Q: load average代表的意義?

    • A: 三個數字分別為 1, 5, and 15 分鐘, 三種 period
      load average 代表一段 period 中 process loading.
      eg. 假設 single core 0.0~1.0 代表不用等待, 1.0 代表再多來一點 process 就需要等待. 數字越小越好.
      1.0 也不代表 full load, 還需要看有幾個 cores 之類的
  • Q: memory欄位可以看出甚麼?

    • A: 不包括 buffers and cached memory 的 memory usage.
      (Green): Used memory pages
      (Blue): Buffer pages
      (Yellow): Cache pages
  • Q: process各欄位可以提供甚麼資訊?
    PID: process ID number.
    USER: The process’s owner.
    PR: The process’s priority. 數字越小 priority 越高
    NI: The nice value of the process, which affects its priority.
    VIRT: How much virtual memory the process is using.
    RES: How much physical RAM the process is using(KB).
    SHR: How much shared memory the process is using.
    S: The current status of the process (zombied, sleeping, running, uninterruptedly sleeping, or traced).
    %CPU: The percentage of the processor time used by the process.
    %MEM: The percentage of physical RAM used by the process.
    TIME+: How much processor time the process has used.
    COMMAND: The name of the command that started the process.

iostat

  • monitoring system input/output device loading by observing the time the devices are active in relation to their average transfer rates.

  • iostat -x 1: shows detail, refresh every second

  • 用於分析各裝置 io 的效能

  • manual: https://linux.die.net/man/1/iostat

  • Interpretations:

    • @user: CPU utilization for the user
    • @nice: the CPU utilization for apps with nice priority
    • @system: the CPU being utilized by the system
    • @iowait: the time percentage during which CPU was idle but there was an outstanding i/o request
    • @steal: percentage of time CPU was waiting as the hypervisor was working on another CPU
    • @idle: is the percentage of time system was idle with no outstanding request
    • Device: 裝置, 哪顆硬碟, 哪塊 RAID (md) …etc
    • tps: transfers per second, transfer = io request
    • Blk_read/s: 每秒從 device 讀幾個 blocks 的 data. Blk_wrtn/s 類推
    • Blk_read: 總共讀了幾個 blocks. Blk_wrtn 類推
    • kB_read/s: 每秒從 device 讀幾個 kilobytes 的 data. kB_wrtn/s 類推
    • rrqm/s: 每秒有幾個 read requests 被 merge 進 queue. wrqm/s 類推
    • r/s: 每秒幾個 read requests 被 issue. w/s 類推
    • see: https://www.linuxtechi.com/monitor-linux-systems-performance-iostat-command/
  • explan @nice and @steal

perf

perf 裡面有三種事件: 由硬體 ( cpu 支援) PMU 發起的 event; 由 kernel process 發起的; 在 Linux kernel source code 留下的 hook 觸發的事件;
(note) 可在 include/trace/events/block.h 找到相關 events
ztex

perf record -g -a $1 $2 # Run a command and record its profile into perf.data
perf script > /tmp/perf.raw.events # Read perf.data (created by perf record) and display trace output
stackcollapse-perf.pl /tmp/perf.raw.events > /tmp/perf.events.folded
flamegraph.pl /tmp/perf.events.folded > perf.events.svg

Flame graph 基於一些 profiler (perf_event, Dtrace, …) 的輸出, 將 call stack 視覺化.
X-axis 不是按照時間排序的, 左到右是依照字母排序的, 寬的 box 代表該函式呼叫在 stack 中的頻率越大.
Y-axis 是 stack depth, root 在底, leaf 在 top. 所以假設 a 在 b 下面, a 是 b 的 parent.
stackcollapse-perf.pl, 可以看出來, 靠著分析 raw event 將 call stack parse 成 single line. 這也是為甚麼時間會消失, 取而代之的是頻率.
ztex

strace

  • trace system calls and signals
  • strace intercepts and records the system calls which are called by a process and the signals which are received by a process.
  • see:
  • -c --summary-only
    • 可以用來決定哪個 system call 值得追蹤.
    • 比如 fio 用在 nanosleep 的比例超高.
  • -i Print the instruction pointer at the time of the system call
  • -T Show the time spent in system calls.
  • -k Print the execution stack trace of the traced processes after each system call.

Summary:
htop: 可以即時看到 process 對系統的 loading, 包括 cpu utilization, memory loading … etc
iostat: 可以即時看到 device 的 io 效能, 包括目前的讀寫效率. io queue 大小, io request issue 的時間 等等 (utilization 代表意義)
perf: 可以分析 cpu 的效能表現, 包括 event 的發生頻率, 比如 cache misses 的發生頻率. 搭配 flame graph 可以視覺化 call stack 的情形, 包括某個 call 的 sample 占比.
strace: 追蹤 signals 跟 system calls. 比如可以先用 -c 看哪個 system call 很花時間, 值得追蹤. 在用 -i 去追蹤 每個 system call 花費的時間來找 bottleneck.
ztex

performance engineering

see:

fuse-xfs (https://github.com/wchaoyi/fuse-xfs)

嘗試一次讀取比較多個 block
ztex

  • 推測是因為一次讀比較多 block 讓 performance 變好
  • non-buffered IO, 沒有有效的集中 read requests 向下層發送, 所以不會有效果
  • 推測, 在 blocks 數: 32 - 64, 會觸發 split
  • 所以在 32 之前的 request size, 皆維持 request size = blocks * block size
  • reference: Why is the size of my IO requests being limited, to about 512K?