# 2022q1 Homework3 (fibdrv) contributed by <`arthurchang09`> ## 排除干擾效能分析的因素抑制 * 限定 CPU 給特定的程式使用 進入 /etc/default/grub ```bash $ sudo vim /etc/default/grub ``` 看到以下這行 ``` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" ``` 預計要將 `cpu 0` 獨立出來,將上方做更改 ``` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0" ``` 接著輸入 ``` $ sudo update-grub ``` 重新開機即可,系統管理員顯示 `CPU 0` 佔用為 $0 \%$ 若須程式放在 `cpu 0` 執行,使用 `taskset` ``` taskset 0x01 ./client ``` * address space layout randomization (ASLR) ``` $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` * 設定 scaling_governor 為 performance。準備以下 shell script,檔名為 `performance.sh`: ``` for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done ``` 之後再用 `$ sudo sh performance.sh` 執行 * 針對 Intel 處理器,關閉 turbo mode ``` $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` ## 使用 Fast Doubling 參考 [Fast doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)的作法以及以下[虛擬碼](https://hackmd.io/@sysprog/linux2022-fibdrv#-費氏數列) ``` Fast_Fib(n) a = 0; b = 1; // m = 0 for i = (number of binary digit in n) to 1 t1 = a*(2*b - a); t2 = b^2 + a^2; a = t1; b = t2; // m *= 2 if (current binary digit == 1) t1 = a + b; // m++ a = b; b = t1; return a; ``` 為了得到 `number of binary digit in n` ,採用一個 `for` 迴圈計算 `n` 的最高 bit 為 1 的位置,並使用 `mask` 記下,隨著迴圈進行逐步右移,如此一來可以判斷某 bit 是否為 1。 ```c uint64_t my_fib(unsigned int n){ unsigned int h = 0; for (unsigned int i = n; i; i>>=1 ){ h++; } uint64_t a = 0; uint64_t b = 1; for (unsigned int mask = 1 << (h - 1); mask; mask >>= 1){ uint64_t t1 = a*(2*b - a); uint64_t t2 = a*a + b*b; a = t1; b = t2; if (mask & n) { t1 = a + b; a = b; b = t1; } } return a; } ``` 在 [考慮到硬體加速 F(n) 的手法](https://hackmd.io/@sysprog/linux2022-fibdrv#考慮到硬體加速-Fn-的手法) 中有提到 `clz` 這一類硬體加速指令。而在 GCC 中有提供 [__builtin_clz](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) ,考慮到 `__builtin_clz(0)` 為 undefined behavior ,增加一個條件判斷 `n` 是否為 0 ,由於此情況非常少見,參考 lab0 中 [list_sort](https://github.com/torvalds/linux/blob/master/lib/list_sort.c#L89) 的 `unlikely` 提供編譯器優化。 ```c #define unlikely(x) __builtin_expect(!!(x), 0) uint64_t my_fib(unsigned int n) { if (unlikely(!n)) { return 0; } uint64_t a = 0; uint64_t b = 1; for (unsigned int mask = 1 << (__builtin_clz(n)); mask; mask >>= 1) { uint64_t t1 = a * (2 * b - a); uint64_t t2 = a * a + b * b; a = t1; b = t2; if (mask & n) { t1 = a + b; a = b; b = t1; } } return a; } ``` 用以下程式碼測量執行時間 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { //return (ssize_t) fib_sequence(*offset); ktime_t kt = ktime_get(); ssize_t add = fib_sequence(*offset); kt = ktime_sub(ktime_get(), kt); ktime_t kt2 = ktime_get(); unsigned long long ans = fib_fast_doubling(*offset); kt2 = ktime_sub(ktime_get(), kt2); ktime_t kt3 = ktime_get(); unsigned long long ans1 = fib_fast_doubling_clz(*offset); kt3 = ktime_sub(ktime_get(), kt3); printk(KERN_INFO "%lld %lld %lld %lld", *offset, ktime_to_ns(kt), ktime_to_ns(kt2), ktime_to_ns(kt3)); return add; } ``` 比較兩這的執行時間得下圖 ![](https://i.imgur.com/jsicQ6m.png) 可看出 fast doubling 的兩個版本均比 iteration 還要快,且計算時間相對平穩。然而,有 `clz` 的版本似乎較沒有 `clz` 者慢。根據 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#2020q1-Homework2-fibdrv),可能是被優化掉,因此查看了組合語言 ``` ktime_t kt3 = ktime_get(); 1ad: 49 89 c7 mov %rax,%r15 if (unlikely(!n)) { 1b0: 48 85 c9 test %rcx,%rcx 1b3: 74 15 je 1ca <fib_read+0x10a> for (unsigned int mask = 1 << (__builtin_clz(n)); mask; mask >>= 1) { 1b5: 0f bd c9 bsr %ecx,%ecx 1b8: b8 01 00 00 00 mov $0x1,%eax 1bd: 83 f1 1f xor $0x1f,%ecx 1c0: d3 e0 shl %cl,%eax 1c2: 85 c0 test %eax,%eax 1c4: 74 04 je 1ca <fib_read+0x10a> 1c6: d1 e8 shr %eax 1c8: 75 fc jne 1c6 <fib_read+0x106> ``` 並沒有被優化掉, 尚欠效能比較.... ## 計算 F93 (包含) 之後的 Fibonacci 數 採用以下之 `struct` 紀錄大數,目前先採用 $32\times 8 = 256$ bits ```c #define LENGTH 8 typedef struct bn { uint32_t num[LENGTH]; } bn; ``` 接著加法、減法、左移需要作出相應的調整。 ```c void bn_add(bn a, bn b, bn *sum) { uint32_t N[LENGTH]; uint64_t c = 0; for (int i = 0; i < LENGTH; ++i) { c += (uint64_t) a.num[i] + b.num[i]; N[i] = c; c >>=32; } for (int i = LENGTH - 1; i >= 0; --i) { sum->num[i] = N[i]; } } ``` 加法先從較低位元開始相加,並將結果存入 `uint64_t` 以保留進位,接著將除了進位以外的總和存起來,並將 `c` 右移 32 位,即會成為下一組 32 bits 的進位。 以 8 bits 為例,如下面的直式,做完加法後產生 carry ,位在第 8 bit 。 $\begin{array}{r|r|rrrrrrrr} a[i] & & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 0\\ b[i] & & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0\\ \hline c & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0\\ \end{array}$ 存入對應的 `N` 陣列後,將 `c` 右移 8 bit ,即是下一組較高位 8 bits 要再 +1 的位置。 $\begin{array}{r|r|rrrrrrrr} c && 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\ \hline a[i+1] && 1 & 0 & 0 & 1 & 0 & 1 & 1 & 0\\ b[i+1] && 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0\\ \hline c &0& 0 & 1 & 1 & 0 & 0 & 0 & 1 & 1\\ \end{array}$ ```c void bn_dec(bn a, bn b, bn *diff) // a - b { uint32_t N[LENGTH]; uint32_t br = 0; for (int i = 0; i < LENGTH; ++i) { if (a.num[i] < (b.num[i] + br)) { N[i] = UINT32_MAX - b.num[i] + a.num[i] - br + 1; br = 1; } else { N[i] = a.num[i] - b.num[i] - br; br = 0; } } for (int i = LENGTH - 1; i >= 0; --i) { diff->num[i] = N[i]; } } ``` 減法要考慮 borrow ,當 `a` 比 `b + br` 小的時候就需要借位。為了避免 overflow , `a` 和 `b` 不能先相減,這裡使用 `UINT32_MAX` 先減掉 `b` ,再依序加上 `a` 減去 `br` 。由於 `UINT_MAX` 為 $2^{32}-1$ ,最後還要加上 1 。 ```c void bn_lshift(bn a, int shift, bn *res) { int shift_32b_amount = shift >> 5; memcpy(res, &a, sizeof(bn)); uint32_t tmp[shift_32b_amount + 1]; uint32_t tmp2[shift_32b_amount + 1]; uint32_t i = 0; int shift_bit = shift & 0x0000001f; memset(tmp, 0, sizeof(int) * shift_32b_amount); if (shift_32b_amount){ for (int i = 0; i < shift_32b_amount; ++i) { tmp[i] = res->num[i]; res->num[i] = 0; } for (i = shift_32b_amount; i <= LENGTH - shift_32b_amount; i += shift_32b_amount) { memcpy(tmp2, &res->num[i], sizeof(uint32_t)* shift_32b_amount); memcpy(&res->num[i], tmp, sizeof(uint32_t)* shift_32b_amount); memcpy(tmp, tmp2, sizeof(uint32_t)*shift_32b_amount); } if (shift_32b_amount >= (LENGTH >> 1) + 1) { memcpy(&res->num[shift_32b_amount], tmp, sizeof(uint32_t)*(LENGTH - shift_32b_amount)); } } if (shift_bit) { tmp[0] = (res->num[0] >> (32 - shift_bit)); res->num[0] <<= shift_bit; for (i = 0; i < LENGTH - 1; ++i){ tmp2[0] = (res->num[i + 1] >> (32 - shift_bit)) ; res->num[i + 1] = res->num[i + 1] << shift_bit | tmp[0]; tmp[0] = tmp2[0]; } } } ``` 上方的程式碼是左移的實作。先計算總共位移幾組 32 bits 和剩下不到 32 bits 之位移量。接著先處理 32 bits 位移,用 `memcpy` 複製數值到較高位元。剩下不到 32 bits 位移先取最高位的 bit 再和 左移後較高位的 32 bits 做 `|` 。 ```c= void bn_mul_1(bn a, bn b, bn *res) { //printf("%d\n", clz(0)); uint64_t tmp[LENGTH]; memset(tmp, 0,sizeof(uint64_t) * LENGTH); for (int i = 0; i < LENGTH; ++i) { for (int j = 0; j < LENGTH; ++j) { if ((i + j) < LENGTH ) { tmp[i + j] += (uint64_t) a.num[i] * b.num[j]; } } } for (int i = 1; i < LENGTH; ++i) { tmp[i] += tmp[i - 1] >> 32; tmp[i - 1] &= 0xffffffff; } for (int i = LENGTH - 1; i >= 0; --i) { res->num[i] = (uint32_t) tmp[i]; } } ``` 上方的程式碼為初始版本之乘法實作,為直式乘法的實作。然而,當計算到第 190 項時會出現錯誤,原因是第 9 行的加法式可能出現 overflow。 參考 [KYG-yaya573142](https://github.com/KYG-yaya573142/fibdrv/blob/master/bn_kernel.c) 作法將乘法後的結果依序加到較高位的 bit 。 ```c void bn_mul_2(bn a, bn b, bn *res) { uint64_t tmp[LENGTH]; uint32_t t[LENGTH]; uint64_t c1 = 0,c2 = 0; memset(tmp, 0,sizeof(uint64_t) * LENGTH); memset(t, 0,sizeof(uint32_t) * LENGTH); for (int i = 0; i < LENGTH; ++i) { for (int j = 0; j < LENGTH; ++j) { if ((i + j) < LENGTH ) { c1 = (uint64_t) a.num[i] * b.num[j]; c2 = 0; for (int k = i + j; k < LENGTH; ++k) { c2 += (uint64_t) t[k] + (c1 & 0xffffffff); t[k] = c2 & 0xffffffff; c2 >>= 32; c1 >>= 32; if (!c1 && !c2) { break; } } } } } for (int i = LENGTH - 1; i >= 0; --i) { res->num[i] = t[i]; } } ``` 為了將這些大數印出,參考 [OscarShiang](https://hackmd.io/@oscarshiang/linux_fibdrv#計算第-92-項之後的-Fibonacci-數) 的方式將大數轉成十進位印出, ```c char * bn_to_string(bn a){ char s[8 * sizeof(uint32_t) * LENGTH / 3 + 2]; uint32_t n[LENGTH]; int i; memset(s, '0', sizeof(s) - 1); memcpy(n, a.num, sizeof(uint32_t) * LENGTH); s[sizeof(s) - 1] = '\0'; //printf("%ld\n", sizeof(a)); for (i = 0 ; i < 8 * sizeof(uint32_t) * LENGTH; ++i) { int j, carry; carry = (n[LENGTH - 1] >= 0x80000000); for (int j = LENGTH - 1; j >= 0; --j) { n[j] = ((n[j] << 1) & 0xffffffff) + ((j - 1) >= 0 ? (n[j - 1] >= 0x80000000) : 0); } for (int j = sizeof(s) - 2; j >= 0; j--) { s[j] += s[j] - '0' + carry; carry = (s[j] > '9'); if (carry) s[j] -= 10; } } i = 0; while (i < sizeof(s) - 2 && s[i] == '0') i++; char *dec = (char *)malloc(sizeof(s) - i); memcpy(dec, &s[i], sizeof(s) - i); return dec; } ``` 以下為 iteration 的作法 ```c void bn_norm_fib(unsigned int n, bn *res) { if (unlikely(!n)) { memset(res, 0, sizeof(bn)); bn_to_string(*res); return; } bn a, b; memset(&a, 0, sizeof(bn)); memset(&b, 0, sizeof(bn)); memset(res, 0, sizeof(bn)); a.num[0] = 0; b.num[0] = 1; res->num[0] = 1; for (int i = 1; i < n; ++i){ bn_add(a, b, res); memcpy(&a, &b, sizeof(bn)); memcpy(&b, res, sizeof(bn)); } } ``` ```c void bn_fib(unsigned int n, bn *res) { if (unlikely(!n)) { memset(res, 0, sizeof(bn)); bn_to_string(*res); return ; } unsigned int h = 0; bn a,b; memset(&a, 0, sizeof(bn)); memset(&b, 0, sizeof(bn)); a.num[0] = 0; b.num[0] = 1; for (unsigned int mask = 1 << (__builtin_clz(n)); mask; mask >>= 1) { bn t1; bn t2; memset(&t1, 0, sizeof(bn)); memset(&t2, 0, sizeof(bn)); bn_lshift(b, 1, &t1); bn_dec(t1, a, &t1); bn_mul_2(a, t1, &t1); bn_mul_2(a, a, &t2); bn_mul_2(b, b, &b); bn_add(t2, b, &t2); memcpy(&a, &t1, sizeof(bn)); memcpy(&b, &t2, sizeof(bn)); if (mask & n) { bn_add(a,b, &t1); memcpy(&a, &b, sizeof(bn)); memcpy(&b, &t1, sizeof(bn)); } } memcpy(res, &a, sizeof(bn)); } ``` 接著修改 `fibdrv.c` ,直接使用 `copy_to_user` 將字串傳到 `client.c`。 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { bn res; bn_fib(*offset, &res); char *str = bn_to_string(res); size_t len = strlen(str) + 1; copy_to_user(buf, str, len); kfree(str); return fib_sequence(*offset); } ``` 修改 `client.c` 使其能接收到字串 ```c #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <unistd.h> #define FIB_DEV "/dev/fibonacci" int main() { long long sz; char buf[1]; char write_buf[] = "testing writing"; int offset = 100; /* TODO: try test something bigger than the limit */ int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } for (int i = 0; i <= offset; i++) { sz = write(fd, write_buf, strlen(write_buf)); printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz); } for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%s.\n", i, buf); } for (int i = offset; i >= 0; i--) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%s.\n", i, buf); } close(fd); return 0; } ``` 運行 `client.c` 發生以下錯誤 ``` *** stack smashing detected ***: terminated Aborted ``` 參考 [blueskyson](https://hackmd.io/@blueskyson/linux2022-fibdrv) 修改 ```c int bn_to_string(bn a, char str[]) { char s[8 * sizeof(uint32_t) * LENGTH / 3 + 2]; unsigned int n[LENGTH]; int i; memset(s, '0', sizeof(s) - 1); memcpy(n, a.num, sizeof(uint32_t) * LENGTH); s[sizeof(s) - 1] = '\0'; // printf("%ld\n", sizeof(a)); for (i = 0; i < 8 * sizeof(uint32_t) * LENGTH; ++i) { int carry; carry = (n[LENGTH - 1] >= 0x80000000); for (int j = LENGTH - 1; j >= 0; --j) { n[j] = ((n[j] << 1) & 0xffffffff) + ((j - 1) >= 0 ? (n[j - 1] >= 0x80000000) : 0); } for (int j = sizeof(s) - 2; j >= 0; j--) { s[j] += s[j] - '0' + carry; carry = (s[j] > '9'); if (carry) s[j] -= 10; } } i = 0; while (i < sizeof(s) - 2 && s[i] == '0') i++; //char *dec = (char *) kmalloc(sizeof(s) - i, GFP_KERNEL); //memcpy(dec, &s[i], sizeof(s) - i); memcpy(str, s, sizeof(s)); return i; } ``` ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { bn res; bn_fib(*offset, &res); char str[8 * sizeof(uint32_t) * LENGTH / 3 + 2]; int len = bn_to_string(res, str); copy_to_user(buf, str + len, 8 * sizeof(uint32_t) * LENGTH / 3 + 2 - len); return fib_sequence(*offset); } ``` ```c #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <unistd.h> #define FIB_DEV "/dev/fibonacci" #define LENGTH 22 #define BUFFSIZE 8 * sizeof(int) * LENGTH / 3 + 2 int main() { long long sz; char buf[BUFFSIZE]; char write_buf[] = "testing writing"; int offset = 1000; /* TODO: try test something bigger than the limit */ int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, BUFFSIZE); printf(" offset %d, returned the sequence " "%s.\n", i, buf); } close(fd); return 0; } ``` ### 效能分析 首先在 userspace 測試 #### Perf ``` sudo perf stat -r 15 -e cycles,instructions,cache-references,cache-misses,branch,branch-instructions,branch-misses ./a.out ``` 採用 `perf` 查看執行費波納契數列第 0 項到第 500 項之 cache miss 和 branch miss。 * Fast doubling ``` Performance counter stats for './a.out' (15 runs): 7995,7081 cycles ( +- 0.30% ) (53.67%) 1,8737,9429 instructions # 2.34 insn per cycle ( +- 0.06% ) (69.11%) 7,6359 cache-references ( +- 1.39% ) (69.11%) 1,8337 cache-misses # 24.015 % of all cache refs ( +- 6.72% ) (69.11%) 2689,1258 branch ( +- 0.05% ) (71.15%) 2727,4701 branch-instructions ( +- 0.04% ) (77.22%) 4,4548 branch-misses # 0.16% of all branches ( +- 10.93% ) (59.74%) 0.0261203 +- 0.0000789 seconds time elapsed ( +- 0.30% ) ``` * Iteration ``` Performance counter stats for './a.out' (15 runs): 2951,5320 cycles ( +- 0.47% ) (24.03%) 5480,8003 instructions # 1.86 insn per cycle ( +- 0.42% ) (65.18%) 4,8197 cache-references ( +- 2.72% ) (98.36%) 1,1488 cache-misses # 23.836 % of all cache refs ( +- 12.41% ) 489,4013 branch ( +- 0.01% ) 491,1414 branch-instructions ( +- 0.09% ) (75.97%) 3,3682 branch-misses # 0.69% of all branches ( +- 37.92% ) (1.64%) 0.009944 +- 0.000117 seconds time elapsed ( +- 1.17% ) ``` Fast Doubling 在 Instruction per cycle(IPC) 和 Branch miss 上表現均優於 Iteration ,然而總執行的指令多非常多,因此執行時間也較多。下方的圖也顯示這一點。 #### 數列各項執行時間比較 下圖是執行 0 到 1000 項之結果 ![](https://i.imgur.com/ccrcNvQ.png) 參考 [Risheng1128](https://hackmd.io/@Risheng/linux2022-fibdrv#研讀-Linux-效能分析的提示) 將 CPU 0 獨立出來執行 ![](https://i.imgur.com/vqIM7fd.png) 接著按照 [Linux 效能分析的提示](https://hackmd.io/@sysprog/linux2022-fibdrv#-Linux-效能分析的提示) 抑制 address space layout randomization (ASLR),設定 scaling_governor 為 performance ,關閉 turbo mode。得到以下圖表 ![](https://i.imgur.com/dHIvNYk.png) 非常明顯 `bn-fast-doubling` 較 `bn_iteration` 慢。 #### cachegrind * fast doubling ``` $ valgrind --tool=cachegrind --branch-sim=yes ./a.out ==23370== Cachegrind, a cache and branch-prediction profiler ==23370== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==23370== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==23370== Command: ./a.out ==23370== --23370-- warning: L3 cache found, using its data for the LL simulation. ==23370== ==23370== I refs: 187,585,847 ==23370== I1 misses: 1,075 ==23370== LLi misses: 1,053 ==23370== I1 miss rate: 0.00% ==23370== LLi miss rate: 0.00% ==23370== ==23370== D refs: 102,874,241 (86,658,760 rd + 16,215,481 wr) ==23370== D1 misses: 3,201 ( 2,542 rd + 659 wr) ==23370== LLd misses: 2,663 ( 2,069 rd + 594 wr) ==23370== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==23370== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==23370== ==23370== LL refs: 4,276 ( 3,617 rd + 659 wr) ==23370== LL misses: 3,716 ( 3,122 rd + 594 wr) ==23370== LL miss rate: 0.0% ( 0.0% + 0.0% ) ==23370== ==23370== Branches: 22,906,501 (22,779,656 cond + 126,845 ind) ==23370== Mispredicts: 1,147,376 ( 1,147,235 cond + 141 ind) ==23370== Mispred rate: 5.0% ( 5.0% + 0.1% ) ``` ``` -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 3145728 B, 64 B, 12-way associative Command: ./a.out Data file: cachegrind.out.26188 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim Thresholds: 0.1 100 100 100 100 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim -------------------------------------------------------------------------------- 186,025,668 971 951 86,059,765 1,336 1,146 16,109,504 566 545 22,649,428 1,144,320 126,819 128 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim file:function -------------------------------------------------------------------------------- 163,840,484 8 8 75,946,074 0 0 12,966,362 0 0 21,096,210 1,070,152 0 0 /home/arthurchang09/quiz2/test.c:bn_mul_2 6,913,606 13 13 2,850,456 0 0 725,116 0 0 237,538 12,525 0 0 /home/arthurchang09/quiz2/test.c:bn_lshift 5,839,347 4 4 2,937,970 0 0 525,084 0 0 487,578 26,360 0 0 /home/arthurchang09/quiz2/test.c:bn_dec 5,462,604 2 2 2,930,076 0 0 441,720 0 0 397,548 29,486 0 0 /home/arthurchang09/quiz2/test.c:bn_add 1,763,169 18 18 1,082,999 0 0 1,136,227 2 1 26,506 3,128 0 0 /home/arthurchang09/quiz2/test.c:bn_fib 1,662,276 5 5 113,519 0 0 277,046 3 2 352,058 9 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms 253,223 17 16 126,584 1 0 11 1 1 10 5 126,562 11 ???:??? ``` * iteration ``` ==25942== Cachegrind, a cache and branch-prediction profiler ==25942== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==25942== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==25942== Command: ./a.out ==25942== --25942-- warning: L3 cache found, using its data for the LL simulation. ==25942== ==25942== I refs: 52,946,905 ==25942== I1 misses: 927 ==25942== LLi misses: 912 ==25942== I1 miss rate: 0.00% ==25942== LLi miss rate: 0.00% ==25942== ==25942== D refs: 34,742,105 (28,234,653 rd + 6,507,452 wr) ==25942== D1 misses: 1,902 ( 1,336 rd + 566 wr) ==25942== LLd misses: 1,691 ( 1,146 rd + 545 wr) ==25942== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==25942== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==25942== ==25942== LL refs: 2,829 ( 2,263 rd + 566 wr) ==25942== LL misses: 2,603 ( 2,058 rd + 545 wr) ==25942== LL miss rate: 0.0% ( 0.0% + 0.0% ) ==25942== ==25942== Branches: 3,899,539 ( 3,772,990 cond + 126,549 ind) ==25942== Mispredicts: 252,834 ( 252,706 cond + 128 ind) ==25942== Mispred rate: 6.5% ( 6.7% + 0.1% ) ``` ``` -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 3145728 B, 64 B, 12-way associative Command: ./a.out Data file: bn_norm_cachegrind.out.25942 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim Thresholds: 0.1 100 100 100 100 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim -------------------------------------------------------------------------------- 52,946,905 927 912 28,234,653 1,336 1,146 6,507,452 566 545 3,772,990 252,706 126,549 128 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim file:function -------------------------------------------------------------------------------- 46,282,250 3 3 24,825,250 0 0 3,742,500 1 0 3,368,250 249,538 0 0 /home/arthurchang09/quiz2/test.c:bn_add 4,762,025 6 6 2,874,257 0 0 2,500,505 3 3 126,252 510 0 0 /home/arthurchang09/quiz2/test.c:bn_norm_fib 1,497,000 2 2 374,250 0 0 249,500 0 0 249,500 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms 252,683 17 16 126,314 1 0 11 1 1 10 5 126,292 11 ???:??? ``` 將以上重要資訊整理如下 | | Bn Fast Doubling | Bn Iteration | |:---------------------------:|:----------------:|:------------:| | perf instructions | 1,8737,9429 | 5480,8003 | | perf cache-references | 7,6359 | 4,8197 | | perf cache miss | 1,8337 | 1,1488 | | perf branch | 2689,1258 | 489,4013 | | perf branch miss | 4,4548 | 3,3682 | | perf branch miss rate | 0.16% | 0.69% | | cachegrind LL cache miss | 3,716 | 2,603 | | cachegrind Branch | 22,906,501 | 3,899,539 | | cachegrind Branch miss | 1,147,376 | 252,834 | | cachegrind Branch miss rate | 5.0% | 6.5% | 可以發現 Bn Fast doubling 執行的指令數約為 iteration 版的 3 倍多。雖然有較低 cache miss rate 和 Branch miss rate ,但仍無助於減少執行時間。 先從 cachegrind 的數據觀察,可以發現 `bn_mul2` 之 `Ir` `Dr` `Dw` 比其他高出非常多,即此部分程式執行最多次,大部分的 Branch miss 發生在此,因此這個部份極有可能是拖慢執行速度的原因。 ``` -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim file:function -------------------------------------------------------------------------------- 163,840,484 8 8 75,946,074 0 0 12,966,362 0 0 21,096,210 1,070,152 0 0 /home/arthurchang09/quiz2/test.c:bn_mul_2 ``` 接著看函式執行的狀況,三層的巢狀迴圈是造成大量 branch miss ,也是最多 `Ir` 之處,大部分 cycle 都是執行此處指令。 ```= . . . . . . . . . . . . . void bn_mul_2(bn a, bn b, bn *res) 300,048 2 2 37,506 0 0 112,518 0 0 0 0 0 0 { . . . . . . . . . . . . . uint64_t tmp[LENGTH]; . . . . . . . . . . . . . uint32_t t[LENGTH]; 75,012 0 0 0 0 0 75,012 0 0 0 0 0 0 uint64_t c1 = 0,c2 = 0; 187,530 0 0 0 0 0 37,506 0 0 0 0 0 0 memset(tmp, 0,sizeof(uint64_t) * LENGTH); 187,530 1 1 0 0 0 37,506 0 0 0 0 0 0 memset(t, 0,sizeof(uint32_t) * LENGTH); 1,500,240 1 1 937,650 0 0 37,506 0 0 487,578 37,516 0 0 for (int i = 0; i < LENGTH; ++i) { 18,002,880 0 0 11,251,800 0 0 450,072 0 0 5,850,936 525,100 0 0 for (int j = 0; j < LENGTH; ++j) { 27,004,320 1 1 10,801,728 0 0 0 0 0 5,400,864 450,217 0 0 if ((i + j) < LENGTH ) { 29,254,680 0 0 11,701,872 0 0 2,925,468 0 0 0 0 0 0 c1 = (uint64_t) a.num[i] * b.num[j]; 2,925,468 0 0 0 0 0 2,925,468 0 0 0 0 0 0 c2 = 0; 20,558,652 2 2 8,829,988 0 0 2,925,468 0 0 2,952,260 16 0 0 for (int k = i + j; k < LENGTH; ++k) { 23,618,080 0 0 11,809,040 0 0 0 0 0 0 0 0 0 c2 += (uint64_t) t[k] + (c1 & 0xffffffff); 14,761,300 0 0 5,904,520 0 0 2,952,260 0 0 0 0 0 0 t[k] = c2 & 0xffffffff; 2,952,260 0 0 2,952,260 0 0 0 0 0 0 0 0 0 c2 >>= 32; 2,952,260 0 0 2,952,260 0 0 0 0 0 0 0 0 0 c1 >>= 32; 11,758,976 0 0 5,879,488 0 0 0 0 0 5,879,488 19,779 0 0 if (!c1 && !c2) { 2,925,468 0 0 0 0 0 0 0 0 0 0 0 0 break; . . . . . . . . . . . . . } . . . . . . . . . . . . . } . . . . . . . . . . . . . } . . . . . . . . . . . . . } . . . . . . . . . . . . . } 1,500,240 1 1 937,650 0 0 37,506 0 0 487,578 37,522 0 0 for (int i = LENGTH - 1; i >= 0; --i) { 3,150,504 0 0 1,800,288 0 0 450,072 0 0 0 0 0 0 res->num[i] = t[i]; . . . . . . . . . . . . . } 225,036 0 0 150,024 0 0 0 0 0 37,506 2 0 0 } ``` 接著使用 perf record 觀察, ``` sudo perf record -g --call-graph dwarf ./a.out sudo perf report --stdio -g graph,0.5,caller ``` 以下節錄其中的片段 ``` # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 659 of event 'cycles' # Event count (approx.): 502472023 # # Children Self Command Shared Object Symbol # ........ ........ ....... ................. .................................. # 99.41% 0.00% a.out a.out [.] _start | ---_start __libc_start_main main bn_fib | |--89.69%--bn_mul_2 | | | --1.53%--__memset_avx2_unaligned_erms | |--3.06%--bn_lshift | |--2.51%--bn_dec | |--2.17%--bn_add | --0.76%--__memset_avx2_unaligned_erms ``` 可發現 `bn_mul_2` 佔據 89% ,剩下的 `bn_lshift` 、 `bn_dec` 和 `bn_add` 占比就比較少,因此必須設法提高乘法的效率。 因此修改 `bn_mul_2` ,修改長乘法的方式, $\begin{array}{rrrrrrr} & & & & a & b & c \\ &\times& & & d & e & f \\ \hline & & & & af & bf & cf \\ & & & ae & be &ce\\ & & ad & bd & cd \\ \hline \end{array}$ 之前的作法是依照長乘法的習慣先由右而左,由上而下做乘法,即先算 $cf$ 、 $bf$ 、 $af$ ,再算 $ce$ 、 $be$ 、 $ae$ 並計算 $bf+ce$ ,以此類推。為處理計算 $bf+ce$ 產生的 overflow ,原始的版本每做完一次加法,需如 `bn_add` 一樣將可能的 carry 向高位數加,因此需要三層巢狀迴圈。因此我將計算順序稍作修改,先由上而下,由右而左,也就是計算 $bf + ce$ ,得出 carry 後再計算 $af+be+cd+carry$ 以此類推。因此只需要兩層 `for` 迴圈。 ```c void bn_mul_3(bn a, bn b, bn *res) { uint64_t c1 = 0,c2 = 0,carry; memset(res, 0, sizeof(bn)); for (int i = 0; i < LENGTH; ++i) { c1 = carry; carry = 0; for (int j = 0; j <= i; ++j) { c2 = (uint64_t) a.num[j] * b.num[i - j]; c1 += c2; carry += c1 < c2; } res->num[i] = c1 & 0xffffffff; carry = (carry << 32) + (c1 >> 32); } } ``` 接著在 userspace 用`timespec` 測量時間,如下: ![](https://i.imgur.com/p8cg5Cv.png) 很明顯減少了約 50% 的執行時間 使用 `perf` 測量 ``` sudo perf stat -r 15 -e cycles,instructions,cache-references,cache-misses, branch,branch-instructions,branch-misses taskset 0x01 ./a.out ``` * iteration ``` Performance counter stats for 'taskset 0x01 ./a.out' (15 runs): 2,0095,7615 cycles ( +- 0.07% ) (51.96%) 3,7231,5532 instructions # 1.85 insn per cycle ( +- 0.03% ) (70.37%) 15,6095 cache-references ( +- 1.02% ) (75.08%) 3,6236 cache-misses # 23.214 % of all cache refs ( +- 11.92% ) (75.43%) 3037,2291 branch ( +- 0.04% ) (75.43%) 3035,1483 branch-instructions ( +- 0.02% ) (72.61%) 5668 branch-misses # 0.02% of all branches ( +- 23.54% ) (49.49%) 0.0654597 +- 0.0000478 seconds time elapsed ( +- 0.07% ) ``` * fast doubling old ``` Performance counter stats for 'taskset 0x01 ./a.out' (15 runs): 5,1662,8092 cycles ( +- 0.23% ) (56.97%) 10,7807,0383 instructions # 2.09 insn per cycle ( +- 0.02% ) (71.31%) 17,5168 cache-references ( +- 1.06% ) (71.31%) 4,4469 cache-misses # 25.386 % of all cache refs ( +- 9.10% ) (71.31%) 1,5608,3228 branch ( +- 0.03% ) (71.41%) 1,5656,4076 branch-instructions ( +- 0.03% ) (71.72%) 123,2533 branch-misses # 0.79% of all branches ( +- 2.46% ) (57.28%) 0.167680 +- 0.000369 seconds time elapsed ( +- 0.22% ) ``` * fast doubling new ``` Performance counter stats for 'taskset 0x01 ./a.out' (15 runs): 2,5550,7791 cycles ( +- 0.24% ) (56.53%) 5,0979,0980 instructions # 2.00 insn per cycle ( +- 0.06% ) (71.02%) 17,1360 cache-references ( +- 1.23% ) (71.02%) 3,7991 cache-misses # 22.170 % of all cache refs ( +- 7.09% ) (71.02%) 2850,6292 branch ( +- 0.11% ) (71.36%) 2833,2169 branch-instructions ( +- 0.06% ) (72.45%) 18,4534 branch-misses # 0.65% of all branches ( +- 3.07% ) (57.62%) 0.083148 +- 0.000213 seconds time elapsed ( +- 0.26% ) ``` 簡單整理如下 | | itseration | fast doubling old | fast doubling new | |:-------------------:|:-----------:|:-----------------:|:-----------------:| | cycle | 2,0095,7615 | 5,1662,8092 | 2,5550,7791 | | instructions | 3,7231,5532 | 10,7807,0383 | 5,0979,0980 | | cache misses | 3,6236 | 4,4469 | 3,7991 | | branch instructions | 3035,1483 | 1,5656,4076 | 2833,2169 | | branch misses | 5668 | 123,2533 | 18,4534 | 以上的 `perf` 測試跑第 0 項到第 1000 項的費式數列的總時間。在 cycle 數、 instruction 數、 執行時間方面約是舊版的一半,進步幅度很大。可看出由於迴圈從三層變成兩層, `branch instructions` 的數量約為原本的 $\frac{1}{5}$ , `branch misses` 約是舊版的 $\frac{1}{6}$ 減少幅度相當可觀。 #### copy_to_user 之成本 接著將以上程式碼放入 `fibdrv` ,使用 `ktime_t` 測量執行時間,並使用 `return` 回傳執行時間。 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { bn res; ktime_t kt = ktime_get(); bn_fib(*offset, &res); // s64 time = ktime_to_ns(ktime_sub(ktime_get(), kt)); char str[8 * sizeof(uint32_t) * LENGTH / 3 + 2]; int len = bn_to_string(res, str); copy_to_user(buf, str + len, 8 * sizeof(uint32_t) * LENGTH / 3 + 2 - len); s64 time = ktime_to_ns(ktime_sub(ktime_get(), kt)); return time; } ``` 我測量了只有`bn_fib` 的執行時間以及包含 `copy_to_usr` 的時間,發現這之間有非常大的差距。原本減少 50% 的執行時間因為要傳送龐大的字串造成執行時間的差距減少。由下圖可見時間從約 3萬 ns 增加到約 30多萬 ns ,非常明顯 `copy_to_user` 為了傳送數十個到數百個的字元而耗費大量時間。 ![](https://i.imgur.com/dDhrMkw.png) 於是將傳送的方式改成只傳送 `bn` 而非龐大的字串。直接將計算的結果 `res` 放入 `copy_to_user` ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { bn res; ktime_t kt = ktime_get(); bn_fib(*offset, &res); copy_to_user(buf, &res, sizeof(res)); s64 time = ktime_to_ns(ktime_sub(ktime_get(), kt)); return time; } ``` 接著再由 `client.c` 接收並使用 `memcpy` 將值放入 `bn res` ,最後呼叫 `bn_to_string` 將數字轉成字串。 ```c int main() { char buf[BUFFSIZE]; int offset = 1000; /* TODO: try test something bigger than the limit */ int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } bn res; for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); long long sz = read(fd, buf, BUFFSIZE); memcpy(&res,buf,sizeof(res)); bn_to_string_new(res); } close(fd); return 0; } ``` 最後再計算一次 `fibdrv` 之執行時間,如下圖所示。有無 `copy_to_usr` 對執行時間影響非常小, ![](https://i.imgur.com/FCMNndB.png) 然而,如此一來,將 `bn` 轉為十進位必需要在 `userspace` 進行,因此 `userspace` 也需要花時間進行計算,得到以下的圖。 ![](https://i.imgur.com/7kCtfR8.png) 由上圖可以發現,隨著費式數列的增長,將 `bn` 轉成字串所花的時間會越來越長,總花費時間會比單純在 `fibdrv` 轉換成字串還慢,因此,須採用別的方式減少 userpace 的轉換時間。