linux2021: yian02

# linux2021: yian02 ###### tags: `linux2021` ## 測驗 $\alpha - 1$ 在 linux 核心程式碼 (arm) 中，實作[進階加密標準(AES)](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard)時，有使用到 bit rotation。此加密演算法的四個動作：SubBytes、ShiftRows、MixColumns、AddRoundKey 中的三個操作 (SubBytes、ShiftRows、MixColumns) 可以透過 lookup table 同時完成，這樣的操作會需要 16 個 lookup tables，但這16個 lookup table 中，只有四個需要被儲存起來，其餘的 12 個 lookup tables 都可以透過儲存起來的四個 tables 做 bit rotation 得到，這樣可以節省 75% 的 D-Cache。程式碼可以見 [linux 核心 AES 程式碼](https://github.com/torvalds/linux/blob/e73f0f0ee7541171d89f2e2491130c7771ba58d3/arch/arm64/crypto/aes-cipher-core.S)，其中有使用到在 arm 架構中用來 right rotate 的指令 `ror`。 Reference: [Accelerated AES for the Arm64 Linux kernel](https://www.linaro.org/blog/accelerated-aes-for-the-arm64-linux-kernel/) ## 測驗 $\alpha - 2$ 將程式碼輸入 [Compiler Explorer](https://godbolt.org) 並使用 x86-64 gcc 11.1 來編譯程式碼。 #### 觀察不使用最佳化時編譯器的輸出若完全沒有做編譯器最佳化時，編譯器會輸出每個 function 對應的組合語言，並使用函式呼叫的方式執行相對應的程式碼。以下程式碼為在不使用編譯器最佳化時輸出的 rotl8 函式。 ```c= rotl8: push rbp mov rbp, rsp mov eax, edi mov DWORD PTR [rbp-24], esi mov BYTE PTR [rbp-20], al mov DWORD PTR [rbp-4], 7 mov eax, DWORD PTR [rbp-4] and DWORD PTR [rbp-24], eax movzx edx, BYTE PTR [rbp-20] mov eax, DWORD PTR [rbp-24] mov ecx, eax sal edx, cl mov eax, edx mov esi, eax movzx edx, BYTE PTR [rbp-20] mov eax, DWORD PTR [rbp-4] sub eax, DWORD PTR [rbp-24] add eax, 1 mov ecx, eax sar edx, cl mov eax, edx or eax, esi pop rbp ret ``` 在上面的程式碼中，編譯器一步一步地做在原始程式碼中寫出來的程式，也沒有發現到我們是在做 bit rotation 而使用 `rotl` 或 `rotr` 指令。 #### 觀察使用最佳化時編譯器的輸出有使用編譯器最佳化時（只需要使用到最基本的編譯器最佳化 `-O` ），會發現若是在編譯時期就已經可以推論出值的變數，會直接在編譯時期就計算出函數的回傳值，因此不會用到 function call，也不會用到 `rotr` 或 `rotl` 指令。執行以下的程式碼： ```c // 8 bits uint8_t a8 = 0xf0; uint8_t b8 = rotl8(a8, 2); printf("%x\n", b8); ``` 觀察對應的組合語言輸出： ```c mov esi, 195 mov edi, OFFSET FLAT:.LC0 mov eax, 0 call printf ``` 可以發現編譯器在編譯時期就已經將函式的回傳值計算出來，上面的組合語言基本上只是在將函式回傳值 print 出來而已，因此也不會使用到 `rotl` 或 `rotr`。不過，若將變數變成編譯器無法在編譯時期就推論出來的值（例如：需要使用者輸入的值），就會發現編譯器能夠將原始的程式碼最佳化成使用 `rotl` 或 `rotr` 指令。執行以下程式碼： ```c uint8_t test; fscanf(stdin,"%d",&test); test = rotl8(test, 2); printf("%x\n", test); ``` 觀察編譯器最佳化後的輸出（對應到呼叫 `rotl8` 時的輸出）： ```c movzx esi, BYTE PTR [rsp+15] rol sil, 2 mov BYTE PTR [rsp+15], sil ``` 可以發現編譯器使用了 `rol` 指令。因此，經過編譯器最佳化後的程式碼，如果其變數的數值在編譯時期還無法被編譯器推論出來的話，編譯器就會最佳化成使用 `rol` 或 `ror` ，而不需要使用函式呼叫。 ## 測驗 $\beta - 1$ :::warning power of 2 的翻譯為「2 的冪」 :notes: jserv ::: 該程式碼首先先判斷 alignment 是否為 2 的冪，若不是的話，就用算術運算算出 align_up 的值，若是的話，則可以透過 bitwise operation 算出 align_up 的值。 #### 不是 2 的冪先將 sz 加上 mask，處理 align_up 的值要大於等於 sz 的問題（先使要被 align 的位數進位，才不會在後面的除法運算中被丟掉），再透過除法和乘法運算計算出其 align_up 的值。 #### 2 的冪 ```c if ((alignment & mask) == 0){ /* power of two? */ return (~mask) & (sz + mask); } ``` 若為 2 的次方，則代表對應到其次方的 bit 會是 1，後面都是 0 ，則減一之後，會使得原本是 1 的位置變為 0，後面的 0 都變為 1。例如： 8 $\to$ 1000~2~, 8 - 1 = 7 $\to$ 0111~2~ 因此對 mask 取 not 後，只要對 sz 做 and ，就可以將 sz 變為 alignment 的倍數。不過還要先處理 align_up 的值要大於等於 sz 的問題，因此要對 ~mask 做 and 的應該要是 sz + mask，先將該進位的進位完成後，再將不需要的 bits 透過和 ~mask 做 and 運算去除掉，就可以透過 bitwise operation 計算出 2 的次方的 align_up 的值。 ## 測驗 $\beta - 2$ 在 [linux/include/linux/align.h](https://github.com/torvalds/linux/blob/e73f0f0ee7541171d89f2e2491130c7771ba58d3/include/linux/align.h) 中可以看到以下 macro： ```c #define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) ``` 而`__ALIGN_KERNEL((x), (a))` 可以在 [linux/include/uapi/linux/const.h](https://github.com/torvalds/linux/blob/e73f0f0ee7541171d89f2e2491130c7771ba58d3/include/uapi/linux/const.h) 中找到： ```c #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) ``` 這個 align 的 macro 其中一個用途是用在 page align 上，在 [linux/include/linux/mm.h](https://github.com/torvalds/linux/blob/e73f0f0ee7541171d89f2e2491130c7771ba58d3/include/linux/mm.h) 中： ```c #define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) ``` 這個 macro 的其中一個用途是在 `make_alloc_exact` 函式中有使用： ```c static void *make_alloc_exact(unsigned long addr, unsigned int order, size_t size) { if (addr) { unsigned long alloc_end = addr + (PAGE_SIZE << order); unsigned long used = addr + PAGE_ALIGN(size); split_page(virt_to_page((void *)addr), order); while (used < alloc_end) { free_page(used); used += PAGE_SIZE; } } return (void *)addr; } ``` 這個函式會透過對傳入的 `size` 先去算 alignment 後，加上傳入的 `addr` 去計算結束的位置，依照起始和結束的位置去 allocate 最小需要的 physically-contiguous pages 。而根據該程式碼中的註解提到，在 linux kernel 中有另一個函式 `alloc_pages()` 可以達到差不多的功能，差別在 `alloc_pages()` 只能 allocate memory in power-of-two pages，而 `make_alloc_exact()` 可以只 allocate 最小需要的 page 即可。 ## 測驗 $\gamma - 1$ 根據 [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html) 中提到： > The entire virtual address space of the parent is replicated in the child 而複製過去的 virtual address space 中，也包含了 I/O Buffer，因此已經被 parent process 存進 buffer 但還沒被 flush 出來的值，也會被 child process 接收，在 buffer 滿了或是執行 fflush 時，就會一併被印出來。假設 `NNN = 12`，則對於每一個被 fork 複製出來的 process 來說，都會印出 12 次的 `-`，只是對於最原始的 parent process 或是在 `i = 0`時即被 fork 出的 child process 來說，這 12 個 `-` 都是自己印出來的。但對於後面才被 fork 出的 process 來說，以 `i = 11` 時被 fork 出的 child process 為例，這個 process 也會印出 12 次的 `-`，不過前面的 11 次是來自於 parent process 儲存在 buffer 中還未被輸出的，只有最後一次是該 process 自己印出的。而在 `for` 迴圈中的 `fork` 會使得 process 數量程二的次方增長，迴圈執行過 12 次時總共會有 $2^{12}$ 個 process ，每個 process 又會印出 12 次 `-`，因此印出的 `-` 總數為 $12\times 2^{12}=49152$，符合題目所述。另外，也可以測試在沒有 I/O Buffer 時會發生的狀況，先計算在沒有 I/O Buffer 時會輸出的 `-` 字元數：在沒有 I/O Buffer 時，輸出的 `-` 字元數會成等比級數成長，也就是若 `NNN = 12` 時，輸出的字元數會是 $\frac{2\times (2^{12}-1)}{2-1} = 8190$ 個。將原本程式中的 `printf("-");` 改成 `fprintf(stderr,"-");` ，也就是從會使用 buffer 的 stdout ，改變成輸出至預設沒有 I/O Buffer 的 stderr。 ```bash > ./fork 2> result.txt > wc -c result.txt ``` 將 stderr 輸出的內容導至 result.txt 檔案中，並執行 wc ，可以看到程式顯示 8190 ，符合上面計算的結果。 ## 測驗 $\delta - 1$ #### 原理程式碼在做的事情是同時會有多個 thread 對同一個 concurrent queue 做 push 和 pop 的操作，為了避免 thread 互相衝突（在操作到一半時被別的 thread 更動），因此此 concurrent queue 有兩把鎖： `first_mutex` 和 `last_mutex` ，需要 push 操作的 thread 在操作之前必須要先得到 `last_mutex` 這把鎖才能夠進行操作，並在操作後釋放這把鎖；而 pop 操作的 thread 在操作前需要獲得 `first_mutex` 這把鎖才能夠進行操作，操作後同樣的也要把獲得的鎖釋放出來。 #### 可能的實作缺失若按照老師在題目程式碼中的寫法，在 push 時會 push 到 queue 的最後一個，因此對於第一個被 push 進 queue 的人來說，他會被 push 到 dummy head 的後面。但在 pop 時卻是從 queue 的 head 開始 pop，也就是第一個被 pop 出來的會是 dummy head，這樣會造成在程式碼認為 queue 已經空了的時候，裡面存在的不是 dummy head，而是最後一個被 push 進去的 node，這樣可能會造成本來應該要被 free 的 node 還可以被後續程式碼使用，若裡面存放的是重要的資料，則可能會有安全上的疑慮。 ## 測驗 $\epsilon - 1$ 這個程式碼使用 `mmap` 來實作 memory pool，一開始先透過 `mpool_init` 來初始化 memory pool ，但這時還沒有真正分配出 memory pool 所需要的所有空間，只有把對應到每個 size 的 head 所需要的空間分配出來。初始化後，可以透過 `mpool_alloc` 來獲得 memory pool 中的空間，在此函式中，會先透過傳入的引數 `sz` 來找到對應大小的 memory pool，找到後就會檢查該 memory pool 是否已經有分配空間，若沒有的話在此時才會真正分配空間給對應 size 的 memory pool，而分配空間是透過呼叫 `mpool_new_pool`，透過傳入此函式的引數可以看出來，一次分配的 memory pool 總共的大小就是一個 page。在 `mpool_new_pool` 函式中，會呼叫 `mmap` 來獲得記憶體空間，再將獲得的空間依照傳入的 `sz` 引數初始化成 linked list 的形式，假設傳入的 `sz` 是 64 ，則代表此 memory pool 中一個區塊就是 64 bytes，因此將獲得空間的 offset = 0 的位置所儲存的資訊變為 offset = 64 的位置，依此類推，就可以將獲得的連續空間變成類似 linked list 的結構，來方便之後得 allocate 和 repool 操作。而若是呼叫 `mpool_alloc` 時發現對應大小的 memory pool 沒有可用空間時，就會呼叫 `mpool_new_pool` 獲得一個新的 memory pool ，再呼叫 `add_pool` ，將新的 memory pool 串到原本的 memory pool 後方，就可以使用其配置的空間。而從 memory pool 分配空間出去之前，會先將 `hs` 中儲存的相對應大小的 head 換成要分配出去的下一個，也就是儲存在要分配出去的空間中的「下一個人的記憶體位置」。透過 memory pool 分配出來的空間中所儲存的資訊，就會從原本儲存「下一個人的記憶體位置」變為儲存「真正的資料」。而 `mpool_repool` 函數在做的事情，就是回收分配出去的 memory pool 空間，透過將回收的空間中所儲存的資訊換成 `hs` 中的第一個人所在的「記憶體位置」，並將 `hs` 中對應大小的 head 換成要回收的那個空間所在的位置，就完成了將要回收的空間串回 memory pool 的 linked list 中的操作。 ## 測驗 $\epsilon - 1$ 在老師題目的程式碼中， `mpool_repool` 函數，不管要回收的 memory pool 的大小是多少，都會將其回收到 size 最小的那個 memory pool 中，這樣會造成空間的浪費，因為大的 memory pool 回收後只能當作最小的來使用。可以透過在 `mpool_repool` 函式中，先尋找對應大小的 memory pool，再將 memory pool 歸還至對應大小的 pool 中，來解決這個問題。改寫後的 `mpool_repool` 如下： ```c void mpool_repool(mpool *mp, void *p, int sz) { int i = 0, curr_sz; if (sz > mp->max_pool) { if (munmap(p, sz) == -1) fprintf(stderr, "Failed to unmap %d bytes at %p\n", sz, p); return; } // looking for the appropriate hs entry long szceil = 0; for (i = 0, curr_sz = mp->min_pool;; i++, curr_sz *= 2) { if (curr_sz > sz) { szceil = curr_sz; break; } } assert(szceil > 0); void **ip = (void **)p; // YYY; *ip = mp->hs[i]; assert(ip); mp->hs[i] = ip; } ```