linux2021: ChinYikMing

# linux2021: ChinYikMing ###### tags: `Linux 作業系統` ## 測驗 $\alpha$ α - 1 **舉出 Linux 核心原始程式碼裡頭 bit rotation 的案例並說明** BLAKE2是第一代BLAKE的改良版，是在2012年由Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O'Hearn, and Christian Winnerlein設計。主要是為了解決其他加密hash function例如MD5, SHA-1的運算速度過慢問題。BLAKE2有3種，分別是BLAKE2b, BLAKE2s, BLAKE2x。Linux核心原始程式碼則實作了BLAKE2b和BLAKE2s。這3種加密hash function的差別在於最後output的bits長度不同，BLAKE2b的output是512bits，BLAKE2s的output是256bits，BLAKE2x的output則是任何長度bits。這裡我們只討論BLAKE2s。在Linux核心原始程式碼的crypto裡面有一個加密hash function, BLAKE2。這個加密hash function的運算過程中有個Mixing G function，這個Mixing G function就使用到了ror32。接下來，我們來了解一下BLAKE2的大致運算流程和ror32被BLAKE2使用了多少次。可參考 [lib/crypto/blake2s-generic.c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/lib/crypto/blake2s-generic.c) 詳細實作可參考RFC7693, https://datatracker.ietf.org/doc/html/rfc7693 一開始BLAKE2s的input是8個word size的資料，然後用這8個word size的資料搭配一些constant(f)和counter(t)組合出16個word size的vector。接著，再把這個vector丟給Compression F function, Compression F function總共作10 round, 每一round會作8次Mixing G function運算（前4次是column step，後4次是diagonal step）, 所以總共使用了80次Mixing G function, 而每次Mixing G function會使用4次ror32，因此ror32在BLAKE2s總共被使用了320次。由此可知，bit rotation是大量被使用的。Compression F function結束後，會再進行Finalization運算，才會得出256bits的hash value。在2020年1月9號，Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O'Hearn提出基於BLAKE2的BLAKE3，但Linux核心程式碼尚未實作。可參考 https://github.com/BLAKE3-team/BLAKE3 延伸閱讀1：https://en.wikipedia.org/wiki/BLAKE_(hash_function) 延伸閱讀2：https://www.blake2.net/ BLAKE2解說影片：https://www.youtube.com/watch?v=bc-lu8uyivk α - 2 **x86_64 指令集具備 rotr 和 rotl 指令，上述 C 程式碼經過編譯器最佳化 (例如使用 gcc) 後，能否運用到這二個指令呢？** 可以，我們嘗試在編譯後來反組譯看看。我使用的反組譯工具是這個網站 https://gcc.godbolt.org/ ，選擇的gcc版本是9.2。不管使用了-O2還是-O3最佳化編譯，都可以使用到x86_64提供的指令rol(rotation left)和ror(rotation right)。 ![](https://i.imgur.com/qkadfyN.png) ![](https://i.imgur.com/oVt2ufe.png) ![](https://i.imgur.com/Zdztbt1.png) ![](https://i.imgur.com/vOQddXN.png) ![](https://i.imgur.com/8dr8Hv7.png) ![](https://i.imgur.com/6aqzO58.png) --- ## 測驗 $\beta$ ```c= #include <stdint.h> #include <stddef.h> #include <stdio.h> #include <stdlib.h> static inline uintptr_t align_up(uintptr_t sz, size_t alignment) { uintptr_t mask = alignment - 1; if ((alignment & mask) == 0) { /* power of two? */ return ((sz + mask) & ~mask); } return (((sz + mask) / alignment) * alignment); } ``` :::warning power of 2 的翻譯是「2 的冪」，不要寫「冪次方」，後者的意義不同。 :notes: jserv 好，收到！ ::: β - 1 **說明上述程式碼的運作原理** 很明顯上述程式碼是個if else寫法，但使用了early return技巧，所以省去了else語法。在if statement裡的`(alignment & mask)` 是對2的冪取餘數的快速位元計算方法，舉個例子我們要用`alignment = 8(1000)` 對2取餘數，這時候mask應該是 `alignment - 1 = 8 - 1 = 7(0111)`，我們可以發現alignment和mask作AND位元運算，會得到`0（0000）`，因此我們可以判斷alignment是2的冪次方。如果alignment是2的冪，align_up就會`return (sz + mask) & ~mask`，這個運算過程 `sz + mask` 而不是 `sz + alignment`是為了避免在接下來和~mask的AND位元運算後得到0，又由於alignment是2的冪，所以和~mask進行AND位元運算後只會保留下一個alignment的倍數`，否則會`return (((sz + mask) / alignment) * alignment), 這個運算過程的 sz + mask 而不是 sz + alignment的情況是為了避免已經是alignment倍數的sz進位到下一個alignment的倍數，接下來整除alignment會把小數都去掉(floor)，最後再乘以alignment就會得到下一個alignment的倍數`。 β - 2 **在 Linux 核心原始程式碼找出類似 align_up 的程式碼，並舉例說明其用法** https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/arch/powerpc/boot/page.h https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/tools/testing/selftests/net/tcp_mmap.c 會再補上詳細例子和說明 ## 測驗 γ ```c= #include <stdio.h> #include <unistd.h> int main(void) { for (int i = 0; i < NNN; i++) { fork(); printf("-"); } fflush(stdout); return 0; } ``` γ - 1 **解釋上述程式碼輸出 - 字元數量的原理** 首先，要得知 `-` 字元數量之前，我們先要來了解fork對child stdio buffer和parent stdio buffer的行為關係根據man page描述: https://man7.org/linux/man-pages/man2/fork.2.html :::warning 文字訊息不要用圖片展現，考量點: 1. 不易檢索和編輯; 2. 對視覺障礙的朋友不友善。Linux 核心開發者其中幾位是高度近視和全盲者，請考慮他們的存在。 :notes: jserv 好，收到！ ::: 我們得知fork出來的child的memory和parent的memory內容幾乎一致（原文：At the time of fork() both memory spaces have the same content），除了打 `*` 的項目是例外，但 `*` 的項目裡沒提到stdio buffer的例外，所以child的stdio buffer應該是會繼承parent的stdio buffer。意思是如果parent的stdio buffer有n個 `-` 字元的話，在fork產生的child的stdio buffer也會有n個 `-` 字元。再來，了解fork的行為還不夠，我們還需要考慮到buffer模式的問題。buffer模式有分為3種，分別是no buffer模式，fully buffer模式，line buffer模式。這支程式使用的printf到stdout預設是使用line buffer，而line buffer的行為是在遇到 `\n` 字元或buffer滿了之前都不會flush，這也是為什麼最後要使用fflush函式強制flush。有了以上背景知識，我們就能來開始理解這支程式的行為和 `-` 字元的數量了。由於，fork是在迴圈內使用，所以每經過一次迭代，fork就會產生 n 的child，n 是上一次迭代時process的總數量。舉個例子，例如NNN是3，第1次迭代之前process的總數是1（parent process），所以n = 1，buffer內容為空。經過第一次迭代後做了第1次fork，產生了1個process，這2個process（一開始的parent process和1個child process）的stdio buffer內容皆為 `-` 。第2次迭代之前process的總數是2，所以n = 2，經過第2次迭代後做了第1次fork，產生了2個process(上一次迭代的2個process使用了fork)，這4個process(上一次迭代的2個process加上新的2個child process)的stdio buffer內容皆為 `--`。第3次迭代之前process的總數是4，所以n = 4，經過第3次迭代後做了第3次fork，產生了 4個process(上一次迭代的4個process使用了fork), 這8個process(上一次迭代的4個process加上新的4個child process)的stdio buffer內容皆為 `---`。最後所有process的stdio buffer flush到stdout，數量應是24。從此例中，我們可以找到一些規律，第`i`次迴圈迭代`-` 的數量是 2n * (i + 1), i = {1, 2, ...}， n是第`i - 1`次迴圈迭代結束時process的數量。我們把此規律再應用到剛剛的例子中做驗證，第3次迴圈(i = 2)迭代`-`的數量應是 2(n) * (i + 1)，n = 4，i = 2，2(4) * (2 + 1) = 24。找到規律後，我們就可以寫一個小程式來幫助我們找到求某個 `-` 數量需要的迴圈上限。 ```c= #include <stdio.h> #include <unistd.h> #include <assert.h> #define BYTE 49152 void get_iteration_lower_bound(int process_cnt, int byte, int i){ assert(byte != 0); if(byte >= BYTE){ printf("iteration_lower_bound: %d\n", i); return; } get_iteration_lower_bound((process_cnt << 1), (process_cnt << 1) * (i + 1), i + 1); } int main(void) { int initial_byte = 1; int initial_process_cnt = 1; int i = 0; get_iteration_lower_bound(initial_process_cnt, initial_byte, i); return 0; } ``` 最後得出若要得到49152數量的`-`字元數量，我們需要至少12次迴圈。因此，NNN = 12。 ## 測驗 ζ ```c= #define _GNU_SOURCE #include <arpa/inet.h> #include <errno.h> #include <fcntl.h> #include <netinet/in.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/poll.h> #include <sys/socket.h> #include <unistd.h> static int connect_to(char *ip, int port) { int sockfd = socket(AF_INET, SOCK_STREAM, 0); if (sockfd < 0) { printf("Failed to create socket.\n"); return -1; } struct sockaddr_in addr; memset(&addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_port = htons(port); if (inet_pton(AF_INET, ip, &addr.sin_addr) <= 0) { perror("inet_pton"); return -1; } if (connect(sockfd, (struct sockaddr *) &addr, sizeof(addr)) < 0) { printf("Failed to connect.\n"); return -1; } return sockfd; } static void move(int in_fd, int out_fd, int pip[2]) { int ret = splice(in_fd, NULL, pip[1], NULL, 8, SPLICE_F_MOVE); if (ret == -1) { perror("splice(1)"); return; } ret = splice(pip[0], NULL, out_fd, NULL, 8, SPLICE_F_MOVE); if (ret == -1) { perror("splice(1)"); return; } } static void proxy(int cl_fd, int target_fd) { if (target_fd == -1) return; int fds[2]; if (pipe(fds) == -1) { perror("pipe"); return; } struct pollfd polls[2] = { [1] = {target_fd, POLLIN}, [0] = {cl_fd, POLLIN}, }; int ret; while ((ret = poll(polls, 2, 1000)) != -1) { if (ret == 0) continue; int from_client = polls[0].revents & POLLIN; if (from_client) move(cl_fd, target_fd, fds); else move(target_fd, cl_fd, fds); } perror("poll"); } #define PORT 1922 int main(int argc, char *argv[]) { if (argc < 3) { fprintf(stderr, "Usage: %s <target IP address> <target port>\n", argv[0]); return -1; } int listenfd = socket(AF_INET, SOCK_STREAM, 0); struct sockaddr_in addr; memset(&addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_addr.s_addr = htonl(INADDR_ANY); addr.sin_port = htons(PORT); int optval = 1; setsockopt(listenfd, SOL_SOCKET, SO_REUSEPORT, &optval, sizeof(optval)); bind(listenfd, (struct sockaddr *) &addr, sizeof(addr)); listen(listenfd, 1); while (1) { int connfd = accept(listenfd, (struct sockaddr *) NULL, NULL); int target_fd = connect_to(argv[1], atoi(argv[2])); if (target_fd >= 0) proxy(connfd, target_fd); close(connfd); } return 0; /* should not reach here */ } ``` ζ - 1 **解釋上述程式碼運作原理** 顧程式之名(proxy)思義，為一代理者。這個代理者程式會代理localhost去連線參數指定的ip和port number。例如 ./proxy 216.58.200.36 80，表示proxy代理 localhost 去建立TCP連線到 `www.google.com port 80`。proxy預設上是在port 1922 listen。從建立的socket參數2（SOCK_STREAM）得知此連線為TCP連線。由於每台電腦的host byte不一樣，有些是big endian，有些則是little endian，所以我們還需要搭配hton系列macro把host byte轉成network byte，這樣網路上所有的電腦都能統一理解對方。舉個例子，如果使用telnet來連線localhost port 1922的話，實際上proxy會幫telnet代理連線到參數所指定的server。在測試的時候是使用`www.google.com`，所以telnet這時候實際上是連線到`www.google.com`。proxy使用了multiplexing I/O, poll system call來監視註冊的fd是否ready。如果ready的話，才用slpice去實作zero-copy（使用splice的好處是不需要像sendfile那樣需要硬體支援），直接從kernel buffer搬動資料到socket buffer，而不用經過user buffer。那問題來了，註冊被監視的fd什麼時候會ready？當poll system call會 > 0的值時（ready fd的總數）則表示有fd ready了。從from_client，我們又得知當polls[0]的revents和POLLIN event bitmask的作&運算為1的時候，則表示client的socket buffer有資料了。例如telnet發送 GET /index.html 請求，會使poll回傳1，使from_client為1，然後會把GET /index.html請求會從cl_fd轉移到target_fd，讓target_fd去發送這個GET請求。之後當`www.google.com` 發回response的時候，會使poll回傳1，使from_client為0，然後response的資料會從target_fd轉移到cl_fd。這時候，我們就會看到telnet的terminal出現`www.google.com`的response。 ζ - 2 **以 epoll 系統呼叫改寫程式碼，並設計實驗來驗證 proxy 程式碼的效率** ```c= #define _GNU_SOURCE #include <arpa/inet.h> #include <errno.h> #include <fcntl.h> #include <netinet/in.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/epoll.h> #include <sys/socket.h> #include <unistd.h> static int connect_to(char *ip, int port) { int sockfd = socket(AF_INET, SOCK_STREAM, 0); if (sockfd < 0) { printf("Failed to create socket.\n"); return -1; } struct sockaddr_in addr; memset(&addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_port = htons(port); if (inet_pton(AF_INET, ip, &addr.sin_addr) <= 0) { perror("inet_pton"); return -1; } if (connect(sockfd, (struct sockaddr *) &addr, sizeof(addr)) < 0) { printf("Failed to connect.\n"); return -1; } return sockfd; } static void move(int in_fd, int out_fd, int pip[2]) { int ret = splice(in_fd, NULL, pip[1], NULL, 8, SPLICE_F_MOVE); if (ret == -1) { perror("splice(1)"); return; } ret = splice(pip[0], NULL, out_fd, NULL, 8, SPLICE_F_MOVE); if (ret == -1) { perror("splice(1)"); return; } } #define MAX_EVENTS 2 static void proxy(int cl_fd, int target_fd) { if (target_fd == -1) return; int fds[2]; if (pipe(fds) == -1) { perror("pipe"); return; } /* one for target_fd, one for cl_fd but on Linux, the parameter is ignored and should be greater or equal to 0 */ int epfd = epoll_create(MAX_EVENTS); if(epfd == -1){ perror("epoll_create"); exit(1); } if(epoll_ctl(epfd, EPOLL_CTL_ADD, target_fd, &(struct epoll_event){EPOLLIN, {.fd = target_fd}}) == -1){ perror("epoll_ctl 1"); exit(1); } if(epoll_ctl(epfd, EPOLL_CTL_ADD, cl_fd, &(struct epoll_event){EPOLLIN, {.fd = cl_fd}}) == -1){ perror("epoll_ctl 2"); exit(1); } struct epoll_event evlist[MAX_EVENTS]; int ret; while ((ret = epoll_wait(epfd, evlist, MAX_EVENTS, 1000)) != -1) { if (ret == 0) continue; int from_client = (evlist[i].events & EPOLLIN) && (evlist[i].data.fd == cl_fd); if (from_client) move(cl_fd, target_fd, fds); else move(target_fd, cl_fd, fds); } close(epfd); } #define PORT 1922 int main(int argc, char *argv[]) { if (argc < 3) { fprintf(stderr, "Usage: %s <target IP address> <target port>\n", argv[0]); return -1; } int listenfd = socket(AF_INET, SOCK_STREAM, 0); struct sockaddr_in addr; memset(&addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_addr.s_addr = htonl(INADDR_ANY); addr.sin_port = htons(PORT); int optval = 1; setsockopt(listenfd, SOL_SOCKET, SO_REUSEPORT, &optval, sizeof(optval)); bind(listenfd, (struct sockaddr *) &addr, sizeof(addr)); listen(listenfd, 1); while (1) { int connfd = accept(listenfd, (struct sockaddr *) NULL, NULL); int target_fd = connect_to(argv[1], atoi(argv[2])); if (target_fd >= 0) proxy(connfd, target_fd); close(connfd); } return 0; /* should not reach here */ } ``` **驗證proxy程式碼效率實驗** 原本的程式碼只能處理一個連線，所以我打算用一個process負責一個連線，這樣就能接受多個連線，因為poll和epoll的效率差別是體現在處理多個fd（這裡是socket fd）的時候。因此，實驗大致如下，main process負責接收新的連線, 同時維護所有的fd。當某個fd觸發POLLIN或EPOLLIN的時候讓對應的thread去做資料的搬動。效率的好壞會根據資料搬動前後的時間差來判斷。由於，fork的成本太高了（雖然是copy-on-write，但幾乎都會write），我打算用thread來代替process。以上2種方式在處理多個連線的成本都過高（每次有的連線，都要和kernel要求資源），所以我打算引入server pool。由於server pool使用到fork，成本又過高，我打算引入thread pool。以下為修改後使用poll的程式碼：以下為修改後使用epoll的程式碼：會再補上程式碼和說明