Concurrent Optimizing Brainfuck JIT compiler

# Concurrent Optimizing Brainfuck JIT compiler contributed by <`kobe0308`>, <`youchihwang`>, <`vjux`>, <`mingnus`> ###### tags: `kobeyu` [github](https://kobeyu@bitbucket.org/kobeyu/bukkake.git) ## 作業目標 - [ ]研究concurrent Brainfuck JIT Compiler ## 工作項目 - [ ]分析各種不同的實現方式 - [Bukkake](https://bitbucket.org/wjmelements/bukkake): A parallel brainfuck JIT in C - [Brainfuck Process Extensions](http://www.kjkoster.org/BFPX/Brainfuck_Process_Extensions.html) - [Parallel Brainfuck](https://github.com/cmdli/parallel-brainfuck) - [Concurrent Brainfuck](http://www.schabi.de/cbf/) - [ ]擴充concurrent語法 - [ ]語言規格書 - [ ]功能驗證 - [ ]效能分析 - 參考 [rdebath/Brainfuck](https://github.com/rdebath/Brainfuck) 的成果 --- >> 1. 中英關鍵字間請以空白隔開 >> 2. 請補上 github 連結 >>[name=課程助教][color=red] >已修正 [name=Kobe Yu] ### Bukkake實作分析此實作是參考 cmdli 的版本,在 Repository 中有一份[投影片簡報](https://docs.google.com/presentation/d/1Y8uwsh_5bmUASmyRE4i3S4fL5OvA2Q92edivIXpsEaM/edit),摘要如下: **1. 觀察 helloworld.b 的程式發現** * 迴圈內都是加法運算 * 都是依序執行 * Withholds treats from puppies ?? * 可以被平行化! 以 HelloWorld.b 為例,在迴圈內是針對不同的 cell 做運算,對於資料的存取是互相獨立的,所以迴圈內的操作可以完全的平行化. >> "withholds treats from puppies" 意思是不給小狗獎勵(食物之類), 引申意思可能是沒有給電腦足以發揮其性能的程式?? [name=mingnus] **2. concurrent 的設計** 每一行都是一個 program,就像 thread_create 所要帶入的 function. ``` +++++ +++++ initialize counter (cell #0) to 10 [ > +++++ ++ add 7 to cell #1 > +++++ +++++ add 10 to cell #2 > +++ add 3 to cell #3 > + add 1 to cell #4 <<<< - decrement counter (cell #0) ] > ++ . print 'H' > + . print 'e' +++++ ++ . print 'l' . print 'l' +++ . print 'o' > ++ . print ' ' << +++++ +++++ +++++ . print 'W' > . print 'o' +++ . print 'r' ----- - . print 'l' ----- --- . print 'd' > + . print '!' > . print '\n' ``` 平行化後的 helloworld.b 1. thread#1 ~ thread#4 初始化 cell[1] ~ cell[4] ({70, 100, 30, 10}) 2. thread#0 印出 "Hello " 3. thread#5 印出 "W" 4. thread#0 印出 "orld!\n" ``` +++++*-[**********-] |>++H.|>+e.+++++ ++l.l.+++o.>++_.<|W|o.+++r.----- -l. ----- ---d.>+!.>\n. > +++++ ++ 70| >> +++++ +++++ 100| >>> +++ 30| >>>> + 10| |>+++++ +++++ +++++ |.| ``` * 回到 bukkake 的版本 * 在程式中每一行代表一個 thread ,要注意的是這邊所有的記憶體都是共享的(shared memory),所以無論哪個 thread 在執行時都可以存取 tape 的內容. * 呼叫 thread 的方式是透過 `*` 這個操作字元,當執行到 `*` 時,啟動一個 thread 執行第(*ptr)行的指令. * 舉例: 印出四個字元 `!"#!`,其中兩個!是在第一個 thread (第一行)印出,第二個 thread (第二行)印出 `"`,而第三個 thread (第三行)印出 `#`,程式碼如下: ``` +++++ [>++++++<-]> +++.<+* +* |>--. \n >+. | >+. | ``` #### Concurrent 擴充語法 - __\n__ : _return_; exit thread and denote beginning of another function - __\*__ : _relative spawn_; spawn a thread running the line number that is the sum of your line number and the current byte - __|__ : _barrier_; block until all threads with a barrier in this column are at said **column** 舉例: 正確的同步(barrier) +++.<+* +* | \n +++. | 可能產生非預期結果的語法 +++.<+* +* | \n +++.| ### 解析程式碼 Bukkake JIT 是跟 OS 申請一塊可執行的記憶體,然後根據每一個 bf 的指令轉為 x86 組合語言,並寫入記憶體中,等到所有的 bf 程式碼都轉換完成後就從記憶體的第一個位址開始執行,所以沒有 IR 的設計. 關於 concurrent 的原理目前看到有一個 thread pool ,至於怎麼派發工作與其他細節尚待解析. #### bukkake.c 應用程式的入口 (main) 的所在,一開始會透過 mmap 將檔案到映射虛擬記憶體空間,使用回傳的位址對檔案進行讀寫操作. 在開始解析程式之前會先計算以下參數:程式的總行數,最長的列寬度以及在第一行 (main thread) 是否有 concurrent ('*')的指令,接著會宣告四塊記憶體空間： * barriers = calloc(longest_row, sizeof(struct barrier*)); * line_barriers = (unsigned long**) calloc(num_rows, sizeof(unsigned long*)); * line_barrier_indices = (unsigned long*) calloc(num_rows, sizeof(unsigned long)); * lines = calloc(num_rows + 1, sizeof(func)); >> ` src/barrier.c` 的 Line #53 是 `sched_yield();`，要注意到後者在 Linux 2.6.23 所引入的 [CFS](https://en.wikipedia.org/wiki/Completely_Fair_Scheduler) 已經棄置，請改用其他方式 [name=jserv] >> todo >> 1.理解 sched_yield（） >> 2.CFS 原理 >> 3.引進 CFS 對於 sched_yield() 的影響 >> 4.更好的實作方式與評估效益. >> [ref](http://blog.csdn.net/magod/article/details/7265555) >> [name=kobeyu] #### runtime.c 會使用到 jit.c 跟 static.c ,往下trace code. ##### Machine code 根據 Intel 64 and IA-32 Architectures Software Developer’s Manual --- ``` // push rcx jit(0x51); ``` ![](https://i.imgur.com/3GeDYIt.png) ![](https://i.imgur.com/AXR8PCR.png) ![](https://i.imgur.com/e9hNE3H.png) >source code 中有一段程式碼 push rcx 需要將其編為機械碼，從 intel spec 中得知 rcx 為 64 位元的 register，而push 的 Opcode 是 50+rd， +rd 是用來編碼 register operand，那 +rd 要填什麼值呢？第二個 table 查知RCX 的 Reg Field 要填為 1，機械碼編為 51。 >而 Op/En 的 M、 O、 I, 64-Bit Mode 的 Valid、N.E、 Compat/Leg Mode， REX.B 這些欄位還不知道其作用。[name=YouChih Wang] --- ``` // push r8 jit2(0x41, 0x50); ``` ![](https://i.imgur.com/RAnAWCq.png) ![](https://i.imgur.com/AXR8PCR.png) ![](https://i.imgur.com/SV2RmOq.png) ![](https://i.imgur.com/pQMT9Gy.png) ![](https://i.imgur.com/INpdR9t.png) >source code 中有一段程式碼 push r8 需要將其編為機械碼，從 intel spec 中得知 r8 為 64 位元的 register，而push 的 Opcode 是 50+rd， +rd 是用來編碼 register operand，那 +rd 要填什麼值呢？第二個 table 查知RCX 的 Reg Field 要填為 0，機械碼編為 50。但是為什麼要多編 0x41 呢？從 table3-1 得知 R8 的 REX.B 是 yes，所以需要加入 RegPrefix 的指令，再看 table 2-4，REX.B 填 1，所以 REG Prefix 編碼為 0x41，最後整個 push r8 編碼為 0x41 0x50。[name=YouChih Wang] --- ``` // mov bu_putchar, %rax jit2(0x48, 0xb8); uint64_t loc = (uint64_t) bu_putchar; jit64(loc); ``` ![](https://i.imgur.com/dcWz9Cf.png) >看到 maker 在 .c 加了組合語言的註解 ```mov bu_putchar, %rax```，看不懂 % 這個符號是代表什麼，查了才知道，assembly language syntax 主要有分二個 branches，一個是 AT&T syntax，另一個是 intel syntax，可以從存取 register 的符號區分兩者。 >AT&T syntax 存取 register 需加 %，mov 指令的 destination 在後方，使用常數需加```$```； >而 intel syntax 存取 register 不需加 %，mov 指令的 destination 在前方，使用常數不需加 ```$```。 > 而 GNU as 可使用.intel_syntax selects Intel mode, .att_syntax兩種設定來選擇 AT&T syntax 或 intel syntax >[name=YouChih Wang] ![](https://i.imgur.com/gN1S6XF.png) >rd io 的 io 是什麼？目前尚未找到定義。[name=YouChih Wang] ![](https://i.imgur.com/ZV4TpCu.png) >找到 io 的定義了，原來是在 opcode 後接 8 byte immediate，也就是 64 bits 立即值，在這裡指的是 funtion address。[name=YouChih Wang] ![](https://i.imgur.com/ajK9Kkd.png) >藉由 table 3-1 可得知 rd 需填 0，所以 Opcode 部分就是 0xB8。（在未查到 rd io 定義前。） ![](https://i.imgur.com/RYJktpX.png) >REX.W 需設為 1。所以需加 prefix instruction 0x48。 >整個編碼就是 0x48 0xb8。[name=YouChih Wang] ``` uint64_t loc = (uint64_t) bu_putchar; jit64(loc); ``` >這段是將 bu_putchar 的 address 取出來。 ![](https://i.imgur.com/ybF1jji.png) >使用 ```objdump -d bukkake```命令，可以 disassemble objfile，查看 bu_putchar 的 address 為 0...04017c0， > ![](https://i.imgur.com/MWdjQrI.png) ``` // mov bu_putchar, %rax jit2(0x48, 0xb8); uint64_t loc = (uint64_t) bu_putchar; jit64(loc); ``` >以上這段程式碼是在 bu_put() 裡，使用 ```objdump -d bukkake```命令 disassemble objfile，找尋 bu_put（），看到對應的位置顯示```$0x4017c0```，這就是 bu_putchar 的 address，並且使用 jit64() 將這 64 bit 的 address 放入執行檔。 --- #### System V Application Binary Interface x86-64TM Architecture Processor Supplement Draft Version 0.21 [3.2.2 The Stack Frame](https://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf) In addition to registers, each function has a frame on the run-time stack. >每個 function 除了可以使用 register 外，在 run-time 時還可以使用 stack。[name=YouChih Wang] This stack grows downwards from high addresses. >從以下的 test.c line 2 , pr_2 傳遞參數到 pr_3，line 35，line 36，pr_3() 接收來自 pr_2 ()的參數，再對應到 test.s pr_3() line 13，14 即可驗證 stack 是從高位址往低位址成長。 > ``` test.c 1 #include <stdio.h> 2 void pr_3(int a_3, int b_3) 3 { 4 } 5 6 void pr_2(int a_2, int b_2) 7 { 8 pr_3(a_2, b_2); 9 } 10 11 void pr_1(int a_1, int b_1) 12 { 13 pr_2(a_1, b_1); 14 } 15 16 int main () 17 { 18 int a=0; 19 int b=1; 20 21 pr_1(a, b); 22 23 return 0; 24 } ``` >執行 gcc -O0 -S test.c，生成 test.s > ``` test.s 1 .file "test.c" 2 .text 3 .globl pr_3 4 .type pr_3, @function 5 pr_3: 6 .LFB0: 7 .cfi_startproc 8 pushq %rbp 9 .cfi_def_cfa_offset 16 10 .cfi_offset 6, -16 11 movq %rsp, %rbp 12 .cfi_def_cfa_register 6 13 movl %edi, -4(%rbp) 14 movl %esi, -8(%rbp) 15 nop 16 popq %rbp 17 .cfi_def_cfa 7, 8 18 ret 19 .cfi_endproc 20 .LFE0: 21 .size pr_3, .-pr_3 22 .globl pr_2 23 .type pr_2, @function 24 pr_2: 25 .LFB1: 26 .cfi_startproc 27 pushq %rbp 28 .cfi_def_cfa_offset 16 29 .cfi_offset 6, -16 30 movq %rsp, %rbp 31 .cfi_def_cfa_register 6 32 subq $8, %rsp 33 movl %edi, -4(%rbp) 34 movl %esi, -8(%rbp) 35 movl -8(%rbp), %edx 36 movl -4(%rbp), %eax 37 movl %edx, %esi 38 movl %eax, %edi 39 call pr_3 40 nop 41 leave 42 .cfi_def_cfa 7, 8 43 ret 44 .cfi_endproc 45 .LFE1: 46 .size pr_2, .-pr_2 47 .globl pr_1 48 .type pr_1, @function 49 pr_1: 50 .LFB2: 51 .cfi_startproc 52 pushq %rbp 53 .cfi_def_cfa_offset 16 54 .cfi_offset 6, -16 55 movq %rsp, %rbp 56 .cfi_def_cfa_register 6 57 subq $8, %rsp 58 movl %edi, -4(%rbp) 59 movl %esi, -8(%rbp) 60 movl -8(%rbp), %edx 61 movl -4(%rbp), %eax 62 movl %edx, %esi 63 movl %eax, %edi 64 call pr_2 65 nop 66 leave 67 .cfi_def_cfa 7, 8 68 ret 69 .cfi_endproc 70 .LFE2: 71 .size pr_1, .-pr_1 72 .globl main 73 .type main, @function 74 main: 75 .LFB3: 76 .cfi_startproc 77 pushq %rbp 78 .cfi_def_cfa_offset 16 79 .cfi_offset 6, -16 80 movq %rsp, %rbp 81 .cfi_def_cfa_register 6 82 subq $16, %rsp 83 movl $0, -8(%rbp) 84 movl $1, -4(%rbp) 85 movl -4(%rbp), %edx 86 movl -8(%rbp), %eax 87 movl %edx, %esi 88 movl %eax, %edi 89 call pr_1 90 movl $0, %eax 91 leave 92 .cfi_def_cfa 7, 8 93 ret 94 .cfi_endproc 95 .LFE3: 96 .size main, .-main 97 .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609" 98 .section .note.GNU-stack,"",@progbits ``` Figure 3.3 shows the stack organization. The end of the input argument area shall be aligned on a 16 byte boundary. >解釋 parameter 和 argument 差別 >1：parameter 可以想成是投手準備要投球過來，投 1、2、3、...顆，但還沒投的那些球，argument 可以想成是投手真正投出去的球。 >2：也可參考 stackoverflow [What's the difference between an argument and a parameter](http://stackoverflow.com/questions/156767/whats-the-difference-between-an-argument-and-a-parameter) >3：也可參考 c11 spec 關於 argument 的部分，3.3 有一段描述 expression in the comma-separated list bounded by the parentheses in a function call，argument 是在 function call 中由（）所包括起來的 expression，有一點感覺了嗎？若沒有，6.5.17 還有一段，In the function call f(a, (t=3, t+2), c) the function has three arguments, the second of which has the value 5.，這個 function 有 3 個 arguments，第 2 個 argument 是 5，這就是真正投出去的球。 >>3.3 1 argument actual argument actual parameter (deprecated) expression in the comma-separated list bounded by the parentheses in a function call expression, or a sequence of preprocessing tokens in the comma-separated list bounded by the parentheses in a function-like macro invocation >>6.5.17 Comma operator 3 EXAMPLE As indicated by the syntax, the comma operator (as described in this subclause) cannot appear in contexts where a comma is used to separate items in a list (such as arguments to functions or lists of initializers). On the other hand, it can be used within a parenthesized expression or within the second expression of a conditional operator in such contexts. In the function call f(a, (t=3, t+2), c) the function has three arguments, the second of which has the value 5. >4：也可參考 c11 spec 關於 parameter 的部分，3.16 有一段描述，object declared as part of a function declaration，parameter是屬於 function 宣告的一部分，這就是投手準備要投球過來，投 1、2、3、...顆，但還沒投的那些球。 >>3.16 parameter formal parameter formal argument (deprecated) object declared as part of a function declaration or definition that acquires a value on entry to the function, or an identifier from the comma-separated list bounded by the parentheses immediately following the macro name in a function-like macro definition >[name=YouChih Wang] In other words, the value (%rsp − 8) is always a multiple of 16 when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame. 7 The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers.8 Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. >以上部分，還不懂，需再回來看。[name=YouChih Wang] System V Application Binary Interface x86-64TM Architecture Processor Supplement Draft Version 0.21 --- [3.2.3 Parameter Passing](https://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf) The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available. >%rbp：frame (base) pointer (%rbp) >%rsp：stack pointer that point top of stack > caller function 會有 local stack frame，由 %rbp 指向 stack frame 起點，由 %rsp 指向最 top 的 stack，當 caller 呼叫 callee 時，caller 的 %rsp 需將值指定給 callee 的 %rbp，當成是 callee 的 stack frame，從上方的 test.s 的 pr_3() line 8 line 11 可得知，以上是描述 call flow， > 那 return flow 呢？可藉由 pop %rbp 將 caller 的 return address pop 出，下方使用 gdb 利用 %rbp 來印出 caller >> ``` (gdb) b pr_3 Breakpoint 1 at 0x4004e0: file test.c, line 4. (gdb) r Starting program: /home/xxx/Jserv/test/test Breakpoint 1, pr_3 (a_3=0, b_3=1) at test.c:4 4 } (gdb) backtrace #0 pr_3 (a_3=0, b_3=1) at test.c:4 #1 0x0000000000400500 in pr_2 (a_2=0, b_2=1) at test.c:8 #2 0x0000000000400520 in pr_1 (a_1=0, b_1=1) at test.c:13 #3 0x0000000000400548 in main () at test.c:21 ``` --- futex：類似 mutex，但不同的是 futex 是在 userspace 執行，少了進出 kernel space，所以 overhead 會比較小。 --- #### jit.c 卡關中... 卡在組合語言的部分: ``` // get rcx from param // mov %rdi, %rcx jit3(0x48, 0x89, 0xf9); ``` * 問題定義: 根據[AMD64 ABI規格書](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf)P24的描述,在函式呼叫時參數傳遞是透過暫存器來完成的,第一個 argument 使用 %rdi ,第二個是 %rsi ,第三個是 %rdx ,第四個是 %rcx .... ![](https://i.imgur.com/z6HB9pP.png) >為什麼需要 calling conventions 機制呢？舉例：Android 手機中有一個 java virtual machine 名為 Dalvik，上面跑的都是 java code，若需要執行 C++ 程式那怎麼辦呢？ Java 會透過名為 java native method 的 interface 去呼叫 C++ 程式，而參數要如何傳遞呢？這就要兩方都遵守 ABI interface 傳參數的規範，才能正確取得參數，請看下面的 caller 和 callee 參數放置的順序。[name=youchihwang] >> 下方的 Java 程式碼沒有列出的必要，說明用的程式碼應該儘量精簡且具體。Dalvik 的例子真要探討的話，已經超出 calling convention，請見: [JNI Tips](https://developer.android.com/training/articles/perf-jni.html) [name=jserv] <s> 下方的檔案 [Android 7.0.0_r1 - Nougat/frameworks/base/services/core/java/com/android/server/lights/LightsService.javaLightsService.java](http://androidxref.com/7.0.0_r1/xref/frameworks/base/services/core/java/com/android/server/lights/LightsService.java) line 136 使用 JNI setLight_native（）interface。 [Android 7.0.0_r1 - Nougat /frameworks/base/services/core/jni/com_android_server_lights_LightsService.cpp](http://androidxref.com/7.0.0_r1/xref/frameworks/base/services/core/jni/com_android_server_lights_LightsService.cpp) Line 106 就是 JNI setLight_native 的實作 [Android 7.0.0_r1 - Nougat/hardware/qcom/display/msm8084/liblight/lights.c](http://androidxref.com/7.0.0_r1/xref/hardware/qcom/display/msm8084/liblight/lights.c) Line 260 ~ 269 是指定一個低層的function 給上層使用。 </s> 仍有疑惑的地方是： 1.呼叫這個 asm 函式(caller)的地方在哪裡 2.傳入幾個參數以及這些參數的用途. 根據 [Wikipedia](https://en.wikipedia.org/wiki/X86_calling_conventions) 的資料,Windows 的參數傳遞依序是透過 RCX/RDX/R8/R9...但這不會是在 Windows 上開發的(non-POSIX). >> 把第一參數 rdi 存到 rcx , 不過光看這行還是不知道上下文做了什麼XD [name=mingnus] >> 你應該寫個小程式，然後用 `gcc -S` 和對照 `objdump -d` 的輸出，確認 function call 由哪些機械碼指令完成 [name=jserv] >> 這時候非常需要 GDB，請趕快學習，包含 GDB Python integration [name=jserv] >> todo: >> 寫一個小程式,除了 main 之外還有一個四個參數 (int) 的 function ,比對 asm/objdump/gdb 參數傳遞的過程,並驗證是否符合abi規格書中所描述方式進行. [name=kobeyu] 驗證參數傳遞是依序透過 di/si/dx/cx 四個暫存器傳遞, ```C=1 void func(int a, int b, int c, int d) { a++; b++; c++; d++; printf("abcdK:%d,%d,%d,%d\n",a,b,c,d); } int main(void) { func(1,2,3,4); return 0; } ``` $ gcc -S callee.c 以下是 .s 的內容可以看到第 17~20 行是暫存器 di/si/dx/cx 依序把傳進來的參數搬到 %rbp 對應的記憶體位置 ``` = 1 .file "callee.c" .section .rodata .LC0: .string "abcdK:%d,%d,%d,%d\n" .text .globl func .type func, @function func: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $16, %rsp movl %edi, -4(%rbp) movl %esi, -8(%rbp) movl %edx, -12(%rbp) movl %ecx, -16(%rbp) movl -16(%rbp), %esi movl -12(%rbp), %ecx movl -8(%rbp), %edx movl -4(%rbp), %eax movl %esi, %r8d movl %eax, %esi movl $.LC0, %edi movl $0, %eax call printf leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size func, .-func .globl main .type main, @function main: .LFB1: #include <stdio.h> ``` --- $objdump -d a.out 以下是 objdump 的內容,可以看到第 6~9 行依序是暫存器 di/si/dx/cx 把傳進來的參數搬到 %rbp 對應的記憶體位置 ```=1 ... 000000000040052d <func>: 40052d: 55 push %rbp 40052e: 48 89 e5 mov %rsp,%rbp 400531: 48 83 ec 10 sub $0x10,%rsp 400535: 89 7d fc mov %edi,-0x4(%rbp) 400538: 89 75 f8 mov %esi,-0x8(%rbp) 40053b: 89 55 f4 mov %edx,-0xc(%rbp) 40053e: 89 4d f0 mov %ecx,-0x10(%rbp) 400541: 83 45 fc 01 addl $0x1,-0x4(%rbp) 400545: 83 45 f8 01 addl $0x1,-0x8(%rbp) 400549: 83 45 f4 01 addl $0x1,-0xc(%rbp) 40054d: 83 45 f0 01 addl $0x1,-0x10(%rbp) 400551: 8b 75 f0 mov -0x10(%rbp),%esi 400554: 8b 4d f4 mov -0xc(%rbp),%ecx 400557: 8b 55 f8 mov -0x8(%rbp),%edx 40055a: 8b 45 fc mov -0x4(%rbp),%eax 40055d: 41 89 f0 mov %esi,%r8d 400560: 89 c6 mov %eax,%esi 400562: bf 24 06 40 00 mov $0x400624,%edi 400567: b8 00 00 00 00 mov $0x0,%eax 40056c: e8 9f fe ff ff callq 400410 <printf@plt> 400571: c9 leaveq 400572: c3 retq 0000000000400573 <main>: 400573: 55 push %rbp 400574: 48 89 e5 mov %rsp,%rbp 400577: b9 04 00 00 00 mov $0x4,%ecx 40057c: ba 03 00 00 00 mov $0x3,%edx 400581: be 02 00 00 00 mov $0x2,%esi 400586: bf 01 00 00 00 mov $0x1,%edi 40058b: e8 9d ff ff ff callq 40052d <func> 400590: b8 00 00 00 00 mov $0x0,%eax 400595: 5d pop %rbp 400596: c3 retq 400597: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 40059e: 00 00 ... ``` --- * GDB 驗證也是一樣依序透過 di/si/dx/cx 傳遞參數. ``` Breakpoint 1, func (a=1, b=2, c=3, d=4) at callee.c:6 6 a++; b++; c++; d++; (gdb) info register rax 0x400573 4195699 rbx 0x0 0 rcx 0x4 4 rdx 0x3 3 rsi 0x2 2 rdi 0x1 1 rbp 0x7fffffffe150 0x7fffffffe150 rsp 0x7fffffffe140 0x7fffffffe140 r8 0x7ffff7dd4e80 140737351863936 r9 0x7ffff7dea530 140737351951664 r10 0x7fffffffdff0 140737488347120 r11 0x7ffff7a36e50 140737348070992 r12 0x400440 4195392 r13 0x7fffffffe240 140737488347712 r14 0x0 0 r15 0x0 0 rip 0x400541 0x400541 <func+20> eflags 0x202 [ IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) ``` --- ## X86 MMU fault handling is turing complete [論文](https://www.usenix.org/system/files/conference/woot13/woot13-bangert.pdf) [簡報錄影](https://www.usenix.org/conference/woot13/workshop-program/presentation/bangert) **摘要**: 目前就我的理解是: 刻意地對記憶體操作,故意產生 page fault 的情況,然後讓某些 flag 發生規性性的變化,讓這些連續的間接操作與變化滿足圖靈完備,由於不是由指令直接控制這些 cpu 上的 flag ,所以可以聲稱不用透過任何的 cpu 指令就可以達到滿足圖靈完備. 這樣透過指令間接地操作機器在術語上稱為 "Weird Machine" - 參考資料 - [WiKi - weired machine](https://en.wikipedia.org/wiki/Weird_machine) - [The Weird Machine](https://tc.gtisc.gatech.edu/cs3210/2016/spring/l/tut05.pdf) - asm: movdbz - [Exploit Programming](http://langsec.org/papers/Bratus.pdf) - [Instruction-less computation](https://github.com/kristerw/instless_comp) --- 文件 Brainfuck Process Extensions ## Brainfuck Process Extensions 心得 --- ## Tritium Optimization Techniques 根據 esolangs.org 指出, Tritium 是目前最快的 bf interpreter, 特色包括各式深度最佳化手法 (-O0 ~ -Orun) 及多樣化的 backends (包括 profiler, 會追蹤每條 IR 的執行次數)。其最佳化方式大致如下: 輸入的 bf code 會被轉成 IR code, 稱之為 tree (實際上是 list of tree cells), interpreter 再對 IR 進行多次最佳化, 最後交由 backend 輸出。 ### Tritium IR code 一道 IR code 由`struct bfi`表示。我們可用 option `-vvv` 印出 IR code Sample: ``` T_ADD[0]:10, $0, prev 0, prof 1, @(1,1) T_WHL[0],id=1, $1, jmp $12, prof 1, @(1,11) ``` 以第一行為例, 其對應的 C-code 為`p[0] += 10` `T_ADD`: IR code `[0]`: IR 操作相對位址 (相對於目前 cell)。對應`bfi::offset` `:10`: IR operand, 相當於 RLE 處理後的累計次數。對應`bfi::count` `$0`: IR index label (用於`jmp`指令) `prof 1`: IR 執行次數 (profiler 開啟時才會輸出, option `-T`) `@(1,1)`: line 1, column 1 ### Tritium Optimisation Levels * `-m`: No optimisation * `-O0`: RLE, 並移除程式開頭不可能執行的 loop * function: `load_file()` * `-O1`: Only enable pointer motion optimisation. * Clear loop: 將`[-]`或`[+]`轉換為 value set, 對應的 IR code 為`T_SET` * function: `quick_scan()` * Merging pointer movements: 消除不必要的 pointer movement, 改用常數 offset 存取 cells。概念類似於 operation offset, 但是功能更強, 可以跨 loop 整併 pointer movement。整併後除了可減少指標運算, 也有助於 backend register allocation。 * example1: `>>>++<<<-` 轉換為 `p[3] += 2; p[0]--;` * example2: `>>>>[>>>++<<<-]` 轉換為 `while (p[4]) {p[7] += 2; p[4]--;}` * example3: `>>>>[>>>++<<<-]<<[>++<-]` 轉換為 `while (p[4]) {p[7] += 2; p[4]--;} while (p[2]) {p[3] += 2; p[2]--;}` * function: `pointer_scan()`。實作上是透過合併與 propagate `T_MOV` IR code 達成, 例如`>>>++<<<-`當中的`p+=3`及`p-=3`就會併入`T_ADD`, 如下所示 * pointer_regen **Pointer movement optimization example1** ```tritium-ir # example1: >>>++<<<- # initial IR code (RLE only) T_MOV 3 $0, prev 0, @(1,1) T_ADD[0]:2, $1, @(1,4) T_MOV -3 $2, @(1,6) T_ADD[0]:-1, $3, next 0, @(1,9) # temporary IR code T_ADD[3]:2, $1, prev 0, @(1,4) T_MOV 3 $0, @(1,1) T_MOV -3 $2, @(1,6) T_ADD[0]:-1, $3, next 0, @(1,9) # final IR code T_ADD[3]:2, $1, prev 0, @(1,4) T_ADD[0]:-1, $3, next 0, @(1,9) ``` **Pointer movement optimization example2** ```tritium-ir # example2: >>>>[>>>++<<<-] # initial IR code (RLE only) T_MOV 4 $0, prev 0, @(1,1) T_WHL[0],id=1, $1, jmp $6, @(1,5) T_MOV 3 $2, @(1,6) T_ADD[0]:2, $3, @(1,9) T_MOV -3 $4, @(1,11) T_ADD[0]:-1, $5, @(1,14) T_END[0] id=1, $6, next 0, jmp $1, @(1,15) # temporary IR code1 # 將 loop 之前的 pointer movement 移到 loop 尾端 T_WHL[4],id=1, $1, prev 0, jmp $6, @(1,5) T_MOV 4 $7, @(1,5) T_MOV 3 $2, @(1,6) T_ADD[0]:2, $3, @(1,9) T_MOV -3 $4, @(1,11) T_ADD[0]:-1, $5, @(1,14) T_MOV -4 $8, @(1,14) T_END[4] id=1, $6, jmp $1, @(1,15) T_MOV 4 $0, next 0, @(1,1) # temporary IR code2 # 中間最佳化步驟類似於 example1, 在此省略 T_WHL[4],id=1, $1, prev 0, jmp $6, @(1,5) T_ADD[7]:2, $3, @(1,9) T_ADD[4]:-1, $5, @(1,14) T_END[4] id=1, $6, jmp $1, @(1,15) T_MOV 4 $0, next 0, @(1,1) # final IR code # 最後一行`T_MOV`位於程式尾端, 故被移除 T_WHL[4],id=1, $1, prev 0, jmp $6, @(1,5) T_ADD[7]:2, $3, @(1,9) T_ADD[4]:-1, $5, @(1,14) T_END[4] id=1, $6, jmp $1, @(1,15) ``` * `-O2`: Allow a few simple optimisations. * `invariants_scan()`: 迴圈相關的最佳化。根據迴圈特性, 將其轉成更有效率的 IR * `classify_loop()`: 標記 loop 類型。只執行一次的迴圈標記為 `T_IF`, 執行次數未知的迴圈則標記為`T_CALC`, `T_MULT`, 或`T_CMULT` * `flatten_multiplier()`: 概念類似於 multiplication loop, 將`T_MULT`或`T_CMULT`迴圈內容轉為乘法 * `build_string_in_tree()`: 為了克服 brainfuck 沒有常數字串/字元的最佳化手法。做法是根據 cell[i] 當前數值, 將brainfuck output op `.` (`T_PRT`) 轉換為直接列印常數字元 (`T_CHR`)。例如`hello.b`可以被轉換為一連串`T_CHR`指令, 使其不再依賴 cell 內容列印字元 * `find_known_calc_state()`: 簡化`T_CALC`不變量相關操作, 例如 C-code `a = b; b = 0; b = a; a = 1;` 可簡化為`a = 1;`, 因為變數`b`的數值最後還是沒有改變。這類操作常見於 brainfuck, 因為 brainfuck 做 copy assignment 必然會使 rvalue (即 copy source) 歸零, 如果想保持該數值只能把它暫存到其他位址，因此造成了上述程式碼。(參見[上次作業](https://hackmd.io/s/r1pcJ0Dkx#optimization)) * `scan_one_node()`: 掃瞄 cell[i] 的所有操作並進行最佳化, 其中一個功能是簡化執行次數已知的迴圈 (`flatten_loop()`), 例如 C-code `for (i = 0; i < 10; i++) x++;`可簡化為`x += 10;` * 刪除程式尾端的`T_SET`, `T_ADD`, `T_CALC`指令 * 這些指令不影響程式輸出入, 形同 dead code, 故可刪除 * function: `trim_trailing_sets()` * 再次合併 pointer movement (if `-fpointer-rescan` is provided) >> 某些手法應該是既有的編譯器最佳化手法, 應該有對應的專有名詞 (回顧 compiler 課本後半段) [name=mingnus] **Example of trim_trailing_sets()** ```tritium-ir # hello.b ++++++++++[>+++++++>++++++++++>+++>+<<<<-] >++.>+.+++++++..+++.>++.<<+++++++++++++++. >.+++.------.--------.>+.>. # hello.b after invariants_scan() # trim_trailing_sets() 將刪除末端四個 T_SET 指令 T_CHR 'H', $15, prev 0, @(1,46) T_CHR 'e', $18, skip $15, @(1,49) T_CHR 'l', $20, skip $15, @(1,57) T_CHR 'l', $21, skip $15, @(1,58) T_CHR 'o', $23, skip $15, @(1,62) T_CHR ' ', $26, skip $15, @(1,66) T_CHR 'W', $29, skip $15, @(1,84) T_CHR 'o', $31, skip $15, @(1,86) T_CHR 'r', $33, skip $15, @(1,90) T_CHR 'l', $35, skip $15, @(1,97) T_CHR 'd', $37, skip $15, @(1,106) T_CHR '!', $40, skip $15, @(1,109) T_CHR 0x0a, $42, skip $15, @(1,111) T_SET[4]:10, $9, @(1,36) T_SET[1]:87, $28, @(1,69) T_SET[2]:100, $36, @(1,98) T_SET[3]:33, $39, next 0, @(1,108) ``` **Example of find_known_calc_state()** ```tritium-ir # awib 開頭程式碼, 重點是中間這段 "[->+>+<<]>>[-<<+>>]+" >>>>>>>>>>>>>>>>>>>>++>>>>+[>[-],+[->+>+<<]>>[-<<+>>]+<[-[>+++++[-<---- --->]+<-[-[-[-[--------------[--[-->>+<<[>>-<++[-<--------->]+<[--[>+++ ++++++[-<++++++++++>]<[-]]]]]]]]]]]<->]>[-<<<->>>]<<<] # invariants_scan() loop #44, processing $12 (T_SET[25]:0, $12, @(1,36)) T_SET[20]:2, $1, prev 0, @(1,21) T_SET[24]:1, $3, @(1,27) T_WHL[24],id=1, $4, jmp $107, @(1,28) T_SET[25]:0, $7, @(1,31) T_INP[25], $9, @(1,33) T_ADD[25]:1, $10, @(1,34) T_CALC[26] = 0 + [25]*1, $14, @(1,38) T_CALC[27] = 0 + [25]*1, $16, @(1,40) T_SET[25]:0, $12, @(1,36) T_CALC[25] = 0 + [27]*1, $23, @(1,50) T_CALC[27] += [27]*-1 + 0, $21, @(1,47) T_NOP[27]:0, $25, jmp $20, @(1,53) T_ADD[27]:1, $26, @(1,54) ... # invariants_scan() loop #49, processing $14 (T_CALC[26] = 0 + [25]*1, $14, @(1,38)) # cell[25] 數值不變, 故簡化為 T_NOP T_SET[20]:2, $1, prev 0, @(1,21) T_SET[24]:1, $3, @(1,27) T_WHL[24],id=1, $4, jmp $107, @(1,28) T_SET[25]:0, $7, @(1,31) T_INP[25], $9, @(1,33) T_ADD[25]:1, $10, @(1,34) T_CALC[26] = 0 + [25]*1, $14, @(1,38) T_CALC[27] = 0 + [25]*1, $16, @(1,40) T_NOP[25]:0, x2[25]:1, x3[25]:0, $23, @(1,50) T_CALC[27] += [27]*-1 + 0, $21, @(1,47) T_NOP[27]:0, $25, jmp $20, @(1,53) T_ADD[27]:1, $26, @(1,54) ... # invariants_scan() loop #62, processing $14 (T_CALC[26] = 0 + [25]*1, $14, @(1,38)) # 簡化 cell[27] 相關操作後的結果 T_SET[20]:2, $1, prev 0, @(1,21) T_SET[24]:1, $3, @(1,27) T_WHL[24],id=1, $4, jmp $107, @(1,28) T_SET[25]:0, $7, @(1,31) T_INP[25], $9, @(1,33) T_ADD[25]:1, $10, @(1,34) T_CALC[26] = 0 + [25]*1, $14, @(1,38) T_ADD[27]:1, $26, @(1,54) ... ``` * `-O3` Maximum normal level, default. * 從 code 來看, `-O2` 和 `-O3` 的效果應該是一樣的 * `-Orun` When generating code run the interpreter over it first. * function: `try_opt_runner()` (`opt_runner = 1`) ### 其他實驗 **巢狀迴圈** brainfuck 必須以巢狀迴圈做兩個變數的乘法運算, 例如以下 C-code ``` p[0] = getchar(); p[1] = getchar(); p[2] = p[1] * p[0]; p[1] = 0; p[0] = 0; putchar(p[2]); ``` 對應的 brainfuck 為 ``` ,>,<[->[->+<]<]>>. ``` 但 Tritium 無法簡化巢狀迴圈 ``` T_INP[0], $0, prev 0, @(1,1) T_INP[1], $2, @(1,3) T_WHL[0],id=1, $4, jmp $14, @(1,5) T_ADD[0]:-1, $5, @(1,6) T_CALC[2] += [1]*2 + 0, $10, @(1,11) T_SET[1]:0, $8, @(1,9) T_END[0] id=1, $14, jmp $4, @(1,16) T_PRT[2], $16, next 0, @(1,19) ``` **Register allocation** 測試 ASM backend 如何實現 IR, 並比較不同 optimization level 的 register allocation 測試1: 乘法 `p[2] = (p[1] * 2) * p[0]`; ``` ,>,<[->[->++<]<]>>. ``` `-O1` 產生的 IR code ``` T_INP[0], $0, prev 0, @(1,1) T_INP[1], $2, @(1,3) T_WHL[0],id=1, $4, jmp $14, @(1,5) T_ADD[0]:-1, $5, @(1,6) T_WHL[1],id=2, $7, jmp $12, @(1,8) T_ADD[1]:-1, $8, @(1,9) T_ADD[2]:2, $10, @(1,11) T_END[1] id=2, $12, jmp $7, @(1,14) T_END[0] id=1, $14, jmp $4, @(1,16) T_PRT[2], $16, next 0, @(1,19) ``` `-O1 -s` 輸出的 nasm 片段, 對應中間那段 loop IR 都是 unary operation, 因此未使用額外 register ``` loop_1: dec byte ptr [ecx] cmp dh,byte ptr [ecx+1] jz end_2 loop_2: dec byte ptr [ecx+1] add byte ptr [ecx+2],2 cmp dh,byte ptr [ecx+1] jnz loop_2 end_2: cmp dh,byte ptr [ecx] jnz loop_1 end_1: ``` `-O2` 產生的 IR code ``` T_INP[0], $0, prev 0, @(1,1) T_INP[1], $2, @(1,3) T_WHL[0],id=1, $4, jmp $14, @(1,5) T_ADD[0]:-1, $5, @(1,6) T_CALC[2] += [1]*2 + 0, $10, @(1,11) T_SET[1]:0, $8, @(1,9) T_END[0] id=1, $14, jmp $4, @(1,16) T_PRT[2], $16, next 0, @(1,19) ``` `-O2 -s` 輸出的 nasm 片段, 對應中間那段 loop 乘法運算暫存於 eax ``` loop_1: dec byte ptr [ecx] movzx eax,byte ptr [ecx+1] add eax,eax add byte ptr [ecx+2],al mov byte ptr [ecx+1],0 cmp dh,byte ptr [ecx] jnz loop_1 end_1: ``` 測試2: 連加 `p[2] = p[0] + p[1]` ``` ,>,<[->>+<<]>[->+<]>. ``` `-O2` 產生的 IR ``` T_INP[0], $0, prev 0, @(1,1) T_INP[1], $2, @(1,3) T_CALC[2] = 0 + [0]*1, $7, @(1,9) T_CALC[2] += [1]*1 + 0, $14, @(1,17) T_PRT[2], $18, next 0, @(1,21) ``` `-O2 -s` 輸出的 nasm 片段 ``` movzx eax,byte ptr [ecx] mov byte ptr [ecx+2],al movzx eax,byte ptr [ecx+1] add byte ptr [ecx+2],al ``` 需存取四次記憶體, 但如果是一般 compiled C-code, 只需存取三次 (p[0] 及 p[1] 分別載入不同暫存器, 相加後存入 p[2]) ### Debugging Tritium * 善用 option `-v` (有些 debug message 需要超過五個`-v`) * `printtree()` 印出全部 IR * 注意 profiler 無法處理含有 input 指令的程式 --- ## 附錄 ### GDB Python Integration [本文](https://hackmd.io/s/HJXBgX2ge) 介紹 gdb Python API 使用案例, 並改善 [supertrace.py](http://rickey-nctu.blogspot.tw/2013/11/gdb.html), 使其更容易使用 --- ## 參考資料 * [Hello, JIT World: The Joy of Simple JITs](http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-simple-jits.html) * [A Basic Just-In-Time Compiler](http://nullprogram.com/blog/2015/03/19/) * Calling convention * [x86 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions) * [x86 Assembly Guide - calling convention](http://www.cs.virginia.edu/~evans/cs216/guides/x86.html#calling) * [Calling convention for system calls](http://man7.org/linux/man-pages/man2/syscall.2.html) * [System V ABI](http://wiki.osdev.org/System_V_ABI) * [Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-L](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) * [Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B: Instruction Set Reference, M-U](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) * [x86 assembly language two main syntax branches](https://en.wikipedia.org/wiki/X86_assembly_language#Syntax) * [9.15.3.1 AT&T Syntax versus Intel Syntax](https://sourceware.org/binutils/docs/as/i386_002dVariations.html#i386_002dVariations) * [System V Application Binary Interface x86-64TM Architecture Processor Supplement Draft Version 0.21](https://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf)