--- tags: Research --- # wasm3 Strategy ## clang compile 這邊要用 `clang-14` 才支援 musttail。 ``` mkdir build cd build cmake -GNinja -DCLANG_SUFFIX="-14" .. ninja ``` ## Tail Call 透過 gef 追蹤得知,程式主要由 `main` -> `repl_call` -> `m3_CallArgv`->`op_Call`->`op_Compile`->`Compile_funciton`->...,function dispatch 主要透過 `op_Call` 這個 function 完成,`op_Call` 這個 function 是由 `d_m3Op (Call)` 所展開,而 `d_m3Op (Call)` 最原始是由 `RunCode` 所展開, 由程式碼可以發現 `RunCode` 的寫法符合 tail-call 的定義。然而,參考這篇 [Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C](https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html?fbclid=IwAR15f5RWb2YfioH-3HrgBwZLbzsoIQ2Yj5uFZeNelyxPMaz2i3STts_H9Qw) 發現 `__attribute__((musttail))` 這個 attribute 目前移植性並不高,當我直接定義 `M3_MUSTTAIL` 為`__attribute__((musttail))` 時,我的電腦並不支援。 ```c #define d_m3BaseOpSig pc_t _pc, m3stack_t _sp, M3MemoryHeader * _mem, m3reg_t _r0 #if M3_COMPILER_HAS_ATTRIBUTE(musttail) #define M3_MUSTTAIL __attribute__((musttail)) #define nextOpDirect() M3_MUSTTAIL return nextOpImpl() d_m3RetSig RunCode (d_m3OpSig) { nextOpDirect(); } d_m3Op (Call) { pc_t callPC = immediate (pc_t); i32 stackOffset = immediate (i32); IM3Memory memory = m3MemInfo (_mem); m3stack_t sp = _sp + stackOffset; m3ret_t r = Call (callPC, sp, _mem, d_m3OpDefaultArgs); _mem = memory->mallocated; if (M3_LIKELY(not r)) nextOp (); else { pushBacktraceFrame (); forwardTrap (r); } } ``` ```c d_m3RetSig Call (d_m3OpSig) { m3ret_t possible_trap = m3_Yield (); if (M3_UNLIKELY(possible_trap)) return possible_trap; nextOpDirect(); } ``` * `Yield` ```c M3Result m3_Yield () { return m3Err_none; // NULL } ``` 透過 disassemble 展開也可發現,編譯器並沒有使用 tail-call optimization 來最佳化`op_Call` 這個 function, function 還是頻繁的操作 stack。 另外,猜測是沒有利用 tail-call optimization 的關係,參數並沒有直接 mapping 到這些暫存器上,而是先透過 r12~15 來存取。 |M3 Register |x86 Register| |-----------------------------|------------| |program counter (pc) |rdi | |stack pointer (sp) |rsi | |linear memory (mem) |rdx | |integer register (r0) |rcx | |floating-point register (fp0)|xmm0 | ``` 000000000000104d0 <op_Call>: 104d0: 55 push %rbp 104d1: 41 57 push %r15 104d3: 41 56 push %r14 104d5: 41 55 push %r13 104d7: 41 54 push %r12 104d9: 53 push %rbx 104da: 48 83 ec 18 sub $0x18,%rsp 104de: c5 fb 11 44 24 10 vmovsd %xmm0,0x10(%rsp) 104e4: 48 89 4c 24 08 mov %rcx,0x8(%rsp) 104e9: 49 89 d4 mov %rdx,%r12 104ec: 49 89 f7 mov %rsi,%r15 104ef: 48 89 fb mov %rdi,%rbx 104f2: 48 8b 2f mov (%rdi),%rbp 104f5: 4c 63 77 08 movslq 0x8(%rdi),%r14 104f9: 4c 8b 2a mov (%rdx),%r13 104fc: e8 1f b0 00 00 callq 1b520 <m3_Yield> 10501: 48 85 c0 test %rax,%rax 10504: 75 49 jne 1054f <op_Call+0x7f> 10506: 4b 8d 34 b7 lea (%r15,%r14,4),%rsi 1050a: 48 8d 7d 08 lea 0x8(%rbp),%rdi 1050e: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 10512: 4c 89 e2 mov %r12,%rdx 10515: 31 c9 xor %ecx,%ecx 10517: ff 55 00 callq *0x0(%rbp) 1051a: 48 85 c0 test %rax,%rax 1051d: 75 30 jne 1054f <op_Call+0x7f> 1051f: 49 8b 95 b8 29 00 00 mov 0x29b8(%r13),%rdx 10526: 48 8b 43 10 mov 0x10(%rbx),%rax 1052a: 48 83 c3 18 add $0x18,%rbx 1052e: 48 89 df mov %rbx,%rdi 10531: 4c 89 fe mov %r15,%rsi 10534: 48 8b 4c 24 08 mov 0x8(%rsp),%rcx 10539: c5 fb 10 44 24 10 vmovsd 0x10(%rsp),%xmm0 1053f: 48 83 c4 18 add $0x18,%rsp 10543: 5b pop %rbx 10544: 41 5c pop %r12 10546: 41 5d pop %r13 10548: 41 5e pop %r14 1054a: 41 5f pop %r15 1054c: 5d pop %rbp 1054d: ff e0 jmpq *%rax 1054f: 48 83 c4 18 add $0x18,%rsp 10553: 5b pop %rbx 10554: 41 5c pop %r12 10556: 41 5d pop %r13 10558: 41 5e pop %r14 1055a: 41 5f pop %r15 1055c: 5d pop %rbp 1055d: c3 retq 1055e: 66 90 xchg %ax,%ax ``` ## issue Report https://github.com/wasm3/wasm3/issues/411 ## Coremark 測試 ``` wasm3-0.5.0 (clang 16) 2901.478358 wasm3-0.5.0 (clang 14) 2953.427087 native clang 33154.666519 wavm (2022-05-14) 27219.358985 WebAssembly.sh v0.1.0 23652.217670 emscripten 23466.833542 ``` ## with and without `__attribute__((musttail))` in O0 ```c int factorial(int n, int sum) { if (n == 0) return sum; sum *= n; // without `__attribute__((musttail))` return factorial(n - 1, sum); // with `__attribute__((musttail))` __attribute__((musttail)) return factorial(n - 1, sum); } ``` 比較兩份指令可發現利用 `__attribute__((musttail))`,它能幫助作到的事情只有不去對 `%rsp` 去做增加或減少,然後將 `callq` 改成 `jmpq`,還是要利用到 function stack,但是用到的都是同一塊。 * without `__attribute__((musttail))` ``` 0000000000001130 <factorial>: 1130: 55 push %rbp 1131: 48 89 e5 mov %rsp,%rbp 1134: 48 83 ec 10 sub $0x10,%rsp 1138: 89 7d f8 mov %edi,-0x8(%rbp) 113b: 89 75 f4 mov %esi,-0xc(%rbp) 113e: 83 7d f8 00 cmpl $0x0,-0x8(%rbp) 1142: 0f 85 0b 00 00 00 jne 1153 <factorial+0x23> 1148: 8b 45 f4 mov -0xc(%rbp),%eax 114b: 89 45 fc mov %eax,-0x4(%rbp) 114e: e9 1b 00 00 00 jmpq 116e <factorial+0x3e> 1153: 8b 45 f8 mov -0x8(%rbp),%eax 1156: 0f af 45 f4 imul -0xc(%rbp),%eax 115a: 89 45 f4 mov %eax,-0xc(%rbp) 115d: 8b 7d f8 mov -0x8(%rbp),%edi 1160: 83 ef 01 sub $0x1,%edi 1163: 8b 75 f4 mov -0xc(%rbp),%esi 1166: e8 c5 ff ff ff callq 1130 <factorial> 116b: 89 45 fc mov %eax,-0x4(%rbp) 116e: 8b 45 fc mov -0x4(%rbp),%eax 1171: 48 83 c4 10 add $0x10,%rsp 1175: 5d pop %rbp 1176: c3 retq 1177: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 117e: 00 00 ``` * with `__attribute__((musttail))` ``` 0000000000001130 <factorial>: 1130: 55 push %rbp 1131: 48 89 e5 mov %rsp,%rbp 1134: 89 7d f8 mov %edi,-0x8(%rbp) 1137: 89 75 f4 mov %esi,-0xc(%rbp) 113a: 83 7d f8 00 cmpl $0x0,-0x8(%rbp) 113e: 0f 85 0b 00 00 00 jne 114f <factorial+0x1f> 1144: 8b 45 f4 mov -0xc(%rbp),%eax 1147: 89 45 fc mov %eax,-0x4(%rbp) 114a: e9 1a 00 00 00 jmpq 1169 <factorial+0x39> 114f: 8b 4d f8 mov -0x8(%rbp),%ecx 1152: 8b 45 f4 mov -0xc(%rbp),%eax 1155: 0f af c1 imul %ecx,%eax 1158: 89 45 f4 mov %eax,-0xc(%rbp) 115b: 8b 7d f8 mov -0x8(%rbp),%edi 115e: ff cf dec %edi 1160: 8b 75 f4 mov -0xc(%rbp),%esi 1163: 5d pop %rbp 1164: e9 c7 ff ff ff jmpq 1130 <factorial> 1169: 8b 45 fc mov -0x4(%rbp),%eax 116c: 5d pop %rbp 116d: c3 retq 116e: 66 90 xchg %ax,%ax ``` 原本懷疑可能是 [這篇](https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html) 提到的要將 slow path 也包成一個 function,因此做改寫 ```c int fall_back(int n, int sum) { return sum; } int factorial(int n, int sum) { if (n == 0) __attribute__((musttail)) return fall_back(n, sum); sum *= n; __attribute__((musttail)) return factorial(n - 1, sum); } ``` 然而,只是多了要跳過去 fall_back funciton 而已,還是要利用到 function stack ``` 0000000000001130 <fall_back>: 1130: 55 push %rbp 1131: 48 89 e5 mov %rsp,%rbp 1134: 89 7d fc mov %edi,-0x4(%rbp) 1137: 89 75 f8 mov %esi,-0x8(%rbp) 113a: 8b 45 f8 mov -0x8(%rbp),%eax 113d: 5d pop %rbp 113e: c3 retq 113f: 90 nop 0000000000001140 <factorial>: 1140: 55 push %rbp 1141: 48 89 e5 mov %rsp,%rbp 1144: 89 7d f8 mov %edi,-0x8(%rbp) 1147: 89 75 f4 mov %esi,-0xc(%rbp) 114a: 83 7d f8 00 cmpl $0x0,-0x8(%rbp) 114e: 0f 85 0c 00 00 00 jne 1160 <factorial+0x20> 1154: 8b 7d f8 mov -0x8(%rbp),%edi 1157: 8b 75 f4 mov -0xc(%rbp),%esi 115a: 5d pop %rbp 115b: e9 d0 ff ff ff jmpq 1130 <fall_back> 1160: 8b 4d f8 mov -0x8(%rbp),%ecx 1163: 8b 45 f4 mov -0xc(%rbp),%eax 1166: 0f af c1 imul %ecx,%eax 1169: 89 45 f4 mov %eax,-0xc(%rbp) 116c: 8b 7d f8 mov -0x8(%rbp),%edi 116f: ff cf dec %edi 1171: 8b 75 f4 mov -0xc(%rbp),%esi 1174: 5d pop %rbp 1175: e9 c6 ff ff ff jmpq 1140 <factorial> 117a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) ``` 但是用 O3 編譯的話,就可以避免掉建立 function stack,但用 O3 來編譯的話,有沒有加 `__attribute__((musttail))` 結果是一樣的。 參考 [這個 commit](https://github.com/wasm3/wasm3/commit/bd152c79aa1c022f085ee98f68149bf1ff75f80a) 發現其他 op_XXX 的 function 也會被 tail-call optimization 優化,但不知道為何只有之前分析的 `op_Call` 沒有。 找其中一個確定有優化的 function 來測試,先用 O0 來編譯,可以發現它需要用到 function stack,有沒有加 `__attribute__((musttail))` 的差異也跟上面的實驗結果一樣。 ``` 00000000000216c0 <op_u64_Or_rs>: 216c0: 48 83 ec 38 sub $0x38,%rsp 216c4: 48 89 7c 24 30 mov %rdi,0x30(%rsp) 216c9: 48 89 74 24 28 mov %rsi,0x28(%rsp) 216ce: 48 89 54 24 20 mov %rdx,0x20(%rsp) 216d3: 48 89 4c 24 18 mov %rcx,0x18(%rsp) 216d8: c5 fb 11 44 24 10 vmovsd %xmm0,0x10(%rsp) 216de: 48 8b 44 24 28 mov 0x28(%rsp),%rax 216e3: 48 8b 4c 24 30 mov 0x30(%rsp),%rcx 216e8: 48 89 ca mov %rcx,%rdx 216eb: 48 83 c2 08 add $0x8,%rdx 216ef: 48 89 54 24 30 mov %rdx,0x30(%rsp) 216f4: 48 63 09 movslq (%rcx),%rcx 216f7: 48 8b 04 88 mov (%rax,%rcx,4),%rax 216fb: 48 89 44 24 08 mov %rax,0x8(%rsp) 21700: 48 8b 44 24 08 mov 0x8(%rsp),%rax 21705: 48 0b 44 24 18 or 0x18(%rsp),%rax 2170a: 48 89 44 24 18 mov %rax,0x18(%rsp) 2170f: 48 8b 44 24 30 mov 0x30(%rsp),%rax 21714: 48 8b 00 mov (%rax),%rax 21717: 48 8b 7c 24 30 mov 0x30(%rsp),%rdi 2171c: 48 83 c7 08 add $0x8,%rdi 21720: 48 8b 74 24 28 mov 0x28(%rsp),%rsi 21725: 48 8b 54 24 20 mov 0x20(%rsp),%rdx 2172a: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx 2172f: c5 fb 10 44 24 10 vmovsd 0x10(%rsp),%xmm0 21735: ff d0 callq *%rax 21737: 48 83 c4 38 add $0x38,%rsp 2173b: c3 retq 2173c: 0f 1f 40 00 nopl 0x0(%rax) ``` ``` 0000000000020770 <op_u64_Or_rs>: 20770: 48 89 7c 24 f8 mov %rdi,-0x8(%rsp) 20775: 48 89 74 24 f0 mov %rsi,-0x10(%rsp) 2077a: 48 89 54 24 e8 mov %rdx,-0x18(%rsp) 2077f: 48 89 4c 24 e0 mov %rcx,-0x20(%rsp) 20784: c5 fb 11 44 24 d8 vmovsd %xmm0,-0x28(%rsp) 2078a: 48 8b 44 24 f0 mov -0x10(%rsp),%rax 2078f: 48 8b 4c 24 f8 mov -0x8(%rsp),%rcx 20794: 48 89 ca mov %rcx,%rdx 20797: 48 83 c2 08 add $0x8,%rdx 2079b: 48 89 54 24 f8 mov %rdx,-0x8(%rsp) 207a0: 48 63 09 movslq (%rcx),%rcx 207a3: 48 8b 04 88 mov (%rax,%rcx,4),%rax 207a7: 48 89 44 24 d0 mov %rax,-0x30(%rsp) 207ac: 48 8b 44 24 d0 mov -0x30(%rsp),%rax 207b1: 48 8b 4c 24 e0 mov -0x20(%rsp),%rcx 207b6: 48 09 c8 or %rcx,%rax 207b9: 48 89 44 24 e0 mov %rax,-0x20(%rsp) 207be: 48 8b 7c 24 f8 mov -0x8(%rsp),%rdi 207c3: 48 8b 07 mov (%rdi),%rax 207c6: 48 83 c7 08 add $0x8,%rdi 207ca: 48 8b 74 24 f0 mov -0x10(%rsp),%rsi 207cf: 48 8b 54 24 e8 mov -0x18(%rsp),%rdx 207d4: 48 8b 4c 24 e0 mov -0x20(%rsp),%rcx 207d9: c5 fb 10 44 24 d8 vmovsd -0x28(%rsp),%xmm0 207df: ff e0 jmpq *%rax 207e1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) 207e8: 0f 1f 84 00 00 00 00 207ef: 00 ``` 接下用 O3 去編譯,可發現不需要到 function stack, ``` 0000000000014630 <op_u64_Or_rs>: 14630: 48 63 07 movslq (%rdi),%rax 14633: 48 0b 0c 86 or (%rsi,%rax,4),%rcx 14637: 48 8b 47 08 mov 0x8(%rdi),%rax 1463b: 48 83 c7 10 add $0x10,%rdi 1463f: ff e0 jmpq *%rax ``` 用 gcc 編出來的 ``` 0000000000011810 <op_u64_Or_rs>: 11810: f3 0f 1e fa endbr64 11814: 48 63 07 movslq (%rdi),%rax 11817: 4c 8d 47 10 lea 0x10(%rdi),%r8 1181b: 48 0b 0c 86 or (%rsi,%rax,4),%rcx 1181f: 48 8b 47 08 mov 0x8(%rdi),%rax 11823: 4c 89 c7 mov %r8,%rdi 11826: ff e0 jmpq *%rax 11828: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 1182f: 00 ``` ## sqrt 有 callq 的 funciton 就會用到 function stack ## `__attribute__((musttail))` 限制 1. function 參數宣告要一樣 2. slow path 要包成一個 function (似乎沒影響) 3. 不能放在函式宣告處 ``` error: 'musttail' attribute cannot be applied to a declaration __attribute__((musttail)) int factorial(int n, int sum) { ^ ~~~~~~~~~ 1 error generated. ``` ## wasm3 porfiler 將 m3_config.h 中的 d_m3EnableOpProfiling 設成 1,跑 coremark 時就會印出以下資訊 ``` 1324428702 op_SetSlot_i32 540959457 op_i32_Add_ss 418617774 op_CopySlot_32 318460691 op_i32_Load_i32_s 228893365 op_u32_And_rs 179737778 op_i32_Add_rs 155645868 op_BranchIf_r 129274454 op_u32_Xor_rs 126925034 op_SetRegister_i32 113785674 op_i32_Multiply_rs 100791878 op_i32_Store_i32_ss 94615534 op_u32_ShiftRight_sr 89468905 op_u32_ShiftRight_ss 89339933 op_u32_And_ss 87556818 op_i32_Load_i16_s 81719750 op_i32_Load_u16_s 80991816 op_i32_Equal_ss 75518225 op_Select_i32_rss 64609494 op_If_r 61167788 op_i32_Load_u8_r 60755923 op_i32_Equal_rs 57889024 op_i32_Load_u8_s 53672190 op_i32_NotEqual_ss 49348860 op_i32_EqualToZero_r 47495786 op_i32_Store_i32_rs 39187964 op_PreserveSetSlot_i32 37928724 op_i32_GreaterThan_ss 35663950 op_i32_Load_u16_r 28192083 op_u32_LessThanOrEqual_sr 21272720 op_i32_EqualToZero_s 21145878 op_Call 18940351 op_ContinueLoopIf 18782683 op_u32_ShiftLeft_ss 17971255 op_u32_Or_ss 16252015 op_i32_NotEqual_rs 15694560 op_Select_i32_srs 14192986 op_u32_Xor_ss 12013411 op_u32_LessThan_sr 8283274 op_i32_Store_i32_sr 8186576 op_u32_ShiftLeft_sr 8186565 op_i32_Store_i16_rs 7266221 op_u32_GreaterThan_sr 6154402 op_i32_Store_u8_rs 5207319 op_i32_ShiftRight_sr 3947862 op_i32_Subtract_sr 2519636 op_i32_LessThanOrEqual_rs 2494769 op_i32_LessThan_sr 2470467 op_i32_GreaterThan_sr 2386041 op_i32_Load_i32_r 1979450 op_i32_LessThan_ss 1380540 op_i32_Load_i16_r 1344938 op_u32_LessThanOrEqual_rs 1211117 op_u32_LessThan_ss 1017538 op_u32_Or_rs 986275 op_BranchIf_s 872075 op_i32_Multiply_ss 363370 op_i32_Store_i16_ss 218039 op_i32_GreaterThanOrEqual_ss 193813 op_i64_Store_i64_ss 193805 op_i64_Store_i64_sr 97373 op_i32_Subtract_ss 97036 op_SetGlobal_i32 48536 op_If_s 48518 op_GetGlobal_s32 48499 op_u32_GreaterThan_ss 24238 op_i32_LessThanOrEqual_sr 218 op_i32_GreaterThan_rs 212 op_Const32 204 op_i32_Subtract_rs 100 op_i32_Store_u8_sr 90 op_i32_Load_i8_s 81 op_i32_Remainder_sr 74 op_u32_GreaterThanOrEqual_ss 68 op_u32_GreaterThanOrEqual_sr 61 op_SetSlot_i64 58 op_u32_Divide_ss 49 op_Select_i32_ssr 44 op_i64_Load_i64_s 42 op_Const64 33 op_BranchIfPrologue_r 32 op_CopySlot_64 32 op_SetSlot_f64 27 op_i32_Load_i8_r 27 op_i32_Wrap_i64_s 23 op_Select_i32_sss 20 op_u64_ShiftRight_ss 20 op_i64_NotEqual_rs 16 op_i64_Store_i64_rs 15 op_i32_Store_u8_ss 15 op_f64_NotEqual_rs 13 op_f64_LessThan_ss 13 op_u32_Trunc_f64_r_s 13 op_f64_GreaterThanOrEqual_ss 13 op_u32_LessThanOrEqual_ss 12 op_f64_Subtract_rs 12 op_f64_Convert_u32_r_s 12 op_i32_LessThanOrEqual_ss 12 op_i64_EqualToZero_s 12 op_f64_Multiply_rs 10 op_u32_GreaterThanOrEqual_rs 8 op_i64_Load_u32_s 7 op_u64_LessThan_ss 7 op_f64_LessThan_sr 7 op_i64_LessThanOrEqual_sr 6 op_i64_Reinterpret_f64_r_s 6 op_f64_Divide_sr 6 op_f64_Add_ss 5 op_u32_GreaterThan_rs 5 op_i64_EqualToZero_r 4 op_i64_Load_i32_s 4 op_i32_Wrap_i64_r 4 op_i64_Subtract_rs 3 op_f64_Multiply_ss 3 op_u64_And_ss 3 op_f64_Store_f64_rs 3 op_Select_f64_rsr 3 op_f64_Load_f64_s 3 op_u64_Or_rs 3 op_f64_Convert_i64_r_s 3 op_Select_f64_rss 3 op_u32_Divide_sr 3 op_i32_Divide_sr 3 op_f64_Abs_s 3 op_i32_LessThan_rs 3 op_f64_Equal_rs 3 op_u64_ShiftRight_sr 3 op_f64_Reinterpret_i64_r_r 3 op_f64_Convert_i64_r_r 3 op_Select_f64_rrs 2 op_u32_ShiftRight_rs 2 op_f64_Convert_u32_r_r 2 op_f64_Convert_i64_s_s 2 op_u32_Divide_rs 2 op_f64_Divide_ss 2 op_PreserveCopySlot_32 2 op_f64_Divide_rs 1 op_MemSize 1 op_i64_Multiply_rs 1 op_f64_GreaterThan_sr 1 op_i64_Store_i32_ss 1 op_i64_Extend_u32_s ```