Extend basic block based on Control Flow Graph

# Extend basic block based on Control Flow Graph 按照 EBB 的定義，single entry and multiple exit，我們必須記錄 block 的進入點，如果該 block 的進入點有多個，就表示這個 block 不能 merge 到其他 block 之下，因為 merge 後這個 EBB 會變成有多個進入點而違反 EBB 的定義。 * 記錄方式，先紀錄第一個 previous block 的 program counter，如果之後紀錄的 previous block 的 program counter 不同，那就代表有多個進入點，就把 can_merge 設定為 false. ```c if (prev && next->can_merge) { if (!next->prev_pc) { next->prev_pc = prev->pc_start; if (prev->left) prev->right = next; else prev->left = next; } else if (prev->pc_start != next->prev_pc) next->can_merge = false; } ``` 因為一個 branch instruction 只會走兩條路徑，一條是 branch taken，而另一條是 brnach not taken，所以選擇以 Binary Tree 來記錄，一個子節點代表 brnach taken path，另一個代表 brnach not taken path，即可建立 CFG. Extend block 前會先判斷 block 是否被 extend 過，如果有就檢查 control flow 是否有變化，判斷方式是 trace 新的 control flow 產生的 EBB 的總指令數量，如果和目前的 EBB 總指令數量不同，那就從新 translate block 並 extend ```c /* Check EBB need to retranslate or not */ static void trace_cfg(block_t *block, uint32_t *total_insn) { *total_insn += block->n_insn; block_t *left = block->left, *right = block->right; if (left && left->can_merge) trace_cfg(left, total_insn); if (right && right->can_merge) trace_cfg(right, total_insn); } /* Trace CFG by DFS and extend basic block */ static void extend_block(block_t *block, block_t *next) { rv_insn_t *last_ir = block->ir + block->n_insn - 1; if (insn_is_unconditional_jump(last_ir->opcode)) return; block_t *left = next->left, *right = next->right; uint16_t tmp = block->n_insn; if (!left && !right) { return; } if (left) { if (left->can_merge && left->pc_start != next->pc_start && append_block(block, left)) { last_ir->tailcall = false; last_ir->merge_pc[0] = left->pc_start; last_ir->offset[0] = 1; extend_block(block, left); } else block->left = NULL; } if (right) { if (right->can_merge && right->pc_start != next->pc_start) { tmp = block->n_insn - tmp + 1; if (append_block(block, right)) { last_ir->merge_pc[1] = right->pc_start; if (left) last_ir->offset[1] = tmp; else last_ir->offset[1] = 1; extend_block(block, right); } } else block->right = NULL; } } ``` #### CoreMark 由於頻繁的對於 BB 做 Extend，或是重新 translate block 並 extend 的成本很高，所以要設定 threshold * Model: Core i7-8700 |Test| Compiler | Iterations / Sec |Speedup| |--------| -------- | -------- |-------- | | BB | clang-15 | 971.951 | | | BB | gcc-12 | 963.336 | | | EBB threshold 1000 | clang-15 | 1051.275 |+8.2% | | EBB | gcc-12 | 1068.915 | +11%| | EBB threshold 100 | clang-15 | 1013.070|+4.2% | | EBB | gcc-12 | 1020.391 | +6%| | EBB threshold 10 | clang-15 | 852.5410005 |-14% | | EBB | gcc-12 | 857.271259167 | -12.3%| ### EBB 含有指令數統計圖 #### threshold 10 ![](https://i.imgur.com/vo102Bm.png) #### threshold 100 ![](https://i.imgur.com/X4nnUjf.png) #### threshold 1000 ![](https://i.imgur.com/2gZHCFf.png) ## Extended basic block (EBB) 策略 ### 策略 1: 當 block 的使用次數超過 threshold 時，按照 EBB 定義展開 * [commit](https://github.com/qwe661234/rv32emu/commit/3cfd439f04ea9fa4a23174f15673d667c7cc3811) 按照 EBB 的定義，single entry and multiple exit，我們必須紀錄 block 的進入點，如果該 block 的進入點有多個，就表示這個 block 不能 merge 到其他 block 之下，因為 merge 後這個 EBB 會變成有多個進入點而違反 EBB 的定義。在展開部份，當 block 的使用次數超過 threshold(目前定 50) 時，我們認定其為值得被 extend 的 block，另一方面，在使用次數較多的情況下展開來避免 control flow 不夠完整，又要清掉原本做好的 EBB 重新 extend。因為我們不展開以 unconditional jump 指令結尾的 block，所以在展開以 conditional jump 指令結尾的 block 時，我們只需要找他的 branch taken path 的 BB(或 EBB) 和 branch not taken path 的 BB(或 EBB)，並將其 merge 到當前的 basic block 下即可。在 merge 時，我們就會檢查這個 block 是否有多個進入點，如果有就不 merge，因為 merge 後這個 EBB 會變成有多個進入點而違反 EBB 的定義。這個做法雖然 extend 了 block，但多了幾個條件限制，合併的 block 數量不多，因此效能提升不顯著。 #### CoreMark * Model: Core i7-8700 |Test| Compiler | Iterations / Sec |Speedup| |--------| -------- | -------- |-------- | | BB | clang-15 | 971.951 | | | BB | gcc-12 | 963.336 | | | EBB | clang-15 | 990.099 |+1.9% | | EBB | gcc-12 | 1009.551 | +4.7%| #### EBB 含有指令數統計圖 * 原始 Basic block ![](https://user-images.githubusercontent.com/48278026/226972958-22239a51-13cd-4d90-a39a-c791cad80ca2.png) * 策略 1 EBB ![](https://i.imgur.com/XeOJIFl.png) ### 策略 2: 超過 threshold 直接展開在展開部份，和上個策略一樣，當 block 的使用次數超過 threshold(目前定 50) 時，我們認定其為值得被 extend 的 block，另一方面，在使用次數較多的情況下展開來避免 control flow 不夠完整，又要清掉原本做好的 EBB 重新 extend。因為我們一樣不展開以 unconditional jump 指令結尾的 block，所以在展開以 conditional jump 指令結尾的 block 時，我們只需要找他的 branch taken path 的 BB(或 EBB) 和 branch not taken path 的 BB(或 EBB)，並將其直接 merge 到當前的 basic block 下，沒有其他檢查機制。這個策略少了較多的檢查並讓較多的 block 合併，所以效能提升較顯著。 #### CoreMark * Model: Core i7-8700 |Test| Compiler | Iterations / Sec |Speedup| |--------| -------- | -------- |-------- | | BB | clang-15 | 971.951 | | | BB | gcc-12 | 963.336 | | | EBB | clang-15 | 1087.284 |+11.8% | | EBB | gcc-12 | 1110.705 | +15.3%| #### EBB 含有指令數統計圖(含有指令數小於 5 的 block 皆以 unconditional jump 指令結尾) * 原始 Basic block ![](https://user-images.githubusercontent.com/48278026/226972958-22239a51-13cd-4d90-a39a-c791cad80ca2.png) * 策略 2 EBB ![](https://user-images.githubusercontent.com/48278026/226975281-9931b654-1704-418c-89e1-ec37ed2c69dd.png) ![](https://i.imgur.com/us8Yz4V.png) Based on our results, we can conclude that the more instructions in a block, the better the performance. However, we do not want the number of instructions in a block to become too large, as this would make it unsuitable for translating to machine code. Moreover, only blocks that are frequently used need to be extended. * We only extend a block once to keep it from becoming too large. * We set a threshold for only translating blocks that are frequently used. We can only extend it if the usage times exceed the threshold. We extend basic block by appending the basic block of branch not taken path and branch taken path to it, allowing us to emulate a basic block and its branch target with an internal jump rather than an external jump. * External jumps have a high overhead because they involve finding or decoding a basic block. A internal jump, on the other hand, only includes adding an offset to the target IR. * This strategy can efficently extend a loop body. We can see the following control flow graph of BB and its EBB after extending, we save two external jump in these two cases. 1. ![](https://i.imgur.com/tcTUD2K.png =400x) 2. ![](https://i.imgur.com/BQAz61c.png =400x) ## 改 IR data structire 原本的 ir 是 array，考慮改為 linked-list 或 tree ``` 0000000000407aa0 <do_addi>: 407aa0: c7 47 58 00 00 00 00 movl $0x0,0x58(%rdi) 407aa7: 0f b6 4e 05 movzbl 0x5(%rsi),%ecx 407aab: 0f b6 56 04 movzbl 0x4(%rsi),%edx 407aaf: 8b 06 mov (%rsi),%eax 407ab1: 03 44 8f 58 add 0x58(%rdi,%rcx,4),%eax 407ab5: 89 44 97 58 mov %eax,0x58(%rdi,%rdx,4) 407ab9: 48 83 87 a0 01 00 00 addq $0x1,0x1a0(%rdi) 407ac0: 01 407ac1: 0f b6 46 09 movzbl 0x9(%rsi),%eax 407ac5: 01 87 d8 00 00 00 add %eax,0xd8(%rdi) 407acb: 0f b6 46 0a movzbl 0xa(%rsi),%eax 407acf: 84 c0 test %al,%al 407ad1: 75 0d jne 407ae0 <do_addi+0x40> 407ad3: 48 8d 56 28 lea 0x28(%rsi),%rdx 407ad7: 48 8b 46 38 mov 0x38(%rsi),%rax 407adb: 48 89 d6 mov %rdx,%rsi 407ade: ff e0 jmpq *%rax 407ae0: c3 retq 407ae1: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 407ae8: 00 00 00 00 407aec: 0f 1f 40 00 nopl 0x0(%rax) ```

Read more

實作 Thread package

Cache Coherency & I/O ordering

CO Final

New EBB Strategy