Understanding Compiler Optimization

--- tags: Optimization, 編譯器 --- # Understanding Compiler Optimization :::info 以下段落摘要自影片 * [Understanding Compiler Optimization - Chandler Carruth - Opening Keynote Meeting C++ 2015](https://www.youtube.com/watch?v=FnGCDLhaxKU&feature=youtu.be) * [slide](https://meetingcpp.com/files/mcpp/2015/talks/meetingcxx_2015-understanding_compiler_optimization_themed_copy.pdf) * [How to optimize C and C++ code in 2018](https://medium.com/@aka.rider/how-to-optimize-c-and-c-code-in-2018-bd4f90a72c2b) 加上我一些拙劣的理解與補充 ::: 編譯器可以分成三個部分: * frontend: 包含 [lexical analysis](https://en.wikipedia.org/wiki/Lexical_analysis)(拆解 token), [syntax analysis](https://en.wikipedia.org/wiki/Parsing)(檢查語法), 與 [semantic analysis](https://en.wikipedia.org/wiki/Semantic_analysis_(compilers))(解析語意、[Type checking](https://en.wikipedia.org/wiki/Type_system#Type_checking)、變數宣告的檢查等)。並將程式轉換成中間表示式([IR、Intermediate representation](https://en.wikipedia.org/wiki/Intermediate_representation)) * optimizer: IR 的存在使得最佳化獨立於 CPU 架構之外，optimizer 會利用 IR 來做最佳化 * code generator: 從最佳化過的 IR，進一步檢查、根據 CPU 的架構做額外的最佳化，產生真正 CPU 對應的 assembly code ![](https://i.imgur.com/dlAgKI9.png =200x) 本段落會聚焦在 optimizer，並以 LLVM IR 做為舉例。 ## First of all, what 's IR ? ### Hello world 讓我們從一個簡單的 LLVM IR 開始 ```c= declare i32 @g(i32 %x) define i32 @f(i32 %a, i32 %b) { entry: %c = add i32 %a, %b %d = call i32 @g(i32 %c) %e = add i32 %c, %d ret i32 %e } ``` 很容易的可以理解其內容: * `declare`、`define` function `g`、 `f` * 型別如 `i32` * 變數如 `a`、`b` * 指令如 `add` `call` `ret` ### Control flow ```c= declare i32 @g(i32 %x) define i32 @f(i32 %a, i32 %b, i1 %flag) { entry: %c = add i32 %a, %b br i1 %flag, label %then, label %else then: %d = call i32 @g(i32 %c) ret i32 %d else: ret i32 %c } ``` IR 可以被視作許多的 basic block(`entry`、`then`...)，IR 透過如 `br` 等指令，去操作程式運作的 control flow ### Data flow 首先要從 [SSA(Static Single Assignment)](https://gcc.gnu.org/onlinedocs/gccint/SSA.html)說起。在 SSA representation 中，每個變數只會被 Assign 一次，並且在使用前會被定義。藉此，SSA 得以把原本複雜的條件和程式邏輯轉化為 Graph，將程式邏輯轉換成明確的路徑 ```c= declare i32 @g(i32 %x) define i32 @f(i32 %a, i32 %b, i1 %flag) { entry: %c = add i32 %a, %b br i1 %flag, label %then, label %end then: %d = call i32 @g(i32 %c) br label %end end: %result = phi i32 [ %entry, %c ], [ %then, %d ] ret i32 %result } ``` 有個問題是，如果需要得到一個只在某條件分支下的值(舉例來說，上面 IR 中的 `%d`)，在 SSA 中該如何使用它呢? SSA representation 可以透過 `phi` (Φ)，根據現在的 basic block 是從哪裡進入的來合併分支。以上面的 IR 來說，如果是從 entry 進入 end，則 assgin `%c`; 如果是從 then 進入，則 assign`%d`。 ## What are optimizer doing？ ### Step 1. Cleanup 以這個程式為例 ```c= int atoi(const char *); int main(int argc, char **argv) { if (argc != 3) return -1; return atoi(argv[1]) + atoi(argv[2]); } ``` 如果我們用 ``` $ clang -cc1 -emit-llvm -O0 tmp.c -o tmp.ll ``` 在沒有最佳化的狀況下去產生 IR，會得到 ```c= define i32 @main(i32 %argc, i8** %argv) #0 { %retval = alloca i32, align 4 %argc.addr = alloca i32, align 4 %argv.addr = alloca i8**, align 8 store i32 0, i32* %retval, align 4 store i32 %argc, i32* %argc.addr, align 4 store i8** %argv, i8*** %argv.addr, align 8 %1 = load i32, i32* %argc.addr, align 4 %cmp = icmp ne i32 %1, 3 br i1 %cmp, label %2, label %3 ; <label>:2: ; preds = %0 store i32 -1, i32* %retval, align 4 br label %8 ; <label>:3: ; preds = %0 %4 = load i8**, i8*** %argv.addr, align 8 %arrayidx = getelementptr inbounds i8*, i8** %4, i64 1 %5 = load i8*, i8** %arrayidx, align 8 %call = call i32 @atoi(i8* %5) %6 = load i8**, i8*** %argv.addr, align 8 %arrayidx1 = getelementptr inbounds i8*, i8** %6, i64 2 %7 = load i8*, i8** %arrayidx1, align 8 %call2 = call i32 @atoi(i8* %7) %add = add nsw i32 %call, %call2 store i32 %add, i32* %retval, align 4 br label %8 ; <label>:8: ; preds = %3, %2 %9 = load i32, i32* %retval, align 4 ret i32 %9 } ``` 用 SSA 方式來 modeling dataflow 並不容易，至少這並不是 frontend 該負責的工作。因此，frontend 會假設程式擁有無限多的 memory 把所有東西都放在 memory，因此可以看到一堆的 alloc、 store、 load。因此，在 cleanup 階段，optimizer 會嘗試用 SSA form 替換這些 memory-based 的指令、替換需要使用記憶體或者暫存器、並且去除某些多餘的變數。當我們將參數替換成 `-O1` 來編譯，可以得到相對簡潔的 IR: ```c= define i32 @main(i32 %argc, i8** nocapture readonly %argv) local_unnamed_addr #0 { %cmp = icmp eq i32 %argc, 3 br i1 %cmp, label %1, label %4 ; <label>:1: ; preds = %0 %arrayidx = getelementptr inbounds i8*, i8** %argv, i64 1 %2 = load i8*, i8** %arrayidx, align 8, !tbaa !2 %call = tail call i32 @atoi(i8* %2) %arrayidx1 = getelementptr inbounds i8*, i8** %argv, i64 2 %3 = load i8*, i8** %arrayidx1, align 8, !tbaa !2 %call2 = tail call i32 @atoi(i8* %3) %add = add nsw i32 %call2, %call br label %4 ; <label>:4: ; preds = %0, %1 %retval.0 = phi i32 [ %add, %1 ], [ -1, %0 ] ret i32 %retval.0 } ``` ### Step 2: Canonicalization(規範化) 一個相同行為的 C/C++ 程式，可以被表達成多種形式: ![](https://i.imgur.com/cEzouXu.png) optimizer 會將原本的 control flow 重寫，例如: 把 not-equal 都用同等 equal 取代(即使 IR 中有 neq)、拿掉在 control flow 中的的 empty basic block 等，確保相同行為的 control flow 有一致的 pattern。因此，即使我們將上面的程式改寫為： ```c= int main(int argc, char **argv) { return argc == 3? atoi(argv[1]) + atoi(argv[2]) : -1; } ``` 使用 `-O1` 後得到的 IR 仍然是相同的形式！透過 Canonicalization，相同邏輯的程式對於 optimizer 就可以維持一致的樣貌，更有利於 optimizer 設計最佳化的方法。 ### Step 3: Collapse Abstractions 3 個主要的 abstraction 1. Functions, calls, and the call graph 2. Memory, loads, and stores 3. Loops #### `Function call` funtion call 是有巨大成本的(如 call stack)，所以對於 optimizer 來說，具備判別 function 是否足夠 **"複雜"** 以至於需要對 function 做 [inline](https://en.wikipedia.org/wiki/Inline_function) 的能力就極其重要。而 optimizer 對 function call 做最佳化的基礎就建立在 call-graph 上 ![](https://i.imgur.com/0mqHfxA.png) 首先找出 call-graph 中的 [strongly connected components](https://en.wikipedia.org/wiki/Strongly_connected_component) ![](https://i.imgur.com/nFHboPf.png) 有了這些 strongly connected components 後，對於 optimizer 而言，圖中的 leaves components(G-H-I) 是做 **inline** 最佳化的第一候選，因為 leaves 不會再往下做 call function。 ![](https://i.imgur.com/10XzL7j.png) 一旦對 (G-H-I) component 做了充分的最佳化後，接著 bottom up 的去對 (D-E-F)。這個步驟是很重要的，由於 E call 了 G，如果沒有事先對 (G-H-I) 進行最佳化，E 和 G 之間的關係資訊會產生錯誤(舉例來說，如果 G 是不重要的 function，E 大可跳過 G 直接 call H，但如果沒有事先對(G-H-I) 做最佳化則不會曉得)。 ![](https://i.imgur.com/SwJ0U8L.png) 下一個問題是，optimizer 怎麼知道一個 function 是否足夠不 **"複雜"** 以至於可以被 inline? ```c= int g(double x, double y, double z); int f(struct S* s, double y, double x) { return g(x, y, s->z); } ``` 許多的 function 可能都如上面的程式: 例如為了使一個 function 接受要求的 parameters 型別，而建立 g 的 wrapper f。以這個例子而言，g 是可以被 inline 在 f 中的。但是 inline 與否的問題並不總是如此單純，argument 的差異、branch 狀況的差異對 function 執行行為的影響等，成為了 optimizer 決定是否要做 inline 的難點。例如你想設計一個 function，依據 vector 大小做不同的排序演算法。optimizer 需 bottom up 搜尋 SSA tree，得到 vector 實際大小的資訊，才可以決定 inline 與否。也就是說，對於跨 function 的 optimization，需要大量資源才能正確達成，並且可能有無數的優化選項。 :::info :triangular_flag_on_post: 你可能會想問: 既然 function call 會有成本，為甚麼不 inline 所有 function 呢？顯然這是不可行的，原因是: * 並不是所有 function 都可以容易的、甚至根本不可能被 inline * 動態連結的函式庫 * 非 tail recursion 的 recursive function * inline 並不完美，它也有某些缺點 * inline 使得 executable 更大，代表需要更多的儲存空間、更多的 loading time * 考慮到 cache，過度的 inline 可能反而導致 instruction cache 上的指令互相排擠，locality 不佳 ::: #### `Memory, loads, and stores` 如前所述，frontend 以 memory 是無限的前提來產生 IR，這使得它的工作變得容易。而 optimizer 需透過 SSA form，觀察實際的值為何，並擅用 register 來安排這些變數。 ![](https://i.imgur.com/LZteHbu.png) 以這個例子而言，左邊相當於 frontend 直接產出的結果，可以看到 IR 會把計算出的 return 值 store 到 `%mem` 中，再用 load 把 `%mem` 讀進 `%result` 作 return; 以右邊來說，optimizer 則透過 phi，迴避了對記憶體額外的使用。 #### `Loops` 大部份狀況下 loop 總是影響程式執行效率的地方，也因此是 optimizer 做最佳化的主要目標。用以下程式為例： ```c= #include <vector> int sum(const std::vector<int> & array) { int result = 0; for (auto i: array) { result += i; } return result; } ``` 可以透過下面的 cmd 得到未經 optimization 的 IR ``` $ clang++ -S -emit-llvm -O0 tmp.cpp -o tmp.ll ``` 然後，你會看到小小的 loop 產生出數量龐大的 IR。現在，我們假設 optimizer 已經對這一堆 IR 做了除了 loop 以外的，前面所說的最佳化(function/memory abstraction...)，此時的 IR 大概可以概括為 ```c= define i32 @sum(%vector* nocapture readonly dereferenceable(24) %v) { entry: %begin_ptr = getelementptr inbounds %vector, %vector* %v, i64 0, i32 0, i32 0 %begin = load i32*, i32** %begin_ptr, align 8 %end_ptr = getelementptr inbounds %vector, %vector* %v, i64 0, i32 0, i32 1 %end = load i32*, i32** %end_ptr, align 8 br label %loop.head loop.head: %ptr = phi i32* [ %begin, %entry ], [ %ptr.next, %loop.latch ] %x = phi i32 [ 0, %entry ], [ %x.next, %loop.latch ] %cond = icmp eq i32* %ptr, %end br i1 %cond, label %exit, label %loop.latch loop.latch: %i = load i32, i32* %ptr, align 4 %x.next = add nsw i32 %x, %i %ptr.next = getelementptr inbounds i32, i32* %ptr, i64 1 br label %loop.head exit: ret i32 %x } ``` 如前所述，optimizer 要對 IR 有做 cleanup，並使之有統一的格式(Canonicalization)，當然 loop 也不例外，因此會把 IR 調整成: ```c= define i32 @sum(%vector* nocapture readonly dereferenceable(24)) #0 { entry: %begin_ptr = getelementptr inbounds %vector, %vector* %0, i64 0, i32 0, i32 0 %begin = load i32*, i32** %begin_ptr, align 8, !tbaa !2 %end_ptr = getelementptr inbounds %vector, %vector* %v, i64 , i32 0, i32 1 %end = load i32*, i32** %end_ptr, align 8 %empty = icmp eq i32* %begin, %end br i1 %empty, label %exit, label %loop.ph loop.ph: ; preds = %entry br label %loop loop: ; preds = %loop.ph, %loop %x = phi i32 [ 0, %loop.ph ], [ %x.next, %loop ] %ptr = phi i32* [ %begin, %loop.ph ], [ %ptr.next, %loop ] %i = load i32, i32* %ptr, align 4 %x.next = add nsw i32 %x, %i %ptr.next = getelementptr inbounds i32, i32* %ptr, i64 1 %cond = icmp eq i32* %ptr.next, %end br i1 %cond, label %loop.exit, label %loop loop.exit: ; preds = %loop %x.lcssa = phi i32 [ %x.next, %loop ] br label %exit exit: ; preds = %loop.exit, %entry %x.result = phi i32 [ %x.lcssa, %loop.exit ], [ 0, %entry ] ret i32 %x.result ``` 你可以注意到幾個重要的Canonicalization: * loop perheder `loop.ph` 的唯一工作，就是擔任 `loop 的進入點 * `loop.exit` 則擔任 loop 的退出點，而沒有其他的 predecessor * `loop.exit` 的 phi 有些特別，裡面只有單個 input。藉此，和前面的 phi 是用來合併(merge) branch 不同，SSA form 可以表達選擇 loop iteration 後的最終 `%x.next`。抽離出的 `loop.exit` 唯一的 predecessor 是 `loop`，並且確切是 block 離開 `loop` 的點，這使得 loop 的表達更為單純，optimizer 可以藉此更好的做 optimization。在此把 IR 調整成統一格式後，接著 optimizer 便可以開始嘗試對其做各種最佳化了： * optimizer 嘗試判斷程式作為 loop 的必要性，嘗試將其展開([loop unrolling](https://en.wikipedia.org/wiki/Loop_unrolling)) * 類似以下的程式，假設 a 跟 b 都是固定的數值 ```c= // version 1 for (int i = 0; i < 10; i++) { out += (a + b); } ``` 那麼寫成下面的形式，對於程式的執行似乎來比較有效率？(減少了 9 次的加法成本) ```c= // version 2 int tmp = a + b; for (int i = 0; i < 10; i++) { out += tmp; } ``` 事實上，對於聰明的編譯器來說，optimizer 會追蹤每個 phi 節點，因此它知道在 loop 內外有什麼被更動了，而甚麼沒有。因此，不勞程式撰寫者費心，需特別宣告 variable 來 cache 某個 constant，如果寫成 version 1 的可讀性更好，你當然可以那樣寫！ * 對於 loop 內部的 branch condition，optimizer 除會嘗試把條件分支移動到 loop 之外，還會針對每種情況創建獨立的 loop block * 現代處理器的 [SIMD](https://en.wikipedia.org/wiki/SIMD) 架構提供更佳的運算效率。因此 optimizer 可以擅用 [loop vectorization](https://en.wikipedia.org/wiki/Automatic_vectorization#Loop-level_automatic_vectorization) 使用 SIMD 指令提升程式運行的效率 ## When abstraction combined 不難想像，當 optimizer 需要最佳化所有上述的 abstraction 時，optimization 面臨巨大的挑戰。以下面的程式來說: ```c= int f(int a, int b) { int c; g(a, b, c); return a + b + c; } void g(int a, int b, int &c) { c = a * b; } ``` 在這個程式中，因為 g 接受的參數 `c` 是一個 output parameter，這意味著是否 inline `g()` ，除了考慮 function 本身，也需要結合 memory 的資訊。 ```c= struct S { float x, y, z; double delta; double compute(); }; double f() { S s; s.x = /* expensive compute */; s.y = /* expensive compute */; s.z = /* expensive compute */; s.delta = s.x - s.y - s.z; return s.compute(); } ``` 上述的程式中也存在相同的問題，function `f` 設置 `struct S` 的 members，然後再呼叫 s.compute() 利用這些數值進行計算。由於 `s.compute()` 相當於是一個 class method，因此呼叫 `s.compute()` 涉及了對 `'this'` pointer 的操作，因此，function 本身隱含了 memory 的 abstraction。或許你可以考慮改寫成下面的形式。 ```c= struct S { float x, y, z; double delta; }; double compute(S s); double f() { S s; s.x = /* expensive compute */; s.y = /* expensive compute */; s.z = /* expensive compute */; s.delta = s.x - s.y - s.z; return compute(s); } ``` 這使得 optimizer 可以比較容易的進行最佳化。 ## Conclusion 如講者 Chandler Carruth 所說，演講的題目之所以是 Understanding Compiler Optimization 而不是 How to write your C++ faster，是因為並不存在一個黃金準則，告訴你怎麼撰寫程式可以得到最佳的 performance。然而，藉由瞭解 optimizer 的運作，我們可以更深入的思考程式與 optimization 的關係，預期自己的程式將被優化而成的樣子。 :::danger 因為我的英文不好加上實力也不好:cry:，其實感覺自己有很多地方都理解的很含糊，建議大家要自己看過演講的影片，如果有發現我有誤解的地方務必要糾正我~ :::