2017q3 Team5 (MathEX)

# 2017q3 Team5 (MathEX) ###### tags: `sysprogram` contributed by <`st9007a`, `ian910297`> ## [GitHub](https://github.com/st9007a/MathEX) ## 題目說明 ### 目標擴充 [MathEX](https://github.com/jserv/MathEX)，設計高效能的數學表示式運算器，參考 [muparserSSE - A Math Expression Compiler](http://beltoforion.de/article.php?a=muparsersse&p=implementation) ### 先前組別進度 - 研讀 [muparserSSE - A Math Expression Compiler](http://beltoforion.de/article.php?a=muparsersse&p=implementation) 並且做筆記 - 引入 sse 做改善：失敗 - 引入 inline assembly：僅提出，未實作 ## MathEX 介紹 ### 簡介 MathEX 能夠輸入一段數學表達式，輸出這段數學表達式所對應的答案，透過 [github](https://github.com/jserv/MathEX/blob/master/README.md) 上的說明，可以知道基本的使用方式： ```clike #include "expression.h" /* Custom function that returns the sum of its two arguments */ static float add(struct expr_func *f, vec_expr_t args, void *c) { float a = expr_eval(&vec_nth(&args, 0)); float b = expr_eval(&vec_nth(&args, 1)); return a + b; } static struct expr_func user_funcs[] = { {"add", add, 0}, {NULL, NULL, 0}, }; int main() { const char *s = "x = 40, add(2, x)"; struct expr_var_list vars = {0}; struct expr *e = expr_create(s, strlen(s), &vars, user_funcs); if (!e) { printf("Syntax error"); return 1; } printf("result: %f\n", expr_eval(e)); expr_destroy(e, &vars); return 0; } ``` 上面的程式碼為範例，利用 `expr_create()` 將字串 `x = 40, add(2, x)` 輸入，再使用 `expr_eval()` 來產生輸出結果 ### API 這個工具提供了 4 個 API： - `struct expr *expr_create(const char *s, size_t len, struct expr_var_list *vars, struct expr_func *funcs)`：將輸入的字串數學表達式編譯成 `struct expr` 型態，並且回傳 - `float expr_eval(struct expr *e)`：計算 `struct expr` 的結果並回傳 - `void expr_destroy(struct expr *e, struct expr_var_list *vars)`：清除記憶體，`vars` 可以為 NULL (不清除 `vars` 的記憶體) - `struct expr_var *expr_var(struct expr_var_list *vars, const char *s, size_t len)`：在給定的變數清單 `vars` 中回傳對應名稱 `s` 的變數，如果清單中不存在該變數則會產生一個並回傳 ### 應用情境這類的數學表達式分析工具可以被應用於工程計算機類的軟體，在 github 上也可以找到相關案例，如：[insect](https://github.com/sharkdp/insect) 是一個支援物理單位的工程計算機，其中可以發現 [insect/src/Insect/Parser.purs](https://github.com/sharkdp/insect/blob/master/src/Insect/Parser.purs) 這支程式就是這個計算機的數學表達式的 parser ## 解析程式碼透過 `expression.h` 的 macro 以及 struct 可以找到用來表示 expression 的 struct： ```clike #define vec(T) \ struct { \ T *buf; \ int len; \ int cap; \ } typedef vec(struct expr) vec_expr_t; struct expr { int type; union { struct { float value; } num; struct { float *value; } var; struct { vec_expr_t args; } op; struct { struct expr_func *f; vec_expr_t args; void *context; } func; } param; }; ``` 從上面的定義可以看到 `struct expr` 中，`param.op.args` 裡面其實包著 `struct expr` 的 pointer，在後續的程式中可以發現這個 pointer 會指向 1 或 2 個 `struct expr`，因此得知 `struct expr` 是一個 tree 的結構接著可以透過 `expression.c` 中的 `expr_eval()` 中得知如何計算 expression 結果： ```clike float expr_eval(struct expr *e) { float n; switch (e->type) { case OP_UNARY_MINUS: return -(expr_eval(&e->param.op.args.buf[0])); case OP_UNARY_LOGICAL_NOT: return !(expr_eval(&e->param.op.args.buf[0])); case OP_UNARY_BITWISE_NOT: return ~(to_int(expr_eval(&e->param.op.args.buf[0]))); case OP_POWER: return powf(expr_eval(&e->param.op.args.buf[0]), expr_eval(&e->param.op.args.buf[1])); case OP_MULTIPLY: return expr_eval(&e->param.op.args.buf[0]) * expr_eval(&e->param.op.args.buf[1]); case OP_DIVIDE: return expr_eval(&e->param.op.args.buf[0]) / expr_eval(&e->param.op.args.buf[1]); case OP_REMAINDER: return fmodf(expr_eval(&e->param.op.args.buf[0]), expr_eval(&e->param.op.args.buf[1])); } ... } ``` 從上面可以知道，它透過遞迴呼叫，計算目前節點的左子節點跟右子節點的結果，最後根據自己的 operator 將左子節點跟右子節點的結果作運算 ## 嘗試優化 - 移除遞迴呼叫首先第一個想到的是將遞迴呼叫移除，換另一種演算法來算這棵樹的答案，這邊採用的透過深度優先搜尋的方式計算，程式大致如下： ```clike float expr_eval_with_dfs(struct expr *e) { #define STACK_SIZE 1000 int visited[STACK_SIZE] = {0}; struct expr *op_stack[STACK_SIZE]; float tmp_val[STACK_SIZE]; int op_sp = -1; int tmp_sp = -1; struct expr *current; op_stack[++op_sp] = e; while (op_sp >= 0) { current = op_stack[op_sp]; if (visited[op_sp]) { float val; switch (current->type) { case OP_UNARY_MINUS: val = -tmp_val[tmp_sp--]; break; case OP_UNARY_LOGICAL_NOT: val = !tmp_val[tmp_sp--]; break; case OP_UNARY_BITWISE_NOT: val = ~to_int(tmp_val[tmp_sp--]); break; case OP_POWER: val = powf(tmp_val[tmp_sp--], tmp_val[tmp_sp--]); break; case OP_MULTIPLY: val = tmp_val[tmp_sp--] * tmp_val[tmp_sp--]; break; ... } tmp_val[++tmp_sp] = val; op_sp--; } else if (current->type == OP_VAR || current->type == OP_CONST) { op_sp--; tmp_val[++tmp_sp] = current->type == OP_VAR ? *current->param.var.value : current->param.num.value; } else { visited[op_sp] = 1; for (int i = 0; i < current->param.op.args.len; i++) { op_stack[++op_sp] = &current->param.op.args.buf[i]; } } } return tmp_val[0]; } ``` 主要的做法是先將 root node 放進 op_stack，接著在迴圈裡面，每次檢查 op_stack 最上面的節點： - 如果已經訪問過就將其 pop 出來，同時將 tmp_val_stack 的值 pop 出來根據 pop 出來的 op 做計算，算出來的結果 push 回 tmp_val_stack - 如果節點代表的是一個常數或變數就將節點 pop 出來，同時將其存放的值 push 進 tmp_val_stack - 如果節點還沒訪問就將它的子節點 push 進 op_stack，同時將該節點標示為已訪問接著利用 repo 中的 `test-bench.c` 來查看結果，由於修改後的程式沒有實作 function 跟 assignment 所以只拿 `test-bench.c` 中前半部的 test case 來比較：修改前： ``` BENCH 5: 5.650043 ns/op (176M op/sec) BENCH 5+5+5+5+5+5+5+5+5+5: 39.499998 ns/op (25M op/sec) BENCH 5*5*5*5*5*5*5*5*5*5: 38.609982 ns/op (25M op/sec) BENCH 5,5,5,5,5,5,5,5,5,5: 42.056084 ns/op (23M op/sec) BENCH ((5+5)+(5+5))+((5+5)+(5+5))+(5+5): 34.670115 ns/op (28M op/sec) BENCH x=5: 3.753185 ns/op (266M op/sec) BENCH x=5,x+x+x+x+x+x+x+x+x+x: 44.126987 ns/op (22M op/sec) BENCH x=5,((x+x)+(x+x))+((x+x)+(x+x))+(x+x): 43.509007 ns/op (22M op/sec) BENCH a=1,b=2,c=3,d=4,e=5,f=6,g=7,h=8,i=9,j=10: 59.928894 ns/op (16M op/sec) BENCH a=1,a=2,a=3,a=4,a=5,a=6,a=7,a=8,a=9,a=10: 58.668852 ns/op (17M op/sec) ``` 修改後： ``` BENCH 5: 71.628094 ns/op (13M op/sec) BENCH 5+5+5+5+5+5+5+5+5+5: 125.663996 ns/op (7M op/sec) BENCH 5*5*5*5*5*5*5*5*5*5: 122.828007 ns/op (8M op/sec) BENCH 5,5,5,5,5,5,5,5,5,5: 116.539001 ns/op (8M op/sec) BENCH ((5+5)+(5+5))+((5+5)+(5+5))+(5+5): 82.113028 ns/op (12M op/sec) BENCH x=5: 66.423893 ns/op (15M op/sec) BENCH x=5,x+x+x+x+x+x+x+x+x+x: 126.016855 ns/op (7M op/sec) BENCH x=5,((x+x)+(x+x))+((x+x)+(x+x))+(x+x): 88.551044 ns/op (11M op/sec) BENCH a=1,b=2,c=3,d=4,e=5,f=6,g=7,h=8,i=9,j=10: 91.861010 ns/op (10M op/sec) BENCH a=1,a=2,a=3,a=4,a=5,a=6,a=7,a=8,a=9,a=10: 95.366001 ns/op (10M op/sec) ``` 結果修改後執行速度反而慢了 2 - 10 倍，初步的推測是修改後增加的條件判斷導致 branch miss 變高了，因此用 perf 來查看：修改前： ``` Performance counter stats for './build/test-bench' (20 runs): 3475 cache-misses # 4.681 % of all cache refs ( +- 31.43% ) 7,4234 cache-references ( +- 2.14% ) 8,6033,9900 branches ( +- 0.00% ) 301,4676 branch-misses # 0.35% of all branches ( +- 0.00% ) 13,0621,3178 cycles ( +- 0.14% ) 28,6157,2152 instructions # 2.19 insns per cycle ( +- 0.00% ) 0.367924836 seconds time elapsed ( +- 0.23% ) ``` 修改後： ``` Performance counter stats for './build/test-bench' (20 runs): 4203 cache-misses # 4.893 % of all cache refs ( +- 24.88% ) 8,5896 cache-references ( +- 2.33% ) 10,6343,4755 branches ( +- 0.00% ) 458,6096 branch-misses # 0.43% of all branches ( +- 3.83% ) 32,5854,0164 cycles ( +- 2.16% ) 46,6710,6417 instructions # 1.43 insns per cycle ( +- 0.00% ) 0.911646339 seconds time elapsed ( +- 2.13% ) ``` 用 perf 觀察後發現，branch miss 雖然有比較多，但是佔總體的比例其實並不高，真正的原因在於修改後的 cycle 以及 instruction 數量變多了 ## 嘗試優化 - 使用 inline assembly code 主要修改的 function 為 expr_eval() ，修改的結果為 expr_eval_with_asm() 先是針對 add 的部分進行改寫 ```clike case OP_PLUS: a = expr_eval(&e->param.op.args.buf[0]); b = expr_eval(&e->param.op.args.buf[1]); asm("fld %1;" "fld %2;" "faddp;" "fstp %0" : "=m" ( n ) : "m" ( a ), "m" ( b ) ); return n; ``` 修改前： ```clike BENCH 5+5: 5.864859 ns/op (170M op/sec) BENCH 5+10+15+20: 13.442039 ns/op (74M op/sec) BENCH 5+5+5+5+5+5+5+5+5+5: 35.995960 ns/op (27M op/sec) ``` 修改後： ```clike BENCH 5+5: 6.362200 ns/op (157M op/sec) BENCH 5+10+15+20: 14.869928 ns/op (67M op/sec) BENCH 5+5+5+5+5+5+5+5+5+5: 37.159920 ns/op (26M op/sec) ``` 單純從數字上觀察不出為什麼比較慢，所以使用 perf 進行查看修改前： ```clike 435 cache-misses # 2.285 % of all cache refs ( +- 50.65% ) (79.45%) 19,032 cache-references ( +- 3.47% ) (84.70%) 147,572,473 branches ( +- 0.21% ) (85.68%) 10,609 branch-misses # 0.01% of all branches ( +- 0.86% ) (71.44%) 583,884,950 instructions # 2.63 insn per cycle ( +- 0.04% ) (85.72%) 222,241,510 cycles ( +- 0.51% ) (81.45%) 0.056145017 seconds time elapsed ( +- 0.49% ) ``` 修改後： ```clike 265 cache-misses # 1.284 % of all cache refs ( +- 31.74% ) (79.73%) 20,609 cache-references ( +- 2.49% ) (81.43%) 149,186,562 branches ( +- 0.36% ) (85.61%) 10,589 branch-misses # 0.01% of all branches ( +- 1.13% ) (72.91%) 617,528,324 instructions # 2.63 insn per cycle ( +- 0.02% ) (86.48%) 234,633,148 cycles ( +- 0.32% ) (83.07%) 0.059308032 seconds time elapsed ( +- 0.33% ) ``` 比較發現，cache-misses 雖然有下降，但修改後的 instructions, cycles 比原本的多。推測是因為修改後的程式比原本的多了兩個 assignment，導致需要執行更多的 instruction, cycles。使用 GDB 檢查產生的組語，驗證推測修改前： ``` B+>│0x55555555ad50 <expr_eval+880> mov 0x8(%rbx),%rdi │ │0x55555555ad54 <expr_eval+884> callq 0x55555555a9e0 <expr_eval> │ │0x55555555ad59 <expr_eval+889> mov 0x8(%rbx),%rdi │ │0x55555555ad5d <expr_eval+893> movss %xmm0,0xc(%rsp) │ │0x55555555ad63 <expr_eval+899> add $0x28,%rdi │ │0x55555555ad67 <expr_eval+903> callq 0x55555555a9e0 <expr_eval> │ │0x55555555ad6c <expr_eval+908> addss 0xc(%rsp),%xmm0 │ │0x55555555ad72 <expr_eval+914> jmpq 0x55555555aa30 <expr_eval+80> ``` 修改後 ``` B+>│0x55555555b5d0 <expr_eval_with_asm+944> mov 0x8(%rdi),%rdi │ │0x55555555b5d4 <expr_eval_with_asm+948> callq 0x55555555a9e0 <expr_eval> │ │0x55555555b5d9 <expr_eval_with_asm+953> mov 0x8(%rbx),%rdi │ │0x55555555b5dd <expr_eval_with_asm+957> movss %xmm0,0x20(%rsp) │ │0x55555555b5e3 <expr_eval_with_asm+963> add $0x28,%rdi │ │0x55555555b5e7 <expr_eval_with_asm+967> callq 0x55555555a9e0 <expr_eval> │ │0x55555555b5ec <expr_eval_with_asm+972> movss %xmm0,0x24(%rsp) │ │0x55555555b5f2 <expr_eval_with_asm+978> flds 0x20(%rsp) │ │0x55555555b5f6 <expr_eval_with_asm+982> flds 0x24(%rsp) │ │0x55555555b5fa <expr_eval_with_asm+986> faddp %st,%st(1) │ │0x55555555b5fc <expr_eval_with_asm+988> fstps 0x1c(%rsp) │ │0x55555555b600 <expr_eval_with_asm+992> movss 0x1c(%rsp),%xmm0 ``` 修改後的程式確實比原本的程式更麻煩，多了兩個 flds ，原始版本可以直接做相加。除了 add 之外，也對 sub, mul, div 都進行了改寫，不過也都是比較慢的結果 assembly 撰寫的參考資料 * [x86 Assembly Guide](http://flint.cs.yale.edu/cs421/papers/x86-asm/asm.html) * [x86 Assembly Language Reference Manual](https://docs.oracle.com/cd/E19253-01/817-5477/817-5477.pdf) * [GCC-Inline-Assembly-HOWTO](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html) * [Using Inline Assembly in C/C++](https://www.codeproject.com/Articles/15971/Using-Inline-Assembly-in-C-C) * [Stackoverflow: Some questions about inline assembly](https://stackoverflow.com/a/34522750) * [Unknown error](https://www.daniweb.com/programming/software-development/threads/100710/gcc-warning-translating-to-fmulp-etc) ## MathEX 可以使用在什麼地方上述優化失敗之後，當然還是要繼續優化，探討為什麼會失敗？不過我們之前並沒有思考過，MathEX可要如何應用在工程計算機以外的領域，或許先往這方面研究可以有一些新的想法。老師在上課有提到，這樣小型的程式，是不是可以應用在 [Ethereum](https://www.ethereum.org/) 的智能合約上，並提供了一篇相關文章作為範例 [利用工具加速Dapp建置和測試](https://medium.com/taipei-ethereum-meetup/%E5%88%A9%E7%94%A8%E5%B7%A5%E5%85%B7%E5%8A%A0%E9%80%9Fdapp%E5%BB%BA%E7%BD%AE%E5%92%8C%E6%B8%AC%E8%A9%A6-fb08e77f208e)