2017q3 Team5 (MathEX)

tags: `sysprogram`

contributed by <st9007a, ian910297>

GitHub

題目說明

目標

擴充 MathEX，設計高效能的數學表示式運算器，參考 muparserSSE - A Math Expression Compiler

先前組別進度

研讀 muparserSSE - A Math Expression Compiler 並且做筆記
引入 sse 做改善：失敗
引入 inline assembly：僅提出，未實作

MathEX 介紹

簡介

MathEX 能夠輸入一段數學表達式，輸出這段數學表達式所對應的答案，透過 github 上的說明，可以知道基本的使用方式：

#include "expression.h"

/* Custom function that returns the sum of its two arguments */
static float add(struct expr_func *f, vec_expr_t args, void *c) {
    float a = expr_eval(&vec_nth(&args, 0));
    float b = expr_eval(&vec_nth(&args, 1));
    return a + b;
}

static struct expr_func user_funcs[] = {
    {"add", add, 0}, {NULL, NULL, 0},
};

int main() {
    const char *s = "x = 40, add(2, x)";
    struct expr_var_list vars = {0};
    struct expr *e = expr_create(s, strlen(s), &vars, user_funcs);
    if (!e) { printf("Syntax error"); return 1; }

    printf("result: %f\n", expr_eval(e));

    expr_destroy(e, &vars);
    return 0;
}

上面的程式碼為範例，利用 expr_create() 將字串 x = 40, add(2, x) 輸入，再使用 expr_eval() 來產生輸出結果

API

這個工具提供了 4 個 API：

struct expr *expr_create(const char *s, size_t len, struct expr_var_list *vars, struct expr_func *funcs)：將輸入的字串數學表達式編譯成 struct expr 型態，並且回傳
float expr_eval(struct expr *e)：計算 struct expr 的結果並回傳
void expr_destroy(struct expr *e, struct expr_var_list *vars)：清除記憶體，vars 可以為 NULL (不清除 vars 的記憶體)
struct expr_var *expr_var(struct expr_var_list *vars, const char *s, size_t len)：在給定的變數清單 vars 中回傳對應名稱 s 的變數，如果清單中不存在該變數則會產生一個並回傳

應用情境

這類的數學表達式分析工具可以被應用於工程計算機類的軟體，在 github 上也可以找到相關案例，如：insect 是一個支援物理單位的工程計算機，其中可以發現 insect/src/Insect/Parser.purs 這支程式就是這個計算機的數學表達式的 parser

解析程式碼

透過 expression.h 的 macro 以及 struct 可以找到用來表示 expression 的 struct：

#define vec(T)                                                                \
    struct {                                                                  \
        T *buf;                                                               \
        int len;                                                              \
        int cap;                                                              \
    }

typedef vec(struct expr) vec_expr_t;
struct expr {
    int type;
    union {
        struct { float value; } num;
        struct { float *value; } var;
        struct { vec_expr_t args; } op;
        struct {
            struct expr_func *f;
            vec_expr_t args;
            void *context;
        } func;
    } param;
};

從上面的定義可以看到 struct expr 中，param.op.args 裡面其實包著 struct expr 的 pointer，在後續的程式中可以發現這個 pointer 會指向 1 或 2 個 struct expr，因此得知 struct expr 是一個 tree 的結構

接著可以透過 expression.c 中的 expr_eval() 中得知如何計算 expression 結果：

float expr_eval(struct expr *e)
{
    float n;
    switch (e->type) {
    case OP_UNARY_MINUS:
        return -(expr_eval(&e->param.op.args.buf[0]));
    case OP_UNARY_LOGICAL_NOT:
        return !(expr_eval(&e->param.op.args.buf[0]));
    case OP_UNARY_BITWISE_NOT:
        return ~(to_int(expr_eval(&e->param.op.args.buf[0])));
    case OP_POWER:
        return powf(expr_eval(&e->param.op.args.buf[0]),
            expr_eval(&e->param.op.args.buf[1]));
    case OP_MULTIPLY:
        return expr_eval(&e->param.op.args.buf[0])
            * expr_eval(&e->param.op.args.buf[1]);
    case OP_DIVIDE:
        return expr_eval(&e->param.op.args.buf[0])
            / expr_eval(&e->param.op.args.buf[1]);
    case OP_REMAINDER:
        return fmodf(expr_eval(&e->param.op.args.buf[0]),
            expr_eval(&e->param.op.args.buf[1]));
    }

    ...
}

從上面可以知道，它透過遞迴呼叫，計算目前節點的左子節點跟右子節點的結果，最後根據自己的 operator 將左子節點跟右子節點的結果作運算

嘗試優化 - 移除遞迴呼叫

首先第一個想到的是將遞迴呼叫移除，換另一種演算法來算這棵樹的答案，這邊採用的透過深度優先搜尋的方式計算，程式大致如下：

float expr_eval_with_dfs(struct expr *e)
{
#define STACK_SIZE 1000
    int visited[STACK_SIZE] = {0};
    struct expr *op_stack[STACK_SIZE];
    float tmp_val[STACK_SIZE];

    int op_sp = -1;
    int tmp_sp = -1;

    struct expr *current;

    op_stack[++op_sp] = e;

    while (op_sp >= 0) {
        current = op_stack[op_sp];

        if (visited[op_sp]) {
            float val;

            switch (current->type) {
                case OP_UNARY_MINUS:
                    val = -tmp_val[tmp_sp--];
                    break;
                case OP_UNARY_LOGICAL_NOT:
                    val = !tmp_val[tmp_sp--];
                    break;
                case OP_UNARY_BITWISE_NOT:
                    val = ~to_int(tmp_val[tmp_sp--]);
                    break;
                case OP_POWER:
                    val = powf(tmp_val[tmp_sp--], tmp_val[tmp_sp--]);
                    break;
                case OP_MULTIPLY:
                    val = tmp_val[tmp_sp--] * tmp_val[tmp_sp--];
                    break;
                                    
                    ...
    
            }
            tmp_val[++tmp_sp] = val;
            op_sp--;
        }
        else if (current->type == OP_VAR || current->type == OP_CONST) {
            op_sp--;
            tmp_val[++tmp_sp] = current->type == OP_VAR ? *current->param.var.value : current->param.num.value;
        }
        else {
            visited[op_sp] = 1;
            for (int i = 0; i < current->param.op.args.len; i++) {
                op_stack[++op_sp] = &current->param.op.args.buf[i];
            }
        }
    }

    return tmp_val[0];
}

主要的做法是先將 root node 放進 op_stack，接著在迴圈裡面，每次檢查 op_stack 最上面的節點：

如果已經訪問過就將其 pop 出來，同時將 tmp_val_stack 的值 pop 出來根據 pop 出來的 op 做計算，算出來的結果 push 回 tmp_val_stack
如果節點代表的是一個常數或變數就將節點 pop 出來，同時將其存放的值 push 進 tmp_val_stack
如果節點還沒訪問就將它的子節點 push 進 op_stack，同時將該節點標示為已訪問

接著利用 repo 中的 test-bench.c 來查看結果，由於修改後的程式沒有實作 function 跟 assignment 所以只拿 test-bench.c 中前半部的 test case 來比較：

修改前：

BENCH                                        5: 5.650043 ns/op (176M op/sec)
BENCH                      5+5+5+5+5+5+5+5+5+5: 39.499998 ns/op (25M op/sec)
BENCH                      5*5*5*5*5*5*5*5*5*5: 38.609982 ns/op (25M op/sec)
BENCH                      5,5,5,5,5,5,5,5,5,5: 42.056084 ns/op (23M op/sec)
BENCH        ((5+5)+(5+5))+((5+5)+(5+5))+(5+5): 34.670115 ns/op (28M op/sec)
BENCH                                      x=5: 3.753185 ns/op (266M op/sec)
BENCH                  x=5,x+x+x+x+x+x+x+x+x+x: 44.126987 ns/op (22M op/sec)
BENCH    x=5,((x+x)+(x+x))+((x+x)+(x+x))+(x+x): 43.509007 ns/op (22M op/sec)
BENCH a=1,b=2,c=3,d=4,e=5,f=6,g=7,h=8,i=9,j=10: 59.928894 ns/op (16M op/sec)
BENCH a=1,a=2,a=3,a=4,a=5,a=6,a=7,a=8,a=9,a=10: 58.668852 ns/op (17M op/sec)

修改後：

BENCH                                        5: 71.628094 ns/op (13M op/sec)
BENCH                      5+5+5+5+5+5+5+5+5+5: 125.663996 ns/op (7M op/sec)
BENCH                      5*5*5*5*5*5*5*5*5*5: 122.828007 ns/op (8M op/sec)
BENCH                      5,5,5,5,5,5,5,5,5,5: 116.539001 ns/op (8M op/sec)
BENCH        ((5+5)+(5+5))+((5+5)+(5+5))+(5+5): 82.113028 ns/op (12M op/sec)
BENCH                                      x=5: 66.423893 ns/op (15M op/sec)
BENCH                  x=5,x+x+x+x+x+x+x+x+x+x: 126.016855 ns/op (7M op/sec)
BENCH    x=5,((x+x)+(x+x))+((x+x)+(x+x))+(x+x): 88.551044 ns/op (11M op/sec)
BENCH a=1,b=2,c=3,d=4,e=5,f=6,g=7,h=8,i=9,j=10: 91.861010 ns/op (10M op/sec)
BENCH a=1,a=2,a=3,a=4,a=5,a=6,a=7,a=8,a=9,a=10: 95.366001 ns/op (10M op/sec)

結果修改後執行速度反而慢了 2 - 10 倍，初步的推測是修改後增加的條件判斷導致 branch miss 變高了，因此用 perf 來查看：

修改前：

Performance counter stats for './build/test-bench' (20 runs):

              3475  cache-misses          #  4.681 % of all cache refs  ( +- 31.43% )
            7,4234  cache-references                                    ( +-  2.14% )
       8,6033,9900  branches                                            ( +-  0.00% )
          301,4676  branch-misses         #  0.35% of all branches      ( +-  0.00% )
      13,0621,3178  cycles                                              ( +-  0.14% )
      28,6157,2152  instructions          #  2.19  insns per cycle      ( +-  0.00% )

       0.367924836 seconds time elapsed                                 ( +-  0.23% )

修改後：

Performance counter stats for './build/test-bench' (20 runs):

              4203  cache-misses          #  4.893 % of all cache refs  ( +- 24.88% )
            8,5896  cache-references                                    ( +-  2.33% )
      10,6343,4755  branches                                            ( +-  0.00% )
          458,6096  branch-misses         #  0.43% of all branches      ( +-  3.83% )
      32,5854,0164  cycles                                              ( +-  2.16% )
      46,6710,6417  instructions          #  1.43  insns per cycle      ( +-  0.00% )

       0.911646339 seconds time elapsed                                 ( +-  2.13% )

用 perf 觀察後發現，branch miss 雖然有比較多，但是佔總體的比例其實並不高，真正的原因在於修改後的 cycle 以及 instruction 數量變多了

嘗試優化 - 使用 inline assembly code

主要修改的 function 為 expr_eval() ，修改的結果為 expr_eval_with_asm()
先是針對 add 的部分進行改寫

    case OP_PLUS:
        a = expr_eval(&e->param.op.args.buf[0]);
        b = expr_eval(&e->param.op.args.buf[1]);
        asm("fld %1;"
            "fld %2;"
            "faddp;"
            "fstp %0"
            : "=m" ( n )
            : "m" ( a ), "m" ( b )
        );
        return n;

修改前：

BENCH                                      5+5: 5.864859 ns/op (170M op/sec) 
BENCH                               5+10+15+20: 13.442039 ns/op (74M op/sec)
BENCH                      5+5+5+5+5+5+5+5+5+5: 35.995960 ns/op (27M op/sec)

修改後：

BENCH                                      5+5: 6.362200 ns/op (157M op/sec)
BENCH                               5+10+15+20: 14.869928 ns/op (67M op/sec)
BENCH                      5+5+5+5+5+5+5+5+5+5: 37.159920 ns/op (26M op/sec)

單純從數字上觀察不出為什麼比較慢，所以使用 perf 進行查看
修改前：

          435      cache-misses          #    2.285 % of all cache refs      ( +- 50.65% )  (79.45%)
       19,032      cache-references                                          ( +-  3.47% )  (84.70%)
  147,572,473      branches                                                  ( +-  0.21% )  (85.68%)
       10,609      branch-misses         #    0.01% of all branches          ( +-  0.86% )  (71.44%)
  583,884,950      instructions          #    2.63  insn per cycle           ( +-  0.04% )  (85.72%)
  222,241,510      cycles                                                    ( +-  0.51% )  (81.45%)

  0.056145017 seconds time elapsed                                      ( +-  0.49% )

修改後：

           265      cache-misses          #    1.284 % of all cache refs      ( +- 31.74% )  (79.73%)
        20,609      cache-references                                          ( +-  2.49% )  (81.43%)
   149,186,562      branches                                                  ( +-  0.36% )  (85.61%)
        10,589      branch-misses         #    0.01% of all branches          ( +-  1.13% )  (72.91%)
   617,528,324      instructions          #    2.63  insn per cycle           ( +-  0.02% )  (86.48%)
   234,633,148      cycles                                                    ( +-  0.32% )  (83.07%)

   0.059308032 seconds time elapsed                                      ( +-  0.33% )

比較發現，cache-misses 雖然有下降，但修改後的 instructions, cycles 比原本的多。推測是因為修改後的程式比原本的多了兩個 assignment，導致需要執行更多的 instruction, cycles。使用 GDB 檢查產生的組語，驗證推測
修改前：

B+>│0x55555555ad50 <expr_eval+880>  mov    0x8(%rbx),%rdi                                                                                                                   │
   │0x55555555ad54 <expr_eval+884>  callq  0x55555555a9e0 <expr_eval>                                                                                                       │
   │0x55555555ad59 <expr_eval+889>  mov    0x8(%rbx),%rdi                                                                                                                   │
   │0x55555555ad5d <expr_eval+893>  movss  %xmm0,0xc(%rsp)                                                                                                                  │
   │0x55555555ad63 <expr_eval+899>  add    $0x28,%rdi                                                                                                                       │
   │0x55555555ad67 <expr_eval+903>  callq  0x55555555a9e0 <expr_eval>                                                                                                       │
   │0x55555555ad6c <expr_eval+908>  addss  0xc(%rsp),%xmm0                                                                                                                  │
   │0x55555555ad72 <expr_eval+914>  jmpq   0x55555555aa30 <expr_eval+80>

修改後

B+>│0x55555555b5d0 <expr_eval_with_asm+944> mov    0x8(%rdi),%rdi                                                                                                           │
   │0x55555555b5d4 <expr_eval_with_asm+948> callq  0x55555555a9e0 <expr_eval>                                                                                               │
   │0x55555555b5d9 <expr_eval_with_asm+953> mov    0x8(%rbx),%rdi                                                                                                           │
   │0x55555555b5dd <expr_eval_with_asm+957> movss  %xmm0,0x20(%rsp)                                                                                                         │
   │0x55555555b5e3 <expr_eval_with_asm+963> add    $0x28,%rdi                                                                                                               │
   │0x55555555b5e7 <expr_eval_with_asm+967> callq  0x55555555a9e0 <expr_eval>                                                                                               │
   │0x55555555b5ec <expr_eval_with_asm+972> movss  %xmm0,0x24(%rsp)                                                                                                         │
   │0x55555555b5f2 <expr_eval_with_asm+978> flds   0x20(%rsp)                                                                                                               │
   │0x55555555b5f6 <expr_eval_with_asm+982> flds   0x24(%rsp)                                                                                                               │
   │0x55555555b5fa <expr_eval_with_asm+986> faddp  %st,%st(1)                                                                                                               │
   │0x55555555b5fc <expr_eval_with_asm+988> fstps  0x1c(%rsp)                                                                                                               │
   │0x55555555b600 <expr_eval_with_asm+992> movss  0x1c(%rsp),%xmm0

修改後的程式確實比原本的程式更麻煩，多了兩個 flds ，原始版本可以直接做相加。除了 add 之外，也對 sub, mul, div 都進行了改寫，不過也都是比較慢的結果

assembly 撰寫的參考資料

MathEX 可以使用在什麼地方

上述優化失敗之後，當然還是要繼續優化，探討為什麼會失敗？不過我們之前並沒有思考過，MathEX可要如何應用在工程計算機以外的領域，或許先往這方面研究可以有一些新的想法。老師在上課有提到，這樣小型的程式，是不是可以應用在 Ethereum 的智能合約上，並提供了一篇相關文章作為範例利用工具加速Dapp建置和測試

2017q3 Team5 (MathEX)

tags: sysprogram

GitHub

題目說明

目標

先前組別進度

MathEX 介紹

簡介

API

應用情境

解析程式碼

嘗試優化 - 移除遞迴呼叫

嘗試優化 - 使用 inline assembly code

MathEX 可以使用在什麼地方

tags: `sysprogram`