2017q3 Homework4 (sandbox)

tags: `sysprogram`

contributed by <st9007a, ian910297>

Github

解析 moxiebox

Registers

首先透過 moxiebox architecture 可以得知 moxiebox 有 16 個 32-bit 的 registers：

$sp (stack pointer)
$fp (frame pointer)
$r0 ~ $r13

以及只能用 gsr 跟 ssr 這兩個指令來存取的 special registers，在 moxie.h 可以得知它們的數量為 256 個：

enum {
    NUM_MOXIE_REGS = 17,   /* Including PC */
    NUM_MOXIE_SREGS = 256, /* The special registers */
    PC_REGNO = 16,
};

除了上述以外，透過 gdb 發現另外兩個 registers：

$pc (program counter)
$cc (condition code register)

Memory

根據 sandbox execution environment 的敘述，虛擬記憶體從低到高分別為：

elf：1 個或以上，存放 moxiebox program 的 machine code
data：0 個或以上，從外部載入的資料
stack：1 個，大小為 64k
memory descriptor：查了一下，memory descriptor 的功用是將 virtual address 映射到 physical address

接著透過 sandbox 執行 tests/basic，可以看到 virtual memory 的分配狀況：

$ src/sandbox -e tests/basic

ep 00001000
ro 00000f8c-0000146c elf0
rw 0000156c-000019d8 elf1
rw 000029d8-000129d8 stack
ro 000139d8-00013a78 mapdesc

最前面的 ro 代表 read-only，rw 代表 read and write，elf1 的權限為 rw，用 moxie-none-moxiebox-objdump 查看了一下，0x156c 對映到 data section：

$  moxie-none-moxiebox-objdump tests/basic -h
Idx Name          Size      VMA       LMA     File off  Algn

...
4 .data        00000430  0000156c  0000156c  000004e0  2**2
...

最後 heap 的資訊可以在 src/sandbox.h 找到：

machine()
{
    startAddr = 0;
    tracing = false;
    profiling = false;
    heapAvail = 0xfffffffU;
}

heapAvail 代表可以使用的 heap 的長度，分配 heap 的程式碼可以在 moxie.cc 的 sim_mmap() 找到

比較值得注意的是，一塊記憶體被劃分後，透過 mapInsert() 這個 function 加入機器中，mapInsert() 內部實作會先空出一個 page size 的空間，再把劃好的記憶體放到這個空間之後：

bool machine::mapInsert(addressRange *rdr)
{
    addressRange *ar = memmap.back();
    rdr->start = ar->end + MACH_PAGE_SIZE;
    rdr->end = rdr->start + rdr->length;
    memmap.push_back(rdr);

    return true;
}

這裡不太懂為什麼要空出一個 page size 的空間
"st9007a"

Library

由於 moxiebox 為隔離環境，能使用的現成的 function 有限，能夠使用的工具在 runtime 裡面找到，可以參照 sandbox execution environment 的 ABI Summary：

Runtime environment:
- moxie_memmap - Global variable, pointer to list of struct moxie_memory_map_ent, which describes the execution environment's input data.
- setreturn(3) - Pointer to environment's output data buffer. This is the data returned from the sandbox to the user.
- stdlib.h: abort(3), exit(3)
- string.h: memchr(3), memcmp(3), memcpy(3), memset(3)
- string.h: strchr(3), strcmp(3), strcpy(3), strlen(3), strncpy(3), strcpy(3), strstr(3)
Runtime environment - crypto:
- sha256: sha256_init(), sha256_update(), sha256_final()
System calls:
- mmap(2) - MAP_PRIVATE|MAP_ANONYMOUS to allocate heap memory.
- _exit(2) - End process

這裡要注意的是 mmap 劃分的空間有條件限制，可以在 moxie.cc 的 sim_mmap() 看到，這個 function 是用來處理 mmap 的 system call：

static void sim_mmap(machine &mach)
{
    ...
    if ((addr != 0) || (length < MACH_PAGE_SIZE) || (length & MACH_PAGE_MASK) ||
        ((uint32_t) length > mach.heapAvail) || (!(prot & MOXIE_PROT_READ)) ||
        (!(prot & MOXIE_PROT_WRITE)) || (!(prot & MOXIE_PROT_EXEC)) ||
        (!(flags & MOXIE_MAP_PRIVATE)) || (!(flags & MOXIE_MAP_ANONYMOUS))) {
        cpu.asregs.regs[2] = -EINVAL;
        return;
    }
    ...
}

mmap 劃分的記憶體大小必須是 page size 的整數倍才行，否則會回傳錯誤碼：EINVAL (Invalid Argument)

修正 sandbox

在研究 sandbox 的期間，在 src/sandbox.cc 的 gdb_main_loop() 中發現了兩個 FIXME：

case 'c':
    wrc = write(newsockfd, "+", 1);
    sim_resume(mach);
    // FIXME.. assuming BREAK for now
    sendGdbReply(newsockfd, "S05");
    mach.cpu.asregs.regs[16] -= 2;
    i += 4;
     
...
    
if (!p) { /*FIXME: implement error handling */
    sprintf(reply, "OK");
    goto cmd_m_out;
}

第 1 個 FIXME 要求是當執行中斷或結束時回傳正確的 signal，原本的程式碼不管執行是中斷或結束都一律回傳 "S05"，對應的 signal 是 SIGTRAP，代表 gdb 遇到了一的 breakpoint

我的作法是在 sim_resume() 結束時將 cpu exception 回傳出來，並且根據回傳的 exception 來回應正確的 signal，程式碼可以參照 github

接著第 2 個 FIXME 要求實作 Error Handling，首先修正前完整程式碼如下：

case 'm': {
    uint32_t addr = readDelimitedHexValue(buffer, &++i);
    uint32_t length = readDelimitedHexValue(buffer, &i);
    char *p = (char *) mach.physaddr(addr, length);
    if (!p) {
        sprintf(reply, "OK");
        goto cmd_m_out;
    }
    reply[0] = 0;
    while (length-- > 0) {
        int c = *p++;
        char buf[3];
        strcat(reply, lowByteToHex(buf, c));
    }
cmd_m_out:
    sendGdbReply(newsockfd, reply);
    i += 2;
} break;

根據 The GDB remote serial protocol - Communication protocol 知道 m 用於 read memory，如果執行成功會回應 "OK"，失敗會回應 "ENN"，其中 NN 代表錯誤碼

再來根據 Howto: GDB Remote Serial Protocol 的 4.7.7 提到 read memory error 會回應 "E01"，所以將原本的 "OK" 改成 "E01"，相關程式碼可以參照 github

Fibonacci 實作

TODO：補上 Q-Matrix 實作

這裡用 C 語言實作出 iterative, recursive, tail recursion, fast doubling 版本的 Fibonacci

由於 sandbox 只能透過 -d 的 flag 將外部資料載入記憶體，所以在實作 Fibonacci 之前先寫一支輔助程式用來產生二進位資料的檔案：

$ tool/write -h
Description:

  Generate the input data for sandbox
  It write 4 byte to output one times

Usage:

-i [input integer]: input data
-o [output file]: output file
-a : append input data to output file

$ tool/write -i 40 -o tmp/fib.in
$ hexdump -C tmp/fib.in
00000000  28 00 00 00                                       |(...|
00000004

相關程式碼放在 github

接著實作 Fibonacci，這裡統一將所有版本的方法寫在同一支程式裡，在編譯時利用 -D 帶入不同 macro 產生對應版本的 Fibonacci 實作

由於程式碼放在共筆上相對稍長，所以不附上程式碼，相關程式碼可以參照 github

嘗試使用 gprof

在分析之前，先嘗試操作 gprof，首先使用 sandbox 的 -p option 來產生 profile 檔案

$ src/sandbox -e tests/fib -d tmp/fib.in -o tmp/fib.out -p tmp/fib.prof

這裡用 iterative 的 Fibonacci 來測試，輸入的資料為 40

接著用 moxie-none-moxiebox-gprof 來查看 profile 檔

$ moxie-none-moxiebox-gprof tests/fib tmp/fib.prof -l

<unknown>:0: (register_tm_clones:0x10a6) 1 executions
<unknown>:0: (frame_dummy:0x114c) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:121: (__register_exitproc:0x1230) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:164: (__register_exitproc:0x1270) 1 executions
~/project/sysprog/moxiebox/runtime/memset.c:92: (memset:0x14b6) 40 executions
~/project/sysprog/moxiebox/runtime/strstr.c:53: (strstr:0x14fe) 2 executions
~/project/sysprog/moxiebox/runtime/strstr.c:57: (strstr:0x14ee) 3 executions
~/project/sysprog/moxiebox/runtime/strstr.c:62: (strstr:0x14da) 6 executions
~/project/sysprog/moxiebox/runtime/strstr.c:113: (strstr:0x14d2) 1 executions
~/project/sysprog/moxiebox/tests/fib.c:15: (find_input:0x141c) 3 executions
~/project/sysprog/moxiebox/tests/fib.c:39: (fib_iterative:0x1498) 39 executions
~/project/sysprog/moxiebox/tests/fib.c:107: (main:0x1436) 1 executions

這裡顯示了 execution 的數量，查了一下，是指 basic block 的 execution 數量，拿倒數第二行的 fib.c:39 (fib_iterative:0x1498) 39 executions 來看，第 39 行附近程式碼如下：

38     for (int i = 1; i < n; i++) {
39         sum = f0 + f1;
40         f0 = f1;
41         f1 = sum;
42     }

所以就是這個 for loop 被編譯成一個 basic block，然後在這次的執行中被執行了 39 次，跟迴圈停止條件吻合 i < 40

試了其他幾個 gprof 的 options 後，都沒辦法讓 profile 檔案 dump 出其他的資訊，不確定 moxie-none-moxiebox-gprof 是不是還能顯示 execution count 以外的資訊
"st9007a"

分析各版本 Fibonacci

Iterative

透過上一節我們可以知道 iterative 版本的程式碼中有一個 basic block，在輸入數值為 40 的狀況下會被執行 39 次

接著利用 perf 來重複執行 100 次查看 instructions, cycles, 以及執行時間

$ perf stat --repeat 100 -e instructions,cycles src/sandbox -e tests/fib -o tmp/fib.out -d tmp/fib.in

Performance counter stats for 'src/sandbox -e tests/fib -o tmp/fib.out -d tmp/fib.in' (100 runs):

          330,2641  instructions          #  1.30  insns per cycle  ( +-  0.07% )
          253,3844  cycles                                          ( +-  0.11% )

       0.064130528  seconds time elapsed                            ( +-  2.00% )

Recursive

一樣輸入數值 40 來查看 basic block 的 execution count

<unknown>:0: (register_tm_clones:0x10a6) 1 executions
<unknown>:0: (frame_dummy:0x114c) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:121: (__register_exitproc:0x1230) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:164: (__register_exitproc:0x1270) 1 executions
~/project/sysprog/moxiebox/runtime/memset.c:92: (memset:0x14e0) 40 executions
~/project/sysprog/moxiebox/runtime/strstr.c:53: (strstr:0x1528) 3 executions
~/project/sysprog/moxiebox/runtime/strstr.c:57: (strstr:0x1518) 4 executions
~/project/sysprog/moxiebox/runtime/strstr.c:62: (strstr:0x1504) 6 executions
~/project/sysprog/moxiebox/runtime/strstr.c:113: (strstr:0x14fc) 1 executions
~/project/sysprog/moxiebox/tests/fib.c:15: (find_input:0x1466) 4 executions
~/project/sysprog/moxiebox/tests/fib.c:56: (fib_recursive:0x1416) 165580140 executions
~/project/sysprog/moxiebox/tests/fib.c:107: (main:0x1480) 1 executions

可以看到 basic block 數量超過

10^{8}

，接著用 perf 來看看

$ perf stat --repeat 5 -e instructions,cycles src/sandbox -e tests/fib -o tmp/fib.out -d tmp/fib.in

Performance counter stats for 'src/sandbox -e tests/fib -o tmp/fib.out -d tmp/fib.in' (5 runs):

    6800,4980,1761  instructions          #  2.35  insns per cycle  ( +-  0.00% )
    2887,8397,5877  cycles                                          ( +-  0.17% )

      80.571199833  seconds time elapsed                            ( +-  0.17% )

在輸入為 40 的狀況下，執行時間為 80 秒，是原本 iterative 版本的 1000 倍以上

Tail Recursion

profile 的結果如下：

<unknown>:0: (register_tm_clones:0x10a6) 1 executions
<unknown>:0: (frame_dummy:0x114c) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:121: (__register_exitproc:0x1230) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:164: (__register_exitproc:0x1270) 1 executions
~/project/sysprog/moxiebox/runtime/memset.c:92: (memset:0x14aa) 40 executions
~/project/sysprog/moxiebox/runtime/strstr.c:53: (strstr:0x14f2) 2 executions
~/project/sysprog/moxiebox/runtime/strstr.c:57: (strstr:0x14e2) 3 executions
~/project/sysprog/moxiebox/runtime/strstr.c:62: (strstr:0x14ce) 6 executions
~/project/sysprog/moxiebox/runtime/strstr.c:113: (strstr:0x14c6) 1 executions
~/project/sysprog/moxiebox/tests/fib.c:15: (find_input:0x141c) 3 executions
~/project/sysprog/moxiebox/tests/fib.c:69: (fib_tail_recursion:0x148c) 40 executions
~/project/sysprog/moxiebox/tests/fib.c:107: (main:0x1436) 1 executions

在 fib_tail_recursion() 裡，basic block 執行了 40 次，查看程式碼 69 行附近，正好是 tail call

69     fib_tail_recursion(n - 1, result, b, a + b);

原本想透過 gdb 查看 stack 狀況，不過編譯時，fib_tail_recrsion() 似乎沒有被編譯成 subrountine，利用 gdb 的 info functions 也沒有看到：

File fib.c:
int main();

File memset.c:
void *memset(void *, int, size_t);

File strstr.c:
char *strstr(const char *, const char *);

接著使用 perf 來查看

$ perf stat --repeat 100 -e instructions,cycles src/sandbox -e tests/fib -d tmp/fib.in -o tmp/fib.out

Performance counter stats for 'src/sandbox -e tests/fib -d tmp/fib.in -o tmp/fib.out' (100 runs):

          330,0576  instructions          #  1.30  insns per cycle  ( +-  0.06% )
          253,3873  cycles                                          ( +-  0.11% )

       0.062247430  seconds time elapsed                            ( +-  1.84% )

其執行速度幾乎與 iterative 版本差不多

Fast Doubling

profile 結果如下：

<unknown>:0: (register_tm_clones:0x10a6) 1 executions
<unknown>:0: (frame_dummy:0x114c) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/gcc/libgcc/libgcc2.c:709: (__clzsi2:0x11a2) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:121: (__register_exitpr$c:0x128e) 1 executions
~/project/sysprog/build-toolchain/.build/moxie-none-moxiebox/src/newlib/newlib/libc/stdlib/__atexit.c:164: (__register_exitpr$c:0x12ce) 1 executions
~/project/sysprog/moxiebox/runtime/memset.c:92: (memset:0x154c) 40 executions
~/project/sysprog/moxiebox/runtime/strstr.c:53: (strstr:0x1594) 3 executions
~/project/sysprog/moxiebox/runtime/strstr.c:57: (strstr:0x1584) 4 executions
~/project/sysprog/moxiebox/runtime/strstr.c:62: (strstr:0x1570) 6 executions
~/project/sysprog/moxiebox/runtime/strstr.c:113: (strstr:0x1568) 1 executions
~/project/sysprog/moxiebox/tests/fib.c:15: (find_input:0x147a) 4 executions
~/project/sysprog/moxiebox/tests/fib.c:81: (fib_fast_doubling:0x14ea) 1 executions
~/project/sysprog/moxiebox/tests/fib.c:87: (fib_fast_doubling:0x1508) 4 executions
~/project/sysprog/moxiebox/tests/fib.c:90: (fib_fast_doubling:0x1536) 4 executions
~/project/sysprog/moxiebox/tests/fib.c:107: (main:0x1494) 1 executions

跟前面比較不一樣的是 fib_fast_doubling() 有三個 basis block 分別被執行了 1, 4, 4 次，查看其對應的程式碼：

...
81    int cbs = 30 - __builtin_clz(n);

...

86    while (cbs >= 0) {
87         sum = (f2 * 2 - f1) * f1;
88         sum_1 = f1 * f1 + f2 * f2;
89         f2 = sum_1;
90         f1 = sum;
91
92         if ((n >> cbs) & 1) {
93             f1 = f2;
94             f2 = sum_1 + sum;
95         }
96         cbs--;
97     }

第 1 個 basic block 呼叫 __builtin_clz()，第 2 個是迴圈，第 3 個將 90 行與下面的 if 條件判斷視為 1 個 basic block，查看它的組合語言如下：

0x1536 <main+248>:   mov     $r0, $r3
0x1538 <main+250>:   jmpa    0x1528 <main+234>
0x153e <memset>:     mov     $r3, $r0
0x1540 <memset+2>:   mov     $r4, $r0
0x1542 <memset+4>:   add     $r4, $r2
0x1544 <memset+6>:   mov     $r2, $r4
0x1546 <memset+8>:   cmp     $r3, $r2
0x1548 <memset+10>:  bne     0x154c <memset+14>
0x154a <memset+12>:  ret

可以看到這裡執行了兩道指令後就進入了 memset() 中，所以第 3 個 basic block 只有兩個指令

接著使用 perf：

$ perf stat --repeat 100 -e instructions,cycles src/sandbox -e tests/fib -d tmp/fib.in -o tmp/fib.out

Performance counter stats for 'src/sandbox -e tests/fib -d tmp/fib.in -o tmp/fib.out' (100 runs):

          331,1244  instructions          #  1.30  insns per cycle  ( +-  0.07% )
          254,2927  cycles                                          ( +-  0.12% )

       0.058379399  seconds time elapsed                            ( +-  1.76% )

執行時間比 iterative, tail recursion 版本快了一點點，但是差距不明顯，其原因在於 fast doubling 在第 2 個 basic block 上執行指令較多
分別查看 iterative, tail recursion, fast doubling 的 basic block 如下，這裡 basic block 的判斷方式為從 gprof 給的虛擬位址到遇到一個 j-type 指令為止將其視為 1 個 basic block：

iterative

0x1498 <main+184>:   mov     $r1, $r5
0x149a <main+186>:   add     $r1, $r3
0x149c <main+188>:   inc     $r2, 0x1
0x149e <main+190>:   mov     $r5, $r3
0x14a0 <main+192>:   mov     $r3, $r1
0x14a2 <main+194>:   jmpa    0x147e <main+158>

tail recursion

0x148c <main+172>:   dec     $r6, 0x1
0x148e <main+174>:   mov     $r3, $r1
0x1490 <main+176>:   add     $r3, $r2
0x1492 <main+178>:   mov     $r1, $r2
0x1494 <main+180>:   mov     $r2, $r3
0x1496 <main+182>:   jmpa    0x1472 <main+146>

fast doubling

0x1508 <main+202>:   mov     $r3, $r1
0x150a <main+204>:   add     $r3, $r1
0x150c <main+206>:   sub     $r3, $r0
0x150e <main+208>:   mul     $r3, $r0
0x1510 <main+210>:   mul     $r0, $r0
0x1512 <main+212>:   mul     $r1, $r1
0x1514 <main+214>:   add     $r0, $r1
0x1516 <main+216>:   mov     $r1, $r0
0x1518 <main+218>:   mov     $r0, $r8
0x151a <main+220>:   ashr    $r0, $r2
0x151c <main+222>:   and     $r0, $r5
0x151e <main+224>:   cmp     $r0, $r6
0x1520 <main+226>:   beq     0x1536 <main+248>
0x1522 <main+228>:   mov     $r0, $r1
0x1524 <main+230>:   add     $r3, $r1
0x1526 <main+232>:   mov     $r1, $r3
0x1528 <main+234>:   dec     $r2, 0x1
0x152a <main+236>:   cmp     $r2, $r4
0x152c <main+238>:   bne     0x1508 <main+202>
0x152e <main+240>:   st.l    ($r7), $r0
0x1530 <main+242>:   jmpa    0x14d4 <main+150>

可以注意到 iterative 跟 tail recursion 的 basic block 只用了 6 道指令，而 fast doubling 有 21 道指令，中間還有 beq, bnq 的 branch 相關指令，其行為較 iterative, tail recursion 複雜

再來還需要考慮到執行時間也包含了啟動 sandbox 的時間，所以在輸入數值為 40 的狀況下差距較小

移植演算法並驗證

constant time memcmp() function

原理

參考自 github - cst_time_memcmp

原本的非常數時間 memcmp function 如下：

int memcmp(const unsigned char *m1, const unsigned char *m1, size_t n) {
    size_t i;
    for (i = 0; i < n; ++i ) {
        int diff = m1[i] - m2[i];
        if (diff != 0)
            return (diff < 0) ? -1 : +1;
    }
    return 0;
}

原本的演算法用一個迴圈將兩個記憶體位址逐個 byte 比較內容，如果內容不相同就回傳 1 或 -1，整個迴圈跑完後表示這兩段記憶體的內容相同，回傳 0

常數時間的 memcmp function 如下：

int cst_time_memcmp(const unsigned char *m1, const unsigned char *m1, size_t n) {
    int res = 0, diff;
    if (n > 0) {
        do {
            --n;
            diff = m1[n] - m2[n];
            if (diff != 0)
                res = diff;
        } while (n != 0);
    }
    return (res > 0) - (res < 0);
}

概念上就是原本的演算法是從頭比較每個 byte，常數時間的演算法則是從記憶體的最後一個 byte 往前比較，直到比完最前面的 byte 為止，即使中間就發現 byte 的內容不一樣也不會馬上回傳，整個迴圈都跑完才回傳

接著必須避免 conditional branch，所以將 if 替換成 bit-wise operator

if (diff != 0)
    res = diff;

改成

res = (res & -!diff) | diff;

當 diff = 0 時，-!diff = 0xffffffff，所以程式碼就會變成 res = (res & 0xffffffff) | 0x0，res 的值會保持，若 diff 不為 0，res 的值就會改變

接著在某些硬體架構下，編譯　!diff 時可能會產生 branching instruction，所以可以再修改：

res = (res & -!diff) | diff;

改成

res = (res & (((diff - 1) & ~diff) >> 8)) | diff;

修改後的程式碼在執行上會比原本慢，但是可以保證編譯後不會產生 branching instruction

另一個解法是利用一個 static table 記下所有取代 -!diff：

static signed char tbl[256] = {-1, 0, ... 0 };
res = (res & tbl[(unsigned char)diff]) | diff;

但是這個做法的執行速度會被 table 的 cache 影響，導致間接透漏記憶體比較的狀況，因此不推薦

最後程式碼最後的回傳 return (res > 0) - (res < 0); 也可能編譯出 branching instruction，所以可以改成如下：

return ((res - 1) >> 8) + (res >> 8) + 1;

所以完整的演算法如下：

int cst_time_memcmp_safest1(const void *m1, const void *m2, size_t n) 
{
    const unsigned char *pm1 = (unsigned char)m1; 
    const unsigned char *pm2 = (unsigned char)m2; 
    int res = 0, diff;
    if (n > 0) {
        do {
            --n;
            diff = pm1[n] - pm2[n];
            res = (res & (((diff - 1) & ~diff) >> 8)) | diff;
        } while (n != 0);
    }
    return ((res - 1) >> 8) + (res >> 8) + 1;
}

移植

在 runtime 資料夾中創立一個 cst_memcmp.c，寫下程式碼：

#include <stddef.h>

int cst_memcmp(const void *m1, const void *m2, size_t n)
{
    const unsigned char *pm1 = (const unsigned char *) m1;
    const unsigned char *pm2 = (const unsigned char *) m2;

    int diff, res = 0;

    if (n > 0) {
        do {
            --n;
            diff = pm1[n] - pm2[n];
            res = (res & (((diff - 1) & ~diff) >> 8)) | diff;
        } while (n != 0);
    }
    return ((res - 1) >> 8) + (res >> 8) + 1;
}

接著在 sandboxrt.h 加入：

extern int cst_memcmp(const void *m1, const void *m2, size_t n);

Makefile 加入對應的 obj 檔名，就可以將 cst_memcmp 加入 runtime library 了

這邊要 commit 的時候發現 cppcheck 會跳錯誤 Shifting a negative value is unbefined behavior，所以將程式最後一行改成：

return (((res - 1) & 0xff000000) | ((unsigned int)(res - 1)) >> 8) +
        ((res & 0xff000000) | (unsigned int)(res >> 8)) + 1;

驗證

首先要確定沒有編譯出 branching instruction，所以用 gdb 來查看：

   ...
   0x00001574 <+10>:    beq     0x15d4 <cst_memcmp+106>
   
   ...
   0x000015a0 <+54>:    bne     0x157e <cst_memcmp+20>
   
   ...
   0x000015d6 <+108>:   jmpa    0x15a2 <cst_memcmp+56>

被編譯出來的組合語言有三個 branching instruction，其中 beq 跟 jmpa 是 if 產生出來的，bne 則是迴圈，其餘部分沒有任何 branch 出現

接著驗證結果正確性，這邊用 3 個 test case 來測試 cst_memcmp() 的回傳結果：

#define TEST_START int pass = 0;

#define GET_TEST_RESULT(total, success) \
    total = 3;                          \
    success = pass;

#define TEST_RULE(m1, m2, expected)                        \
    do {                                                   \
        pass += cst_memcmp(m1, m2, MAP_SIZE); == expected; \
    } while (0);

#define TEST_CASE_1(m1, m2)      \
    do {                         \
        memset(m1, 0, MAP_SIZE); \
        memset(m2, 0, MAP_SIZE); \
        TEST_RULE(m1, m2, 0);    \
    } while (0);

#define TEST_CASE_2(m1, m2)           \
    do {                              \
        memset(m1, 1, MAP_SIZE >> 1); \
        memset(m2, 0, MAP_SIZE);      \
        TEST_RULE(m1, m2, 1);         \
    } while (0);

#define TEST_CASE_3(m1, m2)           \
    do {                              \
        memset(m1, 0, MAP_SIZE);      \
        memset(m2, 1, MAP_SIZE >> 1); \
        TEST_RULE(m1, m2, -1);        \
    } while (0);

int main ()
{
    ...
    
    TEST_START;
    TEST_CASE_1(a, b);
    TEST_CASE_2(a, b);
    TEST_CASE_3(a, b);
    GET_TEST_RESULT(p[0], p[1]);

    setreturn(res, sizeof(int) * 2);

    _exit(0);

    return 0;
}

用 hexdump 查看結果：

00000000  03 00 00 00 03 00 00 00                           |........|
00000008

前 4 個 bytes 代表測試的 case 數量，後 4 bytes 代表通過測試的 case 數量，可以看到回傳結果都正確，程式實作上沒有問題

再來要驗證是否為常數時間：

由於sandbox 中無法取得系統時間，所以使用 perf 對整個 sandbox 執行 100 次，觀察其 cycle 數量以及執行的秒數
準備三組測試資料：
- 測試資料 1：1Mb，每個 bit 皆為 0
- 測試資料 2：1Mb，前 512kb 每個 bit 為 1，後 512kb 每個 bit 為 0
- 測試資料 3：1Mb，每個 bit 皆為 1
每組測試資料皆用於與 1Mb 每個 bit 皆為 0 的靜態資料比較
確保準備測試資料的時間為常數時間，所以測試資料預先準備好，於 sandbox 執行時載入

完整程式碼如下：

#include <stddef.h>
#include "sandboxrt.h"

#define PAGE_SIZE 0x1000
#define MAP_SIZE 0x100000
#define MMAP_FAILED ((void *) -22)

static int prot = MOXIE_PROT_READ | MOXIE_PROT_WRITE | MOXIE_PROT_EXEC;
static int flags = MOXIE_MAP_PRIVATE | MOXIE_MAP_ANONYMOUS;

static struct moxie_memory_map_ent *data;

static void find_input(void)
{
    data = moxie_memmap;
    while (data->addr) {
        if (strstr(data->tags, "data0,"))
            return;

        data++;
    }

    _exit(1);
}

int main()
{
    find_input();

    void *mem = mmap(NULL, MAP_SIZE, prot, flags, 0, 0);

    if (mem == MMAP_FAILED) {
        _exit(1);
    }

    memset(mem, 0, MAP_SIZE);
    cst_memcmp(mem, data->addr, MAP_SIZE);
     _exit(0);

    return 0;
}

在這個測試程式中除了 cst_memcmp() 待驗證以外，排除所有非常數時間的程式碼

perf 執行結果：

測試資料 1

Performance counter stats for 'src/sandbox -e tests/cst_memcmp_time_test -d tmp/0k.in' (100 runs):

       8,0428,2638  cycles                ( +-  0.01% )

       0.225285735  seconds time elapsed  ( +-  0.04% )

測試資料 2

Performance counter stats for 'src/sandbox -e tests/cst_memcmp_time_test -d tmp/512k.in' (100 runs):

       8,0465,0895  cycles                ( +-  0.01% )

       0.225444945  seconds time elapsed  ( +-  0.05% )

測試資料 3

Performance counter stats for 'src/sandbox -e tests/cst_memcmp_time_test -d tmp/1M.in' (100 runs):

       8,0434,6874  cycles                ( +-  0.01% )

       0.225341900  seconds time elapsed  ( +-  0.04% )

可以看到 3 組測試資料的執行時間非常接近，其彼此的差距約略在

0.02 % \sim 0.03 %

，自己執行 100 次的差距也在

10^{- 4}

的層級，故可以確認 cst_memcmp() 執行時間為常數時間

應用場合

constant time memcmp 可以應用於機密資料的比較，比如雜湊或加密資料比較，在 linux man pages - memcmp(3) 中提到 Do not use memcmp() to compare security critical data，原因在於 memcmp 從頭逐個 byte 比較，一找到不一樣就會馬上回傳

Timing Attack 就是利用這個概念根據 memcmp 的執行時間把被保護的資料猜出來，比如一個系統的資料庫存著 hash 過的密碼，每次都使用 memcmp 來比較送過來的 hash 以及密碼的 hash，攻擊者就能根據執行的時間猜出密碼的 hash

原本參考的 github repo 中有提供 NetBSD 所使用的 constant time memcmp

cn_string

原理

cn_string 的主要目的是想要在 c 語言裡模擬 c++ 的 string 物件，使用 typedef 定義一個新的資料結構 CN_STRING ，並實作 string 物件的功能

typedef unsigned int cn_uint;
typedef unsigned char cn_byte;

typedef struct cn_string {
    cn_byte *data;
    cn_uint len;
} * CN_STRING;

移植

移植最關鍵的問題在於 moxiebox 並沒有實作 malloc, calloc 這些 function ，必須重新實作控制記憶體的部分

以 memory pool 的概念實作，先使用 moxiebox 提供的 mmap 取得一段可以使用的記憶體， malloc, calloc 的記憶體都將從這段記憶體裡取出

typedef struct {
    void *ptr;
    size_t used;
    size_t size;
} MEMPool;
MEMPool *pool;

int main() {
    ...
    
    // initial memory pool
    pool = mmap(NULL, PAGE_SIZE, prot, flags, 0, 0);
    pool->ptr = mmap(NULL, MAP_SIZE, prot, flags, 0, 0);
    pool->used = 0;
    pool->size = MAP_SIZE;
    
    ...
}

void *my_malloc(size_t size) {
    if (pool->used + size <= pool->size) {
        void *p = pool->ptr + pool->used;
        pool->used += size;
        return p;
    }

    return NULL;
}

void *my_calloc(size_t num, size_t size) {
    if (pool->used + (num * size) <= pool->size) {
        void *p = pool->ptr + pool->used;
        memset(p, 0, num * size);
        pool->used += num * size;
        return p;
    }

    return NULL;
}

驗證

將 CN_STRING 裡的資料使用 setreturn 匯出，並用 hexdump 驗證結果是否正確

int main() {
    void *res = mmap(NULL, PAGE_SIZE, prot, flags, 0, 0);
    
    ...
    
    // Starting with initial string
    CN_STRING str_i = cn_string_from_cstr("This is also a test.");

    // Output to check result
    int i;
    char *p = res;
    for (i = 0; i < str_i->len; i++) {
        p[i] = str_i->data[i];
    }
    setreturn(res, sizeof(char) * str_i->len);
    
    ...
}

$ ../src/sandbox -e cn_string -o ../tmp/cn_string.out
$ hexdump -C ../tmp/cn_string.out
00000000  54 68 69 73 20 69 73 20  61 6c 73 6f 20 61 20 74  |This is also a t|
00000010  65 73 74 2e                                       |est.|
00000014

應用場合

若須撰寫大量使用到字串處理的程式，有以下優點

查詢字串長度時間恆定，只需呼叫 str->len 即可取得字串長度，是一種使用空間換取時間的設計理念
較安全的指標使用，如果遇到 NULL 資料，會回傳 1 個 byte 的空字串，並將字串長度設為 0 ，不會出現 NULL pointer

參考資料

moxiebox architecture
sandbox execution environment
Understanding the Stack
Microsoft - using MDLs
The GDB remote serial protocol - Communication protocol
Howto: GDB Remote Serial Protocol
Program Error Signals
Interpreting gprof's Output
GCC - Optimize Option
stack exchange - Why should memcmp not be used to compare security critial data ?
linux man pages - memcmp(3)

2017q3 Homework4 (sandbox)

tags: sysprogram

Github

解析 moxiebox

Registers

Memory

Library

修正 sandbox

Fibonacci 實作

嘗試使用 gprof

分析各版本 Fibonacci

Iterative

Recursive

Tail Recursion

Fast Doubling

移植演算法並驗證

constant time memcmp() function

原理

移植

驗證

應用場合

cn_string

原理

移植

驗證

應用場合

參考資料

tags: `sysprogram`