虛擬機器設計與實作

本講座專注在 process virtual machine，像是 JVM 和 Ethereum VM，而非 system virtul machine，後者的案例是 QEMU 和 VirtualBox。

背景知識

投影片

Brainfuck 程式語言

Brainfuck 程式語言由 Urban Müller 在1993年建立，比C語言大約問世於1972年還來晚的多，此語言中雖然簡易，只有八個運算字元，但不論基本數學運算，或是迴圈等等他都能勝任，是個麻雀雖小五臟俱全的語言，以下是此語言的運算符號及意義。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

接著我們就來探討要如何改善此Complier，在此語言上因為極其簡易，因此許多的運算是累贅的，例如乘法、初始值等，又或是迴圈的使用，因為 Brainfuck 的迴圈終止都是仰賴 Array[0] 來判斷，在 Index 的運算上其實會有很多的不必要。在 brainfuck optimization strategies 提到以下最佳化策略:

Contraction
Clear loops
Copy loops
Multiplication loops
Operation offsets
Scan loops

用語

Just In Time (及時)：存在延遲，是可接受的延遲 (在執行時才會做)。
- 如「一場及時雨」，是在人或農作物尚可存活時，所接觸到的適時幫助
Real Time (即時)：超過時限會有 penalty，會造成不可挽回的錯誤。

Optimization

在開始進行最佳化前，先記錄未最佳化之數據。

progs/awib.b GOOD 70.1ms
progs/mandelbrot.b GOOD 2924.2ms
progs/hanoi.b GOOD 7308.1ms

Contraction

此方法是將多餘運算整併，例如 +++++ ，其實就是+=5，此改善方法雖然直覺簡單，但實質效益也不差，而實作方法可在遇到+ - > < 時即檢查是否為連續出現。而在此我們保留 inc 在只有出現一次的時候，可使效能最佳化。

int continuous_count(char *p)
{
    char *ptr = p;
    int count = 0;
    while (*ptr == *p) {
        count++;
        ptr++;
    }
    return count;
}

執行結果

progs/awib.b GOOD 41.5ms
progs/mandelbrot.b GOOD 1018.0ms
progs/hanoi.b GOOD 3799.4ms

Clear loops & Copy loops & Multiplication loops

此部分是將值 Copy 到其他位置上，而 Multiplication loops 是多了一個乘數，Copy loops 則是直接複製，我們現在這段function可涵蓋眾多pattern，只要開頭是 [- 然後是單一迴圈無巢狀，迴圈內也無其他的減法，就能被涵蓋在內，因此標題上的 **Clear loops & Copy loops & Multiplication loops **皆可被涵蓋其中。

int check_loops(char *p,int *index,int *mult)
{
    int res,offset = 0,_index = 0;
    if (*(p+1) != '-') return -1;
    p += 2;
    while (*p != ']') {
        if (*p == '[' || *p == '-' ||
            *p == '.' || *p == ',')
            return -1;
        res = continuous_count(p);
        if (*p == '>') offset += res;
        else if (*p == '<') offset -= res;
        else if (*p == '+') {
            index[_index] = offset;
            mult[_index] = res;
            _index++;
        }
        p += res;
   }
   if (offset != 0) return -1;
   return _index;
}

執行結果

progs/awib.b GOOD 37.0ms
progs/mandelbrot.b GOOD 2526.5ms
progs/hanoi.b GOOD 2929.1ms

~~Operation offsets~~

Scan loops

此部分在 Hint 中視提到用 memchr() 來最佳化，可 parallel 比對8 bits 之字串，而處理的 pattern 如 [<<<] 等。在下圖中，我們可看到 awib 以及 mandelbrot 有著較多的 Scan loops ，但進一步分析，可發現 awib 的 Scans loops 都以 [<], [>] 居多，而 mandelbrot 則以較多的offset 如 [<<<<<<<<<] ，在此部分上的差距可能對效能上也會有所影響。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

移除換行符號

為了能更有效率的判斷，我們加入移除換行符號的 function，如此我們可以降低換行所帶來的影響，也方便之後的演算法方便寫作，不須多考慮換行的差異，如原本是連續的+，有可能會因為換行而截斷，變成兩次add。在這裡補上測試數據，若以完成上述最佳化方法的情況下來測試。

原本

progs/awib.b GOOD 18.3ms
progs/mandelbrot.b GOOD 721.9ms
progs/hanoi.b GOOD 61.4ms

調整後

progs/awib.b GOOD 15.3ms
progs/mandelbrot.b GOOD 684.1ms
progs/hanoi.b GOOD 45.3ms

可發現在時間越大的 test bench 上其效果會更加顯著。

綜合分析

在針對 Brainfuck Compiler 的效能改善上，其實可分為兩個方向來作增進

Contraction
Reduce Loops

Contraction

在此我們可分析 **Contraction **所改善之數量，意即透過 **Contraction **減少多少指令，此採計對象是以未被其他優化方式優化之部分，即是真實有使用到 **Contraction **部分之程式碼，其壓縮比例都相當不錯，尤其在 hanoi 上有極明顯之效率，此數據與執行時˙間也有對應之，但若要能更精確的來探討執行時˙間，可能要多加入上述兩方向的實際所占執行時間比例，能更清楚分析其所占比例，因迴圈必須在實際執行上來估計，若透過此數據能完整呈現此作業的面貌，也能更加清楚可改進之部分。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Reduce Loops

以下我們就以 Loops 來作分析，我將 test bench 個別作分析，可看到我們所作的改善其實包含範圍只有50%左右，其他可能是巢狀迴圈或試其他特殊迴圈，以致於不被包含在我們所訂定的制式迴圈中，而針對上面在對我們所定義的迴圈格式改善中，可看到執行時間上在 mandelbrot 並不太顯著，此問題在這裡可被凸顯出來，在 Scan loops 他所佔據之比例大約有五分之一，而 Scan loops 又是極耗費時間之動作，因此改善效果就沒其他兩者 bench 來的好。而 hanoi 的改善效能極好，因其 Loops 被優化範圍高達 80%，而 awib 的改善範圍則只有 45%，但在 other loops 中其運算並不算太複雜，且 multiplication loops 比例也快有 20%，所以改善效能狀況也還不太差。

所以若我們能改善 Scan loops 那其實對 mandelbrot 應該能有顯著之改善，而若能針對巢狀迴圈或其他迴圈上改進，那對於 awib 應該也能有明顯的加速。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

問題與討論

在這裡我們可發現 hanoi.b 的改善效能最好，有155倍的差異，而其他兩者則改善4倍，mandelbrot.b 有跑出漂亮的684ms，我們可探討一下為何 hanoi 的改善效果能如此漂亮。

其實打開檔案來觀察，不難發現hanoi 的> <極多，看起來非常整齊，這些都是可以被降低的，再者也有許多的Loops可以被簡略，這些都是我們改善的重點，也因此原本長約的700行test bench，168134行的組合語言，改善到42742行；在 mandelbrot.b 中則有許多的 [<<<<<<<]，且也無 hanoi 來的多累贅運算。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

上圖是對於改善的程式碼長度，下面則放上執行時間的統計圖，方便大家做一比較。可發現mandelbrot 對於 **Contraction **較有反應，此部分感覺是因為一些特殊pattern目前無法用Remove Loops 去除，因此產生此對比性。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

除錯提示

Brainfuck 以及指令集都不是我們所熟悉的，而且不容易 print 值debug，在這裡我是用在 Makefile 內的 run-jit-x64 ，此動作可 dump assembly code，對作業的debug上有相當之益處，可發現問題的癥結點；另外我也有印出在 file 中也會方便使用者來檢查，若直接 print 出有時會影響到結果值，或是可能直接不 print 出 ( 在有bug的狀況下 )。

DynASM

根據教學可以輸出碎形，但是速度非常慢

$ git clone https://github.com/corsix/dynasm-doc.git
$ cd dynasm-doc
$ git submodule update --init
$ cp bf_c.c tutorial.c
$ gcc -o tutorial tutorial.c
$ ./tutorial mandelbrot.bf

現在要用 JIT 版本加速

首先要先在 header 中加上

#include "luajit-2.0/dynasm/dasm_proto.h"
#include "luajit-2.0/dynasm/dasm_x86.h"

接著將 interpreter 改為 compiler

- static void bf_interpret(const char* program, bf_state_t* state)
+ static void(* bf_compile(const char* program) )(bf_state_t*)

呼叫的部份也要改

- bf_interpret(program, &state);
+ bf_compile(program)(&state);

變數部份也要做一些新增和修改

- int nskip = 0;
+ dasm_State* d;
+ unsigned npc = 8;
+ unsigned nextpc = 0

接下來要根據不同平台去定義不同的 arch (硬體架構):

|.if X64
|.arch x64
|.else
|.arch x86
|.endif

在上面有宣告 dasm_State* d; 可以呼叫 dasm_init 來初始化

|.section code
dasm_init(&d, DASM_MAXSECTION);

第二個參數為在開始 define 的 DASM_MAXSECTION 1

結束以上動作後，還沒完全的初始化必須在 call dasm_setupglobal(&d, labels, lbl__MAX);

|.globals lbl_
void *labels[lbl__MAX];
dasm_setupglobal(&d, labels, lbl__MAX);

在 call dasm_setup(&d, bf_actions);

|.actionlist bf_actions
dasm_setup(&d, bf_actions);

將 actionlist 當作參數傳入

最後 call dasm_growpc(&d, npc); 來完成初始化

dasm_growpc(&d, npc);

npc 為目前被分配的 dynamic label

接下來要做一連串的定義，和巨集

接下來一樣做 Emitting 各 expression ，如 Arithmetic, I/O, Loops, 完成 compile。

用 JIT compiler 來執行之前的程式：

$ cp bf_dynasm.c tutorial.c
$ gcc -o minilua luajit-2.0/src/host/minilua.c -lm
$ ./minilua luajit-2.0/dynasm/dynasm.lua -o tutorial.posix64.c -D X64 tutorial.c
$ gcc -o tutorial tutorial.posix64.c
$ ./tutorial mandelbrot.bf

rubi

rubi 是成大資訊系師生合作開發、類似 Ruby 語法的程式語言實作，提供 JIT compiler，主體程式碼不到一千行，具體而微。Rubi 使用前述 DynASM 打造，請留意看 Makefile 得知編譯方式。