2016q3 A09 (JIT-compiler)

contributed by <hugikun999>, <ponsheng>
http://www.voy.com/245963/
github page : https://github.com/ponsheng/jit-construct

預期目標

學習設計 JIT 編譯器
運用計算機組織的背景知識，進行深入效能評估和提出改善方案
延展程式語言，提供 concurrency 能力

背景知識

Just-in-time compilation

靜態編譯(compiler)：程式都編譯成 machine language，再執行
動態直譯(interpreter)：在程式執行時，一行一行編譯

JIT compile (動態編譯)在程式執行階段才把程式編譯成 machine language ，同時進行程式優化

通常程式會出現兩種現象，因此可以對應做優化

重複執行相同的程式碼 -> 將熱點程式儲存起來，下次遇到時，可以直接執行不用編譯。
程式不夠精簡，造成多餘成本 -> 先讀取部份程式，優化後，再轉成 machine code

這次作業要優化第2種現象(BF 沒有 subroutine 的概念)

JIT 的優化範例可見第二個參考網址

(1)~~Understanding the differences~~
(2)JIT-compiler
(3)What is the difference between a compiler and an interpreter?

直譯器(interpreter) v.s 編譯器(compiler)
直譯器會在程式執行時才將原始碼一行一行轉譯並在轉譯後立即執行，相對於直接執行編譯器所產生的機械碼花費更多的時間。但是編譯器必須先將原始碼編譯成可執行碼之後才會執行，相比之下直譯的動作會更省時間。
sed語法

$ sed 's/要被取代的字串/新的字串/g'

other
(1) magic code: an informal term of abstraction
(2) final: if a thing is to be declear with the label final,the thing can't to be inherit by other.
(3) virtual: if a thing is to be declear with the label virtual,it will be late binding(dynamic binding).
(4) late binding: the method being called upon an object or the function being called with arguments is looked up by name at runtime.

brainfuck optimization strategies

contraction
將連續性的+、-、>、<統計總次數後做一次性的處理。
形式：+++++
code: *p += 5;
clear loop
將迴圈歸零的形式[-]，直接改成將值設為0。
形式：[-]
code: *p = 0;
copy loops
直接將當前指標的值加給迴圈內牽涉到的其他指標的值。
形式：[->+<]
code: *(p + 1) += *p;
multiplication loops
類似 copy loops 的方法，如果迴圈內是對同一個指標做多次的+、-，可以利用當前指標的值乘上加總次數加給目標指標。
形式：[->+++<]
code:*(p + 1) = *p * 3 ;
scan loops
在較多的[>]、[<]情況下才有好的效能改善，在C語言中可以利用 memchr()、memrchr() 來取代原本的寫法。
operation offsets
事先計算對哪些指標做了什麼事，直接將 p 加上offset做運算，而不必依照 program 的順序一直對 p 做左移或右移。

程式理解：

honai:

完成一次9層河內塔
有大量連續的 +++、---、>>>、<<<,也有滿多的[-]可以用 clear loops。
另外有在[]內做許多連續運算可以用 multiplication loops。

awib：

BF編譯器
3個程式中唯一有出現[<]、[>]，可以用 scan loops 優化。

這個程式要開啟來看一下，有詳細的 Usge，還有把 bf code 排成有趣圖案

mandelbrot：

碎形程式

編譯：





lua dynasm/dynasm.lua -o jit-x86.h jit-x86.dasc
# 用lua interpreter 執行 Lua程式，生成jit-x86.h

cc -Wall -Werror -std=gnu99 -I. -o jit-x86 -DJIT=\"jit-x86.h\" dynasm-driver.c

The LuaJIT Project

Lua 是一種相當普遍使用的內嵌式語言，很適合配合使用其它程式語言（例如 C 或 C++）所寫的程式。Lua 的主要目的之一是要讓對程式設計不熟悉的人也能使用，因此它的設計相當單純，適合一般人學習。

Dynasm

Dynasm home page

The Unofficial DynASM Documentation

A preprocessor and tiny runtime library for creating assemblers and JIT compilers in C or C++
writing a just-in-time compiler or need to generate code on the fly
把組語轉換成 opcode 型式，方便撰寫
other
(1) 複雜的宣告

dynasm.lua

這個檔案是 DynASM 的 preprocessor 的核心檔案，需要一個 DynASM 檔(可以混合 C 和 Assembly)做 input，輸出 C 檔，therein performing the bulk of the conversion from assembly to machine code.

DynASM 檔規則：
1.無 | : 開頭的行：原封不動
2.有一個 | ：包含 assembly instruction 或 assembler directives，會被轉成 C 的型式：dasm_put(Dst, ...)
3.有兩個 \ :被當做 C code，undergo prepreprocessor substitutions created by .define, and can be part of .macro definitions.

DynASM Directives

.arch
指定 DynASM 所使用之組語使用什麼架構
.actionlist
在 C file 中產生一個 array，內容的號碼映射到程式中使用的 assembly 的 opcode。
在.if、.else、.endif前面加上|會被視為C裏面的#if、#else、#endif。
| mov aPtr, state->tape
means [aState + offsetof(bf_state_t,tape)]

DynASM Functions

dasm_init
dasm_State 所宣告的位址需要利用 dasm_init 來初始化。DASM_MAXSECTION 可以利用 |.section code 來決定section的數值。
dasm_setupglobal
dasm_init 並未完全初始化 dasm_State 還需要利用到 dasm_setupglobal。|.globals 後面所接的變數會被視為 enum，且包含一個 lbl_MAX。
dasm_put(Dst, index)
執行從action list 中某個index開始的 assembly，執行方法是 "從 actionlist[index]開始取出 opcode 或是參數，再取下一位，直到收到255(the DynASM bytecode instruction DASM_STOP)停止"

ex:

//|.arch x86

static const unsigned char actions[66] = {
  85,137,229,131,252,236,8,137,157,233,139,133,233,137,195,255,

// Function prologue.
//|  push ebp             //opcode  (55)16 = (85)10
//|  mov  ebp, esp		//opcode  (89)16 = (137)10
//|  sub  esp, 8		//opcode  (83)16 = (131)10
//|  mov  [ebp - 4], PTR
//|  mov  eax, [ebp + 8]
//|  mov  PTR, eax
dasm_put(Dst, 0, - 4, 8);   //從第一個element開始取

問題：程式行為是先跑過 JIT 後，將 assembly存起來(存進DynASM state)，最後才轉machine code(dasm_link / dasm_encode), 是否有違JIT概念？如果code很大怎麼半 ponsheng Chen

Benchmark

執行 make bench-bench-jit-x86

會以某這個環境變數去執行 python script : tests/bench.py

env PATH='.:${PATH}' BF_RUN='jit-x86' tests/bench.py

裏面執行程式的片段






def get_output(program, stdin):
    p = subprocess.Popen([os.getenv('BF_RUN','./jit-x86'), program] + sys.argv[1:], stdout=subprocess.PIPE, stdin=subprocess.PIPE)

    start = time.time()
    output = p.communicate(input=stdin + '\x00')[0]
    return output, time.time() - start

第2行 -> Execute a child program in a new process.
- stdout=subprocess.PIPE, stdin=subprocess.PIPE:
  將 stdout 跟 stdin 轉向 python 內部
- os.getenv('BF_RUN','./jit-x86'), program] + sys.argv[1:]
  BF_RUN 環境變數 (jit 檔) 作為執行程式，其餘作為子程式之 arg:

Note that if you want to send data to the process’s stdin, you need to create the Popen object with stdin=PIPE. Similarly, to get anything other than None in the result tuple, you need to give stdout=PIPE and/or stderr=PIPE too.

第5行 -> Interact with process 給予程式 stdin
- p.communicate(input=stdin + '\x00')[0]

intel x86 ISA

摸索好久

http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
https://en.wikibooks.org/wiki/X86_Assembly/Data_Transfer
http://sparksandflames.com/files/x86InstructionChart.html

優化方式

紀錄原版時間

progs/awib.b            GOOD	117.1ms
progs/mandelbrot.b      GOOD	3329.0ms
progs/hanoi.b           GOOD	8630.7ms

中英關鍵字記得以空白區隔
課程助教

OK 謝謝助教 :+1:

Contraction

此方法是將多餘運算整併，例如 +++++ ，其實就是+=5，此改善方法雖然直覺簡單，但實質效益也不差，而實作方法可在遇到+ - > < 時即檢查是否為連續出現。而在此我們保留 inc 在只有出現一次的時候，可使效能最佳化。

















int continuous_count(char *p)
{
    char *ptr = p;
    int count = 0;
    while (*ptr == *p) {
        count++;
        ptr++;
    }
    return count;
}

//int main()
case '>':
    count = continuous_count(p);
    |  add  PTR, count
    p += (count - 1);
	break;

1.+

progs/awib.b             GOOD	120.5ms
progs/mandelbrot.b       GOOD	3567.5ms
progs/hanoi.b            GOOD	12071.5ms  -> slow down

progs/awib.b            GOOD	61.3ms
progs/mandelbrot.b      GOOD	3443.9ms
progs/hanoi.b           GOOD	4382.7ms

2.++-

progs/awib.b            GOOD	48.5ms
progs/mandelbrot.b      GOOD	3351.4ms
progs/hanoi.b           GOOD	4331.0ms

3.++-+<

progs/awib.b            GOOD	47.9ms
progs/mandelbrot.b      GOOD	2148.3ms
progs/hanoi.b           GOOD	4238.3ms

4.++-+<+>

progs/awib.b            GOOD	47.3ms
progs/mandelbrot.b      GOOD	1098.8ms
progs/hanoi.b           GOOD	4302.7ms

Reduce Loops

Clear loops & Copy loops & Multiplication loops

Clear loops： [-]
Copy loops: [->>>+<<<]
Multipy loops: [->>+++>+<<<]

需要注意的case

1.處理完loop後，要將指標做正確的平移
2.連續的+之間存在空白，換行或其他無關指令
























































int res_count;

int check_loops(char *p,int *index,int *mult)
{
    int res=0,offset = 0,_index = 0;
    char *temp_p = p;
    if (*(p+1) != '-') return -1;
    p += 2;
    while (*p != ']') {
        if (*p == '[' || *p == '-' ||
            *p == '.' || *p == ',')
            return -1;
        res = continuous_count(p);
        if (*p == '>') offset += res;
        else if (*p == '<') offset -= res;
        else if (*p == '+') {
            for(int i=0;i <_index;i++)
            {
                if(index[i] == offset)
                {
                    mult[i] += res;
                    goto L1;
                }
            }
            index[_index] = offset;
            mult[_index] = res;
            _index++;
        }
L1:        p += res;
   }
   if (offset != 0) return -1;
    res_count = p - temp_p;
   return _index;
}

//int main()
if (return_value == 0){   //clear loops      
    |  mov   byte [PTR], 0
    p += res_count;
    res_count = 0;
}
else if(return_value > 0){ //copy 
    for(int t=0; t < return_value; t++){

        | mov al, [PTR]
    	| mov dl, mult[t]
        | mul dl
        | mov edx, index[t]
        | add edx, PTR
        | add byte [edx], al

	}
    | mov  byte [PTR] ,0
    p += res_count;
    res_count = 0;
}

progs/awib.b            GOOD	39.4ms
progs/mandelbrot.b      GOOD	2896.8ms
progs/hanoi.b           GOOD	3461.6ms

Result

Exec time for every Program

Concurrency

對 Brainfuck 程式語言進行擴充，使其能夠支援 concurrency / parallelism。需要有程式語言的規格描述並且驗證。
* Bukkake: A parallel brainfuck JIT in C
* Brainfuck Process Extensions
* Parallel Brainfuck
* Concurrent Brainfuck

+++++*-[**********-] |>++H.|>+e.+++++ ++l.l.+++o.>++_.<|W|o.+++r.----- -l. ----- ---d.>+!.>\n.        
> +++++ ++         70|
>> +++++ +++++    100|
>>> +++            30|
>>>> +             10|
                           |>+++++ +++++ +++++         |.|

研究中

Ref

Brainfuck Optimization Strategies

Jserv's blog

Virtual Machine Constructions for Dummies

Just in Time Compiler ( understanding JVM working in detail )

https://github.com/rampant1018/jit-construct

作業要求： https://hackmd.io/s/SJ7AZJv1e

#line 136 xxx.c
GCC Line Control

Hello, JIT World: The Joy of Simple JITs