DynASM閱讀筆記

重點整理

DynASM簡介

先說明DynASM的目的

DynASM is a preprocessor and tiny runtime library for creating assemblers and JIT compilers in C or C++.

簡單說名就是他幫我們將interpreter的code很常執行的部分直接替換成machine code

Tutorial

這邊是用brainfsck為例，所以可能還是要先了解一下第一份作業在做啥，不然後面在做最佳化的時候可能會看不懂

Includes

首先需要include DynASM的header

+ #include "luajit-2.0/dynasm/dasm_proto.h"
+ #include "luajit-2.0/dynasm/dasm_x86.h"

需要更多資訊參考 reference page, dasm_proto.h 定義 DynASM API, dasm_x86.h 包含了前面使用的API (for x86 / x64).

Types

再來需要將 bf_interpret 改名成 bf_compile還有做一些表示上的轉換:

- static void bf_interpret(const char* program, bf_state_t* state)
+ static void(* bf_compile(const char* program) )(bf_state_t*)

之前的bf_interpret能夠傳入一個const char* 與一個 bf_state_t*, bf_compile現在改完後只接受const char* , 然後會傳回一個 function pointer to the JIT-compiled code.

在宣告那邊也要改一下

- bf_interpret(program, &state);
+ bf_compile(program)(&state);

Initialisation

前面設置完成後,下一件就是建立與初始化DynASM state.

Variables

我們接下來會需要一個變數其型態為 dasm_State* 來存放 DynASM的狀態,另外兩個額外的變數之後會討論到. 改到這邊可以把之前用到的東西刪掉了:

- int nskip = 0;
+ dasm_State* d;
+ unsigned npc = 8;
+ unsigned nextpc = 0;

.arch

接下來開始碰到第一個DynASM指令, 這個是 DymASM preprocessor的指令. 在這個例子中, 在這邊說明了我們要產生machine code給哪中平台, 不是x86就是x64:

+ |.if X64
+ |.arch x64
+ |.else
+ |.arch x86
+ |.endif

｜開頭的那一行會被DynASM preprocessor處理. DynASM preprocessor會處理 .if, .else, and .endif 指令,這邊的語法有點像C的preprocessor 指令#if, #else, and #endif. 結果就是會產生一組對應平台的machine code.

dasm_init

在前面已經先宣告過了dasm_State*用來紀錄DynASM的狀態,然而我們需要實際上給予dasm_State值,這時候需要呼叫dasm_init:

+ |.section code
+ dasm_init(&d, DASM_MAXSECTION);

Note：就像是dasm_State**一樣, dasm_init也需要一個整數的參數,用來說明machine code會產生多少的section.我們只需要一個code section,所以在這邊使用有加上一個參數的.section指令, 這個參數就是DASM_MAXSECTION，DynASM preprocessor會依據#define DASM_MAXSECTION 1 將其值改寫為一。而d就是用來儲存dynASM的狀態

.section

| .section name1 [, name2 [, ...]]

.section 允許DynASM source file能夠寫多個section。Every argument name results in the introduction of a new directive .name, and using one of these introduced directives causes all subsequent assembly and data directives to append to the named section. When dasm_encode is called, the sections are concatenated to form a single contiguous block of machine code.

dasm_setupglobal

dasm_init 會配置一個dasm_State,但是還沒有完整的被初始化. 還需要幾個步驟才能玩整的初始化state, 第一個就是 dasm_setupglobal:

+ |.globals lbl_
+ void* labels[lbl__MAX];
+ dasm_setupglobal(&d, labels, lbl__MAX);

The .globals 指令再加上lbl_ 會被DynASM preprocessor重新改寫成enum型態並且包含許多東西,其中之一就是lbl__MAX.這個值必須要被傳到dasm_setupglobal, 裡頭也要包含著同樣數量的void*的array. 在之後這個array也會很常使用到。

dasm_setup

再來的初始化過程是dasm_setup:

+ |.actionlist bf_actions
+ dasm_setup(&d, bf_actions);

The .actionlist指令後面加上bf_actions這個參數後會被DynASM preprocessor轉換成變數bf_actions, 然後這個變數會傳給dasm_setup.

dasm_growpc

通常在這邊dasm_State就會完全被初始話.但是我們在這邊使用dynamic labels, 還有最後要call dasm_growpc完成初始話:

+ dasm_growpc(&d, npc);

這這邊使用我們在前面定義的npc作為參數. 這個參數代表著我們已經配置的dynamic label, while the related variable nextpc 代表著我們已經使用過的dynamic labels。這些dynamic labels在compiling [ and ]會發揮作用。

Abstractions

在產生machine code之前, 定義一些抽象化的資料型態是很有用的。前面幾個抽象的資料型態是用來將使用的暫存器給予更有意義的名稱，後面是相關的function call:

所有這些抽象化透過.define（簡單替換）或.macro（對於更複雜的架構）來定義，並且對x86，x64，POSIX和x64中的每一個具有不同的定義：

+ |.if X64
+   |.define aPtr, rbx
+   |.define aState, r12
+   |.if WIN
+     |.define aTapeBegin, rsi
+     |.define aTapeEnd, rdi
+     |.define rArg1, rcx
+     |.define rArg2, rdx
+   |.else
+     |.define aTapeBegin, r13
+     |.define aTapeEnd, r14
+     |.define rArg1, rdi
+     |.define rArg2, rsi
+   |.endif
+   |.macro prepcall1, arg1
+     | mov rArg1, arg1
+   |.endmacro
+   |.macro prepcall2, arg1, arg2
+     | mov rArg1, arg1
+     | mov rArg2, arg2
+   |.endmacro
+   |.define postcall, .nop
+   |.macro prologue
+     | push aPtr
+     | push aState
+     | push aTapeBegin
+     | push aTapeEnd
+     | push rax
+     | mov aState, rArg1
+   |.endmacro
+   |.macro epilogue
+     | pop rax
+     | pop aTapeEnd
+     | pop aTapeBegin
+     | pop aState
+     | pop aPtr
+     | ret
+   |.endmacro
+ |.else
+   |.define aPtr, ebx
+   |.define aState, ebp
+   |.define aTapeBegin, esi
+   |.define aTapeEnd, edi
+   |.macro prepcall1, arg1
+     | push arg1
+   |.endmacro
+   |.macro prepcall2, arg1, arg2
+     | push arg2
+     | push arg1
+   |.endmacro
+   |.macro postcall, n
+     | add esp, 4*n
+   |.endmacro
+   |.macro prologue
+     | push aPtr
+     | push aState
+     | push aTapeBegin
+     | push aTapeEnd
+     | mov aState, [esp+20]
+   |.endmacro
+   |.macro epilogue
+     | pop aTapeEnd
+     | pop aTapeBegin
+     | pop aState
+     | pop aPtr
+     | ret 4
+   |.endmacro
+ |.endif

在為DynASM preprocessor完成所有這些架構和操作系統相關定義後，檢查指定給DynASM preprocessor的體系結構和操作系統是否與C預處理器已知的體系結構和操作系統相匹配是有用的，就靠下面的code將這些檢查完成:

+ ||#if ((defined(_M_X64) || defined(__amd64__)) != X64) || (defined(_WIN32) != WIN)
+ #error "Wrong DynASM flags used: pass `-D X64` and/or `-D WIN` to dynasm.lua as appropriate"
+ #endif

注意看看第一行:
==>

這行定義會取代 DynASM 的 .define，且會被加入在 .macro 的定義之中
在 DynASM preprocessing time 確定後就不會再被改變，如果 x64 或 WIN 在 DynASM preprocessing time 值被改為 1，他們就被取代為 1；若他們沒被改，就會被之後的 C processor 改為 0

Emitting Code

隨著所有的準備都完成後我們終於可以產生一些 machine code。

Prologue

第一件事情我們需要產生一個prologue，用來取代一部分由interpreter來初始化的程式:

- unsigned char* tape_begin = state->tape - 1;
- unsigned char* ptr = state->tape;
- unsigned char* tape_end = state->tape + TAPE_SIZE - 1;
+ |.type state, bf_state_t, aState

+ dasm_State** Dst = &d;
+ |.code
+ |->bf_main:
+ | prologue
+ | mov aPtr, state->tape
+ | lea aTapeBegin, [aPtr-1]
+ | lea aTapeEnd, [aPtr+TAPE_SIZE-1]

在這邊首先我們感興趣的是.type指令，後面允許我們將state-> tape作為[aState + offsetof（bf_state_t，tape）]的縮寫。

在下一行定義了一個新的變數Dst，並且將他初始化成&d。這是因為DynASM preprocessor將重寫之後的行到dasm_put（Dst，…）形式的調用，並且像之前對dasm_函數的呼叫一樣，第一個參數是&d。

This is done because the DynASM preprocessor will rewrite the subsequent lines to calls of the form dasm_put(Dst, …), and like the previous calls we've made to dasm_ functions, the first argument wants to be &d.

下一行包含了 .code 指令。這個指令藉由.section指令來引入，這個用在將產生出的machine code應放置在code section（正好是我們現在正在使用的段落）

再來定義一個global label ->bf_main。在產生完machine code後, machine code 中會包含著這一段global label並且轉換成function pointer。

再來調用了前面定義得 prologue macro，在這邊就會產生出幾行程式碼(是什麼他就沒有說)。

最後，有一個 mov instruction與兩個 lea instructions，這三行直接對應到刪掉的幾行interpreter code。如上所述，指定為mov的操作數的state-> tape被識別為[aState + offsetof（bf_state_t，tape）]的縮寫

As mentioned, the state->tape specified as an operand to mov is recognised as shorthand for [aState + offsetof(bf_state_t,tape)].

注意，offsetof（bf_state_t，tape）和TAPE_SIZE-1（lea操作數的一部分）都是所謂的encoding-time常數：DynASM不理解它們的含義，因此將它們的計算推遲到C編譯器。這兩個值都是C中的compile-time常數，但encoding-time常數不一定是compile-time常量（我們稍後會看到這個例子）。

Note that both offsetof(bf_state_t,tape) and TAPE_SIZE-1 (part of the lea operand) are so-called encoding-time constants: DynASM doesn't understand what they mean, so it defers their computation to the C compiler. Both of these values happen to be compile-time constants in C, but encoding-time constants don't have to be compile-time constants (we'll see examples of this in just a minute).

Tape Movement

We've reached the guts of the interpreter now, and the first job is to replace the interpreter's handling of < with the compiler's interpretation:

- if(!nskip) {
-   ptr -= n;
-   while(ptr <= tape_begin)
-     ptr += TAPE_SIZE;
- }
+ | sub aPtr, n%TAPE_SIZE
+ | cmp aPtr, aTapeBegin
+ | ja >1
+ | add aPtr, TAPE_SIZE
+ |1:

需要注意的是compiler沒有像interpreter那樣跳過代碼的概念，因此外部if會完全被丟棄。之後，ptr - = n; 並且後續循環的一些迭代已經成為| sub Ptr，n％TAPE_SIZE。

After that, ptr -= n; and some iterations of the subsequent loop have become | sub aPtr, n%TAPE_SIZE.

注意，n％TAPE_SIZE是一個編碼時間常數，它不是C中的編譯時常量：DynASM仍然不理解操作數的含義，但在這種情況下，當bf_compile運行時，計算操作數的最終值。
Note that n%TAPE_SIZE is an encoding-time constant which isn't a compile-time constant in C: DynASM still doesn't understand what the operand means, but in this case the final value of the operand is computed when bf_compile is running.

在編譯時透過%TAPE_SIZE執行循環的一些迭代之後，在運行時可能仍有一次迭代，這對應於cmp，ja和add指令。請注意，語法 >1 跳轉到本地標籤1的下一行，剛好在add指令之後。

After performing some iterations of the loop at compile time by means of %TAPE_SIZE, there might still be one iteration to perform at runtime, which correspond to the cmp, ja, and add instructions. Note that the syntax >1 jumps forward to the next definition of the local label 1, which is just after the add instruction.

A similar transformation happens for >, but with add and sub transposed:

if(!nskip) {
ptr += n;
while(ptr > tape_end)
ptr -= TAPE_SIZE;
}
| add aPtr, n%TAPE_SIZE
| cmp aPtr, aTapeEnd
| jbe >1
| sub aPtr, TAPE_SIZE
|1: