ria-jit 重點摘要

# ria-jit 重點摘要 ###### tags: `sysprog2020` `ria-jit` > [ria-jit 程式碼](https://github.com/ria-jit/ria-jit) > [ria-jit 論文](https://github.com/ria-jit/ria-jit/blob/master/documentation/paper/paper.pdf) ## 背景介紹 [ria-jit](https://github.com/ria-jit/ria-jit) 是一個在 user space 運作的[動態翻譯](https://zh.wikipedia.org/zh-tw/%E5%8D%B3%E6%99%82%E7%B7%A8%E8%AD%AF)模擬器，在 x86_64 Linux 上模擬 RISC-V Linux 64-bit 的執行環境 ### 動態轉譯模擬器難在哪裡？模擬器有兩個關鍵：**正確性**及**速度**，而這兩項關鍵需要一起考慮這個模擬器需要考量以下幾點： - RISC-V 和 x86_64 的[定址模式 (Addressing Mode)](https://en.wikipedia.org/wiki/Addressing_mode) 相差甚大 - 例如：x86_86 的 `movq` 指令可以接受有以下幾種 operand - ![](https://i.imgur.com/t7bEH7D.png) - ![](https://i.imgur.com/VwoRnC0.png) - 註：更詳細的 x86_64 的指令介紹可以參考 [CS:APP 第三章](https://hackmd.io/@sysprog/CSAPP/https%3A%2F%2Fhackmd.io%2Fs%2FHJDRfVCFG) - 例如：RISC-V 的定址方法只有3種 - PC-relative mode (i.e. auipc, jal br) - Register-offset (i.e. jalr, addi, lw, sw) - Absolute (i.e. lui) - 註：詳請參閱[RISC-V Code Models](https://www.sifive.com/blog/all-aboard-part-4-risc-v-code-models) - RISC-V 和 x86_64 的暫存器數量並不相同，暫存器包含 **通用暫存器 (General Purpose Register)** 和 **浮點數暫存器FPR (Floating Point Register)**。其中，RISC-V 有 32 個 GPR 而 x86_64 有 16 個 GPR，故一對一的映射並不可能（況且 x86_64 實際上沒有 16 個 GPR） - RISC-V 是 **[load-store](https://en.wikipedia.org/wiki/Load%E2%80%93store_architecture)** 架構, x86_64 是 **[register-memory](https://en.wikipedia.org/wiki/Register_memory_architecture)** 架構，RISC-V 的計算指令只會存取暫存器，對記憶體的讀寫只透過 lw, sw，但是 x86_64 的計算指令可以直接存取記憶體。 - 以下將提供一個範例說明這個問題： - 今天要將 RISC-V 指令 `sub x3, x2, x1` 轉換成 x86_64 指令 - 對於 RISC-V load-store 架構來說, 來源 operand （`x1`, `x2`）和目的 operand (`x3`) 不同是很正常的，例如：`sub x3, x2, x1` - 對於 x86_64 register-memory 架構來說, 來源 operand 通常就是目的 operand 例如：`addq %rbx %rcx` 等同 `%rcx = %rcx + %rbx` - 為了在 x86_64 上模擬 `sub x3, x2, x1` , 我們需要兩個 x86_64 指令 `movq %rax %rcx` 和 `addq %rbx %rcx` (假設 `%rax` 對應到 `x1`, `%rbx` 對應到 `x2`, `%rcx` 對應到 `x3`) - 所以即便 x86_64 是 CISC 指令集，RISC-V 是 RISC 指令集，一個 RISC-V 指令卻需要數個 x86_64 指令來達成相同的目的。 - **如何有效的把多個 RISC-V 指令合併成為一個 x86_64 的指令是這個模擬器最困難的地方** ## 模擬器之記憶體管理一個模擬器需要管理 guest binary (RISC-V) 的記憶體空間，以及 host machine （x86_86）記憶體空間 - ria-jit 的記憶體映射方式如下：![](https://i.imgur.com/JPGbxu1.png) - **ELF-file** (Executable and Linkable Format) 的 header 會標出哪些記憶體區段需要被　loader 載入記憶體, 且需要被擺放在哪個地址 - ria-jit 程式碼 - riajit load elf method `src/elf/loadElf.h` ## 轉譯的基礎單位 - 動態轉譯器需要將多個 RISC-V 指令轉譯成 x86_64 指令，且需要在程式執行的當下就完成，所以要邊轉換指令邊執行，同時還要兼顧效能。 - 轉譯器當然可以掃描完全部的 RISC-V 指令再進行轉譯，但那屬於“靜態轉譯模擬器”的範圍，動態轉譯模擬器會將指令切成數個小塊 (chunk)，讀取一個小塊後進行轉譯，完成後再執行轉譯過後的 x86_64 程式碼 - 動態轉譯器很自然的將 [basic block](https://en.wikipedia.org/wiki/Basic_block) 當作是 chunk （跟編譯器很像），以 basic block 作為轉譯的基本單位 - Basic blocks: Code segment with a single point of entry and exit - A basic block will be terminated by any control-flow altering instruction like a **jump**, **call** or **return statement**, or a **system call**. ## JIT 細節 ### 轉譯方法 - Step 1: 將一個 basic block 內的 RISC-V 指令轉換成 ria-jit 內部的表示法 - Step 2: 檢查這個 basic block 是否已經被轉譯過了，已經被轉譯過的 basic block 會被暫存在記憶體中 (ria-jit 稱之為 cache，其實是在memory 中，而非硬體的 cache) - 如否，則需要進行轉譯 - 轉譯器切換到 “轉譯模式” ，轉譯 RISC-V 指令到 x86_64 指令。 - 轉譯器將 basic block 中的 RISC-V 指令跟一系列的現成模板 (pattern) 做比較，程式碼在 [link](https://github.com/ria-jit/ria-jit/blob/master/src/gen/instr/patterns.c) - 轉譯完成後將結果存在記憶體中。 - 轉譯器切換到 “執行模式” - 如是，則可以省下轉譯的時間。 - 從記憶體中尋找轉譯過的 x86_64 指令，直接拷貝到記憶體 - 轉譯器切換到 “執行模式” - Step 3: 轉譯器切換到 “執行模式”之前，需要做狀態轉換，從“轉譯模式”恢復到 “執行模式” 的狀態。 - 狀態 (context) 包含一個模式的 PC，以及所有暫存器的值 #### Summary Control flow diagram ![](https://i.imgur.com/UJflD3J.png) ### 指令轉譯模板 - 轉譯模板是一系列人為設定的轉譯條件，用來將一段目標 RISC-V 指令轉換成 x86_64 指令 - ria-jit 使用到的模板可以在 [patterns.c](https://github.com/ria-jit/ria-jit/blob/master/src/gen/instr/patterns.c) 中找到 - 論文中列出一小部分的模板 ![](https://i.imgur.com/LJkOEqA.png) ### Code cache - 問題: 如何在記憶體中（code cache）快速的找到轉譯過的 basic block ？ - ria-jit 的解法: 兩層的軟體定義快取 (程式碼在 `src/cache/cache.c`) - 架構 - 大快取有 8192 快取塊 (大小可以調整) - 每個快取塊(`t_cache_entry`)包含以下的資訊 : - `t_risc_addr` RISC-V 環境中的 PC 位址 (unsigned long) - `t_cache_loc` void * 指針，指到所在的快取塊 - 程式碼 ```c=1 //cache entries for translated code segments typedef struct { //the full RISC-V pc address t_risc_addr risc_addr; //cache location of that code segment t_cache_loc cache_loc; } t_cache_entry; ``` - 雜湊函數 ```c=1 inline size_t hash(t_risc_addr risc_addr) { return (risc_addr >> 2u) & (table_size - 1); } ``` - 小快取有 32 快取塊 (大小固定) - 大/小快取中的每個快取塊都是相同的 - 雜湊函數 ```c=1 inline size_t smallhash(t_risc_addr risc_addr) { return (risc_addr >> 3u) & (SMALLTLB - 1); } ``` - 原理 - 給定一個 RISC-V 64-bit 地址將其映射到 0 到 31 - 用映射後的值查找小快取 - Hit: 回傳快取塊所在的地址 - Miss: 去大快取中查找 - 將 RISC-V 64-bit 地址其映射到 0 到 8191 - 用映射後的值查找大快取 - Hit: 將快取塊搬到小塊取中，並回傳快取塊所在的地址 - Miss: 進行轉譯 - 衍生問題: 快取滿了之後要如何處理 ? (Cache replacement policy) - Invalidate and purge some or all of the blocks currently residing in the cache - Cons: Add performance overhead in case purged blocks are needed in the future. - Dynamically resize the cache according to the needs of the guest program - Cons: Higher memory usage ### 如何處理暫存器對應的問題？ - 問題定義：x86-64 有不到 16 個 GPR, RISC-V 有 32 個 GPR，如何把 32 個 RISC-V GPR 對應到有限的 x86_64 GPR ? - 想法：將 32 個 RISC-V GPR 用記憶體模擬，所有對 GPR 的讀寫都轉換成對記憶體的讀寫 ? (`rv32emu-next` 的 register file 實作方法可以在 [riscv.h](https://github.com/sysprog21/rv32emu-next/blob/master/riscv.h) 中找到) - ria-jit 的論文中給出了不採用這個方法的原因 > Keeping a **guest register file exclusively in memory**, and loading them into native registers when needed within the translations of single instructions is technically possible, especially in light of the ability to extensively use memory operands in the instructions on x86–64. However, this necessitates a **large number of memory accesses** for both memory operands in the instructions as well as local register allocation within the translated blocks. Due to the very large performance gain connected to using register operands instead of memory operands, this is also not feasible at scale. - ria-jit 採用的解法: - ria-jit 實作了暫存器存取的統計工具，找出在 RISC-V 架構中 **最常用的暫存器**，然後在模擬器中直接映射到給定的 x86–64 暫存器 - 統計結果：![](https://i.imgur.com/SXMVRYy.png) - 根據統計結果採用的暫存器對應方式 ![](https://i.imgur.com/dhFFaIU.png) - 其餘較不常用的 RISC-V 暫存器會被 **動態的映射到 x86_64 的暫存器** - 如果轉譯器遇到一個“沒有被映射的暫存器”，則最後被用到的 RISC-V 暫存器的值會被寫回記憶體，沒有被映射的暫存器 RISC-V 會被映射到一個 x86_64 的暫存器。 ### 動態轉譯器狀態轉換時的暫存器處理 - 問題定義：動態轉譯器有兩個模式，“轉譯模式” (host context) 以及“執行模式” (guest context) ，轉譯模式進行指令的轉譯，執行模式會在 x86_64 執行轉譯過的程式，這兩個模式有不同的記憶體空間，以及不同的暫存器內容 (context) 。這個概念接近作業系統中“執行序(Process)和虛擬記憶體(Virtual Memory)”。 - ria-jit 的細節: 利用 context switch (不是作業系統的 context switch 但是觀念相近) 進行模式之間的切換 - ria-jit 中的 context switch 處理 RISC-V 和 x86_64 之間的暫存器映射 - 換到 “執行模式” (guest context) 之前, 被靜態映射的暫存器的值會從記憶體中被讀出，並存進對應的暫存器中。之後動態轉譯器轉換到 guest address space 並開始執行轉譯過後的程式碼。 - 換到 host context 進行轉譯之前, 需要保存 “執行模式” (guest context) 中所有的暫存器值，確保轉譯完成後可以回復狀態並繼續執行。 ### System call 處理 - RISC–V 使用 **ecall** 來處理系統呼叫需求，**system call number** 在暫存器 **a7** 需要的參數使用 **a0 – a6** 傳遞 - RISC–V Linux 和 x86–64 Linux 使用不同的 system call 介面，故有些 RISC–V 的系統呼叫需要特別處理 ### 浮點數處理 - The main difficulties (and their resolutions) that arise by using the **x86–64 SSE** extensions to translate the RISC-V **F- and D-extensions** - Register handling - We utilise the tools we designed to discover the **most-used registers** in the guest programs, and statically map these to **x86–64 SSE registers XMM2–XMM15**. - Missing equivalent SSE instructions - The instructions that need to be emulated are **unsigned conversion** instructions - Rounding modes: - Handled differently in the RISC–V architecture, as the rounding mode can be set **individually for every instruction**. The rounding mode of the **SSE extension** however is controlled by the state of the **MXCSR** control and status register. - Exception handling - RISC–V is realized by reading the **fcsr** floating-point control and status register, **traps are not supported.** - The **CSR** instructions used to read this register are thus emulated to instead **read and translate the MXCSR exception flags**.