rv32emu 學習

# rv32emu 學習 ## Target 1. 試圖解決目前 [rv32emu](https://github.com/sysprog21/rv32emu) 面臨的問題 2. 紀錄所見所聞 3. 紀錄所觀察到的問題 ### [rv32emu 論文資料整理](https://hackmd.io/@bTrULUl4Q3OpYASGDynLwg/BySJdH_Na) ## 參考資料 [藉由 JIT 編譯加速 rv32emu](https://hackmd.io/@lambert-wu/rv32emu) ### ELF [wikipedia](https://zh.wikipedia.org/zh-tw/%E5%8F%AF%E5%9F%B7%E8%A1%8C%E8%88%87%E5%8F%AF%E9%8F%88%E6%8E%A5%E6%A0%BC%E5%BC%8F) ELF 全名是 Executable and Linkable Format，用於可執行檔、目的碼、共享函式庫和 core dump 的標準檔案格式。依據[規格書](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)說明，可分為 3 種類型 1. relocatable file >A relocatable file holds code and data suitable for linking with other object files to create an executable or a shared object file. 2. executable file >An executable file holds a program suitable for execution. 3. shared object file >A shared object file holds code and data suitable for linking in two contexts. > >First, the link editor may process it with other relocatable and shared object files to create another object file. > >Second, the dynamic linker combines it with an executable file and other shared objects to create a process image. ELF 是由組譯器和連結器所產生，此外，他也是程式的二進位表示，試圖直接在處理器上執行。 Linking view 和 Execution view 指的是不同情況下的檔案格式 * Linking view 是指編譯過的程式檔案格式，這個格式可存在 second storage 上，例如硬碟，但還沒有載入記憶體中。 * Execution view 是指程式已被載入記憶體中，並在執行時的格式。 ![](https://hackmd.io/_uploads/SkeCpTPZa.png) 假設你嘗試撰寫了一個名為 million.c ，那麼我們透過以下命令使用 gcc 編譯出一個執行檔，叫做 million。 ```shell $ gcc -g million.c -o million ``` 接下來如果你想查看 ELF 檔案，可以透過以下命令達成 ```shell $ readelf -h million ``` ELF 檔案內容如下 * 在我這份 ELF 檔案中，program headers 的 size 是 56 (bytes) * program headers 的數目是 13。 ``` ELF 檔頭：魔術位元組： 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 類別: ELF64 資料: 2 的補數，小尾序(little endian) Version: 1 (current) OS/ABI: UNIX - System V ABI 版本: 0 類型: DYN (共享物件檔案) 系統架構: Advanced Micro Devices X86-64 版本: 0x1 進入點位址： 0x11e0 程式標頭起點： 64 (檔案內之位元組) 區段標頭起點： 18584 (檔案內之位元組) 旗標： 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 13 Size of section headers: 64 (bytes) Number of section headers: 36 Section header string table index: 35 ``` 如果你想查看 Program header table，那就輸入以下命令。 ```shell $ readelf -l million ``` * Entry point 0x11e0 指的是執行檔案的進入點的記憶體位址，當運行該執行檔案時，便會從開位址開始執行。 * 表示該 ELF 檔案包含 13 個 Program Header，並且這些標頭的偏移位址是 64。Program Header 通常會包含 segment 的資訊。 * 在 Program Header 中，type 為 `LOAD` 為作業系統載入到記憶體的部分，type 為 `INTERP` 為儲存動態連結器的位置 ``` Elf 檔案類型為 DYN (共享物件檔案) Entry point 0x11e0 There are 13 program headers, starting at offset 64 程式標頭：類型偏移量虛擬位址實體位址檔案大小記憶大小旗標對齊 PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040 0x00000000000002d8 0x00000000000002d8 R 0x8 INTERP 0x0000000000000318 0x0000000000000318 0x0000000000000318 0x000000000000001c 0x000000000000001c R 0x1 [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2] LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x00000000000008c8 0x00000000000008c8 R 0x1000 LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000 0x0000000000000675 0x0000000000000675 R E 0x1000 LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000 0x00000000000001f0 0x00000000000001f0 R 0x1000 LOAD 0x0000000000002d58 0x0000000000003d58 0x0000000000003d58 0x00000000000002b8 0x00000000000002c0 RW 0x1000 DYNAMIC 0x0000000000002d68 0x0000000000003d68 0x0000000000003d68 0x00000000000001f0 0x00000000000001f0 RW 0x8 NOTE 0x0000000000000338 0x0000000000000338 0x0000000000000338 0x0000000000000020 0x0000000000000020 R 0x8 NOTE 0x0000000000000358 0x0000000000000358 0x0000000000000358 0x0000000000000044 0x0000000000000044 R 0x4 GNU_PROPERTY 0x0000000000000338 0x0000000000000338 0x0000000000000338 0x0000000000000020 0x0000000000000020 R 0x8 GNU_EH_FRAME 0x000000000000207c 0x000000000000207c 0x000000000000207c 0x000000000000004c 0x000000000000004c R 0x4 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 0x10 GNU_RELRO 0x0000000000002d58 0x0000000000003d58 0x0000000000003d58 0x00000000000002a8 0x00000000000002a8 R 0x1 區段到節區映射中: 節區段… 00 01 .interp 02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 03 .init .plt .plt.got .plt.sec .text .fini 04 .rodata .eh_frame_hdr .eh_frame 05 .init_array .fini_array .dynamic .got .data .bss 06 .dynamic 07 .note.gnu.property 08 .note.gnu.build-id .note.ABI-tag 09 .note.gnu.property 10 .eh_frame_hdr 11 12 .init_array .fini_array .dynamic .got ``` 在 [rv32emu/src/elf.h](https://github.com/sysprog21/rv32emu/blob/master/src/elf.h) 中定義出 Elf32 program header table 以及 Elf32 section header table 的資料結構下方為 Elf32 program header table ```c struct Elf32_Phdr { Elf32_Word p_type; /* Type, a combination of ELF_PROGRAM_TYPE_ */ Elf32_Off p_offset; /* Offset in the file of the program image */ Elf32_Addr p_vaddr; /* Virtual address in memory */ Elf32_Addr p_paddr; /* Optional physical address in memory */ Elf32_Word p_filesz; /* Size of the image in the file */ Elf32_Word p_memsz; /* Size of the image in memory */ Elf32_Word p_flags; /* Type-specific flags */ Elf32_Word p_align; /* Memory alignment in bytes */ }; ``` 下方為 Elf32 section header table ```c struct Elf32_Shdr { Elf32_Word sh_name; /* Section name */ Elf32_Word sh_type; /* Section type */ Elf32_Word sh_flags; /* Section flags */ Elf32_Addr sh_addr; /* Section virtual addr at execution */ Elf32_Off sh_offset; /* Section file offset */ Elf32_Word sh_size; /* Section size in bytes */ Elf32_Word sh_link; /* Link to another section */ Elf32_Word sh_info; /* Additional section information */ Elf32_Word sh_addralign; /* Section alignment */ Elf32_Word sh_entsize; /* Entry size if section holds table */ }; ``` ## RISC-V指令集編碼我們可以觀察到對於 RISC-V 指令集編碼這件事，劃分為以下幾個 type * R-type * I-type * S-type * B-type * U-type * J-type ![](https://hackmd.io/_uploads/BkWdAQbZ6.png) 下方內容是對上圖的解釋 imm : 在不同的指令有不同的作用，用來輔助指令 rs1 : 第 1 個保存來源的暫存器索引值 rs2 : 第 2 個保存來源的暫存器索引值 rd : 儲存運算結果的暫存器索引值 opcode : 用來表示指令集要做什麼操作 funct3 : 額外的 opcode funct7 : 額外的 opcode 指令的欄位可以透過 bitmask 的操作取得指令的欄位 ### 模擬器運作原理 [rv32emu](https://github.com/sysprog21/rv32emu) 是模擬 CPU 的 RISC-V 指令集，也就是模擬CPU的[指令週期](https://en.wikipedia.org/wiki/Instruction_cycle) > The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. > > It is composed of three main stages: the fetch stage, the decode stage, and the execute stage. > > Each computer's CPU can have different cycles based on different instruction sets 依照 RISC-V 的規格書定義暫存器的數量跟功能以及不同指令集的操作和欄位來實作 CPU 的解碼器和指令集對應的操作（Ex.加減乘除，並將結果存在暫存器。這樣就能在不同指令集架構的電腦上，用軟體模擬 RISC-V 指令集 ### 記憶體 I/O RISC-V 是 load–store architecture 指令，當你要執行指令時，會從記憶體將載入至暫存器。由於是 Byte addressing，所以記憶體讀取單位是一個 Byte。在 rv32emu 的記憶體中，大小為$2^{16}$x$2^{16}$ * X 軸為 data 組成的 chunk * Y 軸為 byte 組成的 data ```c typedef struct { uint8_t data[0x10000]; } chunk_t; // Y軸 typedef struct { chunk_t *chunks[0x10000]; } memory_t; // Ｘ軸 ``` 若要將 32 位元位址對應到記憶的Ｘ軸和Ｙ軸，須將 32 位元以 2 個 16 位元分別表示Ｘ和Ｙ，這樣便能讀取 memory[x][y] * 我們可以將 addr 右移 16 位元，取得Ｘ * 將 addr 和 `0xFFFF` 做 AND operation，取得 Y ![](https://hackmd.io/_uploads/Hk0qqcbbp.png) ### 寫入記憶體從 addr 開始，將大小為 size 的資料寫入記憶體，如果 chunk 未初始化，便配置新的記憶體給 chunk 並存到 memory。在寫入記憶體時，仍按照先前的規則，把 (addr + i) 轉為記憶體的 X 跟 Y，這樣就能將 src 寫入到 memory[x][y]。 ```c void memory_write(memory_t *m, uint32_t addr, const uint8_t *src, uint32_t size) { for (uint32_t i = 0; i < size; ++i) { uint32_t p = addr + i; uint32_t x = p >> 16; chunk_t *c = m->chunks[x]; if (!c) { c = malloc(sizeof(chunk_t)); memset(c->data, 0, sizeof(c->data)); m->chunks[x] = c; } c->data[p & 0xffff] = src[i]; } } ``` ### 讀取記憶體藉由判斷 addr 和 addr+size 來確認是否在同一個 chunk 之間，如果是的話，意味著 addr 和 addr+size 在記憶體的位置中是同個Ｘ。接著確認 chunk 有沒有資料，如果有，就將 chunk 中的資料複製到 dst，沒有的話就對 dst 寫入 0。倘若 addr 和 addr+size 不在同一個 chunk 之間，就從memory[x][p] 開始寫入資料至 dst ```c static const uint32_t mask_hi = ~(0xffff); void memory_read(memory_t *m, uint8_t *dst, uint32_t addr, uint32_t size) { /* test if this read is entirely within one chunk */ if ((addr & mask_hi) == ((addr + size) & mask_hi)) { chunk_t *c; if ((c = m->chunks[addr >> 16])) { /* get the subchunk pointer */ const uint32_t p = (addr & mask_lo); /* copy over the data */ memcpy(dst, c->data + p, size); } else { memset(dst, 0, size); } } else { /* naive copy */ for (uint32_t i = 0; i < size; ++i) { uint32_t p = addr + i; chunk_t *c = m->chunks[p >> 16]; dst[i] = c ? c->data[p & 0xffff] : 0; } } } ``` ### ELF 載入器 ELF 檔可當作執行檔或物件檔來連接。 * 如果作為執行檔則可以從 .text 的區段讀取 RISC-V 的指令讓模擬器運行，.text 可以用 riscv-gnu-toolchain 的 riscv32-unknown-linux-gnu-objdump 印出 RISC-V 的組語如果要用程式來讀取指令則可以依據[規格書](https://refspecs.linuxfoundation.org/elf/elf.pdf)給的欄位來定義如下的 header ![](https://hackmd.io/_uploads/B1sLkrz-T.png) 在 elf_open 中若發現 e->raw_data 已被配置記憶體，那便釋放掉。 elf_open 打開 ELF 檔，接著確認檔案大小，依據該大小分配記憶體空間給 `e->raw_data`，再從檔案指標 f 將檔案寫入 raw_data。考慮到連結和執行等需求，需要開頭為 ELF header，於是將 `elf->hdr` 指向 `e->raw_data`。 ```c bool elf_open(elf_t *e, const char *path) { /* free previous memory */ if (e->raw_data) release(e); FILE *f = fopen(path, "rb"); if (!f) return false; /* get file size */ fseek(f, 0, SEEK_END); e->raw_size = ftell(f); fseek(f, 0, SEEK_SET); if (e->raw_size == 0) { fclose(f); return false; } /* allocate memory */ free(e->raw_data); e->raw_data = malloc(e->raw_size); /* read data into memory */ const size_t r = fread(e->raw_data, 1, e->raw_size, f); fclose(f); if (r != e->raw_size) { release(e); return false; } /* point to the header */ e->hdr = (const struct Elf32_Ehdr *) e->raw_data; /* check it is a valid ELF file */ if (!is_valid(e)) { release(e); return false; } return true; } ``` 我們用 gdb 來觀察 ELF header ```c (gdb) set print pretty (gdb) p *elf->hdr $4 = { e_ident = "\177ELF\001\001\001\000\000\000\000\000\000\000\000", e_type = 2, e_machine = 243, e_version = 1, e_entry = 65680, e_phoff = 52, e_shoff = 92888, e_flags = 0, e_ehsize = 52, e_phentsize = 32, e_phnum = 2, e_shentsize = 40, e_shnum = 13, e_shstrndx = 12 } ``` 想要知道 header 各欄位的意思，可以觀察 puzzle.elf，但須先安裝 [riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain) ```shell $ riscv32-unknown-linux-gnu-readelf -h puzzle.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10090 Start of program headers: 52 (bytes into file) Start of section headers: 92888 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 2 Size of section headers: 40 (bytes) Number of section headers: 13 Section header string table index: 12 ``` [In-depth: ELF - The Extensible & Linkable Format](https://youtu.be/nC1U1LJQL8o) 我們可以從 ELF header ，得知程式的開頭、程式檔頭的地址以及有多少個程式檔頭，而程式檔頭彼此的間距是一樣的，如下圖所示 ![](https://hackmd.io/_uploads/SkXZvSzbp.png) 我們計算完程式檔頭之間的偏移量，便能逐個讀取程式擋頭，並判斷程式檔頭的類型是否為（PT_LOAD），如果是便按照規範書處理 PT_LOAD，對（p_memsz）和（p_memsz）取最小的作為記憶體所需範圍，再將資料寫入記憶體。剩餘部分用 0 補上。 ```c bool elf_load(elf_t *e, struct riscv_t *rv, memory_t *mem) { rv_set_pc(rv, e->hdr->e_entry); /* set the entry point */ /* loop over all of the program headers */ for (int p = 0; p < e->hdr->e_phnum; ++p) { /* find next program header */ uint32_t offset = e->hdr->e_phoff + (p * e->hdr->e_phentsize); const struct Elf32_Phdr *phdr = (const struct Elf32_Phdr *) (e->raw_data + offset); /* check this section should be loaded */ if (phdr->p_type != PT_LOAD) continue; /* memcpy required range */ const int to_copy = min(phdr->p_memsz, phdr->p_filesz); if (to_copy) memory_write(mem, phdr->p_vaddr, e->raw_data + phdr->p_offset, to_copy); /* zero fill required range */ const int to_zero = max(phdr->p_memsz, phdr->p_filesz) - to_copy; if (to_zero) memory_fill(mem, phdr->p_vaddr + to_copy, to_zero, 0); } return true; } ``` > ELF 規格書關於 PT_LOAD 的規範 > The array element specifies a loadable segment, described by p_filesz and p_memsz. > The bytes from the file are mapped to the beginning of the memory segment. > If the segment’s memory size (p_memsz) is larger than the file size (p_filesz), the "extra’’ bytes are defined to hold the value 0 and to follow the segment’s initialized area. > The file size may not be larger than the memory size. > Loadable segment entries in the program header table appear in ascending order,sorted on the p_vaddr member. ### CPU 指令周期 [wikipedia](https://zh.wikipedia.org/zh-tw/%E6%8C%87%E4%BB%A4%E5%91%A8%E6%9C%9F) > The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. > > It is composed of three main stages: the fetch stage, the decode stage, and the execute stage. > > Each computer's CPU can have different cycles based on different instruction sets 在 memory 中進行 fetch 時，先從 32 位元的位址中找出在 memory[x][y] 中的 x 也就是定位在記憶體中 chunks 的位置，可由 `addr >> 16` 完成，y 則是指令的開頭地址，可由 `addr & mask_lo` 完成 |addr >> 16 |addr & mask_lo| |-----|--------| ```c static const uint32_t mask_lo = 0xffff; uint32_t memory_read_ifetch(memory_t *m, uint32_t addr) { const uint32_t addr_lo = addr & mask_lo; assert((addr_lo & 1) == 0); chunk_t *c = m->chunks[addr >> 16]; assert(c); return *(const uint32_t *) (c->data + addr_lo); } ``` 找到指令的開頭地址之後再轉型成 32 位元的指令，接下來從開頭的地址往後讀取 32 位元，其實就是 4 個 8 位元。這也解釋了 32 位元的指令執行完後，為何更新 PC 要加上 4，壓縮指令集執行完後 PC 要加 2，如此一來能確保取出下一個指令的位址是正確的。 ![](https://hackmd.io/_uploads/B1d4nuMb6.png) ## 解碼要擷取 RV32I 指令集中 $rd, rs1, rs2, func3, func7,imm$ 可以透過 $bitmask$ 來讀取指令集的欄位，如下 ```c enum { INST_6_2 = 0b00000000000000000000000001111100, FR_C_1_0 = 0b00000000000000000000000000000011, // c-type FR_C_15_13 = 0b00000000000000001110000000000000, FR_RD = 0b00000000000000000000111110000000, FR_RS1 = 0b00000000000011111000000000000000, FR_RS2 = 0b00000001111100000000000000000000, ... FS_IMM_4_0 = 0b00000000000000000000111110000000, // s-type FS_IMM_11_5 = 0b11111110000000000000000000000000, // ....xxxx....xxxx....xxxx....xxxx FB_IMM_11 = 0b00000000000000000000000010000000, // b-type FB_IMM_4_1 = 0b00000000000000000000111100000000, FB_IMM_10_5 = 0b01111110000000000000000000000000, FB_IMM_12 = 0b10000000000000000000000000000000, // ....xxxx....xxxx....xxxx....xxxx FU_IMM_31_12 = 0b11111111111111111111000000000000, // u-type // ....xxxx....xxxx....xxxx....xxxx FJ_IMM_19_12 = 0b00000000000011111111000000000000, // j-type FJ_IMM_11 = 0b00000000000100000000000000000000, FJ_IMM_10_1 = 0b01111111111000000000000000000000, FJ_IMM_20 = 0b10000000000000000000000000000000, ... } ``` 這裡說明一下為何使用 bitmask 便可以獲得 $rd, rs1, rs2, func3, func7,imm$ 等特定欄位，因為 bitmask 已經為要解碼的欄位設為 1，其餘設為 0。接著 bitmask 跟指令集做 and 運算，便能獲得特定欄位。再將該特定欄位右移至 LSB。 ```c // decode rd field static inline uint32_t dec_rd(uint32_t inst) { return (inst & FR_RD) >> 7; } // decode rs1 field static inline uint32_t dec_rs1(uint32_t inst) { return (inst & FR_RS1) >> 15; } // decode rs2 field static inline uint32_t dec_rs2(uint32_t inst) { return (inst & FR_RS2) >> 20; } ``` 這裡以 `dec_rd` 為例，inst 和 FR_RD 的值如下 ``` inst = 0b00111001100101011010111010100010 FR_RD = 0b00000000000000000000111110000000 ``` 接著計算結果並右移至LSB ``` inst ＆ FR_RD = 0b00000000000000000000111010000000 inst ＆ FR_RD >> 7 = 0b00000000000000000000000000011101 ``` $rd$ 是運算結果的索引值，只需要指令集中的 5 bit 即可表達暫存器中 $0$ 到 $2^5$−$1$ 的位置。若要解碼指令集中的 $imm$ 稍微複雜一些，以 $J-type$ 的指令集為例 ![](https://hackmd.io/_uploads/SkjZysQZ6.png) 我們要使用的 bitmask 如下 ```c FJ_IMM_20 = 0b10000000000000000000000000000000 FJ_IMM_19_12 = 0b00000000000011111111000000000000 FB_IMM_11 = 0b00000000000000000000000010000000 FJ_IMM_10_1 = 0b01111111111000000000000000000000 ``` 假設我們的 inst = 0b00111001100101011010111010100010，執行完以下指令後，可以得到以下結果 ```c ... dst |= (inst & FJ_IMM_20); // dst = 0b00000000000000000000000010000000 dst |= (inst & FJ_IMM_19_12) << 11; // dst = 0b00101101000000000000000000000000 dst |= (inst & FJ_IMM_11) << 2; // dst = 0b00101101000000000000001000000000 dst |= (inst & FJ_IMM_10_1) >> 9; // dst = 0b00101101000001110011001000000000 ... ``` 可以得知在計算完 $J-type$ 後， $dst$ 會佔指令集 21 個 bit，因此最後將右移 11 bit 來得到 $imm$。 ```c // decode jtype instruction immediate static inline int32_t dec_jtype_imm(uint32_t inst) { uint32_t dst = 0; dst |= (inst & FJ_IMM_20); dst |= (inst & FJ_IMM_19_12) << 11; dst |= (inst & FJ_IMM_11) << 2; dst |= (inst & FJ_IMM_10_1) >> 9; // note: shifted to 2nd least significant bit return ((int32_t) dst) >> 11; } // decode btype instruction immediate static inline int32_t dec_btype_imm(uint32_t inst) { uint32_t dst = 0; dst |= (inst & FB_IMM_12); dst |= (inst & FB_IMM_11) << 23; dst |= (inst & FB_IMM_10_5) >> 1; dst |= (inst & FB_IMM_4_1) << 12; // note: shifted to 2nd least significant bit return ((int32_t) dst) >> 19; } // decode stype instruction immediate static inline int32_t dec_stype_imm(uint32_t inst) { uint32_t dst = 0; dst |= (inst & FS_IMM_11_5); dst |= (inst & FS_IMM_4_0) << 13; return ((int32_t) dst) >> 20; } ``` ### 執行與寫入在 rv32emu 的 Makefile 中有以下巨集 ```shell CFLAGS += -D ENABLE_RV32M CFLAGS += -D ENABLE_Zicsr CFLAGS += -D ENABLE_Zifencei CFLAGS += -D ENABLE_RV32A CFLAGS += -D ENABLE_RV32C ENABLE_COMPUTED_GOTO ?= 1 ifeq ("$(ENABLE_COMPUTED_GOTO)", "1") ifneq ($(filter $(CC), gcc clang),) riscv.o: CFLAGS += -D ENABLE_COMPUTED_GOTO ifeq ("$(CC)", "gcc") riscv.o: CFLAGS += -fno-gcse -fno-crossjumping endif endif endif ``` `ENABLE_COMPUTED_GOTO` 表示執行指令集編譯時使用 computed goto 的程式碼。下面我們分別說明開啟跟關閉 computed goto 的程式碼，我們指令集模擬的識別符號都以 `op_` 開頭，若要使用 computed goto 的語法，那麼我們就要在 `op_` 前面加上 &&。此外，這裡有使用條件式編譯來決定是否使用 computed goto 的語法。 ```c #ifdef ENABLE_COMPUTED_GOTO #define OP(instr) &&op_##instr #define TABLE_TYPE const void * #define TABLE_TYPE_RVC const void * #else // ENABLE_COMPUTED_GOTO = false #define OP(instr) op_##instr #define TABLE_TYPE const opcode_t #define TABLE_TYPE_RVC const c_opcode_t #endif ``` 接下來我們介紹一下 jump_table，它是依據規格書的 opcode map 的順序以及函式名稱稱整理出來的，如下 ```c // clang-format off TABLE_TYPE jump_table[] = { // 000 001 010 011 100 101 110 111 OP(load), OP(load_fp), OP(unimp), OP(misc_mem), OP(op_imm), OP(auipc), OP(unimp), OP(unimp), // 00 OP(store), OP(store_fp), OP(unimp), OP(amo), OP(op), OP(lui), OP(unimp), OP(unimp), // 01 OP(madd), OP(msub), OP(nmsub), OP(nmadd), OP(fp), OP(unimp), OP(unimp), OP(unimp), // 10 OP(branch), OP(jalr), OP(unimp), OP(jal), OP(system), OP(unimp), OP(unimp), OP(unimp), // 11 }; ``` 然後這是規格書的 opcode map，如下 ![](https://hackmd.io/_uploads/rk-4pMEZ6.png) * row 是取出 opcode 的 2 至 4 碼($inst[2:4]$) * column 是取出 opcode 的 5 至 6 碼($inst[5:6]$) * opcode map 是取出 opcode 的 2 至 6 碼($inst[2:6]$) 這裡舉 AMO 為例，它的 row 為 011，column 為 01，也就是 $\underbrace{01}_{inst[5:6]}\underbrace{011}_{inst[4:2]}$ ，用 ${inst[2:6]}$ 查看規格書，確認 AMO 要執行甚麼指令 ![](https://hackmd.io/_uploads/BJUFCGN-p.png) 在下面我們可以先找到 column 為 opcode 並且黃色部分和我們一致的，然後看看有哪些 row ，我們分別找到 LR.W, SC.W, AMOSWAP.W。 | | | | | | | | opcode | inst | | -------- | -------- | -------- |-------- | -------- | -------- |-------- | -------- | -------- | | 00010 | aq | rl |00000 | rs1 | 010 |rd | ==01011==11 | LR.W| | 00011 | aq | rl |rs2 | rs1 | 010 |rd | ==01011==11 | SC.W| | 00001 | aq | rl |rs2 | rs1 | 010 |rd | ==01011==11 | AMOSWAP.W| * jump_table_rvc 也是按造規格書整理出來的，行跟列的存取原理相同。 * ump_table_rvc 跟 jump_table 的差異在於是整理壓縮指令集的表格，而壓縮指令集的長度只有一般指令集的一半 * 另外要顧及陣列的存取順序以及可讀性，所以依照指令集小的位元去排，因此 jump_table_rvc 每4個就切換到下一列。 ```c #ifdef ENABLE_RV32C TABLE_TYPE_RVC jump_table_rvc[] = { // 00 01 10 11 OP(caddi4spn), OP(caddi), OP(cslli), OP(unimp), // 000 OP(cfld), OP(cjal), OP(cfldsp), OP(unimp), // 001 OP(clw), OP(cli), OP(clwsp), OP(unimp), // 010 OP(cflw), OP(clui), OP(cflwsp), OP(unimp), // 011 OP(unimp), OP(cmisc_alu), OP(ccr), OP(unimp), // 100 OP(cfsd), OP(cj), OP(cfsdsp), OP(unimp), // 101 OP(csw), OP(cbeqz), OP(cswsp), OP(unimp), // 110 OP(cfsw), OP(cbnez), OP(cfswsp), OP(unimp), // 111 }; #endif ``` ![](https://hackmd.io/_uploads/Bkscfm4-a.png) ### JIT 編譯器 [MIR](https://github.com/vnmakarov/mir) 的全名為 Medium Internal Representation，而 MIR project 目標就是為快速輕量級的 interpreters 和 JIT 提供基礎。 MIR 中提供了一系列基本的 API，可以使用不同語言當作輸入再轉成 MIR，接著進行優化，再編譯成特定平台的機器語言。 MIR 可直接當成函式來呼叫，可以讓開發者直接實作出直譯器或是 JIT 編譯器。以下是目前 MIR project 能做到的功能 * 我們可以透過 API 建立 MIR * 我們也可以建立 MIR 從 MIR binary 或是 text file。 * 最佳理解 MIR 的方式是使用文字 MIR 表示 ![](https://hackmd.io/_uploads/r1bBvDYWT.png) 在這裡有個[範例](https://github.com/vnmakarov/mir)是在說明，C 語言函式是如何以 MIR 文字檔案進行表示 [藉由 JIT 編譯加速 rv32emu](https://hackmd.io/@lambert-wu/rv32emu) 也提到，如果用 [MIR 的函式去實作功能](https://github.com/vnmakarov/mir/blob/master/MIR.md#mir-api-example) 的話程式碼會變的很多且不易閱讀，因此以 C 語言的程式碼當作輸入再去編譯會簡單很多。這裡有一篇關於 [MIR API 的介紹文](https://hackmd.io/@lambert-wu/mir-api) 以下是針對這篇的一些理解 ### 初始化 MIR 初始化跟結束分別為 MIR_init() / MIR_finish (MIR_context_t ctx) 當建立好 mir 的物件，可以將要實作的功能包裝成模組(module)，mir 會用內建的雙向 linked list 將模組連接在一起，而模組包含了下面這些項目(item) * Function * Import * Export * Foward declaration * Prototype * Data * Reference data * Expression Data * Memory segment 模組的初始化跟結束分別為 MIR_new_module / MIR_finish_module ### c2mir 將 C 程式碼當作輸入的話，要透過 c2mir 的 API 來達成，其中 c2mir_compile 的 get_func 會用來讀取程式碼，而 jit_ptr 會當作參數傳入 get_func，讀取完畢會回傳 EOF。 ```c typedef struct jit_item { char *code; size_t code_size; size_t curr; } jit_item_t; int get_func(void *data) { jit_item_t *item = data; return item->curr >= item->code_size ? EOF : item->code[item->curr++]; } jit_item_t *jit_ptr = (jit_item_t *)malloc(sizeof(jit_item_t *)); jit_ptr->curr = 0; jit_ptr->code = "int add(int a,int b) { \ int c = a + b; \ printf(\"%d + %d = %d\n\", a, b, c); \ return a + b; \ }\n "; jit_ptr->code_size = strlen(jit_ptr->code); if (!c2mir_compile(ctx, options, get_func, jit_ptr, name, NULL)) { perror("Compile failure"); exit(EXIT_FAILURE); } ``` 當 c2mir_compile 成功之後，會作為模組給 ctx 接上。如果今天是做直譯器的話，想要呼叫編譯好的函式就要從 ctx 的模組的尾部，取得編譯好的模組，再從模組裡面找編譯好的函式，接著透過 MIR-generator 產生機器碼就可以當作函式來呼叫了。 ```c MIR_module_t module = DLIST_TAIL(MIR_module_t, *MIR_get_module_list(ctx)); MIR_item_t func = DLIST_HEAD(MIR_item_t, module->items); size_t func_len = DLIST_LENGTH(MIR_item_t, module->items); int a = 10, b = 50; for (int i = 0; i < func_len; i++, func = DLIST_NEXT(MIR_item_t, func)) { if (func->item_type == MIR_func_item) { int (*arich)(int, int) = MIR_gen(ctx, 0, func); arich(a, b); printf("%d + %d = %d\n", a, b, c); break; } } ``` ### 連結在我們執行 mir 之前需要將模組載入跟連結，如下 ```c MIR_load_module (MIR_context ctx, MIR_module_t m) MIR_link (MIR_context ctx, void (*set_interface) (MIR_item_t item), void * (*import_resolver) (const char *)) ``` ### Code Generation :::info 先前看[藉由 JIT 編譯加速 rv32emu](https://hackmd.io/@lambert-wu/rv32emu) 都是針對筆者個人的觀察去寫的，但並未附上程式碼來源，所以在閱讀上很卡，後來嘗試去針對一些關鍵字去搜尋完整內容，並一步步理解 ::: jump table 的 op 函式已經把要執行的指令歸類在一起下方為 [rv_step](https://github.com/sysprog21/rv32emu/blob/master/src/emulate.c) 執行 op 的部份，op 可以從 index 得知，藉由 index 跟 inst 可以知道要用哪種類型的解碼方式跟指令，這樣就可以對指令生成程式碼，使用 c2mir 來編譯。 ```c while (rv->csr_cycle < cycles_target && !rv->halt) { // fetch the next instruction inst = rv->io.mem_ifetch(rv, rv->PC); // standard uncompressed instruction if ((inst & 3) == 3) { uint32_t index = (inst & INST_6_2) >> 2; // dispatch this opcode TABLE_TYPE op = jump_table[index]; assert(op); if (!op(rv, inst)) break; rv->inst_len = INST_32; } } ``` 完整的 [rv_step](https://github.com/sysprog21/rv32emu/blob/master/src/emulate.c) 如下 ```c void rv_step(riscv_t *rv, int32_t cycles) { assert(rv); /* find or translate a block for starting PC */ const uint64_t cycles_target = rv->csr_cycle + cycles; /* loop until hitting the cycle target */ while (rv->csr_cycle < cycles_target && !rv->halt) { block_t *block; /* try to predict the next block */ if (prev && prev->predict && prev->predict->pc_start == rv->PC) { block = prev->predict; } else { /* lookup the next block in block map or translate a new block, * and move onto the next block. */ block = block_find_or_translate(rv); } /* by now, a block should be available */ assert(block); /* After emulating the previous block, it is determined whether the * branch is taken or not. The IR array of the current block is then * assigned to either the branch_taken or branch_untaken pointer of * the previous block. */ if (prev) { /* update previous block */ if (prev->pc_start != last_pc) prev = block_find(&rv->block_map, last_pc); rv_insn_t *last_ir = prev->ir_tail; /* chain block */ if (!insn_is_unconditional_branch(last_ir->opcode)) { if (is_branch_taken && !last_ir->branch_taken) last_ir->branch_taken = block->ir_head; else if (!last_ir->branch_untaken) last_ir->branch_untaken = block->ir_head; } else if (IF_insn(last_ir, jal) #if RV32_HAS(EXT_C) || IF_insn(last_ir, cj) || IF_insn(last_ir, cjal) #endif ) { if (!last_ir->branch_taken) last_ir->branch_taken = block->ir_head; } } last_pc = rv->PC; /* execute the block */ const rv_insn_t *ir = block->ir_head; if (unlikely(!ir->impl(rv, ir, rv->csr_cycle, rv->PC))) break; prev = block; } } void ebreak_handler(riscv_t *rv) { assert(rv); rv_except_breakpoint(rv, rv->PC); } void ecall_handler(riscv_t *rv) { assert(rv); rv_except_ecall_M(rv, 0); syscall_handler(rv); } ``` 仔細觀察 op 後，搭配 macro 將常用到的部分簡化成輸入 code 的模板，這樣就可以生成整個 op 的字串又不失可讀性。下方為 [LuaJIT-5.3.6](https://github.com/Yu2erer/LuaJIT-5.3.6) 中針對 [Y_fstr2buffer](https://github.com/Yu2erer/LuaJIT-5.3.6/blob/master/src/YJIT.c) 的寫法 ```c static void Y_fstr2buffer (Y_jitbuffer *buff, const char *fmt, ...) { va_list args; char local_buffer[1024]; va_start(args, fmt); int n = vsnprintf(local_buffer, sizeof(local_buffer), fmt, args); if (n < 0) abort(); va_end(args); Y_str2buffer(buff, local_buffer); } ``` 參考 [LuaJIT-5.3.6](https://github.com/Yu2erer/LuaJIT-5.3.6) 的方式，定義 `CODE` 用來寫入程式碼到 buffer，簡化 codegen 實作。嘗試理解 codegen 一詞含義，根據 [Code generation (compiler)](https://en.wikipedia.org/wiki/Code_generation_(compiler)) >In computing, code generation is part of the process chain of a compiler and converts intermediate representation of source code into a form (e.g., machine code) that can be readily executed by the target system. ```c void str2buffer(rv_buffer *buffer, const char *fmt, ...) { va_list args; char code[1024]; va_start(args, fmt); int n = vsnprintf(code, sizeof(code), fmt, args); va_end(args); if (n < 0) abort(); size_t len = strlen(code); size_t new_size = buffer->size + len; if (new_size > buffer->capacity) { buffer->src = realloc(buffer->src, new_size); buffer->capacity = new_size; } strncpy(&buffer->src[buffer->size], code, len); buffer->size = new_size; } #define CODE(fmt, ...) str2buffer(buff, fmt, ##__VA_ARGS__) ``` 下方為初步設計的模板 ```c #define DECLEAR_FUNC(name) \ CODE("bool %s(struct riscv_t *rv, uint32_t inst) {\n", name) #define END_FUNC CODE("return true;}\n") #define DEC_RD CODE("const uint32_t rd = dec_rd(inst);\n") #define DEC_RS1 CODE("const uint32_t rs1 = dec_rs1(inst);\n") #define DEC_RS2 CODE("const uint32_t rs2 = dec_rs2(inst);\n") #define DEC_FUNCT3 CODE("const uint32_t funct3 = dec_funct3(inst);\n") #define DEC_FUNCT7 CODE("const uint32_t funct7 = dec_funct7(inst);\n") #define DEC_U_IMM CODE("const int32_t imm = dec_utype_imm(inst);\n") #define DEC_J_IMM CODE("const int32_t imm = dec_jtype_imm(inst);\n") #define DEC_I_IMM CODE("const int32_t imm = dec_itype_imm(inst);\n") #define DEC_B_IMM CODE("const int32_t imm = dec_btype_imm(inst);\n") #define DEC_S_IMM CODE("const int32_t imm = dec_stype_imm(inst);\n") #define SWITCH(val) CODE("switch(%s) {\n", val) #define CASE(val) CODE("case %s: \n", val) #define END CODE("}\n") #define UPDATE_PC(val) CODE("rv->PC += %s;\n", val) #define ENFORCE_ZERO CODE("if (rd == rv_reg_zero)\n rv->X[rv_reg_zero] = 0;\n") #define LOAD_ADDR(reg) CODE("const uint32_t addr = rv->X[%s] + imm;", reg) #define RV_DATA CODE("const uint32_t data = rv->X[rs2];\n") #define inst_misaligned CODE("rv_except_inst_misaligned(rv, pc);\n"); #define load_misaligned(num) \ CODE("if(addr & %d) {\n", num); \ CODE("rv_except_load_misaligned(rv, addr);return false; }\n"); #define store_misaligned(num) \ CODE("if(addr & %s) {\n", num); \ CODE(" rv_except_store_misaligned(rv, addr);\n return false; }\n"); #define illegal_inst CODE("rv_except_illegal_inst(rv, inst); return false;\n") #define _OP_UNIMP \ DECLEAR_FUNC("op_unimp"); \ illegal_inst; \ END; ``` 接下來，按照 op 來生成程式碼。 1. 生成程式碼的部份用 enum 按照 jump_table 設定索引值 2. 再透過，switch(index)，便可推知要生成哪個 op 函式 3. 下方 op_system ，由於名稱會和內建函式 system 有衝突，暫時以 op_system 代替。 ```c enum { load = 0b00000, load_fp = 0b00001, misc_mem = 0b00011, op_imm = 0b00100, auipc = 0b00101, store = 0b01000, store_fp = 0b01001, amo = 0b01011, op = 0b01100, lui = 0b01101, madd = 0b10000, msub = 0b10001, nmsub = 0b10010, nmadd = 0b10011, fp = 0b10100, branch = 0b11000, jalr = 0b11001, jal = 0b11011, op_system = 0b11100, }; ``` ```c ``` ### EN 帶你寫個作業系統以下為閱讀本書後的一些見解 risc-v emulator 的工作流程如下 1. IO 以及記憶體和虛擬機初始化 2. run() or run_and_trace() 會根據不同情況，決定每個 cycle 要執行多少指令後再呼叫 rv_step() 3. rv_step() 負責取出指令並判斷其類型(load, jump, store, branch),隨後呼叫 op() 對不同類型的指令進行 dispatch 和後續處理 * rv_step() 在達到 cycle 目標之前，重複以下動作 1. 將指令從 pc 指向的記憶體位置取出 2. 讀取出來之後，將指令交給 op handler op() 進行處理 3. 執行 op() 在 [riscv.c](https://github.com/sysprog21/rv32emu/commit/3ecac53308de774d52535ae6e134b66cc7e2f9b2#diff-76a818eab913b98b885f954273a467dea476f7ddbfd5886f97b1e56cb0e2b58d) 的第 787行有預先定義好 RV32I 各類指令的 opcode(前五碼) ```c static const opcode_t opcodes[] = { // 000 001 010 011 100 101 110 111 op_load, op_load_fp, NULL, op_misc_mem, op_op_imm, op_auipc, NULL, NULL, // 00 op_store, op_store_fp, NULL, op_amo, op_op, op_lui, NULL, NULL, // 01 op_madd, op_msub, op_nmsub, op_nmadd, op_fp, NULL, NULL, NULL, // 10 op_branch, op_jalr, NULL, op_jal, op_system, NULL, NULL, NULL, // 11 }; ``` 接下來書中有提到，只定義前五碼是因為 RV32I opcode 的後兩碼是固定的(xxxxx11)，我們也可以在 rv_step() 中看到待執行指令 inst 的預處理。 ```c while (rv->csr_cycle < cycles_target && !rv->halt) { // fetch the next instruction const uint32_t inst = rv->io.mem_ifetch(rv, rv->PC); inst = rv->io.mem_ifetch(rv, rv->PC); // standard uncompressed instruction if ((inst & 3) == 3) { const uint32_t index = (inst & INST_6_2) >> 2; index = (inst & INST_6_2) >> 2; ``` 藉由 `(inst & 3) == 3`來確認 inst 末兩碼是否為 RV32I 指令，如果是的話，就透過 `(inst & INST_6_2) >> 2` 取得並移除末兩碼，關於為何需要移除末兩碼，可以從 `INST_6_2` 得知 ``` INST_6_2 = 0b00000000000000000000000001111100 ``` 要注意一點就是，除了 RVC 指令集外，其他合法的 RISC-V 指令集的 opcode 末兩碼皆為 11。 op 其實為函式指標，主要還是透過 rv_step() 指定指令的 handler 後，在去做相關操作。下面以整數操作的 handler 舉例說明 ```c static bool op_op_imm(struct riscv_t *rv, uint32_t inst) { // i-type decode const int32_t imm = dec_itype_imm(inst); const uint32_t rd = dec_rd(inst); const uint32_t rs1 = dec_rs1(inst); const uint32_t funct3 = dec_funct3(inst); // dispatch operation type switch (funct3) { case 0: // ADDI rv->X[rd] = (int32_t)(rv->X[rs1]) + imm; break; case 1: // SLLI rv->X[rd] = rv->X[rs1] << (imm & 0x1f); break; case 2: // SLTI rv->X[rd] = ((int32_t)(rv->X[rs1]) < imm) ? 1 : 0; break; case 3: // SLTIU rv->X[rd] = (rv->X[rs1] < (uint32_t) imm) ? 1 : 0; break; case 4: // XORI rv->X[rd] = rv->X[rs1] ^ imm; break; case 5: if (imm & ~0x1f) { // SRAI rv->X[rd] = ((int32_t) rv->X[rs1]) >> (imm & 0x1f); } else { // SRLI rv->X[rd] = rv->X[rs1] >> (imm & 0x1f); } break; case 6: // ORI rv->X[rd] = rv->X[rs1] | imm; break; case 7: // ANDI rv->X[rd] = rv->X[rs1] & imm; break; default: rv_except_illegal_inst(rv); return false; } // step over instruction rv->PC += 4; // enforce zero register if (rd == rv_reg_zero) rv->X[rv_reg_zero] = 0; return true; } ``` ![](https://hackmd.io/_uploads/BklzqoLf6.png) handler 會將傳進來的指令依照 I-type 進行解碼，取得 imm ,rs1 ,funct3, rd ，再由 func3 決定做哪一種操作，例如 ADDI, SLLI 等等假設是執行 ADDI 操作，便按照 RISC-V 中 ADDI 指令所定義的執行，將結果寫回相關暫存器後，handler 會將 Program counter 指到下一個記憶體位置後回傳結果。在這裡有提到一個有趣的問題，Pipeline 越長越好嗎？以下為可能面臨的問題 * 週期同步問題 Clock cycle 須以最慢的 stage 進行考量，倘若 pipeline 設計越長，可能因為最慢的 stage 而造成反效果。 * 電路體積加大每一個 stage 都需要大量的暫存器保存前一個 stage 的輸出。而暫存器由正反器所實現，當 pipeline 越長意味著越多的正反器，可能造成體積加大及散熱問題。 * 指令延遲問題大量暫存器保存前一階層的結果，會造成指令延遲問題。 * 分支預測問題像是常見的無條件跳轉指令 jump 和條件跳轉指令 branch 都會造成問題。假設當指令1 進入執行階段時，發現該指令會進行跳轉，那指令2 和指令3 已經在解碼及讀取階段，這時處理器將對前面的 pipeline 清空並將正確的指令放回來。 ### 關於 rv32emu git action 在 [Understanding GitHub Actions](https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions#the-components-of-github-actions) 提到 GitHub Actions 是一個 continuous integration and continuous delivery (CI/CD) 平台，可以讓你自動去部署 pipeline。您可以建立 workflow 來建置和測試 repository 的每個 pull request，或將合併的 pull request 部署到 production。 **Workflow** 你可以配置當一個事件發生時，便觸發 GitHub Actions workflow，像是打開 pull request, 建立 issue。 workflow 會包含多個 jobs，他們會以順序或平行執行。每一個 job 會執行在它自己的 virtual machine runner 或是 container 裡面，並且有著多個步驟去執行您定義的 script 或 action。 ![image](https://hackmd.io/_uploads/SJbCgMyL0.png) workflow 會被定義成 yaml 檔案，並在 repository 事件觸發時執行，這裡你也可以設定成手動或是[排程](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule)執行。workflow 會放在 `.github/workflows`。 **Job** 預設情況下，Jobs 沒有依賴性，他們彼此會平行運行。當一個 Job 依賴於另一個 Job 時，這個 Job 會等待它所依賴的 Job 完成後才會開始執行。舉例來說，您可以有多個 build jobs 分別處理不同的架構 architectures，這些 job 之間沒有依賴關係，因此它們會平行執行。當所有的 build jobs 成功完成後，再執行一個 packaging job，這個 packaging job 依賴於所有的 build jobs。 **Action** Action 是一種客製化應用，可以幫助你處理複雜而重複頻率高的任務。你只要將 Action 寫好放在 workflow files 即可。此外，action 可以拉取你的 git repository，為你的建置環境設定正確的 toolchain，或設定對雲端提供者的身份驗證。 **Runner** runner 本身是一個 server，它會執行你的 workflows。每一個 runner 一次會單獨執行一個 job。在這裡 GitHub 便有提供你 Ubuntu Linux, Microsoft Windows, macOS 的 runners 來執行你的 workflows。每一個 workflow 執行時都會在新設定的虛擬機器中執行。接下來依序說明 rv32emu workflows 你可以觀察到 **.github/workflows/main.yml** ```yaml name: CI on: [push, pull_request] jobs: detect-code-related-file-changes: runs-on: ubuntu-22.04 outputs: has_code_related_changes: ${{ steps.set_has_code_related_changes.outputs.has_code_related_changes }} steps: - name: Check out the repo uses: actions/checkout@v4 - name: Test changed files id: changed-files uses: tj-actions/changed-files@v44 with: files: | .ci/** build/** mk/** src/** tests/** tools/** .clang-format Dockerfile Makefile - name: Set has_code_related_changes id: set_has_code_related_changes run: | if [[ ${{ steps.changed-files.outputs.any_changed }} == true ]]; then echo "has_code_related_changes=true" >> $GITHUB_OUTPUT else echo "has_code_related_changes=false" >> $GITHUB_OUTPUT fi ``` host-x64 job 會依賴於名為 detect-code-related-file-changes 的 job。當 detect-code-related-file-changes 輸出的 detect-code-related-file-changes 為 true 時才會執行 host-x64 job。該 job 會執行在 ubuntu-22.04。 ```yaml host-x64: needs: [detect-code-related-file-changes] if: needs.detect-code-related-file-changes.outputs.has_code_related_changes == 'true' runs-on: ubuntu-22.04 ``` 這裡使用 actions/checkout@v4 來檢查 repo code，一開始會更新系統並安裝所需的 library，執行 `.ci/riscv-toolchain-install.sh` 安裝 riscv-toolchain 和 LLVM 17，接著再 make。 ```yaml steps: - uses: actions/checkout@v4 - name: install-dependencies run: | sudo apt-get update -q -y sudo apt-get install -q -y libsdl2-dev libsdl2-mixer-dev .ci/riscv-toolchain-install.sh wget https://apt.llvm.org/llvm.sh sudo chmod +x ./llvm.sh sudo ./llvm.sh 17 shell: bash - name: default build run: make ``` ```yaml - name: check + tests run: | make check -j$(nproc) make tests -j$(nproc) make misalign -j$(nproc) make tool -j$(nproc) - name: diverse configurations run: | make distclean && make ENABLE_EXT_M=0 check -j$(nproc) make distclean && make ENABLE_EXT_A=0 check -j$(nproc) make distclean && make ENABLE_EXT_F=0 check -j$(nproc) make distclean && make ENABLE_EXT_C=0 check -j$(nproc) make distclean && make ENABLE_SDL=0 check -j$(nproc) - name: gdbstub test run: | make distclean ENABLE_GDBSTUB=1 gdbstub-test - name: JIT test run: | make ENABLE_JIT=1 clean && make ENABLE_JIT=1 check -j$(nproc) make ENABLE_JIT=1 clean && make ENABLE_EXT_A=0 ENABLE_JIT=1 check -j$(nproc) make ENABLE_JIT=1 clean && make ENABLE_EXT_F=0 ENABLE_JIT=1 check -j$(nproc) make ENABLE_JIT=1 clean && make ENABLE_EXT_C=0 ENABLE_JIT=1 check -j$(nproc) - name: undefined behavior test run: | make clean && make ENABLE_UBSAN=1 check -j$(nproc) make ENABLE_JIT=1 clean && make ENABLE_JIT=1 ENABLE_UBSAN=1 check -j$(nproc) ``` ### 關於 rc32emu 問題整理 #### [jit: Properly adjust THRESHOLD](https://github.com/sysprog21/rv32emu/issues/159) 文中提到 >On macOS/x86-64, I discovered the need to increase the THRESHOLD (in src/cache.c) from 32768 to 65536 in order to achieve the desired performance of SciMark2, aligning it with GNU/Linux (commit [cb0a153](https://github.com/sysprog21/rv32emu/commit/cb0a1537b7cd6cf77c7373594785d4e9e0de08f2)). This observation emphasizes the importance of establishing robust guidelines for adjusting the threshold to ensure consistent performance. 於是我嘗試觀察 [cb0a153](https://github.com/sysprog21/rv32emu/commit/cb0a1537b7cd6cf77c7373594785d4e9e0de08f2) ，在本次提交中我有看到原先是 ```c /* THRESHOLD is set to identify hot spots. Once the frequency of use for a block * exceeds the THRESHOLD, the JIT compiler flow is triggered. */ #define THRESHOLD 1000 ``` 改良後 ```c #define THRESHOLD 32768 ``` 我能理解設置 THRESHOLD 用意是為了識別 hot spots，一旦 block 的使用頻率超過 THRESHOLD ，便會驅動 JIT compiler flow，避免效能掉下來。但是我並未看到 qwe661234 的觀察，只有看到提交程式，關於這部分是蠻好奇是如何發現調整 THRESHOLD 可以改 SciMark2 的效能。我個人初步猜測是，透過設置不同 THRESHOLD 然後觀察其效能結果，接著再觀察設置 THRESHOLD 為 32768 試圖找尋其設置邊界。其實我覺得將該部分發展成自動調整參數的方式進行，畢竟你不會一直實驗參數調整至多少這件事。 ### 已被解決 [Assertion failed: (total_read == rv_get_reg(rv, rv_reg_a2)), function syscall_read #214](https://github.com/sysprog21/rv32emu/issues/214?fbclid=IwAR1bT_33eW93vnsi9wUX1pviBJXX2NBg9ZZ5UjVMUjFlmAjpAmiuB0pLju8) 我們可以看見在 commit 56766c0 之後發生 Doom fails to launch，以下為報錯訊息 >V_Init: allocate screens. M_LoadDefaults: Load system defaults. Assertion failed: (total_read == rv_get_reg(rv, rv_reg_a2)), function syscall_read, file syscall.c, line 290. /bin/sh: line 1: 12197 Abort trap: 6 ../build/rv32emu doom.elf make: *** [doom] Error 134 於是我們嘗試去 trace [src/syscall](https://github.com/sysprog21/rv32emu/commit/56766c0653526c546f00784bbc0cefb800a6f9a2) 尋找問題在 [Modify syscall implementation with fixed memory (#205)](https://github.com/sysprog21/rv32emu/commit/56766c0653526c546f00784bbc0cefb800a6f9a2) 中的目標是希望透過固定的記憶體來實現 syscall，這個想法很不錯，一來是省下分配新的記憶體的時間，另外也可以節省記憶體的用量。但不知道對於 syscall read/write 是不是會有使用超過固定記憶體的情況，如果有的話也需要格外注意。於是他們在運行 nyancat.elf, syscall read/write 時，建立 200 KiB 作為他們使用的記憶體大小，後續都會強制讓 syscall read/write 使用這個預先分配好的記憶體空間。以下為 qwe661234 提交之內容 ```c while (count > PREALLOC_SIZE) { size_t r = fread(tmp, 1, PREALLOC_SIZE, handle); memory_write(s->mem, buf + total_read, tmp, r); count -= PREALLOC_SIZE; total_read += PREALLOC_SIZE; } size_t r = fread(tmp, 1, count, handle); memory_write(s->mem, buf + total_read, tmp, r); total_read += r; assert(total_read == rv_get_reg(rv, rv_reg_a2)); ``` [assert.h](https://en.wikipedia.org/wiki/Assert.h) assert.h 為 C 標準函式庫中的標頭檔。其中定義了 assert() macro 用於程式除錯。 assert()是一個診斷巨集，用於動態辨識程式的邏輯錯誤條件。其原型是： void assert(int expression) * 如果巨集的參數求值結果為非零值，則不做任何操作（no action） * 如果是零值，用寬字元列印診斷訊息，然後呼叫abort()。診斷訊息包括 * 原始檔名字 * 所在的原始檔的行號 * 所在的函式名稱 * 求值結果為0的表達式若想封鎖所有的 assert() 而不需修改原始碼，只透過兩種方式 1. 命令列呼叫C語言的編譯器時添加 `NDEBUG` 巨集定義的命令列選項 2. 在 `<assert.h>` 之前就使用 `#define NDEBUG` 來定義巨集被封鎖後的 assert() 不對傳遞給它的參數列達式求值從 qwe661234 提供的報錯內容，我們可以知道在執行 `assert(total_read == rv_get_reg(rv, rv_reg_a2));`後，求值結果為零，接著呼叫 abort() 終止程式我認為歸納幾個可能的原因，接著去實驗進行驗證 1. total_read 和 rv_get_reg(rv, rv_reg_a2) 之間的資料長度不同，可能是資料不足以滿足所需的資料長度，或是讀取位置錯誤需執行步驟 * 檔案是否具備所需內容 * 確認檔案讀取的位置和 count 是否正確首先根據提交的 hash ID [56766c0](https://github.com/sysprog21/rv32emu/tree/56766c0653526c546f00784bbc0cefb800a6f9a2) 嘗試還原 qwe661234 所說的問題執行以下命令 ``` git clone https://github.com/sysprog21/rv32emu.git cd rv32emu/ git checkout 56766c0653 ``` 執行後得到以下資訊 ``` HEAD 目前位於 56766c0 Modify syscall implementation with fixed memory (#205) ``` 接著嘗試執行下方命令，看看能不能出現 commit 上的報錯訊息 ``` make doom ``` 我們可以得到以下訊息 ``` (base) huaxin@huaxin-System-Product-Name:/media/huaxin/D磁碟區/NUKU/rv32emu/rv32 emu$ make doom mk/toolchain.mk:28: GNU Toolchain for RISC-V is required to build architecture tests. Please check package installation (cd build; ../build/rv32emu doom.elf) ---------------------------- RISC-V DOOM Startup v1.9 ---------------------------- V_Init: allocate screens. M_LoadDefaults: Load system defaults. Z_Init: Init zone memory allocation daemon. W_Init: Init WADfiles. adding DOOM1.WAD M_Init: Init miscellaneous info. R_Init: Init DOOM refresh daemon - [.. ] InitTextures InitFlats........ InitSprites InitColormaps R_InitData R_InitPointToAngle R_InitTables R_InitPlanes R_InitLightTables R_InitSkyMap R_InitTranslationsTables P_Init: Init Playloop state. I_Init: Setting up machine state. D_CheckNetGame: Checking network game status. startskill 2 deathmatch: 0 startmap: 1 startepisode: 1 player 1 of 1 (1 nodes) S_Init: Setting up sound. HU_Init: Setting up heads up display. ST_Init: Init status bar. ``` 為了做個對照，以下為 qwe661234 的報錯資訊 ``` V_Init: allocate screens. M_LoadDefaults: Load system defaults. Assertion failed: (total_read == rv_get_reg(rv, rv_reg_a2)), function syscall_read, file syscall.c, line 290. /bin/sh: line 1: 12197 Abort trap: 6 ../build/rv32emu doom.elf make: *** [doom] Error 134 ``` ![](https://hackmd.io/_uploads/r1ifw29Ma.png) 看起來我的主機上是沒有報錯，有建立成功，但是在這個遊戲上我接續按了幾個選項，然後整個系統就當掉了，需要重開機，為了保險起見，我提供以下我的主機資訊 ``` (base) huaxin@huaxin-System-Product-Name:/media/huaxin/D磁碟區/NUKU/rv32emu/rv32 emu$ gcc --version gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ``` 在 Makefile 中第 8 行添加 `-g` 選項包含在編譯命令中。這將生成包含除錯資訊的執行檔，以便在 GDB 中進行調試。 ``` CFLAGS += -Wno-unused-label -g ``` 接著執行以下命令，它會執行在 /build 中名為 rv32emu 和 doom.elf 的檔案 ```shell $ gdb make doom ``` 會先進到以下畫面 ```shell (base) huaxin@huaxin-System-Product-Name:/media/huaxin/D磁碟區/NUKU/rv32emu$ gdb make doom GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from make... (No debugging symbols found in make) /media/huaxin/D磁碟區/NUKU/rv32emu/doom: 沒有此一檔案或目錄. ``` 我們可以先設一個中斷點在 main，再開始運行 ```shell gdb-peda$ break main gdb-peda$ run ``` 因為我想嘗試看一下目前執行到哪一個檔案第幾行，於是透過以下命令達成 ``` gdb-peda$ layout next ``` ![](https://hackmd.io/_uploads/BkOztMpza.png) 這裡我不明白為何 souce file 為何顯示不出來，是因為 source 檔案都在 rv32emu/src 而不在 rv32emu/build 中，所以才無法顯示？接著我想著剛剛我在 Makefile 中看到有定義以下檔案，其中就包含 syscall.c 的 object file，也就是 syscall.o ，在我設置中斷點指定 syscall.c 中 syscall_write 函式時，我認為要可以找得到。 ``` OBJS := \ map.o \ utils.o \ decode.o \ io.o \ syscall.o \ emulate.o \ riscv.o \ elf.o \ cache.o \ mpool.o \ $(OBJS_EXT) \ main.o ``` 於是我嘗試執行以下命令，得到該函式未定義 ```shell gdb-peda$ b syscall_write Function "syscall_write" not defined. ``` syscall_write 為 static function，並存在於 syscall.c ，syscall_write 只能在 syscall.c 的內部使用，僅在同一個 compilation unit 可見，對其他 compilation unit 是不可見的。於是當我們設置 breakpoint 時，指定 syscall.c 中的 syscall_write 函式，僅適用於 syscall.c 這個 compilation unit ，若嘗試在其他 compilation unit 訪問 syscall_write ，compiler 會找不到。 :::info 以下為額外知識補充內容對我們沒在使用的 global variable 加上 static 修飾，compiler 可以對其進行刪除以節省空間，但是 compiler 不會刪去沒在使用的 non-static global variable，原因是無法確認其他的 compilation unit 是否會用到該變數 ::: 一開始我嘗試想要指定檔案及行號，遇到以下問題 ``` gdb-peda$ b syscall.c:290 No symbol table is loaded. Use the "file" command. ``` 後來先嘗試下載 symbol table ，再指定檔案及行號便成功了 `^0^` ```shell gdb-peda$ file /media/huaxin/D磁碟區/NUKU/rv32emu/build/rv32emu Reading symbols from /media/huaxin/D磁碟區/NUKU/rv32emu/build/rv32emu... gdb-peda$ b /media/huaxin/D磁碟區/NUKU/rv32emu/src/syscall.c:290 Note: breakpoints 1, 2, 3 and 4 also set at pc 0x68a5. Breakpoint 5 at 0x68a5: file src/syscall.c, line 290. ``` 於是我便得到以下數據 ``` [----------------------------------registers-----------------------------------] RAX: 0x7ffff6275f48 --> 0x4f044415749 RBX: 0x0 RCX: 0x3fb7b4000004f0 RDX: 0xc ('\x0c') RSI: 0x4f044415749 RDI: 0x7ffff6275f48 --> 0x4f044415749 RBP: 0x55555558c1c0 --> 0x0 RSP: 0x7fffffffdab0 --> 0xffffdf480000000c RIP: 0x55555555a8a5 (<syscall_handler+1141>: mov esi,0xc) R8 : 0xc ('\x0c') R9 : 0x7c ('|') R10: 0x2 R11: 0x246 R12: 0xc ('\x0c') R13: 0x55555558c000 --> 0x55555558c020 --> 0x7ffef6278000 --> 0x0 R14: 0x55555643bd60 --> 0xfbad2488 R15: 0x5555566a4d08 --> 0x130000110000003f EFLAGS: 0x202 (carry parity adjust zero sign trap INTERRUPT direction overflow) [-------------------------------------code-------------------------------------] 0x55555555a89a <syscall_handler+1130>: add r12d,ebx 0x55555555a89d <syscall_handler+1133>: add rdi,QWORD PTR [rax] 0x55555555a8a0 <syscall_handler+1136>: call 0x555555557530 <memcpy@plt> => 0x55555555a8a5 <syscall_handler+1141>: mov esi,0xc 0x55555555a8aa <syscall_handler+1146>: mov rdi,rbp 0x55555555a8ad <syscall_handler+1149>: call 0x55555555ffd0 <rv_get_reg> 0x55555555a8b2 <syscall_handler+1154>: cmp r12d,eax 0x55555555a8b5 <syscall_handler+1157>: jne 0x55555555aabc <syscall_handler+1676> [------------------------------------stack-------------------------------------] 0000| 0x7fffffffdab0 --> 0xffffdf480000000c 0008| 0x7fffffffdab8 --> 0x0 0016| 0x7fffffffdac0 --> 0x0 0024| 0x7fffffffdac8 --> 0x30000007c 0032| 0x7fffffffdad0 --> 0x5b0000006e ('n') 0040| 0x7fffffffdad8 --> 0x555556570620 --> 0x555556570650 --> 0x3 0048| 0x7fffffffdae0 --> 0x7ffff7cbbb80 --> 0x0 0056| 0x7fffffffdae8 --> 0x8be0489c63c7ca00 [------------------------------------------------------------------------------] Legend: code, data, rodata, value Breakpoint 1, syscall_read (rv=0x55555558c1c0) at src/syscall.c:290 warning: Source file is more recent than executable. 290 assert(total_read == rv_get_reg(rv, rv_reg_a2)); ``` Ctrl + x + a 取得目前程式畫面 ![](https://hackmd.io/_uploads/HJJ3x4TGa.png) 真是可喜可賀～～接下來找出 total_read 和 rv_get_reg(rv, rv_reg_a2) 變數值，結果是相同的。 ``` gdb-peda$ print total_read $1 = 0xc gdb-peda$ print rv_get_reg(rv, rv_reg_a2) $2 = 0xc ``` 剛剛也跑了一下測試檔 donut.elf 超酷的啦！！餓起來了`^0^` ![](https://hackmd.io/_uploads/HkWa8Eaz6.png) 接下來嘗試執行 `gdb build/rv32emu` 並在 "(gdb)" 出現後，輸入 `run doom.elf` ```shell (base) huaxin@huaxin-System-Product-Name:/media/huaxin/D磁碟區/NUKU/rv32emu$ gdb build/rv32emu GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from build/rv32emu... gdb-peda$ run build/doom.elf Starting program: /media/huaxin/D磁碟區/NUKU/rv32emu/build/rv32emu build/doom.elf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". ---------------------------- RISC-V DOOM Startup v1.9 ---------------------------- V_Init: allocate screens. M_LoadDefaults: Load system defaults. Z_Init: Init zone memory allocation daemon. W_Init: Init WADfiles. couldn't open DOOM1.WAD Error: W_InitFiles: no files found inferior exit code -1 [Inferior 1 (process 1609878) exited normally] Warning: not running ``` 但發現位置錯誤於是我們便移動至可取得 `DOOM1.WAD` 的位置 ```shell gdb-peda$ cd build/ ``` 再執行以下命令 ``` gdb-peda$ run doom.elf ``` 結果如下，可以正常執行 ![](https://hackmd.io/_uploads/Hy2KvU6fa.png) ## Pull request "老師建議我描述我的推論並確認我的提交修改能在實際情況發揮作用"，我能明白目前我缺乏描述推論，也並尚未證明我的行動(GDB檢查程式碼)能在實際情況發揮作用思考先前我做的工作是使用 GDB 檢查程式碼，並沒有針對程式碼修改，但是我仍嘗試想要證明，我是能改善這個問題的於是我從幾個面向進行思考 * 為何我無法還原 qwe661234 的問題？考慮到我和 qwe661234 的主機可能有以下差異。 1. 使用的程式碼不同，先前程式碼來源是來自 sysprog21/rv32emu，而不是來自 qwe661234。 2. 我們的環境設定或相依套件不一致針對第一個問題，我去找了[qwe661234 的 rv32emu](https://github.com/qwe661234/rv32emu) 的程式碼，嘗試在我的主機上還原問題，這算是實際情況的一項測試，在相同的程式碼下，假設程式碼真的有問題，那麼我認為在不同的主機上也應該呈現出來，但在我這台主機測試後我沒有發現問題，所以我推斷不是程式碼的問題，問題可能出在我和 qwe661234 環境設定或相依套件不一致。我的環境是 Ubuntu 20.04.2(12th Gen Intel(R) Core(TM) i9-12900F)，而 [qwe661234](https://github.com/sysprog21/rv32emu/issues/214) 是在 Ubuntu Linux 20.04 (Intel Xeon E5-2650) 和 macOS 14.1 (Apple M1)，以 Operating system 來看，在 Ubuntu 20.04 上是一致的。下面為我的描述推論我們可以確認在實際情況中 Ubuntu 20.04.2(12th Gen Intel(R) Core(TM) i9-12900F) ，可以正常運行 make doom ，對於 qwe661234 所出現的問題，我需要 qwe661234 提供更多相關環境資訊，進一步調查造成該 issue 的原因。依序對各個 commit 進行 make doom 確認以下為各個 commit hash，由上至下時間從 11/7 到 10/1 ``` ok commit 6685e938c1bdd8627ae9390b2d4954a2b20dd19b ok commit c1740075cef723207741c798236ea745953f16de ok commit 0c40953fe743580418267639ecd175984a88d32f ok commit c1b8db0edec677cf654d19065b09db666bbd5b4f ok commit 0c40953fe743580418267639ecd175984a88d32f ok commit c1b8db0edec677cf654d19065b09db666bbd5b4f ok commit 289d6aa75d2a7d8070889e118a10d0441a9fdce1 ok commit 03ec122b95d67426e26a69287544ab2932ac31b3 error commit 6294955d3708924a194402fa72f5978a63a75313 error commit e90b5142bb6ad1f9ed301001d269cc8ad8febd12 error commit 912e62eac91bc78457d678494f4b90bbe8e996cf commit 3e4c14c78268712b84bf60403deda397fc63ccf6 commit 30ac674cba62e5406157a6842aed719624cf554b commit 6e88bb61ddb24c935c02e82d92d81c864f196e08 commit ebabef5dd00a51ae6e87d6735f8e8140fb9492f0 commit 3f1d8acde4ecf596fd9a134c24c1af2d0604f0a8 commit d1bf9a9b9713610312f5c18daaf82be3b68ee732 commit 666b7eb697187a270d69e46aeacf788a33470e15 commit 937feee1b973f759061c6ac84b473caa276720ca commit 0337caae7de10cec3110b3f094649f90008182e7 commit a36f35ec2446b0711c28e7f0495f6016bb51334a commit 35b9dcf01fe377b42d20cb559128e4295b63af19 commit 196908a1aa5346780311c874ac2c43ca131f2e3f commit f65db14270f925a2bd684a44bf5cd2166f9aaf98 commit afcde9585378f6ad0ab3d0527a38715e1b72066e commit c6e7e66f637bac5247b7cbcf223356577c19f59e commit f3c7b8990ef4a61b7ebeaeb51be524469a3bca14 commit cd1d7121c6e8eb9f7550459e0b88a199c0c0482d commit 3736827afd86cc6f55c8104787410e38812fb607 commit 669da4c5e430c4b1be7f297fd89343a1966f0c22 commit a210f747c8892b90b237dd8b4cdd3f285761a738 commit 4be9cceaa54db8808020e1becd62bed3e426a571 commit fdf30b69dfe49a69cc2210ed70f111c4df149747 commit 925b587e726f1f96bad0f58dd4b65efc1ac55747 commit 13e9b61c45a9292b20e3c5d45371c7c3a7b00fdd commit 7aa4ea2f050acaa4bdf7c016023ee867bc11cb69 commit e91506914501a58497c4045db29bf0228b41bd78 commit d9b55abb9b0a860bf6b90be929b24551c2d30ee1 commit 4efedfda177aa63c55a3620eb555c784211c117a commit b3ecce384ff6f8d79a683d9d75f2b1c1892b6e04 commit 5ba1a8f6f23d4debd66048866eadcaf03c2dbbe5 commit 15ead60267e783ab51df9b6fb3ec4ec3ce95c534 commit d08dd050a0dac01f24d0c8229efa121460f859a5 commit 78836cf7b6ee082307072630d47977116480e82d commit 15ead60267e783ab51df9b6fb3ec4ec3ce95c534 commit 78836cf7b6ee082307072630d47977116480e82d commit 16df4dad328c3ae5a11a6dc6291f87e1a594e579 commit ce27e8020e129cdc5f7fec8bb90fff52d9bafde7 commit 4bc845675c213b10d3d42c78ceb490c98c391a4c commit 85a31fcc3013475f6f77988c98adc0b6c56b0754 commit 1d8694b092827028c5c72c79c01bf6abba6e7586 commit f471aec2d694d65f6d2d3ade18f9392ca61478ca commit e7d873da637668eaa77dacb886ff4bbb074f63fb commit 006794dd120ec17ccd8923319fa2875d9f0b9e4c commit fa1753c3be40b28c670157e9578caab338337821 commit 81013d02fe75975ecb45cd15ec115b6b17146854 commit 3cb4275e623cfa938b771affb33d9fa12589e266 commit 398234b7251b4fcd4d86c6658ccdf98712ef99ee commit bbfe6dc303c111c77038a9c54d2b976b1bcfea95 commit f13c5412d824841afbc3d1e1d85fe8294aef7bf8 commit e2a60b8626416bb1c5d78f51f4f47b4ebb021d19 commit 24e857a0c620f2174dbd19b3bbc2423594cf1d7f commit b528364029b1d74cc2875fce18f1fe051775044d commit 73ea55c376d5785f740dd7ff27118304efe06d12 commit b23d10ad65ef27a24ec46c6c1b0ba9947c177d4d commit 392c4044d4535801cdf504c734786eb1ad1860ca commit 1c2be0a9e3017753a6dbaa711c8f646d360feb6a commit 3a113f79052396e2b77d4e4cc87c67442acfd79b commit 26729696c77122a8f6bb0ee7d85dc1a7ce8c0e56 commit 47f85366a3b1a87ed890d38a6e51f453bb9ddb9f commit cd3e181b740e050ead869d2d4094cb5ded3d75cd commit 8b59e9d4e68a9eb39dc400ce21c1a42d334495c6 commit ac050039e10637fe702300cd8790106fdde8051a commit 79ce1925f7e5d776f30cfff11099a802df799c93 commit 00312241db570add3b9948a023e7ebbe2041919f commit a2075748788deb8804b14c96f4b185fecc787c48 commit e5db6a7ff71ca9aade2cf1896e6dd824f3b7bb5c Author: Jim Huang <jserv.tw@gmail.com> Date: Sun Oct 1 23:00:30 2023 +0800 Explain implementation aspects for certain RISC-V instructions ``` 報錯內容如下 ``` (base) huaxin@huaxin-System-Product-Name:/media/huaxin/D磁碟區/NUKU/rv32emu$ git checkout 6294955d3708924a194402fa72f5978a63a75313 之前的 HEAD 位置是 a207574 Inline performance-critical functions in debug builds HEAD 目前位於 6294955 tests: donut: Sync with upstream (base) huaxin@huaxin-System-Product-Name:/media/huaxin/D磁碟區/NUKU/rv32emu$ make doom CC build/map.o CC build/utils.o CC build/decode.o CC build/io.o CC build/syscall.o CC build/emulate.o CC build/riscv.o CC build/elf.o CC build/cache.o CC build/mpool.o CC build/syscall_sdl.o CC build/main.o LD build/rv32emu (cd build; ../build/rv32emu doom.elf) Segmentation fault (core dumped) make: *** [Makefile:199：doom] 錯誤 139 ``` 接著嘗試進行以下命令 ```shell gdb-peda$ file rv32emu Reading symbols from rv32emu... (No debugging symbols found in rv32emu) gdb-peda$ run doom.elf Starting program: /media/huaxin/D磁碟區/NUKU/rv32emu/build/rv32emu doom.elf [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. [----------------------------------registers-----------------------------------] RAX: 0x0 RBX: 0x5555555893e0 --> 0x38a RCX: 0x2 RDX: 0x7ffff7fbf038 --> 0x466a800000002 RSI: 0x7ffff7faf1b8 --> 0x81000a0600000000 RDI: 0x1 RBP: 0x38a RSP: 0x7fffffffd9c8 --> 0x7ffff7fbf038 --> 0x466a800000002 RIP: 0x555555559023 (<remove_next_nth_ir+35>: mov rax,QWORD PTR [rax+0x20]) R8 : 0x0 R9 : 0x7ffff7faf168 --> 0x7ffff7faf120 --> 0x7ffff7faf0d8 --> 0x7ffff7faf090 --> 0x7ffff7faf048 --> 0x7ffff7faf240 (--> ...) R10: 0x7ffff7faf120 --> 0x7ffff7faf0d8 --> 0x7ffff7faf090 --> 0x7ffff7faf048 --> 0x7ffff7faf240 --> 0x7ffff7faf288 (--> ...) R11: 0x2 R12: 0x7ffff7faf1b8 --> 0x81000a0600000000 R13: 0x7ffff7faf170 --> 0x5000c0000000014 R14: 0xaaae3388 R15: 0x55555556333c --> 0xf0031300000000 EFLAGS: 0x10202 (carry parity adjust zero sign trap INTERRUPT direction overflow) [-------------------------------------code-------------------------------------] 0x55555555901a <remove_next_nth_ir+26>: mov rbp,QWORD PTR [rbx] 0x55555555901d <remove_next_nth_ir+29>: nop DWORD PTR [rax] 0x555555559020 <remove_next_nth_ir+32>: mov r8,rax => 0x555555559023 <remove_next_nth_ir+35>: mov rax,QWORD PTR [rax+0x20] 0x555555559027 <remove_next_nth_ir+39>: mov r10,r9 0x55555555902a <remove_next_nth_ir+42>: lea r9,[r8-0x8] 0x55555555902e <remove_next_nth_ir+46>: mov QWORD PTR [rsi+0x20],rax 0x555555559032 <remove_next_nth_ir+50>: mov QWORD PTR [r8-0x8],r10 [------------------------------------stack-------------------------------------] 0000| 0x7fffffffd9c8 --> 0x7ffff7fbf038 --> 0x466a800000002 0008| 0x7fffffffd9d0 --> 0x5555555891c0 --> 0x0 0016| 0x7fffffffd9d8 --> 0x55555556146a (<rv_step+1786>: jmp 0x555555561171 <rv_step+1025>) 0024| 0x7fffffffd9e0 --> 0x64 ('d') 0032| 0x7fffffffd9e8 --> 0x7ffff7faf1b8 --> 0x81000a0600000000 0040| 0x7fffffffd9f0 --> 0x1000101 0048| 0x7fffffffd9f8 --> 0x101010000 0056| 0x7fffffffda00 --> 0x0 [------------------------------------------------------------------------------] Legend: code, data, rodata, value Stopped reason: SIGSEGV 0x0000555555559023 in remove_next_nth_ir () ``` 程式在執行至 remove_next_nth_ir 的 `0x555555559023 <remove_next_nth_ir+35>: mov rax,QWORD PTR [rax+0x20]` 觸發了 Segmentation fault 參考 jserv 老師的建議先去理解 `Fix the inaccuracy in the number of IR elements to be removed #256` 所要解決的問題 ## [Fix the inaccuracy in the number of IR elements to be removed #256](https://github.com/sysprog21/rv32emu/pull/256) qwe661234 在 libc_substitute 函式中觀察到需要刪除的只有一個 IR ，因此從原先的刪除兩個 IR 變成一個，解決了運行 doom 時的 NULL pointer dereference 的錯誤。 ```diff .... if (IF_imm(next_ir, 20) && detect_memset(rv, 1)) { ir->opcode = rv_insn_fuse5; ir->impl = dispatch_table[ir->opcode]; - remove_next_nth_ir(rv, ir, block, 2); + remove_next_nth_ir(rv, ir, block, 1); return true; } if (IF_imm(next_ir, 28) && detect_memcpy(rv, 1)) { ir->opcode = rv_insn_fuse6; ir->impl = dispatch_table[ir->opcode]; - remove_next_nth_ir(rv, ir, block, 2); + remove_next_nth_ir(rv, ir, block, 1); return true; }; } .... ``` 於是我也在我的 src/emulate.c 採取同樣做法 ```diff } else if (IF_imm(ir, 0) && IF_rd(ir, t1) && IF_rs1(ir, a0)) { next_ir = ir->next; if (IF_insn(next_ir, beq) && IF_rs1(next_ir, a2) && IF_rs2(next_ir, zero)) { if (IF_imm(next_ir, 20) && detect_memset(rv, 1)) { ir->opcode = rv_insn_fuse5; ir->impl = dispatch_table[ir->opcode]; + remove_next_nth_ir(rv, ir, block, 1); return true; } if (IF_imm(next_ir, 28) && detect_memcpy(rv, 1)) { ir->opcode = rv_insn_fuse6; ir->impl = dispatch_table[ir->opcode]; + remove_next_nth_ir(rv, ir, block, 1); return true; }; } } ``` 再重新運行 make doom，便無報錯，`remove_next_nth_ir 的 0x555555559023 <remove_next_nth_ir+35>: mov rax,QWORD PTR [rax+0x20] 觸發了 Segmentation fault` 該問題順利排除。不過對於以下程式，還不熟悉對應到 c code 的哪一行～～ ``` 0x555555559023 <remove_next_nth_ir+35>: mov rax,QWORD PTR [rax+0x20] 0x555555559027 <remove_next_nth_ir+39>: mov r10,r9 ``` 為了分析 Bug，使用 backtrace 或是 where 等等命令，查看函式呼叫情形，如下 ```shell gdb-peda$ bt #0 0x0000555555559023 in remove_next_nth_ir () #1 0x000055555556146a in rv_step () #2 0x0000555555558b4d in main () #3 0x00007ffff7ddf083 in __libc_start_main (main= 0x5555555583e0 <main>, argc=0x2, argv=0x7fffffffdda8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdd98) at ../csu/libc-start.c:308 #4 0x0000555555558e8e in _start () ``` 我們可以理解 remove_next_nth_ir 是由 rv_step 函式所呼叫的，而 rv_step 最終是由 main 函式所呼叫的，0x5555555583e0 正是出現 Segmentation fault 的位址。試圖使用 info source 命令去找尋當前的 source file，但 GDB 沒有找到。 ```shell ... Legend: code, data, rodata, value Stopped reason: SIGSEGV 0x0000555555559023 in remove_next_nth_ir () gdb-peda$ list 1 <built-in>: 沒有此一檔案或目錄. gdb-peda$ info source Current source file is <built-in> Compilation directory is /build/glibc-SzIz7B/glibc-2.31/elf Source language is c. Producer is GNU C11 9.4.0 -mno-mmx -mtune=generic -march=x86-64 -g -O2 -O3 -std=gnu11 -fgnu89-inline -fmerge-all-constants -frounding-math -fstack-protector-strong -fmath-errno -fPIC -fno-stack-protector -fcf-protection=full -fno-tree-loop-distribute-patterns -ftls-model=initial-exec -fasynchronous-unwind-tables -fstack-clash-protection. Compiled with DWARF 2 debugging format. Does not include preprocessor macro info. ... ``` 如果我們想看見出錯的程式碼在哪一段，我們需要將 Makefile 中的 ENABLE_GDBSTUB 設置為 1，在編譯時便會將 source file 一同編譯進去。 ```Makefile ENABLE_GDBSTUB ?= 1 ``` 接著重新 make 後，再使用 backtrace，我們便可以得到以下結果 ```shell ... [------------------------------------------------------------------------------] Legend: code, data, rodata, value Stopped reason: SIGSEGV 0x0000555555559023 in remove_next_nth_ir (rv=<optimized out>, ir=0x7ffff7faf1b8, block=0x7ffff7fbf038, n=0x2) at src/emulate.c:732 732 ir->next = ir->next->next; gdb-peda$ bt #0 0x0000555555559023 in remove_next_nth_ir (rv=<optimized out>, ir=0x7ffff7faf1b8, block=0x7ffff7fbf038, n=0x2) at src/emulate.c:732 #1 0x000055555556146a in libc_substitute (block=0x7ffff7fbf038, rv=0x5555555891c0) at src/emulate.c:773 #2 block_find_or_translate (rv=0x5555555891c0) at src/emulate.c:962 #3 rv_step (rv=0x5555555891c0, cycles=<optimized out>) at src/emulate.c:1002 #4 0x0000555555558b4d in run (rv=<optimized out>) at src/main.c:90 #5 main (argc=argc@entry=0x2, args=args@entry=0x7fffffffdd68) at src/main.c:253 #6 0x00007ffff7ddf083 in __libc_start_main (main= 0x5555555583e0 <main>, argc=0x2, argv=0x7fffffffdd68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdd58) at ../csu/libc-start.c:308 #7 0x0000555555558e8e in _start () at src/emulate.c:1062 ``` 我們可以知道執行完第 773 行以後，接著執行 732 行的程式碼時便出現 `Segmentation fault` ```shell ... for (uint8_t i = 0; i < n; i++) { rv_insn_t *next = ir->next; ir->next = ir->next->next; // 第732行 mpool_free(rv->block_ir_mp, next); } if (!ir->next) block->ir_tail = ir; block->n_insn -= n; ... ir->opcode = rv_insn_fuse5; ir->impl = dispatch_table[ir->opcode]; remove_next_nth_ir(rv, ir, block, 2); // 第773行 return true; ... ``` 於是我便針對 `remove_next_nth_ir(rv, ir, block, 2);` 中的 2 更改成 3和4，結果和先前一樣，於是當我更改成 1 時便可以正常運行。這說明了一件事，若是錯誤刪除 ir 可能導致 access violation。 ## [Quake renders incorrectly following the implementation of constant optimization and fused instructions #258](https://github.com/sysprog21/rv32emu/issues/258) 提交 [0d1f7f3](https://github.com/sysprog21/rv32emu/commit/0d1f7f34f46db47d1ad38d5274155e3a3468d0eb) 後，Quake 出現渲染不正確，遊戲物件不會出現在預期顏色調色盤中。在 0d1f7f3 commit 中，加入了 fusion instruction ，並移除不必要的指令。經由不斷的 propagation 和不斷的 folding，使 auipc + addi 和 lui + addi 變為 2 lui instuctions. 在優化過的基本模塊中，已存在一系列的 lui instruction，加入 fusion fuction 來處例這些 lui instructions. :::info fusion 補充說明這項技術全名為 Macro-operation fusion，又名為 MOP fusion，將多個相鄰指令合併為單一融合指令。可以減少指令數量，從而提高處理器的效率和整體效能。 fusion 使我們能夠輕鬆識別 basic block 內的 instruction sequences 並且當 pattern align 時執行 fusion operation。 ::: 下方舉 ./build/dhrystone.elf 為例子，進行示範 ``` Before 0x100f0: auipc gp,0x18 0x100f4: add gp,gp,864 0x100f8: add a0,gp,-1996 0x100fc: auipc a2,0x1a 0x10100: add a2,a2,1384 0x10104: sub a2,a2,a0 After 0x100f0: lui gp, 0x10108 0x100f4: lui gp, 0x10468 0x100f8: lui a0, 0xFC9C 0x100fc: lui a2,0x10116 0x10100: lui a2, 0x1067E 0x10104: lui a2, 0x9e2 ``` 接下來觀察並說明 commit 0d1f7f3 中變更的程式碼。 ### 下方程式為「AUIPC + ADDI」的組合指令改為「multiple lui」組合指令 ```diff -/* AUIPC + ADDI */ +/* multiple lui */ static bool do_fuse1(riscv_t *rv, const rv_insn_t *ir) { - rv->csr_cycle += 2; - rv->X[ir->rd] = rv->PC + ir->imm; - rv->X[ir->rs1] = rv->X[ir->rd] + ir->imm2; - rv->PC += 2 * ir->insn_len; + rv->csr_cycle += ir->imm2; + for (int i = 0; i < ir->imm2; i++) { + const rv_insn_t *cur_ir = ir + i; + rv->X[cur_ir->rd] = cur_ir->imm; + } + rv->PC += ir->imm2 * ir->insn_len; if (unlikely(RVOP_NO_NEXT(ir))) return true; - const rv_insn_t *next = ir + 2; + const rv_insn_t *next = ir + ir->imm2; MUST_TAIL return next->impl(rv, next); } ``` 「AUIPC + ADDI」 * 在 `rv->csr_cycle += 2;`中加 2 是因為有兩個 instructions，分別為 AUIPC 和 AUIPC。 * 讓 rv->X[ir->rd] 指向 rv->PC + ir->imm `(rv->PC + ir->imm 表示 rd 欄位)` * 讓 rv->X[ir->rs1] 指向 rv->X[ir->rd] + ir->imm2 `(rv->X[ir->rd] + ir->imm2 表示 rs1 欄位)` * rv->PC 儲存著下一條指令要執行的位址，由於「AUIPC + ADDI」的組合指令，包含著兩個指令，所以當執行完該組合指令，需要對 rv->PC 累加 2 個 ir->insn_len。「multiple lui」 * ir->imm2 表示 lui instruction 的數量 * 接下來用 for loop 對每一個 lui instruction，`讓 rv->X[cur_ir->rd]` 指向 `cur_ir->imm`。 const rv_insn_t *next 指著下一個要執行的指令 ### 下方程式為「AUIPC + ADD」的組合指令改為「LUI + ADD」組合指令 ```diff -/* AUIPC + ADD */ +/* LUI + ADD */ static bool do_fuse2(riscv_t *rv, const rv_insn_t *ir) { rv->csr_cycle += 2; - rv->X[ir->rd] = rv->PC + ir->imm; + rv->X[ir->rd] = ir->imm; rv->X[ir->rs2] = rv->X[ir->rd] + rv->X[ir->rs1]; rv->PC += 2 * ir->insn_len; if (unlikely(RVOP_NO_NEXT(ir))) @@ -468,21 +469,8 @@ static bool do_fuse4(riscv_t *rv, const rv_insn_t *ir) MUST_TAIL return next->impl(rv, next); } ``` 更改前 `rv->X[ir->rd] = rv->PC + ir->imm;` `(根據 AUIPC 指令，指向 rd 欄位)` 更改後 `* rv->X[ir->rd] = ir->imm;` `(根據 lui 指令，指向 rd 欄位)` ```c static bool libc_substitute(riscv_t *rv, block_t *block) { rv_insn_t *ir = block->ir, *next_ir = NULL; switch (ir->opcode) { case rv_insn_addi: /* Compare the target block with the first basic block of * memset/memcpy, if two block is match, we would extract the * instruction sequence starting from the pc_start of the basic * block and then compare it with the pre-recorded memset/memcpy * instruction sequence. */ if (ir->imm == 15 && ir->rd == rv_reg_t1 && ir->rs1 == rv_reg_zero) { next_ir = ir + 1; if (next_ir->opcode == rv_insn_addi && next_ir->rd == rv_reg_a4 && next_ir->rs1 == rv_reg_a0 && next_ir->rs2 == rv_reg_zero) { next_ir = next_ir + 1; if (next_ir->opcode == rv_insn_bgeu && next_ir->imm == 60 && next_ir->rs1 == rv_reg_t1 && next_ir->rs2 == rv_reg_a2) { if (detect_memset(rv, 1)) { ir->opcode = rv_insn_fuse5; ir->impl = dispatch_table[ir->opcode]; ir->tailcall = true; return true; }; } } } else if (ir->imm == 0 && ir->rd == rv_reg_t1 && ir->rs1 == rv_reg_a0) { next_ir = ir + 1; if (next_ir->opcode == rv_insn_beq && next_ir->rs1 == rv_reg_a2 && next_ir->rs2 == rv_reg_zero) { if (next_ir->imm == 20 && detect_memset(rv, 2)) { ir->opcode = rv_insn_fuse5; ir->impl = dispatch_table[ir->opcode]; ir->tailcall = true; return true; } else if (next_ir->imm == 28 && detect_memcpy(rv, 2)) { ir->opcode = rv_insn_fuse6; ir->impl = dispatch_table[ir->opcode]; ir->tailcall = true; return true; }; } } break; case rv_insn_xor: /* Compare the target block with the first basic block of memcpy, if * two block is match, we would extract the instruction sequence * starting from the pc_start of the basic block and then compare * it with the pre-recorded memcpy instruction sequence. */ if (ir->rd == rv_reg_a5 && ir->rs1 == rv_reg_a0 && ir->rs2 == rv_reg_a1) { next_ir = ir + 1; if (next_ir->opcode == rv_insn_andi && next_ir->imm == 3 && next_ir->rd == rv_reg_a5 && next_ir->rs1 == rv_reg_a5) { next_ir = next_ir + 1; if (next_ir->opcode == rv_insn_add && next_ir->rd == rv_reg_a7 && next_ir->rs1 == rv_reg_a0 && next_ir->rs2 == rv_reg_a2) { next_ir = next_ir + 1; if (next_ir->opcode == rv_insn_bne && next_ir->imm == 104 && next_ir->rs1 == rv_reg_a5 && next_ir->rs2 == rv_reg_zero) { if (detect_memcpy(rv, 1)) { ir->opcode = rv_insn_fuse6; ir->impl = dispatch_table[ir->opcode]; ir->tailcall = true; return true; }; } } } } break; /* TODO: Inject other frequently used function calls from the C standard * library */ } return false; } ``` ### 實作調查先切換到 commit 0d1f7f3 之後，執行 make quake 命令，接著持續按 Enter 鍵進入遊戲畫面，然後跳出以下錯誤 ```shell (base) huaxin@huaxin-System-Product-Name:/media/huaxin/D磁碟區/NUKU/rv32emu$ make quake mk/toolchain.mk:28: GNU Toolchain for RISC-V is required to build architecture tests. Please check package installation (cd build; ../build/rv32emu quake.elf) Added packfile .//id1/pak0.pak (339 files) Couldn't open .//id1/pak1.pak Playing registered version. Console initialized. Exe: 20:12:56 Sep 1 2023 8.0 megabyte heap 792k surface cache Sound init ========Quake Initialized========= QUAKEMBD [INFO]: QuakEMBD - Based on WinQuake 1.090 execing quake.rc execing default.cfg FindFile: can't find config.cfg couldn't exec config.cfg FindFile: can't find autoexec.cfg couldn't exec autoexec.cfg 3 demo(s) in loop Playing demo from demo1.dem. [1d][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1e][1f] [02]the Necropolis You got the shells You got the Grenade Launcher rv32emu: src/syscall.c:290: syscall_read: Assertion `total_read == rv_get_reg(rv, rv_reg_a2)' failed. Aborted (core dumped) make: *** [Makefile:167：quake] 錯誤 134 ``` 使用 GDB 除錯 ### [Refactor mpool_calloc function #433](https://github.com/sysprog21/rv32emu/pull/433) 這裡我針對 visitorckw 和 qwe661234 所提供的程式碼，使用 gdb 檢查 mpool_calloc function 在 x86 instruction 上的差異。 **qwe661234 的版本** ```c FORCE_INLINE void *mpool_alloc_helper(mpool_t *mp) { char *ptr = (char *) mp->free_chunk_head + sizeof(memchunk_t); mp->free_chunk_head = mp->free_chunk_head->next; mp->chunk_count--; return ptr; } void *mpool_alloc(mpool_t *mp) { if (!mp->chunk_count && !(mpool_extend(mp))) return NULL; return mpool_alloc_helper(mp); } void *mpool_calloc(mpool_t *mp) { if (!mp->chunk_count && !(mpool_extend(mp))) return NULL; char *ptr = mpool_alloc_helper(mp); memset(ptr, 0, mp->chunk_size); return ptr; } ``` **visitorckw 的版本** ```c FORCE_INLINE void *mpool_alloc_helper(mpool_t *mp) { if (!mp->chunk_count && !(mpool_extend(mp))) return NULL; char *ptr = (char *) mp->free_chunk_head + sizeof(memchunk_t); mp->free_chunk_head = mp->free_chunk_head->next; mp->chunk_count--; return ptr; } void *mpool_alloc(mpool_t *mp) { return mpool_alloc_helper(mp); } void *mpool_calloc(mpool_t *mp) { char *ptr = mpool_alloc_helper(mp); if (ptr) memset(ptr, 0, mp->chunk_size); return ptr; } ``` 先將 Makefile 做以下設定，以利我們除錯 ``` CFLAGS = -std=gnu99 -O2 -Wall -Wextra -g ``` 再把 src/emulate.c 中，使用 mpool_alloc 的函式更改為 mpool_calloc ```bash= $ gdb ./build/rv32emu $ b mpool_calloc $ r build/donut.elf ``` 進到以下畫面 ![image](https://hackmd.io/_uploads/S1mYTlazC.png) 接著執行 ```bash= disassemble ``` 可得到以下 instruction 輸出(**qwe661234 的版本**) ```terminal gdb-peda$ disassemble Dump of assembler code for function mpool_calloc: => 0x0000555555560990 <+0>: push r14 0x0000555555560992 <+2>: push r13 0x0000555555560994 <+4>: push r12 0x0000555555560996 <+6>: push rbp 0x0000555555560997 <+7>: push rbx 0x0000555555560998 <+8>: mov rbp,QWORD PTR [rdi] 0x000055555556099b <+11>: mov rbx,rdi 0x000055555556099e <+14>: test rbp,rbp 0x00005555555609a1 <+17>: je 0x5555555609e0 <mpool_calloc+80> 0x00005555555609a3 <+19>: mov r9,QWORD PTR [rdi+0x10] 0x00005555555609a7 <+23>: mov r12,QWORD PTR [rdi+0x18] 0x00005555555609ab <+27>: mov rax,QWORD PTR [r12] 0x00005555555609af <+31>: sub rbp,0x1 0x00005555555609b3 <+35>: lea r8,[r12+0x8] 0x00005555555609b8 <+40>: mov rdx,r9 0x00005555555609bb <+43>: mov QWORD PTR [rbx],rbp 0x00005555555609be <+46>: mov rdi,r8 0x00005555555609c1 <+49>: xor esi,esi 0x00005555555609c3 <+51>: mov QWORD PTR [rbx+0x18],rax 0x00005555555609c7 <+55>: call 0x555555559100 <memset@plt> 0x00005555555609cc <+60>: pop rbx 0x00005555555609cd <+61>: pop rbp 0x00005555555609ce <+62>: mov r8,rax 0x00005555555609d1 <+65>: pop r12 0x00005555555609d3 <+67>: pop r13 0x00005555555609d5 <+69>: mov rax,r8 0x00005555555609d8 <+72>: pop r14 0x00005555555609da <+74>: ret 0x00005555555609db <+75>: nop DWORD PTR [rax+rax*1+0x0] 0x00005555555609e0 <+80>: call 0x5555555593e0 <getpagesize@plt> 0x00005555555609e5 <+85>: xor r9d,r9d 0x00005555555609e8 <+88>: mov r8d,0xffffffff 0x00005555555609ee <+94>: xor edi,edi 0x00005555555609f0 <+96>: movsxd r13,eax 0x00005555555609f3 <+99>: imul r13,QWORD PTR [rbx+0x8] 0x00005555555609f8 <+104>: mov ecx,0x22 0x00005555555609fd <+109>: mov edx,0x3 0x0000555555560a02 <+114>: mov rsi,r13 0x0000555555560a05 <+117>: call 0x555555559310 <mmap@plt> 0x0000555555560a0a <+122>: mov r12,rax 0x0000555555560a0d <+125>: mov r14,rax 0x0000555555560a10 <+128>: lea rax,[rax-0x1] 0x0000555555560a14 <+132>: cmp rax,0xfffffffffffffffd 0x0000555555560a18 <+136>: ja 0x555555560a90 <mpool_calloc+256> 0x0000555555560a1a <+138>: mov edi,0x10 0x0000555555560a1f <+143>: call 0x5555555592f0 <malloc@plt> 0x0000555555560a24 <+148>: mov rcx,rax 0x0000555555560a27 <+151>: test rax,rax 0x0000555555560a2a <+154>: je 0x555555560a90 <mpool_calloc+256> 0x0000555555560a2c <+156>: mov r9,QWORD PTR [rbx+0x10] 0x0000555555560a30 <+160>: mov QWORD PTR [rax],r12 0x0000555555560a33 <+163>: xor edx,edx 0x0000555555560a35 <+165>: mov QWORD PTR [rax+0x8],0x0 0x0000555555560a3d <+173>: mov rax,r13 0x0000555555560a40 <+176>: lea rsi,[r9+0x8] 0x0000555555560a44 <+180>: mov QWORD PTR [rbx+0x18],r12 0x0000555555560a48 <+184>: div rsi 0x0000555555560a4b <+187>: mov rdi,rax 0x0000555555560a4e <+190>: sub rdi,0x1 0x0000555555560a52 <+194>: je 0x555555560a6a <mpool_calloc+218> 0x0000555555560a54 <+196>: nop DWORD PTR [rax+0x0] 0x0000555555560a58 <+200>: mov rdx,r14 0x0000555555560a5b <+203>: add rbp,0x1 0x0000555555560a5f <+207>: add r14,rsi 0x0000555555560a62 <+210>: mov QWORD PTR [rdx],r14 0x0000555555560a65 <+213>: cmp rbp,rdi 0x0000555555560a68 <+216>: jne 0x555555560a58 <mpool_calloc+200> 0x0000555555560a6a <+218>: add rax,QWORD PTR [rbx] 0x0000555555560a6d <+221>: mov rbp,rax 0x0000555555560a70 <+224>: lea rax,[rbx+0x20] 0x0000555555560a74 <+228>: nop DWORD PTR [rax+0x0] 0x0000555555560a78 <+232>: mov rdx,rax 0x0000555555560a7b <+235>: mov rax,QWORD PTR [rax+0x8] 0x0000555555560a7f <+239>: test rax,rax 0x0000555555560a82 <+242>: jne 0x555555560a78 <mpool_calloc+232> 0x0000555555560a84 <+244>: mov QWORD PTR [rdx+0x8],rcx 0x0000555555560a88 <+248>: jmp 0x5555555609ab <mpool_calloc+27> 0x0000555555560a8d <+253>: nop DWORD PTR [rax] 0x0000555555560a90 <+256>: xor r8d,r8d 0x0000555555560a93 <+259>: pop rbx 0x0000555555560a94 <+260>: pop rbp 0x0000555555560a95 <+261>: mov rax,r8 0x0000555555560a98 <+264>: pop r12 0x0000555555560a9a <+266>: pop r13 0x0000555555560a9c <+268>: pop r14 0x0000555555560a9e <+270>: ret End of assembler dump. ``` 可得到以下 instruction 輸出(**visitorckw 的版本**) ```terminal gdb-peda$ disassemble Dump of assembler code for function mpool_calloc: => 0x0000555555560990 <+0>: push r14 0x0000555555560992 <+2>: push r13 0x0000555555560994 <+4>: push r12 0x0000555555560996 <+6>: push rbp 0x0000555555560997 <+7>: push rbx 0x0000555555560998 <+8>: mov rbp,QWORD PTR [rdi] 0x000055555556099b <+11>: mov rbx,rdi 0x000055555556099e <+14>: test rbp,rbp 0x00005555555609a1 <+17>: je 0x5555555609e0 <mpool_calloc+80> 0x00005555555609a3 <+19>: mov r9,QWORD PTR [rdi+0x10] 0x00005555555609a7 <+23>: mov r12,QWORD PTR [rdi+0x18] 0x00005555555609ab <+27>: mov rax,QWORD PTR [r12] 0x00005555555609af <+31>: sub rbp,0x1 0x00005555555609b3 <+35>: lea r8,[r12+0x8] 0x00005555555609b8 <+40>: mov rdx,r9 0x00005555555609bb <+43>: mov QWORD PTR [rbx],rbp 0x00005555555609be <+46>: mov rdi,r8 0x00005555555609c1 <+49>: xor esi,esi 0x00005555555609c3 <+51>: mov QWORD PTR [rbx+0x18],rax 0x00005555555609c7 <+55>: call 0x555555559100 <memset@plt> 0x00005555555609cc <+60>: mov r8,rax 0x00005555555609cf <+63>: pop rbx 0x00005555555609d0 <+64>: mov rax,r8 0x00005555555609d3 <+67>: pop rbp 0x00005555555609d4 <+68>: pop r12 0x00005555555609d6 <+70>: pop r13 0x00005555555609d8 <+72>: pop r14 0x00005555555609da <+74>: ret 0x00005555555609db <+75>: nop DWORD PTR [rax+rax*1+0x0] 0x00005555555609e0 <+80>: call 0x5555555593e0 <getpagesize@plt> 0x00005555555609e5 <+85>: xor r9d,r9d 0x00005555555609e8 <+88>: mov r8d,0xffffffff 0x00005555555609ee <+94>: xor edi,edi 0x00005555555609f0 <+96>: movsxd r14,eax 0x00005555555609f3 <+99>: imul r14,QWORD PTR [rbx+0x8] 0x00005555555609f8 <+104>: mov ecx,0x22 0x00005555555609fd <+109>: mov edx,0x3 0x0000555555560a02 <+114>: mov rsi,r14 0x0000555555560a05 <+117>: call 0x555555559310 <mmap@plt> 0x0000555555560a0a <+122>: mov r12,rax 0x0000555555560a0d <+125>: mov r13,rax 0x0000555555560a10 <+128>: lea rax,[rax-0x1] 0x0000555555560a14 <+132>: cmp rax,0xfffffffffffffffd 0x0000555555560a18 <+136>: ja 0x555555560a90 <mpool_calloc+256> 0x0000555555560a1a <+138>: mov edi,0x10 0x0000555555560a1f <+143>: call 0x5555555592f0 <malloc@plt> 0x0000555555560a24 <+148>: mov rcx,rax 0x0000555555560a27 <+151>: test rax,rax 0x0000555555560a2a <+154>: je 0x555555560a90 <mpool_calloc+256> 0x0000555555560a2c <+156>: mov r9,QWORD PTR [rbx+0x10] 0x0000555555560a30 <+160>: mov QWORD PTR [rax],r12 0x0000555555560a33 <+163>: xor edx,edx 0x0000555555560a35 <+165>: mov QWORD PTR [rax+0x8],0x0 0x0000555555560a3d <+173>: mov rax,r14 0x0000555555560a40 <+176>: lea rsi,[r9+0x8] 0x0000555555560a44 <+180>: mov QWORD PTR [rbx+0x18],r12 0x0000555555560a48 <+184>: div rsi 0x0000555555560a4b <+187>: mov rdi,rax 0x0000555555560a4e <+190>: sub rdi,0x1 0x0000555555560a52 <+194>: je 0x555555560a6a <mpool_calloc+218> 0x0000555555560a54 <+196>: nop DWORD PTR [rax+0x0] 0x0000555555560a58 <+200>: mov rdx,r13 0x0000555555560a5b <+203>: add rbp,0x1 0x0000555555560a5f <+207>: add r13,rsi 0x0000555555560a62 <+210>: mov QWORD PTR [rdx],r13 0x0000555555560a65 <+213>: cmp rbp,rdi 0x0000555555560a68 <+216>: jne 0x555555560a58 <mpool_calloc+200> 0x0000555555560a6a <+218>: add rax,QWORD PTR [rbx] 0x0000555555560a6d <+221>: mov rbp,rax 0x0000555555560a70 <+224>: lea rax,[rbx+0x20] 0x0000555555560a74 <+228>: nop DWORD PTR [rax+0x0] 0x0000555555560a78 <+232>: mov rdx,rax 0x0000555555560a7b <+235>: mov rax,QWORD PTR [rax+0x8] 0x0000555555560a7f <+239>: test rax,rax 0x0000555555560a82 <+242>: jne 0x555555560a78 <mpool_calloc+232> 0x0000555555560a84 <+244>: mov QWORD PTR [rdx+0x8],rcx 0x0000555555560a88 <+248>: jmp 0x5555555609ab <mpool_calloc+27> 0x0000555555560a8d <+253>: nop DWORD PTR [rax] 0x0000555555560a90 <+256>: xor r8d,r8d 0x0000555555560a93 <+259>: jmp 0x5555555609cf <mpool_calloc+63> End of assembler dump. ``` 比較兩個版本，**@visitorckw** 的版本多了以下 instruction ```terminal= 0x0000555555560a93 <+259>: jmp 0x5555555609cf <mpool_calloc+63> ``` 第二種取得 x86 instruction 的方法對 `./src/mpool.c` 進行以下變更 ```diff #include <stdlib.h> #include <string.h> - #if HAVE_MMAP #include <sys/mman.h> #include <unistd.h> + #define FORCE_INLINE static inline __attribute__((always_inline)) - #endif ``` 執行以下指令 ```bash $ gcc -S -o mpool.s ./src/mpool.c ``` 再用 vim 觀察 `mpool.s`，可得到以下結果 **qwe661234 版本** ```c= mpool_calloc: .LFB11: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $48, %rsp movq %rdi, -40(%rbp) movq -40(%rbp), %rax movq (%rax), %rax testq %rax, %rax jne .L24 movq -40(%rbp), %rax movq %rax, %rdi call mpool_extend testq %rax, %rax jne .L24 movl $0, %eax jmp .L25 .L24: movq -40(%rbp), %rax movq %rax, -16(%rbp) movq -16(%rbp), %rax movq 24(%rax), %rax addq $8, %rax movq %rax, -8(%rbp) movq -16(%rbp), %rax movq 24(%rax), %rax movq (%rax), %rdx movq -16(%rbp), %rax movq %rdx, 24(%rax) movq -16(%rbp), %rax movq (%rax), %rax leaq -1(%rax), %rdx movq -16(%rbp), %rax movq %rdx, (%rax) movq -8(%rbp), %rax movq %rax, -24(%rbp) movq -40(%rbp), %rax movq 16(%rax), %rdx movq -24(%rbp), %rax movl $0, %esi movq %rax, %rdi call memset@PLT movq -24(%rbp), %rax .L25: leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE11: .size mpool_calloc, .-mpool_calloc .globl mpool_free .type mpool_free, @function ``` **visitorckw 版本** ```c= mpool_calloc: .LFB11: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $48, %rsp movq %rdi, -40(%rbp) movq -40(%rbp), %rax movq %rax, -16(%rbp) movq -16(%rbp), %rax movq (%rax), %rax testq %rax, %rax jne .L24 movq -16(%rbp), %rax movq %rax, %rdi call mpool_extend testq %rax, %rax jne .L24 movl $0, %eax jmp .L25 .L24: movq -16(%rbp), %rax movq 24(%rax), %rax addq $8, %rax movq %rax, -8(%rbp) movq -16(%rbp), %rax movq 24(%rax), %rax movq (%rax), %rdx movq -16(%rbp), %rax movq %rdx, 24(%rax) movq -16(%rbp), %rax movq (%rax), %rax leaq -1(%rax), %rdx movq -16(%rbp), %rax movq %rdx, (%rax) movq -8(%rbp), %rax .L25: movq %rax, -24(%rbp) cmpq $0, -24(%rbp) je .L26 movq -40(%rbp), %rax movq 16(%rax), %rdx movq -24(%rbp), %rax movl $0, %esi movq %rax, %rdi call memset@PLT .L26: movq -24(%rbp), %rax leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE11: .size mpool_calloc, .-mpool_calloc .globl mpool_free .type mpool_free, @function ``` ### [Optimize cmp function #441](https://github.com/sysprog21/rv32emu/pull/441) 嘗試對 cmp 進行效能改進 **原版** ```c static inline int cmp(const void *arg0, const void *arg1) { riscv_word_t *a = (riscv_word_t *) arg0, *b = (riscv_word_t *) arg1; return (*a < *b) ? _CMP_LESS : (*a > *b) ? _CMP_GREATER : _CMP_EQUAL; } ``` **改進版** ```c static inline int cmp(const void *arg0, const void *arg1) { riscv_word_t a = *(riscv_word_t *) arg0, b = *(riscv_word_t *) arg1; return (a < b) ? _CMP_LESS : (a > b) ? _CMP_GREATER : _CMP_EQUAL; } ``` 對 `src/breakpoint.c` 做以下設定 ```c #include "common.h" // #if !RV32_HAS(GDBSTUB) // #error "Do not manage to build this file unless you enable gdbstub support." // #endif #include "breakpoint.h" ``` 接著執行以下命令，取得 x86 instruction ```bash $ gcc -S -m64 src/breakpoint.c ``` **原版** ```c cmp: .LFB4: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movq %rdi, -24(%rbp) movq %rsi, -32(%rbp) movq -24(%rbp), %rax movq %rax, -16(%rbp) movq -32(%rbp), %rax movq %rax, -8(%rbp) movq -16(%rbp), %rax movl (%rax), %edx movq -8(%rbp), %rax movl (%rax), %eax cmpl %eax, %edx jb .L2 movq -16(%rbp), %rax movl (%rax), %edx movq -8(%rbp), %rax movl (%rax), %eax cmpl %eax, %edx seta %al movzbl %al, %eax jmp .L4 .L2: movl $-1, %eax .L4: popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE4: .size cmp, .-cmp .globl breakpoint_map_new .type breakpoint_map_new, @function ``` **改進版** ```c cmp: .LFB4: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movq %rdi, -24(%rbp) movq %rsi, -32(%rbp) movq -24(%rbp), %rax movl (%rax), %eax movl %eax, -8(%rbp) movq -32(%rbp), %rax movl (%rax), %eax movl %eax, -4(%rbp) movl -8(%rbp), %eax cmpl -4(%rbp), %eax jb .L2 movl -8(%rbp), %eax cmpl -4(%rbp), %eax seta %al movzbl %al, %eax jmp .L4 .L2: movl $-1, %eax .L4: popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE4: .size cmp, .-cmp .globl breakpoint_map_new .type breakpoint_map_new, @function ``` 從改進版可以觀察到，Instruction 數量明顯減少，同時記憶體訪問次數減少。後來經 visitorckw 提醒，加上 O2 flag 後，結果所產生的程式碼是一致的，從這 commit 學到，在優化程式碼的同時，也須考慮 compiler 優化。 ``` $ gcc -S -m64 -O2 src/breakpoint.c ``` 所以我們接著下一個 commit 再接再厲。 ### Replace sprintf with safer snprintf 以下錯誤是在 macOS 上所發現的而 ubuntu 上不會出現，這是我的 macOs profile ```terminal happy@HappydeMacBook-Air rv32emu % system_profiler SPHardwareDataType Hardware: Hardware Overview: Model Name: MacBook Air Model Identifier: MacBookAir10,1 Model Number: MGN93TA/A Chip: Apple M1 Total Number of Cores: 8 (4 performance and 4 efficiency) Memory: 8 GB System Firmware Version: 10151.61.4 OS Loader Version: 10151.61.4 Serial Number (system): FVFHX4U5Q6L7 Hardware UUID: E07835E4-3164-5C51-AB26-8791EA68D4EC Provisioning UDID: 00008103-000429843AC0801E Activation Lock Status: Disabled ``` 將 Makefile 做以下調整，從 **檢測未定義行為** 轉向 **偵測記憶體錯誤** ```diff - ENABLE_UBSAN ?= 0 + ENABLE_UBSAN ?= 1 ifeq ("$(ENABLE_UBSAN)", "1") - CFLAGS += -fsanitize=undefined -fno-sanitize=alignment -fno-sanitize-recover=all - LDFLAGS += -fsanitize=undefined -fno-sanitize=alignment -fno-sanitize-recover=all + CFLAGS += -fsanitize=address -fno-sanitize=alignment -fno-sanitize-recover=all + LDFLAGS += -fsanitize=address -fno-sanitize=alignment -fno-sanitize-recover=all endif ``` 經 make 後，指出 src/main.c 的第 135 行存在安全警告。指出 sprintf 沒有[邊界檢查](https://zh.wikipedia.org/zh-tw/%E8%BE%B9%E7%95%8C%E6%A3%80%E6%9F%A5)，容易導致 buffer overflow，而 snprintf 會確保不會寫入超過 buffer 大小的字符，從而避免 buffer overflow。根據 wiki 對 [Bounds checking](https://) 的解釋 > bounds checking is any method of detecting whether a variable is within some bounds before it is used. > As performing bounds checking during each use can be time-consuming, it is not always done. ```terminal src/main.c:135:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations] sprintf(prof_out_file, "%s/%s%s.prof", cwd_path, rel_path, ^ /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here __deprecated_msg("This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.") ^ /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg' #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg))) ^ 1 warning generated. ``` 在 CERN(European Organization for Nuclear Research) Computer Security 的 [Common vulnerabilities guide for C programmers](https://security.web.cern.ch/recommendations/en/codetools/c.shtml) 中明確指出，sprintf 為安全漏洞，藉由 snprintf 可以提高安全性。以下是我在 main.c 中新增 snprintf，改進潛在 buffer overflow 問題。 **main.c** ```diff - prof_out_file = malloc(strlen(cwd_path) + 1 + strlen(rel_path) + - strlen(prog_basename) + 5 + 1); + size_t prof_out_file_size = strlen(cwd_path) + 1 + strlen(rel_path) + + strlen(prog_basename) + 5 + 1; + prof_out_file = malloc(prof_out_file_size); assert(prof_out_file); - sprintf(prof_out_file, "%s/%s%s.prof", cwd_path, rel_path, - prog_basename); + snprintf(prof_out_file, prof_out_file_size, "%s/%s%s.prof", cwd_path, + rel_path, prog_basename); ``` 和 reviewer 討論後，reviewer 傾向於解決目前 rv32emu 所存在的安全問題，也就是否存在 buffer overflow。對於使用 snprintf 降低部分效能以解決潛在安全性問題，他們不接受。於是在我檢測記憶體後，確認目前分配的記憶體是足夠的，並不會引發 buffer overflow。因此這項 commit 提交，最後被 Closed。從這次經驗我也學到，reviewer 對專案安全性考量有哪些，可作為我下一次 commit 的養分，謝謝 reviewer。 ### Issues record 最近有人問了一些有趣的問題 * QEMU 也是 RISC-V 的模擬器，而 rv32emu 在效能上聲稱具備高效率，好奇是 QEMU 比較快還是 rv32emu ，假設 rv32emu 較快，那是哪些設計造成的。參考 EN 帶你寫個作業系統一書 ![image](https://hackmd.io/_uploads/Bk1nSV9rp.png) 裡面提到關於 KVM/QEMU 的 performnce bottleneck ，QEMU 與 KVM 屬於完全虛擬化的解決方案，在沒有硬體加速輔助的情況下，所有的工作都必須透過軟體模擬，這樣就會造成模擬器的效能低落，尤其是 device I/O 。模擬器的 I/O request 流程如下 1. 將處理結果傳至 I/O sharing page. 2. QEMU process 取得 I/O 資訊，再交由 QEMU I/O Emulation Code 來模擬 I/O request 3. 模擬完成後，結果放回 I/O sharing page 4. 通知 KVM module 中的 I/O trap 將處理結果取回並回傳至 virtual machine. 除了複雜的步驟造成效能不彰，此外還有 VMEntry, VMExit, context switch。在 [2023 年系統軟體系列課程討論區](https://www.facebook.com/groups/system.software2023/permalink/747209127231244/) 裡面 > 分析 rv32emu 和 QEMU 的效能評比，rv32emu 預設採用純粹直譯的方式來執行 RISC-V 指令，而 QEMU 運用其 TCG (tiny code generation) 進行動態二進位轉換。大部分的測試項目在 7 倍以內，某些沒有顯著熱點 (hot spot) 的程式碼來說，甚至 rv32emu 還可超越 QEMU，但二者佔用的記憶體開銷卻有極大的差異：QEMU 啟動就要數十 MB 並隨執行需求俱增，rv32emu 則可維持在數十 KB 的記憶體開銷。 ![image](https://hackmd.io/_uploads/SyvZBVcBp.png)