# Lab2: RISC-V RV32I[MA] emulator with ELF support <contributed by:`xl86305955`> ###### tags:`Computer Architecture`, `RISC-V` ## Requirements 1. Following the instructions of [Lab2: RISC-V RV32I[MA] emulator with ELF support](https://hackmd.io/@sysprog/rJAufgHYS), pick up two assembly programs from [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@jserv/By5OE6fOr) and rewrite into C implementations which can execute with [rv32emu](https://github.com/sysprog21/rv32emu). * You shall not pick up the program(s) submitted by youself already. * You may pick up the same topic as other students did, but you shall figure out something by your own. 2. Disassemble the ELF files generated by C compiler and compare the assembly listing between handwritten and compiler optimized one. * You can modify the arguments specified by `RV32I_CFLAGS` to experiment. e.g. Change `-O3` (optimized for speed) to `-Os` (optimized for size). * Describe your obserations and explain. * Function calls and `ecall` shall be taken into considerations. 3. Check the results of `emu-rv32i` for the statistics of execution flow and explain the internal counters such as `true_counter`, `true_counter` (crucial for [branch prediction](https://en.wikipedia.org/wiki/Branch_predictor)), `jump_counter`, etc. Then, try to optimize the generated assembly. You shall read [RISC-V Assembly Programmer's Manual](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md) carefully and modify Makefile in order to append new assembly targets. * Would you encounter data hazards? Can you do [instruction scheduling](https://en.wikipedia.org/wiki/Instruction_scheduling) manually? 4. Write down your thoughts and progress in [HackMD notes](https://hackmd.io/s/features). * [Example page](https://hackmd.io/@kksweet8845/2019q3homworkquiz2): Do not modify this note. * Insert your HackMD notes and programs in the following table. * 2 entries as expected. ## Environment Setup * Operating System * Ubuntu 18.04.3 LTS Follow the steps in [Lab2: RISC-V RV32I[MA] emulator with ELF support](https://hackmd.io/@sysprog/rJAufgHYS) to set up the GNU Toolchain for RISC-V. One thing I have encountered is my directories is a little different from the guide ``` $ tree -a 2 /home/fred/riscv-none-embed-gcc/8.2.1-3.1.1/ 2 [error opening dir] /home/fred/riscv-none-embed-gcc/8.2.1-3.1.1/ ├── CHANGELOG.md ├── .content │ ├── bin ``` Since that, I should be more aware of the location where to configure `$PATH` ``` $ cd riscv-none-embed-gcc/ $ echo "export PATH=`pwd`/8.2.1-3.1.1/.content/bin:$PATH" > setenv ``` Update `$PATH` environment: ``` $ cd $HOME $ source riscv-none-embed-gcc/setenv ``` Check `$PATH`, the result showned as expected ``` $ riscv-none-embed-gcc -v gcc version 8.2.0 (xPack GNU RISC-V Embedded GCC, 64-bit) ``` Now we can start to test for [rv32emu](https://github.com/sysprog21/rv32emu) ``` $ cd rv32emu/ $ make riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -O3 -nostdlib -o test1 test1.c $ make check ./emu-rv32i test1 Hello RISC-V! >>> Execution time: 56911 ns >>> Instruction count: 62 (IPS=1089420) >>> Jumps: 14 (22.58%) - 0 forwards, 14 backwards >>> Branching T=13 (92.86%) F=1 (7.14%) ``` ## What is ELF The definition from [wikipedia](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format): > THe Executable and Linking Format (ELF) file, is a common standard file format for executable files, object code, shared libraries, and core dumps The benefits of ELF file is that it defines couple variables and information, so that it makes dynamic linking more flexible ``` $ riscv-none-embed-readelf -h test1 ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100 Start of program headers: 52 (bytes into file) Start of section headers: 1172 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 1 Size of section headers: 40 (bytes) Number of section headers: 10 Section header string table index: 9 ``` :::warning What is magic number ? ::: :::info Ans: A magic number is a numeric or string constant that indicates the file type. NOTE: Take ELF file for example, ELF file's Magic number will begin with 7f 45 4c 46. The 45 4c 46 is "ELF" in ASCII code ::: ## Problem Encountered ### Readings * [Linker Script初探 - GNU Linker Ld手冊略讀](http://wen00072.github.io/blog/2014/03/14/study-on-the-linker-script/#cmd) * [你所不知道的 C 語言:連結器和執行檔資訊](https://hackmd.io/@sysprog/c-linker-loader?type=view) ### Stack Pointer Problem Testing program: ``` $ ./emu-rv32i frac >>> Execution time: 273 ns >>> Instruction count: 2 (IPS=7326007) >>> Jumps: 1 (50.00%) - 0 forwards, 1 backwards >>> Branching T=0 (-nan%) F=0 (-nan%) ``` Check for objdump: >compile without -o3 option ``` riscv-none-embed-objdump -d frac frac: file format elf32-littleriscv Disassembly of section .text: 00010054 <_start>: 10054: fd010113 addi sp,sp,-48 10058: 02812623 sw s0,44(sp) 1005c: 03010413 addi s0,sp,48 10060: 400027b7 lui a5,0x40002 10064: fef42623 sw a5,-20(s0) 10068: 00100793 li a5,1 1006c: fcf42c23 sw a5,-40(s0) 10070: 00100793 li a5,1 10074: fcf42e23 sw a5,-36(s0) 10078: 00000013 nop 1007c: 02c12403 lw s0,44(sp) 10080: 03010113 addi sp,sp,48 10084: 00008067 ret ``` The instruction counts and branching don't work as expected. I find out that if anything which is corresponding to stack pointer, the program will go wrong The emulator ran as a bare machine without software support, so we will need to define our own code section and data section by ourself. Hence, we need to do the program linking From [RISC-V Assembler Reference](https://rv8.io/asm.html): >Program linking is the process of reading multiple relocatable object files, merging the sections from each of the source files, calculating the new addresses for symbols and applying relocation fixups to text or data that is pointed to in relocation entries I also checked it on Stackoverflow: [How can I compile C code to get a bare-metal skeleton of a minimal RISC-V assembly program?](https://stackoverflow.com/questions/31390127/how-can-i-compile-c-code-to-get-a-bare-metal-skeleton-of-a-minimal-risc-v-assemb) The lab senpei teach us about writing linking script, shout out to my __GODLIKE__ senpei YT Shen ``` $ cat link.ld OUTPUT_ARCH( "riscv" ) _STACK_SIZE = DEFINED(_STACK_SIZE) ? _STACK_SIZE : 0x1000; /* Specify the default entry point to the program */ MEMORY { com : ORIGIN = 0x00000000, LENGTH = 0x00000100 mem : ORIGIN = 0x0000100, LENGTH = 0x00010000 } ENTRY(_start) /***************************************************************************** * Define the sections, and where they are mapped in memory ****************************************************************************/ SECTIONS { /* text: test code section */ . = 0x00000; .comment : { *(.comment) } > com .text : { setup.o(.text) *(.text) *(.text.*) *(.gnu.linkonce.t.*) } > mem .rodata : { __rodata_start = .; *(.rdata) *(.rodata) *(.rodata.*) *(.gnu.linkonce.r.*) __rodata_end = .; } > mem .data : { . = ALIGN(4); __data_paddr_start = LOADADDR(.data); __data_start = .; *(.data) *(.data.*) *(.srodata*) *(.gnu.linkonce.d.*) __data_end = .; } > mem .bss : { . = ALIGN(4); __bss_start = .; *(.bss) *(.bss.*) *(.gnu.linkonce.b.*) *(COMMON) . = ALIGN(4); __bss_end = .; } > mem .stack : { . = ALIGN(4); _stack_end = .; . += _STACK_SIZE; _stack = .; __stack = _stack; } > mem _end = .; } ``` * `.` is represented the memory address counter. The result is calculated by linker by the size of all input and output section * Section is the command for the memory layout while execution * `.text` is a read-only section containing executable code * `.data` is a read-write section containing global or static variables * `.rodata` is a read-only section containing const variables * `.bss` is a read-write section containing uninitialised data The memory map will be something like the image below ![](https://images.slideplayer.com/25/7981189/slides/slide_3.jpg) After finished the `link.ld` file, the stack pointer still didn't initialize the stack pointer. We need another object file to initialize the stack pointer. This is the linker job that make sure every needed object file is linked to create a executable file ``` $ cat setup.s #Define constants .section .text .align 4 .global _start _start: li x1, 0 li x2, 0 li x3, 0 li x4, 0 li x5, 0 li x6, 0 li x7, 0 li x8, 0 li x9, 0 li x10, 0 li x11, 0 li x12, 0 li x13, 0 li x14, 0 li x15, 0 li x16, 0 li x17, 0 li x18, 0 li x19, 0 li x20, 0 li x21, 0 li x22, 0 li x23, 0 li x24, 0 li x25, 0 li x26, 0 li x27, 0 li x28, 0 li x29, 0 li x30, 0 li x31, 0 init_stack: la sp, _stack SystemInit: jal main j SystemExit SystemExit: ``` The link.ld file will recognize the label `_start` and initialize the registers, and then initialize the stack pointer. After these steps, we can finally link to out C program entry `main`. If the program executes sucessfully will return `0`, else return `1` and the OS takes place. ### Used of div/rem instruction comment the define in rv32emu.c ``` /* #define STRICT_RV32I */ ``` :::info There is no suppport in rv32i such as MUL, DIV, REM ::: ## Rewrite Programs from [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@jserv/By5OE6fOr) ### Lucas Number ![](https://i.imgur.com/ytPW7U8.png) ```clike= #define NUM 7 void main() { volatile char* tx = (volatile char*) 0x40002000; const char* output = "L(7) = "; int L[NUM+1]; int o_int; char o_char[50]; while(*output) { *tx = *output; output++; } /* calculate the lucas number */ for (int i = 0;i <= NUM;i++) { if (i == 0 || i ==1) if(i == 0) L[i] = 2; // L[0] = 2 else L[i] = 1; // L[1] = 1 else L[i] = L[i-1] + L[i-2]; } o_int = L[NUM]; int tmp; int count = 0; int flag = 1; /* int to char */ do { if (o_int < 10) { flag = 0; } tmp = o_int%10; if (tmp == 0) o_char[count] = '0'; if (tmp == 1) o_char[count] = '1'; if (tmp == 2) o_char[count] = '2'; if (tmp == 3) o_char[count] = '3'; if (tmp == 4) o_char[count] = '4'; if (tmp == 5) o_char[count] = '5'; if (tmp == 6) o_char[count] = '6'; if (tmp == 7) o_char[count] = '7'; if (tmp == 8) o_char[count] = '8'; if (tmp == 9) o_char[count] = '9'; count++; o_int = o_int/10; }while ( o_int > 10 || flag == 1); /* print out the value of result*/ for (;count>=0;count--) { *tx = o_char[count]; } } ``` The int to char used a lot of `mod` and `div`, which is very bad for the execution time Execute without optimization: ``` $ riscv-none-embed-gcc lucas.c setup.o -march=rv32im -mabi=ilp32 -nostdlib -T link.ld -o lucas $ ./emu-rv32i lucas L(7) = 29 >>> Execution time: 136514 ns >>> Instruction count: 539 (IPS=3948312) >>> Jumps: 54 (10.02%) - 34 forwards, 20 backwards >>> Branching T=46 (71.88%) F=18 (28.12%) ``` Execute with O2 optimization: ``` $ riscv-none-embed-gcc lucas.c setup.o -march=rv32im -mabi=ilp32 -O2 -nostdlib -T link.ld -o lucas $ ./emu-rv32i lucas L(7) = 29 >>> Execution time: 97466 ns >>> Instruction count: 245 (IPS=2513697) >>> Jumps: 39 (15.92%) - 19 forwards, 20 backwards >>> Branching T=31 (65.96%) F=16 (34.04%) ``` Execute with O3 optimization: ``` $ riscv-none-embed-gcc lucas.c setup.o -march=rv32im -mabi=ilp32 -O3 -nostdlib -T link.ld -o lucas $ ./emu-rv32i lucas L(7) = 29 >>> Execution time: 77749 ns >>> Instruction count: 190 (IPS=2443761) >>> Jumps: 28 (14.74%) - 16 forwards, 12 backwards >>> Branching T=23 (74.19%) F=8 (25.81%) ``` As the table show, with or without optimization brings huge impact to the program | | Without Optimization | O2 | O3 | | ----------------- | --------------------------- |:--------------------------- | -------------------------- | | Execution Time | 136514 ns | 97466 ns | 77749 ns | | Instruction Count | 539 | 245 | 190 | | Jumps | 54 | 39 | 28 | | Branching | T=46 (71.88%) F=18 (28.12%) | T=31 (65.96%) F=16 (34.04%) | T=23 (74.19%) F=8 (25.81%) | Compare without optimization to O3, the instuction count reduced 70% instantly!!! And the Jump and Branch almost reduced by 50% !!! The for loop witout optimization will do it step by step with checking `i` values (The `i's` value is stored in " a5, -20(s0) ") ``` 1b8: fec42783 lw a5,-20(s0) 1bc: 00078863 beqz a5,1cc <main+0x38> 1c0: fec42703 lw a4,-20(s0) 1c4: 00100793 li a5,1 1c8: 04f71263 bne a4,a5,20c <main+0x78> 1cc: fec42783 lw a5,-20(s0) 1d0: 02079063 bnez a5,1f0 <main+0x5c> 1d4: fec42783 lw a5,-20(s0) 1d8: 00279793 slli a5,a5,0x2 1dc: ff040713 addi a4,s0,-16 1e0: 00f707b3 add a5,a4,a5 1e4: 00200713 li a4,2 1e8: fce7a423 sw a4,-56(a5) # 40001fc8 <__stack+0x40000d30> 1ec: 0680006f j 254 <main+0xc0> 1f0: fec42783 lw a5,-20(s0) 1f4: 00279793 slli a5,a5,0x2 1f8: ff040713 addi a4,s0,-16 1fc: 00f707b3 add a5,a4,a5 200: 00100713 li a4,1 204: fce7a423 sw a4,-56(a5) 208: 04c0006f j 254 <main+0xc0> 20c: fec42783 lw a5,-20(s0) 210: fff78793 addi a5,a5,-1 214: 00279793 slli a5,a5,0x2 218: ff040713 addi a4,s0,-16 21c: 00f707b3 add a5,a4,a5 220: fc87a703 lw a4,-56(a5) 224: fec42783 lw a5,-20(s0) 228: ffe78793 addi a5,a5,-2 22c: 00279793 slli a5,a5,0x2 230: ff040693 addi a3,s0,-16 234: 00f687b3 add a5,a3,a5 238: fc87a783 lw a5,-56(a5) 23c: 00f70733 add a4,a4,a5 240: fec42783 lw a5,-20(s0) 244: 00279793 slli a5,a5,0x2 248: ff040693 addi a3,s0,-16 24c: 00f687b3 add a5,a3,a5 250: fce7a423 sw a4,-56(a5) 254: fec42783 lw a5,-20(s0) 258: 00178793 addi a5,a5,1 25c: fef42623 sw a5,-20(s0) 260: fec42703 lw a4,-20(s0) 264: 00700793 li a5,7 268: f4e7d8e3 bge a5,a4,1b8 <main+0x24> ``` The O3 one just load and store the value from L[0] to L[7] without addition operation ``` 1e8: 00200793 li a5,2 1ec: 00f12623 sw a5,12(sp) 1f0: 00100793 li a5,1 1f4: 00f12823 sw a5,16(sp) 1f8: 00300793 li a5,3 1fc: 00f12a23 sw a5,20(sp) 200: 00400793 li a5,4 204: 00f12c23 sw a5,24(sp) 208: 00700793 li a5,7 20c: 00f12e23 sw a5,28(sp) 210: 00b00793 li a5,11 214: 02f12023 sw a5,32(sp) 218: 01200793 li a5,18 21c: 01d00693 li a3,29 220: 02f12223 sw a5,36(sp) 224: 00900813 li a6,9 228: 01d00793 li a5,29 22c: 02f12423 sw a5,40(sp) 230: 00d827b3 slt a5,a6,a3 234: 02c10613 addi a2,sp,44 ``` As you can see, the different optimization method uses different method to deal with calculation in for loop