Lab2: RISC-V RV32I[MA] emulator with ELF support : Count Leading Zero(CLZ)

# Lab2: RISC-V RV32I[MA] emulator with ELF support : Count Leading Zero(CLZ) ###### tags: `Computer Architecture` [TOC] ### RISC-V GNU Compiler toolchain #### Introduction This is the RISC-V C and C++ cross-compiler. It supports two build modes: a generic ELF/Newlib toolchain and a more sophisticated Linux-ELF/glibc toolchain. ### Environment Setup #### Getting the [source](https://github.com/riscv/riscv-gnu-toolchain) ``` $ git clone --recursive https://github.com/riscv/riscv-gnu-toolchain ``` #### Prerequisites Several standard packages are needed to build the toolchain. ``` $ sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev ``` #### Installation To build the 32-bit RV32GC toolchain, use: ``` 1. cd riscv-gnu-toolchain 2. export PATH=$PATH:/opt/riscv/bin 3. source ~/.bashrc 4. ./configure --prefix=/opt/riscv --with-arch=rv32i --with-abi=ilp32 5. sudo make ``` ### Rewritten code in C #### clz.c ```c= unsigned int clz(unsigned int num) { int res = 0; unsigned mask = 0x80000000; for (int i = 0; i < 32; i++) { if (num & (mask >> i)) break; res++; } return res; } int _start(){ unsigned int countNum = 0x0000000f; unsigned int ans = clz(countNum); volatile char *tx = (volatile char*)0x40002000; *tx = ans + '0'; return 0; } ``` note that the part of print result is not correct but haven't been fixed, because it cannot correctly print out the value > 9, it still have to fix the cast. Currently it can only correctly print out the value less than 10 result (compile without optimization) ``` riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -O0 -nostdlib -o clz clz.c ./emu-rv32i clz ``` ``` >>> Execution time: 80585 ns >>> Instruction count: 466 (IPS=5782713) >>> Jumps: 34 (7.30%) - 3 forwards, 31 backwards >>> Branching T=30 (51.72%) F=28 (48.28%) ``` #### Disassemble the ELF file ``` riscv-none-embed-objdump -h clz ``` ``` clz: file format elf32-littleriscv Sections: Idx Name Size VMA LMA File off Algn 0 .text 000000d0 00010054 00010054 00000054 2**2 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .comment 00000033 00000000 00000000 00000124 2**0 CONTENTS, READONLY ``` #### Header of ELF file ``` ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10088 Start of program headers: 52 (bytes into file) Start of section headers: 544 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 1 Size of section headers: 40 (bytes) Number of section headers: 6 Section header string table index: 5 ``` #### RISC-V assembly result ``` riscv-none-embed-objdump -d clz ``` ``` clz: file format elf32-littleriscv Disassembly of section .text: 00010054 <clz>: 10054: fd010113 addi sp,sp,-48 10058: 02812623 sw s0,44(sp) 1005c: 03010413 addi s0,sp,48 10060: fca42e23 sw a0,-36(s0) 10064: fe042623 sw zero,-20(s0) 10068: 800007b7 lui a5,0x80000 1006c: fef42223 sw a5,-28(s0) 10070: fe042423 sw zero,-24(s0) 10074: 0340006f j 100a8 <clz+0x54> 10078: fe842783 lw a5,-24(s0) 1007c: fe442703 lw a4,-28(s0) 10080: 00f75733 srl a4,a4,a5 10084: fdc42783 lw a5,-36(s0) 10088: 00f777b3 and a5,a4,a5 1008c: 02079663 bnez a5,100b8 <clz+0x64> 10090: fec42783 lw a5,-20(s0) 10094: 00178793 addi a5,a5,1 # 80000001 <__global_pointer$+0x7ffee6dd> 10098: fef42623 sw a5,-20(s0) 1009c: fe842783 lw a5,-24(s0) 100a0: 00178793 addi a5,a5,1 100a4: fef42423 sw a5,-24(s0) 100a8: fe842703 lw a4,-24(s0) 100ac: 01f00793 li a5,31 100b0: fce7d4e3 bge a5,a4,10078 <clz+0x24> 100b4: 0080006f j 100bc <clz+0x68> 100b8: 00000013 nop 100bc: fec42783 lw a5,-20(s0) 100c0: 00078513 mv a0,a5 100c4: 02c12403 lw s0,44(sp) 100c8: 03010113 addi sp,sp,48 100cc: 00008067 ret 000100d0 <_start>: 100d0: fe010113 addi sp,sp,-32 100d4: 00112e23 sw ra,28(sp) 100d8: 00812c23 sw s0,24(sp) 100dc: 02010413 addi s0,sp,32 100e0: 00f00793 li a5,15 100e4: fef42623 sw a5,-20(s0) 100e8: fec42503 lw a0,-20(s0) 100ec: f69ff0ef jal ra,10054 <clz> 100f0: fea42423 sw a0,-24(s0) 100f4: 400027b7 lui a5,0x40002 100f8: fef42223 sw a5,-28(s0) 100fc: fe842783 lw a5,-24(s0) 10100: 0ff7f713 andi a4,a5,255 10104: fe442783 lw a5,-28(s0) 10108: 00e78023 sb a4,0(a5) # 40002000 <__global_pointer$+0x3fff06dc> 1010c: 00000793 li a5,0 10110: 00078513 mv a0,a5 10114: 01c12083 lw ra,28(sp) 10118: 01812403 lw s0,24(sp) 1011c: 02010113 addi sp,sp,32 10120: 00008067 ret ``` #### Compare to assembly code credit to [鄭惟](https://hackmd.io/@WeiCheng14159/rkUifs2Hw) ``` .data input: .word 0x0000000f one: .word 0x80000000 str1: .string "clz value of " str2: .string " is " .text main: lw a0, input # Load input from static data jal ra, clz # Jump-and-link to the 'clz' label # Print the result to console mv a1, a0 lw a0, input jal ra, printResult # Exit program li a7, 10 ecall clz: # t0 = one # t1 = cnt = 32 # t2 = res # a0 = i lw t0, one li t1, 32 li t2, 0 _beg: bne t1, zero, cnt _ret: mv a0, t2 ret cnt: addi t1,t1,-1 and t3, a0, t0 # i & one bne t3, zero, _ret addi t2, t2, 1 srli t0, t0, 1 j _beg # --- printResult --- # a0: input # a1: result printResult: mv t0, a0 mv t1, a1 la a0, str1 li a7, 4 ecall mv a0, t0 li a7, 1 ecall la a0, str2 li a7, 4 ecall mv a0, t1 li a7, 1 ecall ret ``` ### Compare result with optimization ``` riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -O# -nostdlib -o clz clz.c # = 1 ~ 3 ``` #### Optimize O1 ``` >>> Execution time: 40708 ns >>> Instruction count: 156 (IPS=3832170) >>> Jumps: 31 (19.87%) - 2 forwards, 29 backwards >>> Branching T=28 (50.00%) F=28 (50.00%) ``` #### Optimize O2 ``` >>> Execution time: 6854 ns >>> Instruction count: 149 (IPS=21739130) >>> Jumps: 28 (18.79%) - 0 forwards, 28 backwards >>> Branching T=27 (48.21%) F=29 (51.79%) ``` #### Optimize O3 ``` >>> Execution time: 50465 ns >>> Instruction count: 149 (IPS=2952541) >>> Jumps: 28 (18.79%) - 0 forwards, 28 backwards >>> Branching T=27 (48.21%) F=29 (51.79%) ``` #### remove the printing part (line 16,17 in C code), compile with O3 ``` >>> Execution time: 1764 ns >>> Instruction count: 3 (IPS=1700680) >>> Jumps: 1 (33.33%) - 0 forwards, 1 backwards >>> Branching T=0 (-nan%) F=0 (-nan%) ``` * We can observe that compile with O2 has the best result, although Instruction cont, Jumps and Branching is almost similar with O3, but execution time is obviously better than O3. The [GCC document](http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) has decribe the detaied in each kind of optimization. * [GCC optimization](https://wiki.gentoo.org/wiki/GCC_optimization#-O) has mentioned that >Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage. * Another point is that without printing result, the all of information include execution time, instruction count, Jumps amd Branching will drastically decrease.