--- tags: jserv, 2023-arch, RISC-V --- # Assignment2: GNU Toolchain contributed by freshLiver ## TODO - Memory Efficiency - Loop unrolling ## Why In assignment 1, I choose a floating-point related topic as my homework 1, so this time I want to choose another topic. And I remember this person (`fewletter`) modify my simplefs notes several months ago, so I choose this topic. ## Setup ### Compile C Code ```bash cd /tmp git clone https://github.com/fewletter/CAHW && cd CAHW/HW1 riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c ``` Then, check the report of `riscv-none-elf-readelf -h VariableByte.elf`: ```text ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 98872 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` And try to run the binary with the `rv32emu VariableByte.elf`: ```text 2 Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff inferior exit code 0 ``` To check the execution cycles of this program, we need to add the following c code to the original c code: ```c #include <stdint.h> typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile("rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t)h) << 32) | ((uint64_t)l); return result; } ``` And sightly modify the main function: ```c int main() { ticks start = getticks(), end; ... // not modified end = getticks(); printf("cycles: %llu\n", end - start); return 0; } ``` Then, recompile and execute: ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c rv32emu VariableByte.elf ``` Now we can see the execution cycles of this program: ```text 2 Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff cycles: 9895 inferior exit code 0 ``` ### Execute Handwritten Assembly To execute the handwritten assembly code in the rv32emu, we need to solve some problems: #### Environment Calls :::warning There are several `div` instructions used in the handwritten assembly code, so we can't compile the assembly code with only RV32I instructions. This problem will be handled in [the later section](#Replace-div-Instructions), and this section simply remove the RV32I options (`-march=rv32i -mabi=ilp32`) to compile the handwritten assembly code. ::: In additional to the use of `div` instructions, the handwritten code use some ripes specific ecalls, such as PrintString (4) and PrintHex (34), to output some information: ```asm \n: .string "\n" EncodedBytes: .string "Encoded Bytes: " main: ... mv t6, a3 jal ra, printHEX ... mv t6, a3 jal ra, printHEX ... mv t6, a3 jal ra, printHEX mv t6, a4 jal ra, printHEX li a7, 10 ecall printHEX: la a0, EncodedBytes li a7, 4 ecall mv a0, t6 # return in register a0 li a7, 34 ecall la a0, \n li a7, 4 ecall ret ``` However, the library function `printf` is used in the C code, so that the assembly code generated by the compiler will jump to the library function: ```bash riscv-none-elf-objdump -d VariableByte.elf 000108e0 <main>: ... 10978: 4f078513 add a0,a5,1264 # 224f0 <__clzsi2+0x88> 1097c: 085000ef jal 11200 <printf> ``` So here, instead of using the ecalls PrintString (4) and PrintHex (34), call the library function `printf` to ensure the handwritten assembly code is compatible with rv32emu, and comparable with that generated by the compiler. To do this, we should: 1. Save `ra` and set arguments for `printf` Since the ecalls are replaced with `printf`, we should save the `ra` in stack to avoid the program stuck in the function `printHEX`. And the string address should be loaded to `a0`, and the value to be printed should be placed in the `a1` register: ```asm printHEX: addi sp, sp, -4 sw ra, 0(sp) mv a1, t6 lui a0, %hi(EncodedBytes) addi a0, a0, %lo(EncodedBytes) call printf lw ra, 0(sp) addi sp, sp, 4 ret ``` Also, change the string to be printed from `Encoded Bytes:` to `Encoded Bytes: 0x%X\n`. 2. Modify the Strings However, if we compile this code, we will get the following error message: ```bash riscv-none-elf-gcc 2023-arch-hw2-handwritten.s /home/hsi/Softwares/xpack-riscv-none-elf-gcc-13.2.0-2/bin/../lib/gcc/riscv-none-elf/13.2.0/../../../../riscv-none-elf/bin/ld: /home/hsi/Softwares/xpack-riscv-none-elf-gcc-13.2.0-2/bin/../lib/gcc/riscv-none-elf/13.2.0/../../../../riscv-none-elf/lib/crt0.o: in function `.L0 ': (.text+0x2c): undefined reference to`main' collect2: error: ld returned 1 exit status ``` To fix this problem, modify the string section to: ```asm EncodedBytes: .string "Encoded Bytes: 0x%X\n" .section .text.startup, "ax", @progbits .align 2 .global main .type main, @function ``` Also, though it's not essential, the section label `printHEX` is renamed to `printANS`, and add `li a0, 0` before program exit: Finally, the modified handwritten code will be like: ```asm EncodedBytes: .string "Encoded Bytes: 0x%X\n" .section .text.startup, "ax", @progbits .align 2 .global main .type main, @function ... main: ... jal ra, printANS ... jal ra, printANS ... jal ra, printANS ... jal ra, printANS li a0, 0 li a7, 93 ecall ... printANS: addi sp, sp, -4 sw ra, 0(sp) mv a1, t6 lui a0, %hi(EncodedBytes) addi a0, a0, %lo(EncodedBytes) call printf lw ra, 0(sp) addi sp, sp, 4 ret ``` #### Calling Convention Now, we have got rid of the used ripes specific ecalls by using `printf`. However, if we execute the executable with rv32emu, we will get the wrong results: ```bash rv32emu 2023-arch-hw2-handwritten.elf Encoded Bytes: 0x8100 Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0x0 inferior exit code 0 ``` It's because the handwritten assembly code not obey the calling convention. The original handwritten code reuse the `tX` and `aX` registers after calling `printHEX`: ```asm main: ... mv t6, a3 jal ra, printANS ... mv t6, a3 jal ra, printANS ... mv t6, a3 jal ra, printANS mv t6, a4 # problem! reuse aX register jal ra, printANS ``` It may be safe to reuse these registers as the original implementation `printHEX` only use ecalls. However, we have replace the ecalls with `printf` in last section, so now we should avoid doing this. To solve this problem, we need to save the value of `a4`: ```asm main: addi sp, sp, -4 # allocate stack ... mv t6, a3 jal ra, printANS ... mv t6, a3 jal ra, printANS ... mv t6, a3 sw a4, 0(sp) # save in stack jal ra, printANS lw t6, 0(sp) # load from stack jal ra, printANS addi sp, sp, 4 # free stack ... ``` Then, recompile and execute: ```bash riscv-none-elf-gcc 2023-arch-hw2-handwritten.s -o 2023-arch-hw2-handwritten.elf rv32emu 2023-arch-hw2-handwritten.elf Encoded Bytes: 0x8100 Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFFFF ``` And now the result is as expect. ::: warning Not all the calling convention related problems are solved, here just minimize the need of modification on the source handwritten code. ::: #### Count Execution Cycle Now we need to measure the performance using the instruction `rdcycle`: ```asm ExecutionTime: .string "Cycles: %u\n" .section .text.startup, "ax", @progbits .align 2 .global main .type main, @function ... main: rdcycle t0 # get start time addi sp, sp, -8 # increase stack size sw t0, 0(sp) # save start time in stack ... rdcycle t0 # get finish time lw t1, 0(sp) # load start time sub a0, t0, t1 jal ra, printTIME # print execution time # exit addi sp, sp, 8 li a7, 93 ecall printTIME: addi sp, sp, -4 sw ra, 0(sp) mv a1, a0 lui a0, %hi(ExecutionTime) addi a0, a0, %lo(ExecutionTime) call printf lw ra, 0(sp) addi sp, sp, 4 ret ``` Then, recompile and execute: ```bash riscv-none-elf-gcc 2023-arch-hw2-handwritten.s -o 2023-arch-hw2-handwritten.elf rv32emu 2023-arch-hw2-handwritten.elf Encoded Bytes: 0x8100 Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFFFF Cycles: 7731 inferior exit code 0 ``` #### Replace `div` Instructions There are 4 `div` instructions used in the handwritten code: - 1 for divide a value by 4: The `div` instruction used here could be simply replaced with a right shift operation. That is, we can change the following code: ```asm encodeVariableByte: li t0, 0x7f # bitmask1 li t1, 0x80 # bitmask2 li t2, 0 # i = 0 li t3, 4 ... first_register: div a5, t2, t3 ... ``` To: ```asm encodeVariableByte: li t0, 0x7f # bitmask1 li t1, 0x80 # bitmask2 li t2, 0 # i = 0 first_register: srli a5, t2, 2 ... ``` - 3 for divide a value by 7 To prevent dividing a small unsigned integer which is less less than 64, by 7 operation from using the `div` instruction, we can simply declare a table for fast lookup: ```asm div7_table: .word 0, 0, 0, 0, 0, 0 .word 1, 1, 1, 1, 1, 1 .word 2, 2, 2, 2, 2, 2 .word 3, 3, 3, 3, 3, 3 .word 4, 4, 4, 4, 4, 4 .word 5, 5, 5, 5, 5, 5 .word 6, 6, 6, 6, 6, 6 .word 7, 7, 7, 7, 7, 7 .word 8, 8, 8, 8, 8, 8 .word 9, 9, 9, 9, 9, 9 ``` With this table, we can replace the `div` instruction by using the `la` and `lw` instructions: ```asm # li t1, 7 # div a0, a0, t1 la t1, div7_table slli a0, a0, 2 # offset = CLZ*4, each element 4B add a0, a0, t1 # base addr + offset lw a0, 0(a0) ``` Now, recompile with the RV32I specific options (`-march=rv32i -mabi=ilp32`): ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s rv32emu 2023-arch-hw2-handwritten.elf Encoded Bytes: 0x8100 Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFFFF Cycles: 7075 inferior exit code 0 ``` ## Disassemble Use `riscv-none-elf-objdump -d <binary>` to get the disassembled code. The execution cycles of the original hanwritten assembly code is 7075. ### Case 1 - `-O3` ```bash riscv-none-elf-gcc -O3 -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c rv32emu VariableByte.elf ``` ```text 2 Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff cycles: 7876 inferior exit code 0 ``` #### Observation 1 - Check Branch Condition First ```c void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) { uint32_t bitmask1 = 0x7f; uint32_t bitmask2 = 0x80; for (int i = 0; i < len; i++) { ... } } ``` In the corresponding assembly code generated by the compiler, because the operations before the for loop are just simple value assignments, the branch condition `i < len` with be checked before these operations: ```asm encodeVariableByte: ble a3,zero,.L3 ``` So that the two assignments could be skipped when the input `len` is zero. ### Case 2 - `-Os` ```bash riscv-none-elf-gcc -Os -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c rv32emu VariableByte.elf ``` ```text 2 Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff cycles: 8053 inferior exit code 0 ``` ## Optimization The execution cycles of the original hanwritten assembly code is 7075. ### Handwritten Assembly #### Redundant Jump 1 ```asm encodeVariableByte: ... j first_register first_register: ... ``` In the above code, the instruction `j` is a pseudoinstruction, and by refering the [official document](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md#pseudoinstructions), the `j first_register` should be translated into `jal x0, offset`, and this operation is redundant here. By removing the instruction `j first_register`, the execution cycles could be reduced from 7075 to 7072: ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s rv32emu 2023-arch-hw2-handwritten.elf Encoded Bytes: 0x8100 Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFFFF Cycles: 7072 inferior exit code 0 ``` #### Redundant Jump 2 ```asm exit: jr ra second_register: ... beq t2, a0, exit j second_register reset_bitmask: ... j second_register ``` Similarly, by moving the section `reset_bitmask` before `second_register` section: ```asm exit: jr ra reset_bitmask: ... or a2, a2, s1 second_register: ... beq t2, a0, exit j second_register ``` the last pseudoinstruction `j` could be removed, and the execution cycles could be further reduced to 7070. ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s rv32emu 2023-arch-hw2-handwritten.elf Encoded Bytes: 0x8100 Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFFFF Cycles: 7070 inferior exit code 0 ``` #### Redundant Register In the original c code, we need to handle the case `i == 0`: ```c void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) { ... encodebytes[i/4] |= i/4 == 0 ? ... : ...; encodebytes[i/4] = i == 0 ? ... : ...; ... } } ``` And the handwritten code use a temporary register `t5` to keep the value 0 for only checking the special case mentioned above: ```asm encodeVariableByte: ... li t5, 0 first_register: ... bne a5, t5, reset_bitmask ... bne t2, t5, not_firstbyte ... ``` However, there is already one dedicate register `x0` that keeps the value 0. So we can replace the register `t5` in the handwritten code with `x0`: ```asm ```asm encodeVariableByte: ... # li t5, 0 ... first_register: ... bne a5, x0, reset_bitmask ... bne t2, x0, not_firstbyte ... ``` This can reduce 3 more cycles, and now the execution cycle is 7067: ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s rv32emu 2023-arch-hw2-handwritten.elf Encoded Bytes: 0x8100 Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFF7F Encoded Bytes: 0xFFFFFFFF Cycles: 7067 inferior exit code 0 ``` ### Optimize Case 2 - `-Os` ```bash riscv-none-elf-gcc -Os -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c rv32emu VariableByte.elf ``` ```text 2 Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff cycles: 8053 inferior exit code 0 ``` In the handwritten code, serveral constants used in the CLZ function are loaded from the `.data` section: ```asm .data testdata: .word 0, 128 .word 0, 0xfffffff .word 0xffffff, 0xffffffff and1: .word 0x33333333, 0x33333333 and2: .word 0x0f0f0f0f, 0x0f0f0f0f and3: .word 0x55555555, 0x55555555 ... CLZ: ... # load 0x5555555555555555 la t0, and3 lw t1, 0(t0) lw t2, 4(t0) ... ``` However, the generated assembly code set the constants using 2 instructions: ```asm count_leading_zeros: ... li a2,1431654400 # set 0x55555000 addi a2,a2,1365 # add 0x00000555 ... li a2,858992640 # set 0x33333000 addi a2,a2,819 # add 0x00000333 ... li a3,252645376 # set 0x0F0F1000 addi a3,a3,-241 # sub 0x000000F1 ... ``` There is no need to use 2 instruction to assign a constant, it can be optimized by using single `li`: ```asm count_leading_zeros: ... li a2,0x55555555 # set 0x55555555 ... li a2,0x33333333 # set 0x33333333 ... li a3,0x0F0F0F0F # set 0x0F0F0F0F ... ``` But this can neither reduce the elf size: ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.0.elf VariableByte.Os.0.S # original riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.1.elf VariableByte.Os.1.S # modified ls -l VariableByte.*.elf ``` ```text -rwxr-xr-x 1 hsi hsi 95484 Oct 29 22:27 VariableByte.0.elf -rwxr-xr-x 1 hsi hsi 95484 Oct 29 22:27 VariableByte.1.elf ``` nor make the program use less cycles: ```bash rv32emu VariableByte.0.elf rv32emu VariableByte.1.elf ``` ```text 2 Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff cycles: 8053 inferior exit code 0 2 Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff cycles: 8053 inferior exit code 0 ``` :::warning The above assembly code is generated by the compiler with the `-S` option, which is different from the disassembled codes read from the objdump tool: ```asm 10358: 55555637 lui a2,0x55555 1035c: 55560613 add a2,a2,1365 # 55555555 <__BSS_END__+0x55531645> ``` ::: ## References - <https://hackmd.io/@fewletter/CAHW1> - <https://github.com/fewletter/CAHW/tree/main/HW1>