---
tags: jserv, 2023-arch, RISC-V
---
# Assignment2: GNU Toolchain
contributed by freshLiver
## TODO
- Memory Efficiency
- Loop unrolling
## Why
In assignment 1, I choose a floating-point related topic as my homework 1, so this time I want to choose another topic. And I remember this person (`fewletter`) modify my simplefs notes several months ago, so I choose this topic.
## Setup
### Compile C Code
```bash
cd /tmp
git clone https://github.com/fewletter/CAHW && cd CAHW/HW1
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c
```
Then, check the report of `riscv-none-elf-readelf -h VariableByte.elf`:
```text
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 98872 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
And try to run the binary with the `rv32emu VariableByte.elf`:
```text
2
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
inferior exit code 0
```
To check the execution cycles of this program, we need to add the following c code to the original c code:
```c
#include <stdint.h>
typedef uint64_t ticks;
static inline ticks getticks(void)
{
uint64_t result;
uint32_t l, h, h2;
asm volatile("rdcycleh %0\n"
"rdcycle %1\n"
"rdcycleh %2\n"
"sub %0, %0, %2\n"
"seqz %0, %0\n"
"sub %0, zero, %0\n"
"and %1, %1, %0\n"
: "=r"(h), "=r"(l), "=r"(h2));
result = (((uint64_t)h) << 32) | ((uint64_t)l);
return result;
}
```
And sightly modify the main function:
```c
int main() {
ticks start = getticks(), end;
... // not modified
end = getticks();
printf("cycles: %llu\n", end - start);
return 0;
}
```
Then, recompile and execute:
```bash
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c
rv32emu VariableByte.elf
```
Now we can see the execution cycles of this program:
```text
2
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
cycles: 9895
inferior exit code 0
```
### Execute Handwritten Assembly
To execute the handwritten assembly code in the rv32emu, we need to solve some problems:
#### Environment Calls
:::warning
There are several `div` instructions used in the handwritten assembly code, so we can't compile the assembly code with only RV32I instructions.
This problem will be handled in [the later section](#Replace-div-Instructions), and this section simply remove the RV32I options (`-march=rv32i -mabi=ilp32`) to compile the handwritten assembly code.
:::
In additional to the use of `div` instructions, the handwritten code use some ripes specific ecalls, such as PrintString (4) and PrintHex (34), to output some information:
```asm
\n: .string "\n"
EncodedBytes: .string "Encoded Bytes: "
main:
...
mv t6, a3
jal ra, printHEX
...
mv t6, a3
jal ra, printHEX
...
mv t6, a3
jal ra, printHEX
mv t6, a4
jal ra, printHEX
li a7, 10
ecall
printHEX:
la a0, EncodedBytes
li a7, 4
ecall
mv a0, t6 # return in register a0
li a7, 34
ecall
la a0, \n
li a7, 4
ecall
ret
```
However, the library function `printf` is used in the C code, so that the assembly code generated by the compiler will jump to the library function:
```bash
riscv-none-elf-objdump -d VariableByte.elf
000108e0 <main>:
...
10978: 4f078513 add a0,a5,1264 # 224f0 <__clzsi2+0x88>
1097c: 085000ef jal 11200 <printf>
```
So here, instead of using the ecalls PrintString (4) and PrintHex (34), call the library function `printf` to ensure the handwritten assembly code is compatible with rv32emu, and comparable with that generated by the compiler.
To do this, we should:
1. Save `ra` and set arguments for `printf`
Since the ecalls are replaced with `printf`, we should save the `ra` in stack to avoid the program stuck in the function `printHEX`.
And the string address should be loaded to `a0`, and the value to be printed should be placed in the `a1` register:
```asm
printHEX:
addi sp, sp, -4
sw ra, 0(sp)
mv a1, t6
lui a0, %hi(EncodedBytes)
addi a0, a0, %lo(EncodedBytes)
call printf
lw ra, 0(sp)
addi sp, sp, 4
ret
```
Also, change the string to be printed from `Encoded Bytes:` to `Encoded Bytes: 0x%X\n`.
2. Modify the Strings
However, if we compile this code, we will get the following error message:
```bash
riscv-none-elf-gcc 2023-arch-hw2-handwritten.s
/home/hsi/Softwares/xpack-riscv-none-elf-gcc-13.2.0-2/bin/../lib/gcc/riscv-none-elf/13.2.0/../../../../riscv-none-elf/bin/ld: /home/hsi/Softwares/xpack-riscv-none-elf-gcc-13.2.0-2/bin/../lib/gcc/riscv-none-elf/13.2.0/../../../../riscv-none-elf/lib/crt0.o: in function `.L0 ':
(.text+0x2c): undefined reference to`main'
collect2: error: ld returned 1 exit status
```
To fix this problem, modify the string section to:
```asm
EncodedBytes:
.string "Encoded Bytes: 0x%X\n"
.section .text.startup, "ax", @progbits
.align 2
.global main
.type main, @function
```
Also, though it's not essential, the section label `printHEX` is renamed to `printANS`, and add `li a0, 0` before program exit:
Finally, the modified handwritten code will be like:
```asm
EncodedBytes:
.string "Encoded Bytes: 0x%X\n"
.section .text.startup, "ax", @progbits
.align 2
.global main
.type main, @function
...
main:
...
jal ra, printANS
...
jal ra, printANS
...
jal ra, printANS
...
jal ra, printANS
li a0, 0
li a7, 93
ecall
...
printANS:
addi sp, sp, -4
sw ra, 0(sp)
mv a1, t6
lui a0, %hi(EncodedBytes)
addi a0, a0, %lo(EncodedBytes)
call printf
lw ra, 0(sp)
addi sp, sp, 4
ret
```
#### Calling Convention
Now, we have got rid of the used ripes specific ecalls by using `printf`. However, if we execute the executable with rv32emu, we will get the wrong results:
```bash
rv32emu 2023-arch-hw2-handwritten.elf
Encoded Bytes: 0x8100
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0x0
inferior exit code 0
```
It's because the handwritten assembly code not obey the calling convention. The original handwritten code reuse the `tX` and `aX` registers after calling `printHEX`:
```asm
main:
...
mv t6, a3
jal ra, printANS
...
mv t6, a3
jal ra, printANS
...
mv t6, a3
jal ra, printANS
mv t6, a4 # problem! reuse aX register
jal ra, printANS
```
It may be safe to reuse these registers as the original implementation `printHEX` only use ecalls. However, we have replace the ecalls with `printf` in last section, so now we should avoid doing this.
To solve this problem, we need to save the value of `a4`:
```asm
main:
addi sp, sp, -4 # allocate stack
...
mv t6, a3
jal ra, printANS
...
mv t6, a3
jal ra, printANS
...
mv t6, a3
sw a4, 0(sp) # save in stack
jal ra, printANS
lw t6, 0(sp) # load from stack
jal ra, printANS
addi sp, sp, 4 # free stack
...
```
Then, recompile and execute:
```bash
riscv-none-elf-gcc 2023-arch-hw2-handwritten.s -o 2023-arch-hw2-handwritten.elf
rv32emu 2023-arch-hw2-handwritten.elf
Encoded Bytes: 0x8100
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFFFF
```
And now the result is as expect.
::: warning
Not all the calling convention related problems are solved, here just minimize the need of modification on the source handwritten code.
:::
#### Count Execution Cycle
Now we need to measure the performance using the instruction `rdcycle`:
```asm
ExecutionTime:
.string "Cycles: %u\n"
.section .text.startup, "ax", @progbits
.align 2
.global main
.type main, @function
...
main:
rdcycle t0 # get start time
addi sp, sp, -8 # increase stack size
sw t0, 0(sp) # save start time in stack
...
rdcycle t0 # get finish time
lw t1, 0(sp) # load start time
sub a0, t0, t1
jal ra, printTIME # print execution time
# exit
addi sp, sp, 8
li a7, 93
ecall
printTIME:
addi sp, sp, -4
sw ra, 0(sp)
mv a1, a0
lui a0, %hi(ExecutionTime)
addi a0, a0, %lo(ExecutionTime)
call printf
lw ra, 0(sp)
addi sp, sp, 4
ret
```
Then, recompile and execute:
```bash
riscv-none-elf-gcc 2023-arch-hw2-handwritten.s -o 2023-arch-hw2-handwritten.elf
rv32emu 2023-arch-hw2-handwritten.elf
Encoded Bytes: 0x8100
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFFFF
Cycles: 7731
inferior exit code 0
```
#### Replace `div` Instructions
There are 4 `div` instructions used in the handwritten code:
- 1 for divide a value by 4:
The `div` instruction used here could be simply replaced with a right shift operation.
That is, we can change the following code:
```asm
encodeVariableByte:
li t0, 0x7f # bitmask1
li t1, 0x80 # bitmask2
li t2, 0 # i = 0
li t3, 4
...
first_register:
div a5, t2, t3
...
```
To:
```asm
encodeVariableByte:
li t0, 0x7f # bitmask1
li t1, 0x80 # bitmask2
li t2, 0 # i = 0
first_register:
srli a5, t2, 2
...
```
- 3 for divide a value by 7
To prevent dividing a small unsigned integer which is less less than 64, by 7 operation from using the `div` instruction, we can simply declare a table for fast lookup:
```asm
div7_table:
.word 0, 0, 0, 0, 0, 0
.word 1, 1, 1, 1, 1, 1
.word 2, 2, 2, 2, 2, 2
.word 3, 3, 3, 3, 3, 3
.word 4, 4, 4, 4, 4, 4
.word 5, 5, 5, 5, 5, 5
.word 6, 6, 6, 6, 6, 6
.word 7, 7, 7, 7, 7, 7
.word 8, 8, 8, 8, 8, 8
.word 9, 9, 9, 9, 9, 9
```
With this table, we can replace the `div` instruction by using the `la` and `lw` instructions:
```asm
# li t1, 7
# div a0, a0, t1
la t1, div7_table
slli a0, a0, 2 # offset = CLZ*4, each element 4B
add a0, a0, t1 # base addr + offset
lw a0, 0(a0)
```
Now, recompile with the RV32I specific options (`-march=rv32i -mabi=ilp32`):
```bash
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s
rv32emu 2023-arch-hw2-handwritten.elf
Encoded Bytes: 0x8100
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFFFF
Cycles: 7075
inferior exit code 0
```
## Disassemble
Use `riscv-none-elf-objdump -d <binary>` to get the disassembled code.
The execution cycles of the original hanwritten assembly code is 7075.
### Case 1 - `-O3`
```bash
riscv-none-elf-gcc -O3 -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c
rv32emu VariableByte.elf
```
```text
2
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
cycles: 7876
inferior exit code 0
```
#### Observation 1 - Check Branch Condition First
```c
void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) {
uint32_t bitmask1 = 0x7f;
uint32_t bitmask2 = 0x80;
for (int i = 0; i < len; i++) {
...
}
}
```
In the corresponding assembly code generated by the compiler, because the operations before the for loop are just simple value assignments, the branch condition `i < len` with be checked before these operations:
```asm
encodeVariableByte:
ble a3,zero,.L3
```
So that the two assignments could be skipped when the input `len` is zero.
### Case 2 - `-Os`
```bash
riscv-none-elf-gcc -Os -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c
rv32emu VariableByte.elf
```
```text
2
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
cycles: 8053
inferior exit code 0
```
## Optimization
The execution cycles of the original hanwritten assembly code is 7075.
### Handwritten Assembly
#### Redundant Jump 1
```asm
encodeVariableByte:
...
j first_register
first_register:
...
```
In the above code, the instruction `j` is a pseudoinstruction, and by refering the [official document](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md#pseudoinstructions), the `j first_register` should be translated into `jal x0, offset`, and this operation is redundant here.
By removing the instruction `j first_register`, the execution cycles could be reduced from 7075 to 7072:
```bash
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s
rv32emu 2023-arch-hw2-handwritten.elf
Encoded Bytes: 0x8100
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFFFF
Cycles: 7072
inferior exit code 0
```
#### Redundant Jump 2
```asm
exit:
jr ra
second_register:
...
beq t2, a0, exit
j second_register
reset_bitmask:
...
j second_register
```
Similarly, by moving the section `reset_bitmask` before `second_register` section:
```asm
exit:
jr ra
reset_bitmask:
...
or a2, a2, s1
second_register:
...
beq t2, a0, exit
j second_register
```
the last pseudoinstruction `j` could be removed, and the execution cycles could be further reduced to 7070.
```bash
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s
rv32emu 2023-arch-hw2-handwritten.elf
Encoded Bytes: 0x8100
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFFFF
Cycles: 7070
inferior exit code 0
```
#### Redundant Register
In the original c code, we need to handle the case `i == 0`:
```c
void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) {
...
encodebytes[i/4] |= i/4 == 0 ? ... : ...;
encodebytes[i/4] = i == 0 ? ... : ...;
...
}
}
```
And the handwritten code use a temporary register `t5` to keep the value 0 for only checking the special case mentioned above:
```asm
encodeVariableByte:
...
li t5, 0
first_register:
...
bne a5, t5, reset_bitmask
...
bne t2, t5, not_firstbyte
...
```
However, there is already one dedicate register `x0` that keeps the value 0. So we can replace the register `t5` in the handwritten code with `x0`:
```asm
```asm
encodeVariableByte:
...
# li t5, 0
...
first_register:
...
bne a5, x0, reset_bitmask
...
bne t2, x0, not_firstbyte
...
```
This can reduce 3 more cycles, and now the execution cycle is 7067:
```bash
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o 2023-arch-hw2-handwritten.elf 2023-arch-hw2-handwritten.s
rv32emu 2023-arch-hw2-handwritten.elf
Encoded Bytes: 0x8100
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFF7F
Encoded Bytes: 0xFFFFFFFF
Cycles: 7067
inferior exit code 0
```
### Optimize Case 2 - `-Os`
```bash
riscv-none-elf-gcc -Os -march=rv32i -mabi=ilp32 -o VariableByte.elf VariableByte.c
rv32emu VariableByte.elf
```
```text
2
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
cycles: 8053
inferior exit code 0
```
In the handwritten code, serveral constants used in the CLZ function are loaded from the `.data` section:
```asm
.data
testdata:
.word 0, 128
.word 0, 0xfffffff
.word 0xffffff, 0xffffffff
and1: .word 0x33333333, 0x33333333
and2: .word 0x0f0f0f0f, 0x0f0f0f0f
and3: .word 0x55555555, 0x55555555
...
CLZ:
...
# load 0x5555555555555555
la t0, and3
lw t1, 0(t0)
lw t2, 4(t0)
...
```
However, the generated assembly code set the constants using 2 instructions:
```asm
count_leading_zeros:
...
li a2,1431654400 # set 0x55555000
addi a2,a2,1365 # add 0x00000555
...
li a2,858992640 # set 0x33333000
addi a2,a2,819 # add 0x00000333
...
li a3,252645376 # set 0x0F0F1000
addi a3,a3,-241 # sub 0x000000F1
...
```
There is no need to use 2 instruction to assign a constant, it can be optimized by using single `li`:
```asm
count_leading_zeros:
...
li a2,0x55555555 # set 0x55555555
...
li a2,0x33333333 # set 0x33333333
...
li a3,0x0F0F0F0F # set 0x0F0F0F0F
...
```
But this can neither reduce the elf size:
```bash
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.0.elf VariableByte.Os.0.S # original
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o VariableByte.1.elf VariableByte.Os.1.S # modified
ls -l VariableByte.*.elf
```
```text
-rwxr-xr-x 1 hsi hsi 95484 Oct 29 22:27 VariableByte.0.elf
-rwxr-xr-x 1 hsi hsi 95484 Oct 29 22:27 VariableByte.1.elf
```
nor make the program use less cycles:
```bash
rv32emu VariableByte.0.elf
rv32emu VariableByte.1.elf
```
```text
2
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
cycles: 8053
inferior exit code 0
2
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
cycles: 8053
inferior exit code 0
```
:::warning
The above assembly code is generated by the compiler with the `-S` option, which is different from the disassembled codes read from the objdump tool:
```asm
10358: 55555637 lui a2,0x55555
1035c: 55560613 add a2,a2,1365 # 55555555 <__BSS_END__+0x55531645>
```
:::
## References
- <https://hackmd.io/@fewletter/CAHW1>
- <https://github.com/fewletter/CAHW/tree/main/HW1>