# Assignment2: GNU Toolchain
Contributed by [yptang5488](https://github.com/yptang5488)
## Install riscv rv32emu toolchain
Follow the instruction in [Lab2](https://hackmd.io/@sysprog/SJAR5XMmi) document.
## Pick a Question
The following question is picked from the Assignment 1.
> from *GliAmanti*'s Assignment1 : [Implementation of Bit-Plane Slicing with CLZ](https://hackmd.io/@GliAmanti/BJlxAv651a)
>
> In Bit-plane slicing, we divide image into 8 planes. Plane 1 contains all LSBs in the bytes comprising the pixels in the image and Plane 8 contains all MSBs. It’s a useful way for image enhancement, since human eye is sensitive to digital with high frequency. And we only highlight the contribution made to total image appearance by specific high level bits.

In this problem, we will use the CLZ function to get the highest bit, and then maximize the other values with the same number of bits to `255`, and set the other values to `0`.
The motivation for me to choose this topic is that bit-plane slicing is a concept that is used in image processing, and I think it's a very relevant topic for practical purposes.
The original implementation of the question is as follows :
```c
void bit_plane_slicing(int caseNum, int size, uint32_t img[]){
int32_t max = -1;
for (int i = 0; i < size; i++){
int32_t LZ = (int32_t)count_leading_zeros(img[i]);
int32_t MSB = 32 - LZ - 1;
img[i] = MSB;
if(MSB > max) max = MSB;
}
printf("Test %d: ", caseNum);
for (int i = 0; i < size; i++){
if(img[i] == max){
img[i] = 255;
}else{
img[i] = 0;
}
printf("%u ", img[i]);
}
printf("\n");
}
```
The original RISC-V assembly implementation is as follows :
:::spoiler Assembly code
```c
.data
test0: .word 255, 0, 128, 1
test1: .word 167, 133, 111, 144, 140, 135, 159, 154, 148
test2: .word 50, 100, 150, 200, 250, 255
testStr: .string "Test "
colonStr: .string ": "
newlineStr: .string "\n"
spaceStr: .string " "
.text
main:
la a0, test0 # a0: the base address of test0
addi a1, x0, 4 # a1: the size of test0
addi a2, x0, 0 # a2: caseNum = 0
jal ra, bit_plane_slicing
la a0, test1 # a0: the base address of test1
addi a1, x0, 9 # a1: the size of test1
addi a2, x0, 1 # a2: caseNum = 1
jal ra, bit_plane_slicing
la a0, test2 # a0: the base address of test2
addi a1, x0, 6 # a1: the size of test2
addi a2, x0, 2 # a2: caseNum = 2
jal ra, bit_plane_slicing
li a7, 10 # Exit program
ecall
bit_plane_slicing:
addi s0, x0, -1 # s0: max = -1
add s1, x0, x0 # s1: i = 0
clzLoop:
slli t2, s1, 2 # t2: array offset = i*4
add t2, t2, a0 # t2: base addr + offset
lw s4, 0(t2) # s4: arr[i]
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
add a0, x0, s4 # load arr[i] to parameter register a0
jal ra, count_leading_zeros
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
add s2, x0, a3 # s2: LZ
addi t1, x0, 31 # t1: 32 - 1
sub s3, t1, s2 # s3: MSB = 32 - 1 - LZ
sw s3, 0(t2) # arr[i] = MSB
bge s3, s0, isGreater # if(MSB > max)
goOn:
addi s1, s1, 1 # i = i + 1
blt s1, a1, clzLoop
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
jal ra, printLoopOutside # printf("Test %d: ", caseNum) || printf("\n")
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
add s1, x0, x0 # s1: i = 0
reconLoop:
slli t2, s1, 2 # t2: array offset = i*4
add t2, t2, a0 # t2: base addr + offset
lw s4, 0(t2) # s4: arr[i]
beq s4, s0, isEqual # if(arr[i] == max)
add s4, x0, x0 # arr[i] = 0
goOn2:
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
jal ra, printLoopInside # printf("%u ", arr[i])
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
addi s1, s1, 1 # i = i + 1
blt s1, a1, reconLoop
ret
isGreater:
add s0, x0, s3 # max = MSB
j goOn
isEqual:
addi s4, x0, 255 # arr[i] = 255
j goOn2
count_leading_zeros:
srli t0, a0, 1 # t0: x >> 1
or a0, a0, t0 # a0: x |= (x >> 1)
srli t0, a0, 2 # t0: x >> 2
or a0, a0, t0
srli t0, a0, 4 # t0: x >> 4
or a0, a0, t0
srli t0, a0, 8 # t0: x >> 8
or a0, a0, t0
srli t0, a0, 16 # t0: x >> 16
or a0, a0, t0
count_ones:
srli t0, a0, 1 # t0: x >> 1
lui t1, 0x55555 # t1: 0x55555555
ori t1, t1, 0x555
and t0, t0, t1 # t0: (x >> 1) & 0x55555555
sub a0, a0, t0
srli t0, a0, 2 # t0: x >> 2
lui t1, 0x33333 # t1: 0x33333333
ori t1, t1, 0x333
and t0, t0, t1 # t0: (x >> 2) & 0x33333333
and t4, a0, t1 # t4: x & 0x33333333
add a0, t0, t4
srli t0, a0, 4 # t0: x >> 4
add t0, t0, a0 # t0: (x >> 4) + x
lui t1, 0x0f0f0 # t1: 0x0f0f0f0f
ori t1, t1, 0x787
addi t1, t1, 0x788
and a0, t0, t1 # a0: ((x >> 4) + x) & 0x0f0f0f0f
srli t0, a0, 8 # t0: x >> 8
add a0, a0, t0
srli t0, a0, 16 # t0: x >> 16
add a0, a0, t0
andi t0, a0, 0x7f # t0: x & 0x7f
addi t1, x0, 32 # t1: 32
sub a3, t1, t0 # a3: return (32 - (x & 0x7f))
ret
printLoopOutside:
la a0, newlineStr
li a7, 4
ecall
la a0, testStr
li a7, 4
ecall
add a0, x0, a2
li a7, 1
ecall
la a0, colonStr
li a7, 4
ecall
ret
printLoopInside:
add a0, x0, s4
li a7, 1
ecall
la a0, spaceStr
li a7, 4
ecall
ret
```
:::
## adapt the assembly to rv32emu
I have modified the following partition in assembly code to eliminate the incompatibility between the `Ripes` and `rv32emu`.
```shell
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -nostartfiles -nostdlib -o slicing_s slicing_s.s
$ rv32emu/build/rv32emu slicing_s
```
### 1. Program Start
I rename the `main` function to `_start` so that the linker can recognize the starting address.
```c
.org 0
.global _start
_start:
la a0, test1 # a0: the base address of test0
addi a1, x0, 4 # a1: the size of test0
addi a2, x0, 0 # a2: caseNum = 0
jal ra, bit_plane_slicing
...
```
### 2. Data Definition
I found that the string definition `.string` have to be changed to `.ascii`. In addition, the definition is usually followed by the size of the data for later use.
So, I change the original code :
```c
testStr: .string "Test"
```
to the following code :
```c
.section .rodata
testStr: .ascii "Test"
.set testStr_size, .-testStr
```
### 3. System Call
The following changes were made with reference to [docs/syscall.md](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md).
I define commonly used system call numbers for future use in advance.
```c
/* newlib system calls */
.set SYSEXIT, 93
.set SYSWRITE, 64
.set STDOUT, 1
```
In the system output operation, defining `a1` as the system return value instead of using `a0`.
- `a7` : system call number
- `a0` : stdout
- `a1` : output value
- `a2` : size of the output value
So, I change the original code :
```c
print:
la a0, testStr
li a7, 4
ecall
```
to the following code :
```c
print:
li a7, SYSWRITE
li a0, STDOUT
la a1, testStr
li a2, testStr_size
ecall
```
It's worth noting that we can't print out integer types directly, but have to convert each digit to ASCII and then output it as a string. In this program, the output values will only be `0` and `255`, so instead of processing the numbers I just print string to speed things up.
### Output
Finally, I can print the answer like this :
```python
Test0: 255 0 255 0
Test1: 255 255 255 255 255 255 255 255 255
Test2: 255 255 255 255 255 255
```
> Answer
> Test0: 255 0 255 0
> Test1: 255 255 0 255 255 255 255 255 255
> Test2: 0 0 255 255 255 255
I encountered the problem that only the first data in my definition can get the correct answer. There is the array definition in my code :
```c
.data
test0: .word 255, 0, 128, 1
.set arr_size0, .test0
test1: .word 167, 133, 111, 144, 140, 135, 159, 154, 148
.set arr_size1, .test1
test2: .word 50, 100, 150, 200, 250, 255
.set arr_size2, .test2
```
I have tried switching the execution order, but still only the first defined data can be executed correctly. I guess that according to my code, there is something missing under the rv32emu compiler, but I haven't found a solution.
## ELF files produced by C compiler
### complie the program
```shell
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Wall -O0 -c -o slicing_c.o slicing_c.c
$ riscv-none-elf-gcc -o slicing.elf slicing_c.o
```
:::spoiler makefile
```shell
.PHONY: clean
include toolchain.mk
ASFLAGS = -march=rv32i -mabi=ilp32
LDFLAGS = --oformat=elf32-littleriscv
CFLAGS = -Wall -O0
OBJS = slicing_c.o
%.o: %.s
riscv-none-elf-gcc $(ASFLAGS) -c -o $@ $<
%.o: %.c
riscv-none-elf-gcc $(ASFLAGS) $(CFLAGS) -c -o $@ $<
all: slicing.elf
slicing.elf: $(OBJS)
riscv-none-elf-gcc -o $@ $^
clean:
$(RM) $(OBJS)
```
:::
### deal with warning
There is a warning when the original c program is complied by `riscv-none-elf-gcc`.
```shell
slicing_c.c:52:18: warning: format '%u' expects argument of type 'unsigned int', but argument 2 has type 'uint32_t' {aka 'long unsigned int'} [-Wformat=]
52 | printf("%u ", img[i]);
| ~^ ~~~~~~
| | |
| | uint32_t {aka long unsigned int}
| unsigned int
| %lu
```
So, I change the original code :
```c
printf("%u ", img[i]);
```
to the following code :
```c
printf("%lu ", img[i]);
```
### gcc optimization
I write a shell script to run `riscv-none-elf-objdump`, `riscv-none-elf-readelf`, `riscv-none-elf-size` to get the information of the program and `build/rv32emu` to execute the c program. To get the cycle of the program, I use the [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) function to get the cycle of different test case respectively.
:::spoiler shell script
```shell=
#!/bin/bash
FILE_NAME="slicing.elf"
OPTIM=0 # replace with different optimization choice
riscv-none-elf-objdump -d ${FILE_NAME} > record/objdump_O${OPTIM}.txt
riscv-none-elf-readelf -h ${FILE_NAME}
riscv-none-elf-size ${FILE_NAME}
rv32emu/build/rv32emu ${FILE_NAME}
```
:::
And then, I change the optimization flag from `-O0` (no optimization) to `-O1`, `-O2`, `-O3`, `-Os`, `-Ofast` and check their elapsed cycle respectively.
```shell
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Wall -O${OPTIM} -c -o slicing_c.o slicing_c.c
riscv-none-elf-gcc -o slicing.elf slicing_c.o
```
| optimization | text | data | bss | dec | hex | elapsed cycle |
|:------------:|:-----:|:----:|:----:|:-----:|:----:|:-------------:|
| -O0 | 52714 | 1876 | 1528 | 56118 | db36 | 21581 |
| -O1 | 52198 | 1876 | 1528 | 55602 | d932 | 20054 |
| -O2 | 52184 | 1876 | 1528 | 55588 | d924 | 20001 |
| -O3 | 52396 | 1876 | 1528 | 55800 | d9f8 | 19855 |
| -Os | 52042 | 1876 | 1528 | 55446 | d896 | 20059 |
| -Ofast | 52396 | 1876 | 1528 | 55800 | d9f8 | 19855 |
From the table, we can see that the optimization of `-O3` is mainly aimed at reducing the execution time and less concerned about the size of the program. `-Os` is also based on `-O2`, but the gola is to minimize the size of the target code.
## Optimization the program
I optimized the hand-written assembly code and I run it directly on Ripes for quick access the information about cycle and clock, etc. There is the information of the original hand-written assembly code :

### 1. Reduce the number of branches (version 1)
I found that there are two branches `isGreater`, `isEqual` in the assembly that can be modified and fine-tuned to reduce the number of branches.
- `isGreater` is the branch that can be simplified.
before modification :
```c
clzLoop:
...
bge s3, s0, isGreater # if(MSB > max)
goOn:
...
...
isGreater: # redundant branch
add s0, x0, s3 # max = MSB
j goOn
```
modified :
```c
clzLoop:
...
blt s3, s0, goOn
add s0, x0, s3 # if(max < MSB) max = MSB
goOn:
...
```
- move `isEqual` to reduce jump
before modification :
```c
reconLoop:
...
beq s4, s0, isEqual # if(arr[i] == max)
add s4, x0, x0 # arr[i] = 0
goOn2:
...
...
isEqual: # the branch that have to jump to the above command far away
addi s4, x0, 255 # arr[i] = 255
j goOn2
```
modified :
```c
reconLoop:
...
beq s4, s0, isEqual # if(arr[i] == max)
add s4, x0, x0 # arr[i] = 0
j goOn2
isEqual:
addi s4, x0, 255 # arr[i] = 255
goOn2:
...
```
After the modification, a slight decrease in cycle counts can be noticed.

### 2. Reduce the times of load/store : a0 (version 2)
The register `a0` is usually used to access the return value of the callee function. I found that in this program, `a0` is not only used to access the return value, but also for many parameter accesses. Therefore, each time a function is called, the caller function have to load and store `a0`, which takes a lot of time to move data in and out of memory.
```c
# when calling "printLoopInside" function
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
jal ra, printLoopInside # printf("%u ", arr[i])
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
```
I changed the non-returned parameter values to be stored in other registers, so that it is not necessary to read/write and access `a0` every time. The modification is as follows :
```c
# when calling "printLoopInside" function
addi sp, sp, -4
sw ra, 0(sp)
jal ra, printLoopInside # printf("%u ", arr[i])
lw ra, 0(sp)
addi sp, sp, 4
```
The result does succeed in reducing the number of the cycles.

### 3. Reduce the times of load/store : ra (version 3)
In the original program, there are several levels of calls that use `jal` to record the return address `ra`. So, there are many `ra` read and write operations when the callee function is called.
I found that the output functions `printLoopInside`, `printLoopOutside` were all called in the same place, so I moved them directly into the loop to minimize the numbers of branching and read-write of `ra` register.
before modification :
```c
goOn2:
addi sp, sp, -4
sw ra, 0(sp)
jal ra, printLoopInside # printf("%u ", arr[i])
lw ra, 0(sp)
addi sp, sp, 4
addi s1, s1, 1 # i = i + 1
...
...
printLoopInside:
add a0, x0, s4
li a7, 1
ecall
la a0, spaceStr
li a7, 4
ecall
ret
```
modified :
```c
goOn2:
add a0, x0, s4
li a7, 1
ecall
la a0, spaceStr
li a7, 4
ecall
addi s1, s1, 1 # i = i + 1
...
...
```

Optimization results table :
| optimization | cycles | instrs. retired | CPI | IPC |
|:------------:|:------:|:---------------:|:----:|:-----:|
| original | 2054 | 1569 | 1.31 | 0.764 |
| version 1 | 1953 | 1544 | 1.26 | 0.791 |
| version 2 | 1871 | 1462 | 1.28 | 0.781 |
| version 3 | 1651 | 1330 | 1.24 | 0.806 |
I reduced cycle counts by 19.62% compared to the initial program.
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::