# Assignment2: GNU Toolchain
contributed by < [`st10740`](https://github.com/st10740) >
## Question Selection
### Question
I choose the question from [陳浩文 (Implement palindrome detection and using CLZ)](https://hackmd.io/@DpuBXDwLSDCIE6cNJSjUlg/HyQ2i4nep).
### Motivation
The reason why I choose this question is that I want to try another algorthm to implement palindrome detection with less number of lines and compare the performances between them.
## Modified Code
### C code
Below is my modified version of `palindrome_detected`.
```c
int palindrome_detected(uint64_t x, int clz){
uint64_t nob = (64 - clz);
uint64_t checkEven = nob % 2;
uint64_t left = x >> (nob/2+checkEven);
uint64_t revRight = 0x0;
/* Reverse the right half of the input x*/
for(int i=0; i<nob/2; i++) {
uint64_t rightestBit = x & 0x1;
revRight = revRight << 1;
revRight |= rightestBit;
x = x >> 1;
}
return (left==revRight) ? 1 : 0;
}
```
#### Compare Implementation Difference
Let's take input `0b1100011` for example to compare the implementation difference between original and modified version.
The original version of palindrome detection is implemented by the following steps:
1. Use a variable `tempX` to store the left half of input. So the value stored in `tempX` is `0b110`. (If the number of bits excluding the leading zeros is odd, we ignore the middle bit.)
2. Use another variable `tempY` to store the right half of input. So the value stored in `tempY` is `0b011`.
3. Reverse the total **64 bits** of `tempX` and store the result in `revTempX`. After doing it, the value stored in `revTempX` become `0b0110000000000000000000000000000000000000000000000000000000000000`.
4. Shift right `revTempX` to make left half of input in the rightest place. Finally, `revTempX` become `0b011`.
5. Compare `revTempX` and `tempY`. If the value stored in both variables are the same meaning that input is palindrome, return `1`, otherwise, return `0`.
My modified version of implementation does similar things as original version. However, I only reverse half number of bits exclusive leading zeros to detect palindrome compared to orignal 64 bits. Below is my modified implementation steps:
1. Use a variable `left` to store the left half of valid input excluding middle one. So the value stored in `left` is `0b110`.
2. Use a variable `revRight` initialized to `0b0` to store the reversed right half of input.
3. Reverse the right half of input and simultaneously store the result in `revRight`. The for loop only iterate **half number of valid excluding middle one** times. So, in this case, for loop only iterate **3 times**, and `revRight` stores `0b110`.
4. Compare `left` and `revRight`. If the value stored in both variables are the same meaning that input is palindrome, return `1`, otherwise, return `0`.
#### Test Case Discovery
I notice that the orginal test case `testB` (0x0000000000000001) is not a palindrome case checking by his implementation, but it's palindrome by my implementation.
### Assembly Code
I fix some bug in the original assembly code to make it run properly. You can check the fixed code [here](https://github.com/st10740/Computer-Architecture-HW/blob/main/HW2/asm/palindrome_detection_using_CLZ_origin.s).
The bug I fix:
- The input of `count_leading_zeros` and `palindrome_detected` should be the same.
- The counter of `reverse_loop` should be 32, not 31.
Then, I turn the code into my modified implementation of palindrome detection just mentioned above based on fixed one. Below is the `palindrome_detected` part, you can check the complete code [here](https://github.com/st10740/Computer-Architecture-HW/blob/main/HW2/asm/palindrome_detection_using_CLZ_modify1.s).
```c
palindrome_detected:
addi sp, sp, -4
sw ra, 0(sp)
mv t1, a0 # low 32 bits
mv t0, a1 # high 32 bits
andi t5, a2, 0x1 # t5 = (input number is odd) ? 1 : 0
srai t4, a2, 1 # t4 = input bit number / 2
add t4, t4, t5 # t4 = input bit number / 2 + checkEven
# x >> (input bit number / 2 + checkEven)
li t5, 32
sub t4, t5, t4 # t4 = 32 - t4
sll t2, t0, t4
sub t4, t5, t4 # (32 - t4) back to t4
srl t3, t1, t4
or t3, t2, t3 # t3 = left half of input number
li t2, 0 # t2 = reversed right half of input number
li t5, 0 # t5 = i
srai t4, a2, 1 # t4 = input bit number / 2
reverse_loop:
srl t6, t1, t5
andi t6, t6, 1
slli t2, t2, 1
or t2, t2, t6
addi t5, t5, 1 # i++
blt t5, t4, reverse_loop
# compare different
beq t2, t3, isPalindrome
li a0, 0
jr ra
isPalindrome:
li a0, 1
jr ra
```
### Performance Comparison
First, I set up the environment mentioned in [Lab2: RISC-V RV32I[MACF] emulator with ELF support](https://hackmd.io/@sysprog/SJAR5XMmi).
Because everytime I restart terminal, I have to enter command line `$ source riscv-none-elf-gcc/setenv` to update `PATH` environment variable, I put this command into `~/.bashrc` to automatically execute it when I start new terminal.
Then, I add the following code to original and modified c code to count CSR cycle of both program. The test case I use is `0x00000C0000000003`.
```c
typedef uint64_t ticks;
static inline ticks getticks(void)
{
uint64_t result;
uint32_t l, h, h2;
asm volatile(
"rdcycleh %0\n"
"rdcycle %1\n"
"rdcycleh %2\n"
"sub %0, %0, %2\n"
"seqz %0, %0\n"
"sub %0, zero, %0\n"
"and %1, %1, %0\n"
: "=r"(h), "=r"(l), "=r"(h2));
result = (((uint64_t) h) << 32) | ((uint64_t) l);
return result;
}
int main(){
uint64_t test = 0x00000C0000000003; //test is palindrome
ticks t0 = getticks();
printf("%d\n", palindrome_detected(test, count_leading_zeros(test)));
ticks t1 = getticks();
printf("elapsed cycle: %" PRIu64 "\n", t1 - t0);
}
```
#### Origin
- Compile
```shell
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 palindrome_detection_using_CLZ_origin.c -o palindrome_detection_using_CLZ_origin.elf
```
- Run `palindrome_detection_using_CLZ_origin.elf `
```shell
$ ~/rv32emu/build/rv32emu palindrome_detection_using_CLZ_origin.elf
```
- Output
```
1
elapsed cycle: 4587
inferor exit code 0
```
#### Modified
- Compile
```shell
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 palindrome_detection_using_CLZ_modified.c -o palindrome_detection_using_CLZ_modified.elf
```
- Run `palindrome_detection_using_CLZ_modified.elf `
```shell
$ ~/rv32emu/build/rv32emu palindrome_detection_using_CLZ_modified.elf
```
- Output
```
1
elapsed cycle: 2896
inferor exit code 0
```
We can see that the CSR cycle is `2896` for modified code comparing to `4587` for origin.
## Assembly Optimization by riscv-none-elf-gcc
There are multiple optimization levels can be used when compiling. Below is the introduction of each optimization level I used in this homework:
- `-O0` : This level turns off optimization. The result is the same as no `-O` level.
- `-O1` : The basic optimization level. The compiler will try to speedup the program and reduce code size without taking too much compilation time.
- `-O2` : A step up for `-O1`. The compiler will try to increase performance without compromising code size and without taking too much compilation time.
- `-O3` : Enables `-O2` as well as optimizations that are expensive in terms of compile time and memory usage.
- `-Os` : Optimizes code for size.
Since I want to compare original and modified code with five optimization methods and it will take too much time for me to type command line one by one, I use shell script to execute all the command at once.
```shell
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 palindrome_detection_using_CLZ_origin.c -o origin_O0.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O1 palindrome_detection_using_CLZ_origin.c -o origin_O1.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 palindrome_detection_using_CLZ_origin.c -o origin_O2.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O3 palindrome_detection_using_CLZ_origin.c -o origin_O3.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Os palindrome_detection_using_CLZ_origin.c -o origin_Os.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 palindrome_detection_using_CLZ_modified.c -o modified_O0.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O1 palindrome_detection_using_CLZ_modified.c -o modified_O1.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 palindrome_detection_using_CLZ_modified.c -o modified_O2.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O3 palindrome_detection_using_CLZ_modified.c -o modified_O3.elf
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Os palindrome_detection_using_CLZ_modified.c -o modified_Os.elf
riscv-none-elf-objdump -d origin_O0.elf > disassembly/origin_O0.txt
riscv-none-elf-objdump -d origin_O1.elf > disassembly/origin_O1.txt
riscv-none-elf-objdump -d origin_O2.elf > disassembly/origin_O2.txt
riscv-none-elf-objdump -d origin_O3.elf > disassembly/origin_O3.txt
riscv-none-elf-objdump -d origin_Os.elf > disassembly/origin_Os.txt
riscv-none-elf-objdump -d modified_O0.elf > disassembly/modified_O0.txt
riscv-none-elf-objdump -d modified_O1.elf > disassembly/modified_O1.txt
riscv-none-elf-objdump -d modified_O2.elf > disassembly/modified_O2.txt
riscv-none-elf-objdump -d modified_O3.elf > disassembly/modified_O3.txt
riscv-none-elf-objdump -d modified_Os.elf > disassembly/modified_Os.txt
```
You can check the complete disassembly code from [here](https://github.com/st10740/Computer-Architecture-HW/tree/main/HW2/c/disassembly).
Because there are over **20000** lines of code. I will only analyze `palindrome_detected` part with different optimization level.
### -O0
- Observation
- Lines of code
| | original | modified |
| -------- | -------- | -------- |
| Number of lines | 242 | 138 |
- Register used
| Register | original | modified |
| -------- | -------- | -------- |
| s-type | s0, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11 | s0, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11 |
| a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 |
| t-type | t1, t2, t3, t4, t5, t6 | t1, t2, t3, t4, t5, t6 |
| other | sp, zero | sp, zero |
| Total | 27 | 27 |
- Stack used
| | original | modified |
| -------- | -------- | -------- |
| Stack used| 144 | 112 |
- lw/sw count
| | original | modified |
| -------- | -------- | -------- |
| lw/sw count | 114 | 73 |
### -O1
- Observation
- Lines of code
| | original | modified |
| -------- | -------- | -------- |
| Number of lines | 90 | 54 |
- Register used
| Register | original | modified |
| -------- | -------- | -------- |
| s-type | None | None |
| a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 |
| t-type | t1, t3 | t1, t3, t4, t5 |
| other | None | None |
| Total | 10 | 12 |
- Stack used
| | original | modified |
| -------- | -------- | -------- |
| Stack used| 0 | 0 |
- lw/sw count
| | original | modified |
| -------- | -------- | -------- |
| lw/sw count | 0 | 0 |
### -O2
- Observation
- Lines of code
| | original | modified |
| -------- | -------- | -------- |
| Number of lines | 97 | 44 |
- Register used
| Register | original | modified |
| -------- | -------- | -------- |
| s-type | None | None |
| a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 |
| t-type | t1, t3, t4, t5 | t1, t3, t4 |
| other | None | None |
| Total | 12 | 11 |
- Stack used
| | original | modified |
| -------- | -------- | -------- |
| Stack used| 0 | 0 |
- lw/sw count
| | original | modified |
| -------- | -------- | -------- |
| lw/sw count | 0 | 0 |
### -O3
- Observation
- Lines of code
| | original | modified |
| -------- | -------- | -------- |
| Number of lines | 97 | 44 |
- Register used
| Register | original | modified |
| -------- | -------- | -------- |
| s-type | None | None |
| a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 |
| t-type | t1, t3, t4, t5 | t1, t3, t4 |
| other | None | None |
| Total | 12 | 11 |
- Stack used
| | original | modified |
| -------- | -------- | -------- |
| Stack used| 0 | 0 |
- lw/sw count
| | original | modified |
| -------- | -------- | -------- |
| lw/sw count | 0 | 0 |
### -Os
- Observation
- Lines of code
| | original | modified |
| -------- | -------- | -------- |
| Number of lines | 66 | 42 |
- Register used
| Register | original | modified |
| -------- | -------- | -------- |
| s-type | s0, s1, s2, s3, s4, s5, s6 | s0, s1, s2 |
| a-type | a0, a1, a2, a3, a4, a5 | a0, a1, a2, a3, a4, a5, a6 |
| t-type | None | None |
| other | sp, ra | sp, ra |
| Total | 15 | 12 |
- Stack used
| | original | modified |
| -------- | -------- | -------- |
| Stack used| 48 | 16 |
- lw/sw count
| | original | modified |
| -------- | -------- | -------- |
| lw/sw count | 19 | 8 |
### Summary of Optimization
Below is the comparison of **`palindrome_detected`** part between original and modified code.
- Original
| | -O0 | -O1 | -O2 | -O3 | -Os |
| -------- | -------- | -------- | --- | --- | --- |
| Number of lines | 242 | 90 | 97 | 97 | 66 |
| Register number | 27 | 10 | 12 | 12 | 15 |
| Stack used | 144 | 0 | 0 | 0 | 48 |
| lw/sw count | 114 | 0 | 0 | 0 | 19 |
- Modified
| | -O0 | -O1 | -O2 | -O3 | -Os |
| -------- | -------- | -------- | --- | --- | --- |
| Number of lines | 138 | 54 | 44 | 44 | 42 |
| Register number | 27 | 12 | 11 | 11 | 12 |
| Stack used | 112 | 0 | 0 | 0 | 16 |
| lw/sw count | 73 | 0 | 0 | 0 | 8 |
- Analysis
- **`-O0` to `-O1, -O2, -O3` :** `-O0` uses a lot of lw/sw instructions and stores the value in the stack. Because executing lw/sw takes a lot of time, to improve the performance, `-O1`, `-O2` and `-O3` get rid of using lw/sw, only store the value in the registers.
- **`-O2` to `-O3` :** To my surprise, the results of `-O2` and `-O3` are the same. So, I further check the disassembly code of `palindrome_detected` part of `-O2` and `-O3` and find out that they are entirely the same.
- **`-Os` :** `-Os` optimizes the size of code but sacrifices performance.
In addition, I also compare CSR cycle of the **whole program** between original and modified c code by using the same method as [Performance Comparison part of this homework](###Performance-Comparison).
| CSR cycle | -O0 | -O1 | -O2 | -O3 | -Os |
| -------- | -------- | -------- | --- | --- | --- |
| Original | 4587 | 2109 | 2105 | 2105 | 2148 |
| Modified | 2896 | 1698 | 1585 | 1559 | 1616 |
Considering the results, my discoveries are
- The CSR cycle decreases when the optimization level increases.
- The difference of CSR cycle between original and modified code dramatically decreases when the optimization is turned on.
- `-Os` can reduce code size but the effect of performance improvement may not be as great as other optimization levels.
## Handwritten Assembly Optimization
In this section, I will try to optimize the modified version assembly code.
### Loop Unrolling
Since loop for reverting the right half of input number in `palindrome_detected` iterates through several iterations when the MSB of input is at high place, I unroll the loop 2 times.
```c
palindrome_detected:
addi sp, sp, -4
sw ra, 0(sp)
mv t1, a0 # low 32 bits
mv t0, a1 # high 32 bits
andi t5, a2, 0x1 # t5 = (input number is odd) ? 1 : 0
srai t4, a2, 1 # t4 = input bit number / 2
add t4, t4, t5 # t4 = input bit number / 2 + checkEven
# x >> (input bit number / 2 + checkEven)
li t5, 32
sub t4, t5, t4 # t4 = 32 - t4
sll t2, t0, t4
sub t4, t5, t4 # (32 - t4) back to t4
srl t3, t1, t4
or t3, t2, t3 # t3 = left half of input number
li t2, 0 # t2 = reversed right half of input number
li t5, 0 # t5 = i
srai t4, a2, 1 # t4 = input bit number / 2
# Loop unrolling: Unroll the loop 2 times
reverse_2_loop:
srl t6, t1, t5
andi t6, t6, 1
slli t2, t2, 1
or t2, t2, t6
addi t5, t5, 1 # i++
srl t6, t1, t5
andi t6, t6, 1
slli t2, t2, 1
or t2, t2, t6
addi t5, t5, 1 # i++
blt t5, t4, reverse_2_loop
# compare different
beq t2, t3, isPalindrome
li a0, 0
jr ra
isPalindrome:
li a0, 1
jr ra
```
### Performance Comparision
I call my three versions assembly code from c and also call `get_cycles.s` from [rv32emu/tests/perfcount](https://github.com/sysprog21/rv32emu/blob/master/tests/perfcounter/getcycles.S) to get cycle time in order to calculate CSR cycle of each program.
- `get_cycles.s`
```c
.text
.globl get_cycles
.align 2
get_cycles:
csrr a1, cycleh
csrr a0, cycle
csrr a2, cycleh
bne a1, a2, get_cycles
ret
.size get_cycles,.-get_cycles
```
- Calculate each handwritten assembly code CSR cycle via below c code.
```c
#include <stdint.h>
#include <stdio.h>
#include <string.h>
extern uint64_t get_cycles();
extern void palindrome_detection_origin();
extern void palindrome_detection_modify1();
extern void palindrome_detection_modify2();
int main(void)
{
/* measure cycles */
/* origin */
uint64_t oldcount = get_cycles();
palindrome_detection_origin();
uint64_t cyclecount = get_cycles() - oldcount;
printf("origin cycle count: %u\n", (unsigned int) cyclecount);
/* modify1 */
uint64_t oldcount = get_cycles();
palindrome_detection_modify1();
uint64_t cyclecount = get_cycles() - oldcount;
printf("modify1 cycle count: %u\n", (unsigned int) cyclecount);
/* modify2 */
uint64_t oldcount = get_cycles();
palindrome_detection_modify2();
uint64_t cyclecount = get_cycles() - oldcount;
printf("modify2 cycle count: %u\n", (unsigned int) cyclecount);
return 0;
}
```
- In order to make my assembly code can be called from c, I have to change some places. Take `palindrome_detection_using_CLZ_origin.s` for example :
```diff
.text
- .global main
+ .global palindrome_detection_origin
+ .type palindrome_detection_origin, %function
- main:
+ palindrome_detection_origin:
+ addi sp, sp, -4
+ sw ra, 0(sp)
li a0, 0x00000003
li a1, 0x00000C00
jal ra, count_leading_zeros
# check is palindrome or not
li a2, 64
sub a2, a2, a0
li a0, 0x00000003
li a1, 0x00000C00
jal ra, palindrome_detected
# Exit program
- li a7, 10
- li a0, 0
- ecall
+ lw ra, 0(sp)
+ addi sp, sp, 4
+ ret
```
:::info
I stuck in not successfully returning from assembly function to c code multiple days, and finally find out that I have to use `ret` to return to c and also have to store the value stored in `ra` at the beginning of entering assembly function because the value stored in `ra` changes when program excuting.
:::
#### Result
| CSR cycle | |
| -------- | -------- |
| Original | 349 |
| Modified | 278 |
| Modified + Loop unrolling | 267 |
By the way, the reason why CSR cycle count of assembly code is substantially less than c code is that I don't output the result of palindrome detection to console in assembly code, but do it in c code.
## Reference
- https://hackmd.io/@sysprog/2023-arch-homework1
- https://hackmd.io/@sysprog/2023-arch-homework2
- https://hackmd.io/@sysprog/SJAR5XMmi
- https://wiki.gentoo.org/wiki/GCC_optimization