# Assignment2: GNU Toolchain contributed by < [`st10740`](https://github.com/st10740) > ## Question Selection ### Question I choose the question from [陳浩文 (Implement palindrome detection and using CLZ)](https://hackmd.io/@DpuBXDwLSDCIE6cNJSjUlg/HyQ2i4nep). ### Motivation The reason why I choose this question is that I want to try another algorthm to implement palindrome detection with less number of lines and compare the performances between them. ## Modified Code ### C code Below is my modified version of `palindrome_detected`. ```c int palindrome_detected(uint64_t x, int clz){ uint64_t nob = (64 - clz); uint64_t checkEven = nob % 2; uint64_t left = x >> (nob/2+checkEven); uint64_t revRight = 0x0; /* Reverse the right half of the input x*/ for(int i=0; i<nob/2; i++) { uint64_t rightestBit = x & 0x1; revRight = revRight << 1; revRight |= rightestBit; x = x >> 1; } return (left==revRight) ? 1 : 0; } ``` #### Compare Implementation Difference Let's take input `0b1100011` for example to compare the implementation difference between original and modified version. The original version of palindrome detection is implemented by the following steps: 1. Use a variable `tempX` to store the left half of input. So the value stored in `tempX` is `0b110`. (If the number of bits excluding the leading zeros is odd, we ignore the middle bit.) 2. Use another variable `tempY` to store the right half of input. So the value stored in `tempY` is `0b011`. 3. Reverse the total **64 bits** of `tempX` and store the result in `revTempX`. After doing it, the value stored in `revTempX` become `0b0110000000000000000000000000000000000000000000000000000000000000`. 4. Shift right `revTempX` to make left half of input in the rightest place. Finally, `revTempX` become `0b011`. 5. Compare `revTempX` and `tempY`. If the value stored in both variables are the same meaning that input is palindrome, return `1`, otherwise, return `0`. My modified version of implementation does similar things as original version. However, I only reverse half number of bits exclusive leading zeros to detect palindrome compared to orignal 64 bits. Below is my modified implementation steps: 1. Use a variable `left` to store the left half of valid input excluding middle one. So the value stored in `left` is `0b110`. 2. Use a variable `revRight` initialized to `0b0` to store the reversed right half of input. 3. Reverse the right half of input and simultaneously store the result in `revRight`. The for loop only iterate **half number of valid excluding middle one** times. So, in this case, for loop only iterate **3 times**, and `revRight` stores `0b110`. 4. Compare `left` and `revRight`. If the value stored in both variables are the same meaning that input is palindrome, return `1`, otherwise, return `0`. #### Test Case Discovery I notice that the orginal test case `testB` (0x0000000000000001) is not a palindrome case checking by his implementation, but it's palindrome by my implementation. ### Assembly Code I fix some bug in the original assembly code to make it run properly. You can check the fixed code [here](https://github.com/st10740/Computer-Architecture-HW/blob/main/HW2/asm/palindrome_detection_using_CLZ_origin.s). The bug I fix: - The input of `count_leading_zeros` and `palindrome_detected` should be the same. - The counter of `reverse_loop` should be 32, not 31. Then, I turn the code into my modified implementation of palindrome detection just mentioned above based on fixed one. Below is the `palindrome_detected` part, you can check the complete code [here](https://github.com/st10740/Computer-Architecture-HW/blob/main/HW2/asm/palindrome_detection_using_CLZ_modify1.s). ```c palindrome_detected: addi sp, sp, -4 sw ra, 0(sp) mv t1, a0 # low 32 bits mv t0, a1 # high 32 bits andi t5, a2, 0x1 # t5 = (input number is odd) ? 1 : 0 srai t4, a2, 1 # t4 = input bit number / 2 add t4, t4, t5 # t4 = input bit number / 2 + checkEven # x >> (input bit number / 2 + checkEven) li t5, 32 sub t4, t5, t4 # t4 = 32 - t4 sll t2, t0, t4 sub t4, t5, t4 # (32 - t4) back to t4 srl t3, t1, t4 or t3, t2, t3 # t3 = left half of input number li t2, 0 # t2 = reversed right half of input number li t5, 0 # t5 = i srai t4, a2, 1 # t4 = input bit number / 2 reverse_loop: srl t6, t1, t5 andi t6, t6, 1 slli t2, t2, 1 or t2, t2, t6 addi t5, t5, 1 # i++ blt t5, t4, reverse_loop # compare different beq t2, t3, isPalindrome li a0, 0 jr ra isPalindrome: li a0, 1 jr ra ``` ### Performance Comparison First, I set up the environment mentioned in [Lab2: RISC-V RV32I[MACF] emulator with ELF support](https://hackmd.io/@sysprog/SJAR5XMmi). Because everytime I restart terminal, I have to enter command line `$ source riscv-none-elf-gcc/setenv` to update `PATH` environment variable, I put this command into `~/.bashrc` to automatically execute it when I start new terminal. Then, I add the following code to original and modified c code to count CSR cycle of both program. The test case I use is `0x00000C0000000003`. ```c typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } int main(){ uint64_t test = 0x00000C0000000003; //test is palindrome ticks t0 = getticks(); printf("%d\n", palindrome_detected(test, count_leading_zeros(test))); ticks t1 = getticks(); printf("elapsed cycle: %" PRIu64 "\n", t1 - t0); } ``` #### Origin - Compile ```shell $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 palindrome_detection_using_CLZ_origin.c -o palindrome_detection_using_CLZ_origin.elf ``` - Run `palindrome_detection_using_CLZ_origin.elf ` ```shell $ ~/rv32emu/build/rv32emu palindrome_detection_using_CLZ_origin.elf ``` - Output ``` 1 elapsed cycle: 4587 inferor exit code 0 ``` #### Modified - Compile ```shell $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 palindrome_detection_using_CLZ_modified.c -o palindrome_detection_using_CLZ_modified.elf ``` - Run `palindrome_detection_using_CLZ_modified.elf ` ```shell $ ~/rv32emu/build/rv32emu palindrome_detection_using_CLZ_modified.elf ``` - Output ``` 1 elapsed cycle: 2896 inferor exit code 0 ``` We can see that the CSR cycle is `2896` for modified code comparing to `4587` for origin. ## Assembly Optimization by riscv-none-elf-gcc There are multiple optimization levels can be used when compiling. Below is the introduction of each optimization level I used in this homework: - `-O0` : This level turns off optimization. The result is the same as no `-O` level. - `-O1` : The basic optimization level. The compiler will try to speedup the program and reduce code size without taking too much compilation time. - `-O2` : A step up for `-O1`. The compiler will try to increase performance without compromising code size and without taking too much compilation time. - `-O3` : Enables `-O2` as well as optimizations that are expensive in terms of compile time and memory usage. - `-Os` : Optimizes code for size. Since I want to compare original and modified code with five optimization methods and it will take too much time for me to type command line one by one, I use shell script to execute all the command at once. ```shell riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 palindrome_detection_using_CLZ_origin.c -o origin_O0.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O1 palindrome_detection_using_CLZ_origin.c -o origin_O1.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 palindrome_detection_using_CLZ_origin.c -o origin_O2.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O3 palindrome_detection_using_CLZ_origin.c -o origin_O3.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Os palindrome_detection_using_CLZ_origin.c -o origin_Os.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 palindrome_detection_using_CLZ_modified.c -o modified_O0.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O1 palindrome_detection_using_CLZ_modified.c -o modified_O1.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 palindrome_detection_using_CLZ_modified.c -o modified_O2.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O3 palindrome_detection_using_CLZ_modified.c -o modified_O3.elf riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Os palindrome_detection_using_CLZ_modified.c -o modified_Os.elf riscv-none-elf-objdump -d origin_O0.elf > disassembly/origin_O0.txt riscv-none-elf-objdump -d origin_O1.elf > disassembly/origin_O1.txt riscv-none-elf-objdump -d origin_O2.elf > disassembly/origin_O2.txt riscv-none-elf-objdump -d origin_O3.elf > disassembly/origin_O3.txt riscv-none-elf-objdump -d origin_Os.elf > disassembly/origin_Os.txt riscv-none-elf-objdump -d modified_O0.elf > disassembly/modified_O0.txt riscv-none-elf-objdump -d modified_O1.elf > disassembly/modified_O1.txt riscv-none-elf-objdump -d modified_O2.elf > disassembly/modified_O2.txt riscv-none-elf-objdump -d modified_O3.elf > disassembly/modified_O3.txt riscv-none-elf-objdump -d modified_Os.elf > disassembly/modified_Os.txt ``` You can check the complete disassembly code from [here](https://github.com/st10740/Computer-Architecture-HW/tree/main/HW2/c/disassembly). Because there are over **20000** lines of code. I will only analyze `palindrome_detected` part with different optimization level. ### -O0 - Observation - Lines of code | | original | modified | | -------- | -------- | -------- | | Number of lines | 242 | 138 | - Register used | Register | original | modified | | -------- | -------- | -------- | | s-type | s0, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11 | s0, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11 | | a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 | | t-type | t1, t2, t3, t4, t5, t6 | t1, t2, t3, t4, t5, t6 | | other | sp, zero | sp, zero | | Total | 27 | 27 | - Stack used | | original | modified | | -------- | -------- | -------- | | Stack used| 144 | 112 | - lw/sw count | | original | modified | | -------- | -------- | -------- | | lw/sw count | 114 | 73 | ### -O1 - Observation - Lines of code | | original | modified | | -------- | -------- | -------- | | Number of lines | 90 | 54 | - Register used | Register | original | modified | | -------- | -------- | -------- | | s-type | None | None | | a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 | | t-type | t1, t3 | t1, t3, t4, t5 | | other | None | None | | Total | 10 | 12 | - Stack used | | original | modified | | -------- | -------- | -------- | | Stack used| 0 | 0 | - lw/sw count | | original | modified | | -------- | -------- | -------- | | lw/sw count | 0 | 0 | ### -O2 - Observation - Lines of code | | original | modified | | -------- | -------- | -------- | | Number of lines | 97 | 44 | - Register used | Register | original | modified | | -------- | -------- | -------- | | s-type | None | None | | a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 | | t-type | t1, t3, t4, t5 | t1, t3, t4 | | other | None | None | | Total | 12 | 11 | - Stack used | | original | modified | | -------- | -------- | -------- | | Stack used| 0 | 0 | - lw/sw count | | original | modified | | -------- | -------- | -------- | | lw/sw count | 0 | 0 | ### -O3 - Observation - Lines of code | | original | modified | | -------- | -------- | -------- | | Number of lines | 97 | 44 | - Register used | Register | original | modified | | -------- | -------- | -------- | | s-type | None | None | | a-type | a0, a1, a2, a3, a4, a5, a6, a7 | a0, a1, a2, a3, a4, a5, a6, a7 | | t-type | t1, t3, t4, t5 | t1, t3, t4 | | other | None | None | | Total | 12 | 11 | - Stack used | | original | modified | | -------- | -------- | -------- | | Stack used| 0 | 0 | - lw/sw count | | original | modified | | -------- | -------- | -------- | | lw/sw count | 0 | 0 | ### -Os - Observation - Lines of code | | original | modified | | -------- | -------- | -------- | | Number of lines | 66 | 42 | - Register used | Register | original | modified | | -------- | -------- | -------- | | s-type | s0, s1, s2, s3, s4, s5, s6 | s0, s1, s2 | | a-type | a0, a1, a2, a3, a4, a5 | a0, a1, a2, a3, a4, a5, a6 | | t-type | None | None | | other | sp, ra | sp, ra | | Total | 15 | 12 | - Stack used | | original | modified | | -------- | -------- | -------- | | Stack used| 48 | 16 | - lw/sw count | | original | modified | | -------- | -------- | -------- | | lw/sw count | 19 | 8 | ### Summary of Optimization Below is the comparison of **`palindrome_detected`** part between original and modified code. - Original | | -O0 | -O1 | -O2 | -O3 | -Os | | -------- | -------- | -------- | --- | --- | --- | | Number of lines | 242 | 90 | 97 | 97 | 66 | | Register number | 27 | 10 | 12 | 12 | 15 | | Stack used | 144 | 0 | 0 | 0 | 48 | | lw/sw count | 114 | 0 | 0 | 0 | 19 | - Modified | | -O0 | -O1 | -O2 | -O3 | -Os | | -------- | -------- | -------- | --- | --- | --- | | Number of lines | 138 | 54 | 44 | 44 | 42 | | Register number | 27 | 12 | 11 | 11 | 12 | | Stack used | 112 | 0 | 0 | 0 | 16 | | lw/sw count | 73 | 0 | 0 | 0 | 8 | - Analysis - **`-O0` to `-O1, -O2, -O3` :** `-O0` uses a lot of lw/sw instructions and stores the value in the stack. Because executing lw/sw takes a lot of time, to improve the performance, `-O1`, `-O2` and `-O3` get rid of using lw/sw, only store the value in the registers. - **`-O2` to `-O3` :** To my surprise, the results of `-O2` and `-O3` are the same. So, I further check the disassembly code of `palindrome_detected` part of `-O2` and `-O3` and find out that they are entirely the same. - **`-Os` :** `-Os` optimizes the size of code but sacrifices performance. In addition, I also compare CSR cycle of the **whole program** between original and modified c code by using the same method as [Performance Comparison part of this homework](###Performance-Comparison). | CSR cycle | -O0 | -O1 | -O2 | -O3 | -Os | | -------- | -------- | -------- | --- | --- | --- | | Original | 4587 | 2109 | 2105 | 2105 | 2148 | | Modified | 2896 | 1698 | 1585 | 1559 | 1616 | Considering the results, my discoveries are - The CSR cycle decreases when the optimization level increases. - The difference of CSR cycle between original and modified code dramatically decreases when the optimization is turned on. - `-Os` can reduce code size but the effect of performance improvement may not be as great as other optimization levels. ## Handwritten Assembly Optimization In this section, I will try to optimize the modified version assembly code. ### Loop Unrolling Since loop for reverting the right half of input number in `palindrome_detected` iterates through several iterations when the MSB of input is at high place, I unroll the loop 2 times. ```c palindrome_detected: addi sp, sp, -4 sw ra, 0(sp) mv t1, a0 # low 32 bits mv t0, a1 # high 32 bits andi t5, a2, 0x1 # t5 = (input number is odd) ? 1 : 0 srai t4, a2, 1 # t4 = input bit number / 2 add t4, t4, t5 # t4 = input bit number / 2 + checkEven # x >> (input bit number / 2 + checkEven) li t5, 32 sub t4, t5, t4 # t4 = 32 - t4 sll t2, t0, t4 sub t4, t5, t4 # (32 - t4) back to t4 srl t3, t1, t4 or t3, t2, t3 # t3 = left half of input number li t2, 0 # t2 = reversed right half of input number li t5, 0 # t5 = i srai t4, a2, 1 # t4 = input bit number / 2 # Loop unrolling: Unroll the loop 2 times reverse_2_loop: srl t6, t1, t5 andi t6, t6, 1 slli t2, t2, 1 or t2, t2, t6 addi t5, t5, 1 # i++ srl t6, t1, t5 andi t6, t6, 1 slli t2, t2, 1 or t2, t2, t6 addi t5, t5, 1 # i++ blt t5, t4, reverse_2_loop # compare different beq t2, t3, isPalindrome li a0, 0 jr ra isPalindrome: li a0, 1 jr ra ``` ### Performance Comparision I call my three versions assembly code from c and also call `get_cycles.s` from [rv32emu/tests/perfcount](https://github.com/sysprog21/rv32emu/blob/master/tests/perfcounter/getcycles.S) to get cycle time in order to calculate CSR cycle of each program. - `get_cycles.s` ```c .text .globl get_cycles .align 2 get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret .size get_cycles,.-get_cycles ``` - Calculate each handwritten assembly code CSR cycle via below c code. ```c #include <stdint.h> #include <stdio.h> #include <string.h> extern uint64_t get_cycles(); extern void palindrome_detection_origin(); extern void palindrome_detection_modify1(); extern void palindrome_detection_modify2(); int main(void) { /* measure cycles */ /* origin */ uint64_t oldcount = get_cycles(); palindrome_detection_origin(); uint64_t cyclecount = get_cycles() - oldcount; printf("origin cycle count: %u\n", (unsigned int) cyclecount); /* modify1 */ uint64_t oldcount = get_cycles(); palindrome_detection_modify1(); uint64_t cyclecount = get_cycles() - oldcount; printf("modify1 cycle count: %u\n", (unsigned int) cyclecount); /* modify2 */ uint64_t oldcount = get_cycles(); palindrome_detection_modify2(); uint64_t cyclecount = get_cycles() - oldcount; printf("modify2 cycle count: %u\n", (unsigned int) cyclecount); return 0; } ``` - In order to make my assembly code can be called from c, I have to change some places. Take `palindrome_detection_using_CLZ_origin.s` for example : ```diff .text - .global main + .global palindrome_detection_origin + .type palindrome_detection_origin, %function - main: + palindrome_detection_origin: + addi sp, sp, -4 + sw ra, 0(sp) li a0, 0x00000003 li a1, 0x00000C00 jal ra, count_leading_zeros # check is palindrome or not li a2, 64 sub a2, a2, a0 li a0, 0x00000003 li a1, 0x00000C00 jal ra, palindrome_detected # Exit program - li a7, 10 - li a0, 0 - ecall + lw ra, 0(sp) + addi sp, sp, 4 + ret ``` :::info I stuck in not successfully returning from assembly function to c code multiple days, and finally find out that I have to use `ret` to return to c and also have to store the value stored in `ra` at the beginning of entering assembly function because the value stored in `ra` changes when program excuting. ::: #### Result | CSR cycle | | | -------- | -------- | | Original | 349 | | Modified | 278 | | Modified + Loop unrolling | 267 | By the way, the reason why CSR cycle count of assembly code is substantially less than c code is that I don't output the result of palindrome detection to console in assembly code, but do it in c code. ## Reference - https://hackmd.io/@sysprog/2023-arch-homework1 - https://hackmd.io/@sysprog/2023-arch-homework2 - https://hackmd.io/@sysprog/SJAR5XMmi - https://wiki.gentoo.org/wiki/GCC_optimization