# Assignment2: GNU Toolchain Contributed by [yptang5488](https://github.com/yptang5488) ## Install riscv rv32emu toolchain Follow the instruction in [Lab2](https://hackmd.io/@sysprog/SJAR5XMmi) document. ## Pick a Question The following question is picked from the Assignment 1. > from *GliAmanti*'s Assignment1 : [Implementation of Bit-Plane Slicing with CLZ](https://hackmd.io/@GliAmanti/BJlxAv651a) > > In Bit-plane slicing, we divide image into 8 planes. Plane 1 contains all LSBs in the bytes comprising the pixels in the image and Plane 8 contains all MSBs. It’s a useful way for image enhancement, since human eye is sensitive to digital with high frequency. And we only highlight the contribution made to total image appearance by specific high level bits. ![](https://hackmd.io/_uploads/H16vOq0Ma.png =500x) In this problem, we will use the CLZ function to get the highest bit, and then maximize the other values with the same number of bits to `255`, and set the other values to `0`. The motivation for me to choose this topic is that bit-plane slicing is a concept that is used in image processing, and I think it's a very relevant topic for practical purposes. The original implementation of the question is as follows : ```c void bit_plane_slicing(int caseNum, int size, uint32_t img[]){ int32_t max = -1; for (int i = 0; i < size; i++){ int32_t LZ = (int32_t)count_leading_zeros(img[i]); int32_t MSB = 32 - LZ - 1; img[i] = MSB; if(MSB > max) max = MSB; } printf("Test %d: ", caseNum); for (int i = 0; i < size; i++){ if(img[i] == max){ img[i] = 255; }else{ img[i] = 0; } printf("%u ", img[i]); } printf("\n"); } ``` The original RISC-V assembly implementation is as follows : :::spoiler Assembly code ```c .data test0: .word 255, 0, 128, 1 test1: .word 167, 133, 111, 144, 140, 135, 159, 154, 148 test2: .word 50, 100, 150, 200, 250, 255 testStr: .string "Test " colonStr: .string ": " newlineStr: .string "\n" spaceStr: .string " " .text main: la a0, test0 # a0: the base address of test0 addi a1, x0, 4 # a1: the size of test0 addi a2, x0, 0 # a2: caseNum = 0 jal ra, bit_plane_slicing la a0, test1 # a0: the base address of test1 addi a1, x0, 9 # a1: the size of test1 addi a2, x0, 1 # a2: caseNum = 1 jal ra, bit_plane_slicing la a0, test2 # a0: the base address of test2 addi a1, x0, 6 # a1: the size of test2 addi a2, x0, 2 # a2: caseNum = 2 jal ra, bit_plane_slicing li a7, 10 # Exit program ecall bit_plane_slicing: addi s0, x0, -1 # s0: max = -1 add s1, x0, x0 # s1: i = 0 clzLoop: slli t2, s1, 2 # t2: array offset = i*4 add t2, t2, a0 # t2: base addr + offset lw s4, 0(t2) # s4: arr[i] addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) add a0, x0, s4 # load arr[i] to parameter register a0 jal ra, count_leading_zeros lw ra, 0(sp) lw a0, 4(sp) addi sp, sp, 8 add s2, x0, a3 # s2: LZ addi t1, x0, 31 # t1: 32 - 1 sub s3, t1, s2 # s3: MSB = 32 - 1 - LZ sw s3, 0(t2) # arr[i] = MSB bge s3, s0, isGreater # if(MSB > max) goOn: addi s1, s1, 1 # i = i + 1 blt s1, a1, clzLoop addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) jal ra, printLoopOutside # printf("Test %d: ", caseNum) || printf("\n") lw ra, 0(sp) lw a0, 4(sp) addi sp, sp, 8 add s1, x0, x0 # s1: i = 0 reconLoop: slli t2, s1, 2 # t2: array offset = i*4 add t2, t2, a0 # t2: base addr + offset lw s4, 0(t2) # s4: arr[i] beq s4, s0, isEqual # if(arr[i] == max) add s4, x0, x0 # arr[i] = 0 goOn2: addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) jal ra, printLoopInside # printf("%u ", arr[i]) lw ra, 0(sp) lw a0, 4(sp) addi sp, sp, 8 addi s1, s1, 1 # i = i + 1 blt s1, a1, reconLoop ret isGreater: add s0, x0, s3 # max = MSB j goOn isEqual: addi s4, x0, 255 # arr[i] = 255 j goOn2 count_leading_zeros: srli t0, a0, 1 # t0: x >> 1 or a0, a0, t0 # a0: x |= (x >> 1) srli t0, a0, 2 # t0: x >> 2 or a0, a0, t0 srli t0, a0, 4 # t0: x >> 4 or a0, a0, t0 srli t0, a0, 8 # t0: x >> 8 or a0, a0, t0 srli t0, a0, 16 # t0: x >> 16 or a0, a0, t0 count_ones: srli t0, a0, 1 # t0: x >> 1 lui t1, 0x55555 # t1: 0x55555555 ori t1, t1, 0x555 and t0, t0, t1 # t0: (x >> 1) & 0x55555555 sub a0, a0, t0 srli t0, a0, 2 # t0: x >> 2 lui t1, 0x33333 # t1: 0x33333333 ori t1, t1, 0x333 and t0, t0, t1 # t0: (x >> 2) & 0x33333333 and t4, a0, t1 # t4: x & 0x33333333 add a0, t0, t4 srli t0, a0, 4 # t0: x >> 4 add t0, t0, a0 # t0: (x >> 4) + x lui t1, 0x0f0f0 # t1: 0x0f0f0f0f ori t1, t1, 0x787 addi t1, t1, 0x788 and a0, t0, t1 # a0: ((x >> 4) + x) & 0x0f0f0f0f srli t0, a0, 8 # t0: x >> 8 add a0, a0, t0 srli t0, a0, 16 # t0: x >> 16 add a0, a0, t0 andi t0, a0, 0x7f # t0: x & 0x7f addi t1, x0, 32 # t1: 32 sub a3, t1, t0 # a3: return (32 - (x & 0x7f)) ret printLoopOutside: la a0, newlineStr li a7, 4 ecall la a0, testStr li a7, 4 ecall add a0, x0, a2 li a7, 1 ecall la a0, colonStr li a7, 4 ecall ret printLoopInside: add a0, x0, s4 li a7, 1 ecall la a0, spaceStr li a7, 4 ecall ret ``` ::: ## adapt the assembly to rv32emu I have modified the following partition in assembly code to eliminate the incompatibility between the `Ripes` and `rv32emu`. ```shell $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -nostartfiles -nostdlib -o slicing_s slicing_s.s $ rv32emu/build/rv32emu slicing_s ``` ### 1. Program Start I rename the `main` function to `_start` so that the linker can recognize the starting address. ```c .org 0 .global _start _start: la a0, test1 # a0: the base address of test0 addi a1, x0, 4 # a1: the size of test0 addi a2, x0, 0 # a2: caseNum = 0 jal ra, bit_plane_slicing ... ``` ### 2. Data Definition I found that the string definition `.string` have to be changed to `.ascii`. In addition, the definition is usually followed by the size of the data for later use. So, I change the original code : ```c testStr: .string "Test" ``` to the following code : ```c .section .rodata testStr: .ascii "Test" .set testStr_size, .-testStr ``` ### 3. System Call The following changes were made with reference to [docs/syscall.md](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md). I define commonly used system call numbers for future use in advance. ```c /* newlib system calls */ .set SYSEXIT, 93 .set SYSWRITE, 64 .set STDOUT, 1 ``` In the system output operation, defining `a1` as the system return value instead of using `a0`. - `a7` : system call number - `a0` : stdout - `a1` : output value - `a2` : size of the output value So, I change the original code : ```c print: la a0, testStr li a7, 4 ecall ``` to the following code : ```c print: li a7, SYSWRITE li a0, STDOUT la a1, testStr li a2, testStr_size ecall ``` It's worth noting that we can't print out integer types directly, but have to convert each digit to ASCII and then output it as a string. In this program, the output values will only be `0` and `255`, so instead of processing the numbers I just print string to speed things up. ### Output Finally, I can print the answer like this : ```python Test0: 255 0 255 0 Test1: 255 255 255 255 255 255 255 255 255 Test2: 255 255 255 255 255 255 ``` > Answer > Test0: 255 0 255 0 > Test1: 255 255 0 255 255 255 255 255 255 > Test2: 0 0 255 255 255 255 I encountered the problem that only the first data in my definition can get the correct answer. There is the array definition in my code : ```c .data test0: .word 255, 0, 128, 1 .set arr_size0, .test0 test1: .word 167, 133, 111, 144, 140, 135, 159, 154, 148 .set arr_size1, .test1 test2: .word 50, 100, 150, 200, 250, 255 .set arr_size2, .test2 ``` I have tried switching the execution order, but still only the first defined data can be executed correctly. I guess that according to my code, there is something missing under the rv32emu compiler, but I haven't found a solution. ## ELF files produced by C compiler ### complie the program ```shell $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Wall -O0 -c -o slicing_c.o slicing_c.c $ riscv-none-elf-gcc -o slicing.elf slicing_c.o ``` :::spoiler makefile ```shell .PHONY: clean include toolchain.mk ASFLAGS = -march=rv32i -mabi=ilp32 LDFLAGS = --oformat=elf32-littleriscv CFLAGS = -Wall -O0 OBJS = slicing_c.o %.o: %.s riscv-none-elf-gcc $(ASFLAGS) -c -o $@ $< %.o: %.c riscv-none-elf-gcc $(ASFLAGS) $(CFLAGS) -c -o $@ $< all: slicing.elf slicing.elf: $(OBJS) riscv-none-elf-gcc -o $@ $^ clean: $(RM) $(OBJS) ``` ::: ### deal with warning There is a warning when the original c program is complied by `riscv-none-elf-gcc`. ```shell slicing_c.c:52:18: warning: format '%u' expects argument of type 'unsigned int', but argument 2 has type 'uint32_t' {aka 'long unsigned int'} [-Wformat=] 52 | printf("%u ", img[i]); | ~^ ~~~~~~ | | | | | uint32_t {aka long unsigned int} | unsigned int | %lu ``` So, I change the original code : ```c printf("%u ", img[i]); ``` to the following code : ```c printf("%lu ", img[i]); ``` ### gcc optimization I write a shell script to run `riscv-none-elf-objdump`, `riscv-none-elf-readelf`, `riscv-none-elf-size` to get the information of the program and `build/rv32emu` to execute the c program. To get the cycle of the program, I use the [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) function to get the cycle of different test case respectively. :::spoiler shell script ```shell= #!/bin/bash FILE_NAME="slicing.elf" OPTIM=0 # replace with different optimization choice riscv-none-elf-objdump -d ${FILE_NAME} > record/objdump_O${OPTIM}.txt riscv-none-elf-readelf -h ${FILE_NAME} riscv-none-elf-size ${FILE_NAME} rv32emu/build/rv32emu ${FILE_NAME} ``` ::: And then, I change the optimization flag from `-O0` (no optimization) to `-O1`, `-O2`, `-O3`, `-Os`, `-Ofast` and check their elapsed cycle respectively. ```shell riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Wall -O${OPTIM} -c -o slicing_c.o slicing_c.c riscv-none-elf-gcc -o slicing.elf slicing_c.o ``` | optimization | text | data | bss | dec | hex | elapsed cycle | |:------------:|:-----:|:----:|:----:|:-----:|:----:|:-------------:| | -O0 | 52714 | 1876 | 1528 | 56118 | db36 | 21581 | | -O1 | 52198 | 1876 | 1528 | 55602 | d932 | 20054 | | -O2 | 52184 | 1876 | 1528 | 55588 | d924 | 20001 | | -O3 | 52396 | 1876 | 1528 | 55800 | d9f8 | 19855 | | -Os | 52042 | 1876 | 1528 | 55446 | d896 | 20059 | | -Ofast | 52396 | 1876 | 1528 | 55800 | d9f8 | 19855 | From the table, we can see that the optimization of `-O3` is mainly aimed at reducing the execution time and less concerned about the size of the program. `-Os` is also based on `-O2`, but the gola is to minimize the size of the target code. ## Optimization the program I optimized the hand-written assembly code and I run it directly on Ripes for quick access the information about cycle and clock, etc. There is the information of the original hand-written assembly code : ![](https://hackmd.io/_uploads/B1gLVERM6.png) ### 1. Reduce the number of branches (version 1) I found that there are two branches `isGreater`, `isEqual` in the assembly that can be modified and fine-tuned to reduce the number of branches. - `isGreater` is the branch that can be simplified. before modification : ```c clzLoop: ... bge s3, s0, isGreater # if(MSB > max) goOn: ... ... isGreater: # redundant branch add s0, x0, s3 # max = MSB j goOn ``` modified : ```c clzLoop: ... blt s3, s0, goOn add s0, x0, s3 # if(max < MSB) max = MSB goOn: ... ``` - move `isEqual` to reduce jump before modification : ```c reconLoop: ... beq s4, s0, isEqual # if(arr[i] == max) add s4, x0, x0 # arr[i] = 0 goOn2: ... ... isEqual: # the branch that have to jump to the above command far away addi s4, x0, 255 # arr[i] = 255 j goOn2 ``` modified : ```c reconLoop: ... beq s4, s0, isEqual # if(arr[i] == max) add s4, x0, x0 # arr[i] = 0 j goOn2 isEqual: addi s4, x0, 255 # arr[i] = 255 goOn2: ... ``` After the modification, a slight decrease in cycle counts can be noticed. ![](https://hackmd.io/_uploads/SkGXbLRza.png) ### 2. Reduce the times of load/store : a0 (version 2) The register `a0` is usually used to access the return value of the callee function. I found that in this program, `a0` is not only used to access the return value, but also for many parameter accesses. Therefore, each time a function is called, the caller function have to load and store `a0`, which takes a lot of time to move data in and out of memory. ```c # when calling "printLoopInside" function addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) jal ra, printLoopInside # printf("%u ", arr[i]) lw ra, 0(sp) lw a0, 4(sp) addi sp, sp, 8 ``` I changed the non-returned parameter values to be stored in other registers, so that it is not necessary to read/write and access `a0` every time. The modification is as follows : ```c # when calling "printLoopInside" function addi sp, sp, -4 sw ra, 0(sp) jal ra, printLoopInside # printf("%u ", arr[i]) lw ra, 0(sp) addi sp, sp, 4 ``` The result does succeed in reducing the number of the cycles. ![](https://hackmd.io/_uploads/r1ELwO0G6.png) ### 3. Reduce the times of load/store : ra (version 3) In the original program, there are several levels of calls that use `jal` to record the return address `ra`. So, there are many `ra` read and write operations when the callee function is called. I found that the output functions `printLoopInside`, `printLoopOutside` were all called in the same place, so I moved them directly into the loop to minimize the numbers of branching and read-write of `ra` register. before modification : ```c goOn2: addi sp, sp, -4 sw ra, 0(sp) jal ra, printLoopInside # printf("%u ", arr[i]) lw ra, 0(sp) addi sp, sp, 4 addi s1, s1, 1 # i = i + 1 ... ... printLoopInside: add a0, x0, s4 li a7, 1 ecall la a0, spaceStr li a7, 4 ecall ret ``` modified : ```c goOn2: add a0, x0, s4 li a7, 1 ecall la a0, spaceStr li a7, 4 ecall addi s1, s1, 1 # i = i + 1 ... ... ``` ![](https://hackmd.io/_uploads/H1i6cuRMp.png) Optimization results table : | optimization | cycles | instrs. retired | CPI | IPC | |:------------:|:------:|:---------------:|:----:|:-----:| | original | 2054 | 1569 | 1.31 | 0.764 | | version 1 | 1953 | 1544 | 1.26 | 0.791 | | version 2 | 1871 | 1462 | 1.28 | 0.781 | | version 3 | 1651 | 1330 | 1.24 | 0.806 | I reduced cycle counts by 19.62% compared to the initial program. :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv :::