Assignment2: GNU Toolchain

# Assignment2: GNU Toolchain contributed by < [`CSIE523`](https://github.com/CSIE523/2023_Computer_Architecture_FALL/tree/main/hw2) > I chose [Implement Binarization by count leading zero](https://hackmd.io/@edenlin/CompArchi_HW1) made from [林勁羽](https://github.com/JinYu1225/2023_Computer_Architecture/tree/main/Lab1). ## Motivation Binarization is a common method in image processing. In order to make binarization more efficiently, I come up with some optimization ideas about this topic. ## Environment I installed Ubuntu Linux 20.04-LTS on [Oracle Virtualbox](https://www.virtualbox.org/) and followed the [Lab2: RISC-V RV32I[MACF] emulator with ELF support](https://hackmd.io/@sysprog/SJAR5XMmi) step by step until I saw `hello.elf` successfully build. ``` ~/rv32emu/tests$ cd asm-hello/ ~/rv32emu/tests/asm-hello$ make riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o hello.o hello.S riscv-none-elf-ld -o hello.elf -T hello.ld --oformat=elf32-littleriscv hello.o ~/rv32emu/tests/asm-hello$ ls hello.elf hello.ld hello.o hello.S Makefile ``` ### elf Commands list We can simply change `Ox` to `-O0`, `-O1`, `-O2`, `-O3` or `-Ofast`. #### elf compile (We can simply change the number after '-O' to view different levels of compilation, such as '-O0' or '-Ofast'.) ``` riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Ox hw2_binarization.c -o hw2_binarization_Ox.elf ``` #### Check elf instructions ``` riscv-none-elf-objdump -d hw2_binarization_Ox.elf ``` #### Check elf file header ``` riscv-none-elf-readelf -h hw2_binarization_Ox.elf ``` #### Check elf size ``` riscv-none-elf-size hw2_binarization_Ox.elf ``` #### elf execution ``` build/rv32emu hw2_binarization_Ox.elf ``` ## Analysis ### C code Origin ```clike #include <stdint.h> #include <stdio.h> uint32_t count_leading_zeros_32(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); // change 0x7f to 0x3f } int main(){ // pixel test // 8-bit color depth for black and white photo uint32_t picture[5] = {20,80,128,150,231}; uint32_t threshold = 128; uint32_t *pixel = &picture; for(int i = 0; i < 5; i++){ uint32_t sub = threshold - *(pixel+i); printf("%d, ",i); printf("before = %d, ",*(pixel+i)); sub = count_leading_zeros_32(sub); if(sub) *(pixel+i) = 0; else *(pixel+i) = 255; printf("after = %d\n",*(pixel+i)); } return 0; } ``` The pixel values of an image are limited between 0 and 255, so we don't need to declare **uint32_t** to store variables. Instead, we can just declare **uint16_t** to store variables. ### C code Optimization ```clike #include <stdint.h> #include <stdio.h> uint16_t count_leading_zeros_16(uint16_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x -= ((x >> 1) & 0x5555); x = ((x >> 2) & 0x3333) + (x & 0x3333); x = ((x >> 4) + x) & 0x0f0f; x += (x >> 8); return (16 - (x & 0x1f)); // change 0x3f to 0x1f } int main(){ // pixel test // 8-bit color depth for black and white photo uint16_t picture[5] = {20,80,128,150,231}; uint16_t threshold = 127; uint16_t *pixel = picture; for(int i = 0; i < 5; i++){ uint16_t sub = threshold - *(pixel+i); printf("%d, ",i); printf("before = %d\n, ",*(pixel+i)); sub = count_leading_zeros_16(sub); *(pixel+i) = (sub) : 0 ? 255; printf("after = %d\n",*(pixel+i)); printf("--------------------------------\n"); } ``` In addition to changing **uint32_t** to **uint16_t**, we can also eliminate the codes that are always zero since we declare uint16_t but are only using 8 bits of data. ```clike uint32_t count_leading_zeros_32(uint32_t x) { ... x |= (x >> 16); ... x += (x >> 16); ... } ``` ### -O0 ``` count_leading_zeros_16: addi sp,sp,-32 sw s0,28(sp) addi s0,sp,32 mv a5,a0 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,1 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) or a5,a5,a4 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,2 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) or a5,a5,a4 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,4 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) or a5,a5,a4 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,8 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) or a5,a5,a4 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,1 slli a4,a5,16 srli a4,a4,16 li a5,20480 addi a5,a5,1365 and a5,a4,a5 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) sub a5,a4,a5 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,2 slli a4,a5,16 srli a4,a4,16 li a5,12288 addi a5,a5,819 and a5,a4,a5 slli a4,a5,16 srli a4,a4,16 lhu a5,-18(s0) mv a3,a5 li a5,12288 addi a5,a5,819 and a5,a3,a5 slli a5,a5,16 srli a5,a5,16 add a5,a4,a5 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,4 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) add a5,a5,a4 slli a4,a5,16 srli a4,a4,16 li a5,4096 addi a5,a5,-241 and a5,a4,a5 sh a5,-18(s0) lhu a5,-18(s0) srli a5,a5,8 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) add a5,a5,a4 sh a5,-18(s0) lhu a5,-18(s0) andi a5,a5,31 slli a5,a5,16 srli a5,a5,16 li a4,16 sub a5,a4,a5 slli a5,a5,16 srli a5,a5,16 mv a0,a5 lw s0,28(sp) addi sp,sp,32 jr ra ``` No optimization. The default value when no option is input. `lw` : 1 `sw` : 1 `sh` : 9 `lhu`: 17 ### -O1 ``` count_leading_zeros_16: srli a5,a0,1 or a0,a5,a0 srli a5,a0,2 or a5,a5,a0 srli a0,a5,4 or a0,a0,a5 srli a4,a0,8 or a4,a4,a0 srli a5,a4,1 li a3,20480 addi a3,a3,1365 and a5,a5,a3 sub a4,a4,a5 slli a4,a4,16 srli a4,a4,16 srli a5,a4,2 li a3,12288 addi a3,a3,819 and a5,a5,a3 and a4,a4,a3 add a5,a5,a4 srli a4,a5,4 add a5,a5,a4 li a4,4096 addi a4,a4,-241 and a5,a5,a4 srli a4,a5,8 add a5,a4,a5 andi a5,a5,31 li a0,16 sub a0,a0,a5 slli a0,a0,16 srli a0,a0,16 ret ``` Compared with option `-O0`, `-O1` eliminated many `Load and Store` instructions and any operations related to the `stack pointer`. Moreover, it reduces ``` srli a5,a5,1 slli a5,a5,16 srli a5,a5,16 lhu a4,-18(s0) or a5,a5,a4 sh a5,-18(s0) lhu a5,-18(s0) ``` to ``` srli a5,a0,1 or a0,a5,a0 ``` . ### `-O2` (for count_leading_zeros_16, `-O2`, `-O3` and `-Ofast` are the same) ```c count_leading_zeros_16: srli a5,a0,1 or a0,a5,a0 srli a5,a0,2 or a5,a5,a0 srli a0,a5,4 or a0,a0,a5 srli a4,a0,8 or a4,a4,a0 li a3,20480 srli a5,a4,1 addi a3,a3,1365 and a5,a5,a3 sub a4,a4,a5 slli a4,a4,16 srli a4,a4,16 li a3,12288 addi a3,a3,819 srli a5,a4,2 and a5,a5,a3 and a4,a4,a3 add a5,a5,a4 srli a3,a5,4 li a4,4096 add a5,a5,a3 addi a4,a4,-241 and a5,a5,a4 srli a4,a5,8 add a5,a4,a5 andi a5,a5,31 li a0,16 sub a0,a0,a5 slli a0,a0,16 srli a0,a0,16 ret ``` The only difference between options `-O1` and `-O2` is the swapping of some instruction positions. ### Statistics I think there are some mistakes in my program when counting cycle count, because it's impossible that cycle count is less than instret. Instret is total number of retired instructions, so cycle count must be equal to instret in single-cycle processor. In 5-stage processor, cycle count may be greater than or eqaul to instret due to hazards. Therefore, I'm still finding the solution to the problem. | | O0_origin | O0_opt | O1_opt | O2_opt | O3_opt | Ofast_opt | | ----------- | --------- | ------ | ------ | ------ | ------ | --------- | | cycle count | 508 | 620 | 234 | 239 | 13 | 13 | | instret | 210d | 20ce | 209a | 209b | 2099 | 2099 | | text | 53692 | 52544 | 52096 | 52104 | 52056 | 52056 | | data | 1924 | 1876 | 1876 | 1876 | 1876 | 1876 | | bss | 1528 | 1528 | 1528 | 1528 | 1528 | 1528 | | dec | 57144 | 55948 | 55500 | 55508 | 55460 | 55460 | ## Handwritten Assembly ### Origin ```c .set STDOUT, 1 .set WRITE, 64 .set EXIT, 93 .data test: .byte 20, 80, 128, 150, 231 result: .byte 0, 0, 0, 0, 0 .text .global binarization_origin .type binarization_origin, %function binarization_origin: addi sp, sp, -4 sw ra, 0(sp) # a1 = address of pixel # a2 = address of result # a3 = threshold la a1, test # load address of pixel la a2, result # load address of result li a3, 0x80 # load address of threshold lbu a3, 0(a3) # load value of threshold BINARIZATION: li s1, 5 # count of haven't done pixel B_LOOP: lbu a4, 0(a1) # load value of pixel sub a0, a3, a4 # threshold - pixel jal ra, CLZ # jump to CLZ bne a0, x0, FLOOR # branch to FLOOR if CLZ = 0 CEIL: li a4, 255 j STORE_MOVE # jump to STORE_MOVE FLOOR: li a4, 0 STORE_MOVE: sb a4, 0(a2) addi s1, s1, -1 # total count -1 addi a1, a1, 1 # go to next address of pixel addi a2, a2, 1 # go to next address of result bne s1, x0, B_LOOP li s1, 5 # make count for RESULT_CHECK sub a2, a2, s1 # reset address of result RESULT_CHECK: lbu a0, 0(a2) # load result to a0 #jal ra, PRINT_INT addi s1, s1, -1 # count print result addi a2, a2, 1 # move to next address of result bne s1, x0, RESULT_CHECK # j EXIT # Exit program lw ra, 0(sp) addi sp, sp, 4 ret CLZ: # a0: x = the number to be counted leading zero # t0: shifted x # t1: x shift & mask # t2: mask PAD1: srli t0, a0, 1 # t0 = x >> 1 or a0, a0, t0 # x |= x >> 1 srli t0, a0, 2 # t0 = x >> 2 or a0, a0, t0 # x |= x >> 2 srli t0, a0, 4 # t0 = x >> 4 or a0, a0, t0 # x |= x >> 4 srli t0, a0, 8 # t0 = x >> 8 or a0, a0, t0 # x |= x >> 8 srli t0, a0, 16 # t0 = x >> 16 or a0, a0, t0 # x |= x >> 16 POPCNT: li t2, 0x55555555 # load mask1 to t2 srli t0, a0, 1 # t0 = x >> 1 and t1, t0, t2 # t1 = (x >> 1) & mask1 sub a0, a0, t1 # x -= ((x >> 1) & mask1) li t2, 0x33333333 # load mask2 to t2 srli t0, a0, 2 # t0 = x >> 2 and t1, t0, t2 # (x >> 2) & mask2 and a0, a0, t2 # x & mask2 add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2) srli t0, a0, 4 # t0 = x >> 4 add a0, a0, t0 # x + (x >> 4) li t2, 0x0f0f0f0f # load mask4 to t2 and a0, a0, t2 # ((x >> 4) + x) & mask4 srli t0, a0, 8 # t0 = x >> 8 add a0, a0, t0 # x += (x >> 8) srli t0, a0, 16 # t0 = x >> 16 add a0, a0, t0 # x += (x >> 16) andi t0, a0, 0x3f # t0 = x & 0x3f li a0, 32 # a0 = 64 sub a0, a0, t0 # 64 - (x & 0x3f) addi sp, sp, -4 # make space for ra in stack sw ra, 0(sp) # push ra into stack # jal ra, PRINT_INT lw ra, 0(sp) # pull ra out of stack addi sp, sp, 4 # free the space create in stack for ra ret PRINT_INT: # li a7, 1 # print the value of input a0 in int # ecall addi sp, sp, -4 sw a0, 0(sp) lw a0, 0(sp) addi sp, sp, 4 ret ``` :::success I fix the handwritten assembly code with eliminating some codes which may cause mistakes while compiling. The experiment result is listed at the bottom of this section. ::: **Experiment Result** <s> ![](https://hackmd.io/_uploads/S10yLmaMp.png) </s> :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: ### Handwritten Assembly Optimization With print function loop unrolling and eliminatiing some useless instructions. ```c .set STDOUT, 1 .set WRITE, 64 .set EXIT, 93 .data test: .byte 20, 80, 128, 150, 231 result: .byte 0, 0, 0, 0, 0 .text .global binarization_opt_v1 .type binarization_opt_v1, %function binarization_opt_v1: addi sp, sp, -4 sw ra, 0(sp) # a1 = address of pixel # a2 = address of result # a3 = threshold la a1, test # load address of pixel la a2, result # load address of result li a3, 0x7F # load address of threshold BINARIZATION: li s1, 5 # count of haven't done pixel B_LOOP: lbu a4, 0(a1) # load value of pixel sub a0, a3, a4 # threshold - pixel jal ra, CLZ # jump to CLZ bne a0, x0, FLOOR # branch to FLOOR if CLZ = 0 CEIL: li a4, 255 j STORE_MOVE FLOOR: li a4, 0 STORE_MOVE: sb a4, 0(a2) addi s1, s1, -1 # total count -1 addi a1, a1, 1 # go to next address of pixel addi a2, a2, 1 # go to next address of result bne s1, x0, B_LOOP addi a2, a2, -5 # la a0, str2 # start to print the result # li a7, 4 # ecall RESULT_CHECK: # lbu a0, 0(a2) # load result to a0 # li a7, 1 # print the value of input a0 in int # ecall # la a0, str1 # li a7, 4 # ecall # lbu a0, 1(a2) # load result to a0 # li a7, 1 # print the value of input a0 in int # ecall # la a0, str1 # li a7, 4 # ecall # lbu a0, 2(a2) # load result to a0 # li a7, 1 # print the value of input a0 in int # ecall # la a0, str1 # li a7, 4 # ecall # lbu a0, 3(a2) # load result to a0 # li a7, 1 # print the value of input a0 in int # ecall # la a0, str1 # li a7, 4 # ecall # lbu a0, 4(a2) # load result to a0 # li a7, 1 # print the value of input a0 in int # ecall # j EXIT # Exit program lw ra, 0(sp) addi sp, sp, 4 ret CLZ: # a0: x = the number to be counted leading zero # t0: shifted x # t1: x shift & mask # t2: mask PAD1: srli t0, a0, 1 # t0 = x >> 1 or a0, a0, t0 # x |= x >> 1 srli t0, a0, 2 # t0 = x >> 2 or a0, a0, t0 # x |= x >> 2 srli t0, a0, 4 # t0 = x >> 4 or a0, a0, t0 # x |= x >> 4 srli t0, a0, 8 # t0 = x >> 8 or a0, a0, t0 # x |= x >> 8 POPCNT: li t2, 0x55555555 # load mask1 to t2 srli t0, a0, 1 # t0 = x >> 1 and t1, t0, t2 # t1 = (x >> 1) & mask1 sub a0, a0, t1 # x -= ((x >> 1) & mask1) li t2, 0x33333333 # load mask2 to t2 srli t0, a0, 2 # t0 = x >> 2 and t1, t0, t2 # (x >> 2) & mask2 and a0, a0, t2 # x & mask2 add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2) srli t0, a0, 4 # t0 = x >> 4 add a0, a0, t0 # x + (x >> 4) li t2, 0x0f0f0f0f # load mask4 to t2 and a0, a0, t2 # ((x >> 4) + x) & mask4 srli t0, a0, 8 # t0 = x >> 8 add a0, a0, t0 # x += (x >> 8) andi t0, a0, 0x1f # t0 = x & 0x1f li a0, 16 # a0 = 16 sub a0, a0, t0 # 16 - (x & 0x1f) ret ``` ### Without CLZ Just for comparison, I did another version without CLZ. ```c .data test: .byte 20, 80, 128, 150, 255 result: .byte 0, 0, 0, 0, 0 .text .global binarization_opt_v2 .type binarization_opt_v2, %function binarization_opt_v2: addi sp, sp, -4 sw ra, 0(sp) # a1 = address of pixel # a2 = address of result # a3 = threshold la a1, test # load address of pixel la a2, result # load address of result li a3, 0x7F # load address of threshold BINARIZATION: li s1, 5 # count of haven't done pixel addi a3, a3, 1 B_LOOP: lbu a4, 0(a1) # load value of pixel sltu t0, a4, a3 addi a0, t0, 255 sb a0, 0(a2) lbu a4, 1(a1) # load value of pixel sltu t0, a4, a3 addi a0, t0, 255 sb a0, 1(a2) lbu a4, 2(a1) # load value of pixel sltu t0, a4, a3 addi a0, t0, 255 sb a0, 2(a2) lbu a4, 3(a1) # load value of pixel sltu t0, a4, a3 addi a0, t0, 255 sb a0, 3(a2) lbu a4, 4(a1) # load value of pixel sltu t0, a4, a3 addi a0, t0, 255 sb a0, 4(a2) RESULT_CHECK: # lbu a0, 0(a2) # load result to a0 # jal ra, PRINT_INT # addi s1, s1, -1 # count print result # addi a2, a2, 1 # move to next address of result # bne s1, x0, RESULT_CHECK # Exit program lw ra, 0(sp) addi sp, sp, 4 ret PRINT_INT: li a7, 1 # print the value of input a0 in int ecall addi sp, sp, -4 sw a0, 0(sp) lw a0, 0(sp) addi sp, sp, 4 ret ``` **Experiment Result** I am still continuing to understand how to adapt corresponding `write` system call to my handwritten assembly code in order to check the result. Thus, the assembly code above doesn't print anything which results in less cycle counts than ripes. Here is my makefile to build related .S and .o files. ``` .PHONY: clean include ../../../mk/toolchain.mk CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall OBJS = \ getcycles.o \ percount.o \ binarization_origin.o \ binarization_opt_v1.o \ binarization_opt_v2.o BIN = perfcount.elf %.o: %.S $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< %.o: %.c $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< all: $(BIN) $(BIN): $(OBJS) $(CROSS_COMPILE)gcc -o $@ $^ clean: $(RM) $(BIN) $(OBJS) ``` ``` $ make riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o getcycles.o getcycles.S riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o percount.o percount.c riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o binarization_origin.o binarization_origin.S riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o binarization_opt_v1.o binarization_opt_v1.S riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o binarization_opt_v2.o binarization_opt_v2.S riscv-none-elf-gcc -o perfcount.elf getcycles.o percount.o binarization_origin.o binarization_opt_v1.o binarization_opt_v2.o $ ~/rv32emu/build/rv32emu perfcount.elf origin cycle count: 288 origin cycle count: 222 origin cycle count: 39 inferior exit code 0 ``` | cycle counts | rv32emu | Ripes | | --- | -------- | -------- | | binarization_origin.S | 288 | 545 | | binarization_opt_v1.S | 222 | 324 | | binarization_opt_v2.S | 39 | 177 |