# Assignment2: GNU Toolchain contributed by <[`shhung`](https://github.com/shhung/Image-scaling-with-Bilinear-interpolation-by-float32-multiplication)> ## Motivation The work I selected is the [Image scaling with Bilinear interpolation](https://github.com/linyu425/ComputerArchitecture/tree/main/hw1) done by [linyu0425](https://hackmd.io/@linyu0425/SJHkb8lWT). Due to the interesting work, I selected the work for my assignment 2. In addition, I want to review more floating-point processing that has been implemented in the work, such as addition and multiplication. ## Make C program be executed with rv32emu Compiling C program is quite easy by gcc for risc-v. ```shell $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o ImgScale ImgScale.S $ rv32emu ImgScale ``` ## Make assembly program be executed with rv32emu By referencing [asm-hello](https://github.com/sysprog21/rv32emu/tree/master/tests/asm-hello) which is the demo in rv32emu for [System calls](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md), we could have a `Makefile` and linker script for ours project. ### Modification All that needs to be done is modifying the syscall part in the original program, as Ripes has its own design for syscalls. In order to focus on the performance of the process logic, I deleted the part that outputs to the console. ### Optimization method Since this work implements multiple sub-functions used for image scaling, I have split it into several parts. Due to this process, we can analyze each unit of image scaling and find out the parts to be optimized. The function structure is shown in the figure below, and the performance of each part will be discussed later. ``` ─── ImgScale ├── count_leading_zeros ├── imul32 ├── fmul32 └── fadd32 ``` ### Bug During testing the parts of the whole program, I found that the functions `imul32` and `fmul32` have a bit of error, and this error occurs in both the C and assembly programs. Additionally, the `imul32` returns a 32-bit signed integer, which could lead to overflow. The same problem occurs in `fmul32` as well. Even so, there is no problem with the overall logic, so optimization will still be done based on the buggy program. ## Execute the program The workable code is available at [src](https://github.com/shhung/Image-scaling-with-Bilinear-interpolation-by-float32-multiplication/tree/main/src), and the Makefile is provided to generate ELF file which is runable on the rv32emu. 1. Buile the ELF file Following the command, you will get `main.elf` and `ImgScaleFromC.elf`. ```shell $ make ``` 2. Run each of it Make sure you have already installed `rv32emu` and set the path for `rv32emu`. If not, check the [Lab2: RISC-V RV32I[MACF] emulator with ELF support](https://hackmd.io/@sysprog/SJAR5XMmi) to setup your environment. ```shell $ rv32emu main.elf ``` ```shell $ rv32emu ImgScaleFromC.elf ``` Expected output: For `main.elf` and `ImgScaleFromC.elf`, the output would only differ at elapsed cycle. ```shell 0.954780 0.877887 0.800995 0.724102 0.647210 0.921899 0.826679 0.731460 0.636240 0.541020 0.889018 0.775471 0.661924 0.548377 0.434830 0.856138 0.724263 0.592389 0.460514 0.328640 0.823257 0.673055 0.522853 0.372652 0.222450 elapsed cycle: 39580 inferior exit code 0 ``` ## Contrast the handwritten and compiler-optimized assembly code By following the example of [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c), we can evaluate our work by accumulating cycle count. | | Handwritten | O0 | O1 | O2 | O3 | Os | Ofast | | ----- | ----------- | ----- | ----- | ----- | ----- | ----- | ----- | | Size | 56998 | 83940 | 82384 | 82364 | 83244 | 82364 | 83244 | | Cycle | 39610 | 39065 | 16162 | 16459 | 12183 | 17012 | 12183 | It can be obviously observed that the cycle count has a significant improvement when compiler optimization is applied. However, the size is not affected much. Below, we will describe the difference between the assembly code generated by the compiler and the handwritten assembly code. The complete code is accessible at [shhung](https://github.com/shhung/Image-scaling-with-Bilinear-interpolation-by-float32-multiplication/tree/main/src/disasm). ### count_leading_zeros Since `count_leading_zeros` is already simplified, there is almost nothing to be optimized. The only thing that can be optimized is omitting store/load operations by replacing callee-saved registers with caller-saved registers. ### imul32 By replacing callee-saved registers with caller-saved registers, store/load operations are not required in `imul32`. ### fadd32 Since `count_leading_zeros` is only used once in `fadd32`, `-O2` optimization and further levels replace the function call to `count_leading_zeros` in `fadd32` with directly writing the operation logic in fadd32. ### fmul32 Same as `imul32`, `-O1` optimization generated code uses fewer callee-saved registers. This operation results in fewer load/store instructions. For `-O2` optimization and further levels, the function is branchless, and store/load operations are not required due to replacing callee-saved registers with caller-saved registers. ### Difference Based on the above observation results, it can be found that the purpose and method of optimization are very similar, so we only list the differences between handwritten and compiler-optimized code for `fadd32`. It can be obviously found that there are no store/load operations at the prologue/epilogue of the compiler-optimized code. Additionally, please check line 30 of the compiler-optimized code. Comparing it with the handwritten code, you will find that `count_leading_zeros` is executed here. By embedding it directly into `fadd32`, you can reduce the possible delays caused by function calls. Handwritten: ```c fadd32: addi sp , sp , -52 sw s0 , 0(sp) sw s1 , 4(sp) sw s2 , 8(sp) sw s3 , 12(sp) sw s4 , 16(sp) sw s5 , 20(sp) sw s6 , 24(sp) sw s7 , 28(sp) sw s8 , 32(sp) sw s9 , 36(sp) sw s10 , 40(sp) sw s11 , 44(sp) sw ra , 48(sp) mv s0 , a0 mv s1 , a1 li t0 , 0x7fffffff and t1 , s0 , t0 and t2 , s1 , t0 blt t2 , t1 , noswap mv t0 , s0 mv s0 , s1 mv s1 , t0 noswap: li t0 , 31 srl s2 , s0 , t0 srl s3 , s1 , t0 li t0 , 0x7fffff li t1 , 0x800000 and t2 , s0 , t0 or s4 , t2 , t1 and t2 , s1 , t0 or s5 , t2 , t1 li t0 , 23 li t1 , 0xff srl t2 , s0 , t0 and s6 , t2 , t1 srl t2 , s1 , t0 and s7 , t2 , t1 sub t0 , s6 , s7 li t1 , 24 blt t1 , t0 , setalign_1 mv s8 , t0 j setalign_exit setalign_1: mv s8 , t1 setalign_exit: srl s5 , s5 ,s8 or t0 , s2 , s3 bne t0 , x0 , setma_1 add s4 , s4 , s5 j setma_exit setma_1: sub s4 , s4 , s5 setma_exit: mv a0 , s4 call count_leading_zeros mv s9 , a0 mv s10 , x0 li t0 , 8 blt t0 , s9 , shift_false li t0 , 8 sub s10 , t0 , s9 srl s4 , s4 , s10 add s6 , s6 , s10 j shift_exit shift_false: li t0 , 8 sub s10 , s9 , t0 sll s4 , s4 , s10 sub s6 , s6 , s10 shift_exit: li t0 , 0x80000000 and t0 , s0 , t0 li t1 , 23 sll t1 , s6 , t1 li t2 , 0x7fffff and t2 , s4 , t2 or t0 , t0 , t1 or a0 , t0 , t2 lw s0 , 0(sp) lw s1 , 4(sp) lw s2 , 8(sp) lw s3 , 12(sp) lw s4 , 16(sp) lw s5 , 20(sp) lw s6 , 24(sp) lw s7 , 28(sp) lw s8 , 32(sp) lw s9 , 36(sp) lw s10 , 40(sp) lw s11 , 44(sp) lw ra , 48(sp) addi sp , sp , 52 ret count_leading_zeros: addi sp , sp , -8 sw s0 , 0(sp) sw ra , 4(sp) mv s0 , a0 # s0 = x srli t0 , s0 , 1 or s0 , s0 ,t0 srli t0 , s0 , 2 or s0 , s0 ,t0 srli t0 , s0 , 4 or s0 , s0 ,t0 srli t0 , s0 , 8 or s0 , s0 ,t0 srli t0 , s0 , 16 or s0 , s0 ,t0 srli t0 , s0 , 1 li t1 , 0x55555555 and t0 , t0 , t1 sub s0 , s0 , t0 li t1 , 0x33333333 and t2 , s0 , t1 srli t0 , s0 , 2 li t1 , 0x33333333 and t0 , t0 , t1 add s0 , t0 , t2 srli t0 , s0 , 4 add t0 , t0 , s0 li t1 , 0x0f0f0f0f and s0 , t0 , t1 srli t0 , s0 , 8 add s0 , s0 , t0 srli t0 , s0 , 16 add s0 , s0 , t0 li t0 , 32 andi t1 , s0 , 0x7f sub a0 , t0 , t1 lw s0 , 0(sp) lw ra , 4(sp) addi sp , sp , 8 ret ``` Compiler-optimized: ```c 00010830 <fadd32>: 10830: 800007b7 lui a5,0x80000 10834: fff78793 add a5,a5,-1 # 7fffffff <__BSS_END__+0x7ffdb0db> 10838: 00a7f733 and a4,a5,a0 1083c: 00b7f7b3 and a5,a5,a1 10840: 00050813 mv a6,a0 10844: 00058613 mv a2,a1 10848: 00f74663 blt a4,a5,10854 <fadd32+0x24> 1084c: 00050613 mv a2,a0 10850: 00058813 mv a6,a1 10854: 00800537 lui a0,0x800 10858: 41765693 sra a3,a2,0x17 1085c: 41785793 sra a5,a6,0x17 10860: fff50713 add a4,a0,-1 # 7fffff <__BSS_END__+0x7db0db> 10864: 0ff6f693 zext.b a3,a3 10868: 0ff7f793 zext.b a5,a5 1086c: 00e675b3 and a1,a2,a4 10870: 40f687b3 sub a5,a3,a5 10874: 00e87733 and a4,a6,a4 10878: 01800893 li a7,24 1087c: 00a5e5b3 or a1,a1,a0 10880: 00a76733 or a4,a4,a0 10884: 00f8d463 bge a7,a5,1088c <fadd32+0x5c> 10888: 01800793 li a5,24 1088c: 40f757b3 sra a5,a4,a5 10890: 01066833 or a6,a2,a6 10894: 00f58733 add a4,a1,a5 10898: 00085463 bgez a6,108a0 <fadd32+0x70> 1089c: 40f58733 sub a4,a1,a5 108a0: 00175793 srl a5,a4,0x1 108a4: 00e7e7b3 or a5,a5,a4 108a8: 0027d593 srl a1,a5,0x2 108ac: 00b7e7b3 or a5,a5,a1 108b0: 0047d593 srl a1,a5,0x4 108b4: 00b7e7b3 or a5,a5,a1 108b8: 0087d593 srl a1,a5,0x8 108bc: 00b7e7b3 or a5,a5,a1 108c0: 0107d593 srl a1,a5,0x10 108c4: 00b7e7b3 or a5,a5,a1 108c8: 55555537 lui a0,0x55555 108cc: 0017d593 srl a1,a5,0x1 108d0: 55550513 add a0,a0,1365 # 55555555 <__BSS_END__+0x55530631> 108d4: 00a5f5b3 and a1,a1,a0 108d8: 40b787b3 sub a5,a5,a1 108dc: 33333537 lui a0,0x33333 108e0: 33350513 add a0,a0,819 # 33333333 <__BSS_END__+0x3330e40f> 108e4: 0027d593 srl a1,a5,0x2 108e8: 00a5f5b3 and a1,a1,a0 108ec: 00a7f7b3 and a5,a5,a0 108f0: 00f585b3 add a1,a1,a5 108f4: 0045d793 srl a5,a1,0x4 108f8: 0f0f1537 lui a0,0xf0f1 108fc: 00b787b3 add a5,a5,a1 10900: f0f50513 add a0,a0,-241 # f0f0f0f <__BSS_END__+0xf0cbfeb> 10904: 00a7f7b3 and a5,a5,a0 10908: 0087d593 srl a1,a5,0x8 1090c: 00b787b3 add a5,a5,a1 10910: 0107d593 srl a1,a5,0x10 10914: 00b787b3 add a5,a5,a1 10918: 07f7f793 and a5,a5,127 1091c: 02000593 li a1,32 10920: 40f585b3 sub a1,a1,a5 10924: 00800513 li a0,8 10928: 02b54863 blt a0,a1,10958 <fadd32+0x128> 1092c: fe878793 add a5,a5,-24 10930: 40f75733 sra a4,a4,a5 10934: 00f686b3 add a3,a3,a5 10938: 00971713 sll a4,a4,0x9 1093c: 01769693 sll a3,a3,0x17 10940: 800007b7 lui a5,0x80000 10944: 00975713 srl a4,a4,0x9 10948: 00d76733 or a4,a4,a3 1094c: 00f67533 and a0,a2,a5 10950: 00a76533 or a0,a4,a0 10954: 00008067 ret 10958: 01800593 li a1,24 1095c: 40f587b3 sub a5,a1,a5 10960: 00f71733 sll a4,a4,a5 10964: 40f686b3 sub a3,a3,a5 10968: 00971713 sll a4,a4,0x9 1096c: 01769693 sll a3,a3,0x17 10970: 800007b7 lui a5,0x80000 10974: 00975713 srl a4,a4,0x9 10978: 00d76733 or a4,a4,a3 1097c: 00f67533 and a0,a2,a5 10980: 00a76533 or a0,a4,a0 10984: 00008067 ret ``` ### Cycle count for each part | | count_leading_zeros | imul32 | fadd32 | fmul32 | | ----- | --- | ------ | ------ | ------ | | O0 | 81 | 105 | 191 | 284 | | O1 | 11 | 67 | 106 | 197 | | O2 | 11 | 11 | 107 | 195 | | O3 | 0 | 0 | 103 | 191 | | Os | 11 | 68 | 107 | 209 | | Ofast | 0 | 0 | 103 | 191 | ### Summary Through the process of reviewing optimized assembly code, I found some rules for writing efficient assembly code: 1. Fewer store/load operations are better for callable functions. 2. Rearrange instructions to reduce unnecessary conditional branches. By following these rules when writing assembly code, you can create a relatively clear and efficient program. This will make it more convenient to carry out subsequent optimizations based on this foundation.