contributed by <shhung
>
The work I selected is the Image scaling with Bilinear interpolation done by linyu0425.
Due to the interesting work, I selected the work for my assignment 2. In addition, I want to review more floating-point processing that has been implemented in the work, such as addition and multiplication.
Compiling C program is quite easy by gcc for risc-v.
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o ImgScale ImgScale.S
$ rv32emu ImgScale
By referencing asm-hello which is the demo in rv32emu for System calls, we could have a Makefile
and linker script for ours project.
All that needs to be done is modifying the syscall part in the original program, as Ripes has its own design for syscalls.
In order to focus on the performance of the process logic, I deleted the part that outputs to the console.
Since this work implements multiple sub-functions used for image scaling, I have split it into several parts. Due to this process, we can analyze each unit of image scaling and find out the parts to be optimized.
The function structure is shown in the figure below, and the performance of each part will be discussed later.
─── ImgScale
├── count_leading_zeros
├── imul32
├── fmul32
└── fadd32
During testing the parts of the whole program, I found that the functions imul32
and fmul32
have a bit of error, and this error occurs in both the C and assembly programs.
Additionally, the imul32
returns a 32-bit signed integer, which could lead to overflow. The same problem occurs in fmul32
as well. Even so, there is no problem with the overall logic, so optimization will still be done based on the buggy program.
The workable code is available at src, and the Makefile is provided to generate ELF file which is runable on the rv32emu.
Following the command, you will get main.elf
and ImgScaleFromC.elf
.
$ make
Make sure you have already installed rv32emu
and set the path for rv32emu
. If not, check the Lab2: RISC-V RV32I[MACF] emulator with ELF support to setup your environment.
$ rv32emu main.elf
$ rv32emu ImgScaleFromC.elf
Expected output:
For main.elf
and ImgScaleFromC.elf
, the output would only differ at elapsed cycle.
0.954780 0.877887 0.800995 0.724102 0.647210
0.921899 0.826679 0.731460 0.636240 0.541020
0.889018 0.775471 0.661924 0.548377 0.434830
0.856138 0.724263 0.592389 0.460514 0.328640
0.823257 0.673055 0.522853 0.372652 0.222450
elapsed cycle: 39580
inferior exit code 0
By following the example of ticks.c, we can evaluate our work by accumulating cycle count.
Handwritten | O0 | O1 | O2 | O3 | Os | Ofast | |
---|---|---|---|---|---|---|---|
Size | 56998 | 83940 | 82384 | 82364 | 83244 | 82364 | 83244 |
Cycle | 39610 | 39065 | 16162 | 16459 | 12183 | 17012 | 12183 |
It can be obviously observed that the cycle count has a significant improvement when compiler optimization is applied. However, the size is not affected much.
Below, we will describe the difference between the assembly code generated by the compiler and the handwritten assembly code. The complete code is accessible at shhung.
Since count_leading_zeros
is already simplified, there is almost nothing to be optimized. The only thing that can be optimized is omitting store/load operations by replacing callee-saved registers with caller-saved registers.
By replacing callee-saved registers with caller-saved registers, store/load operations are not required in imul32
.
Since count_leading_zeros
is only used once in fadd32
, -O2
optimization and further levels replace the function call to count_leading_zeros
in fadd32
with directly writing the operation logic in fadd32.
Same as imul32
, -O1
optimization generated code uses fewer callee-saved registers. This operation results in fewer load/store instructions.
For -O2
optimization and further levels, the function is branchless, and store/load operations are not required due to replacing callee-saved registers with caller-saved registers.
Based on the above observation results, it can be found that the purpose and method of optimization are very similar, so we only list the differences between handwritten and compiler-optimized code for fadd32
.
It can be obviously found that there are no store/load operations at the prologue/epilogue of the compiler-optimized code. Additionally, please check line 30 of the compiler-optimized code. Comparing it with the handwritten code, you will find that count_leading_zeros
is executed here. By embedding it directly into fadd32
, you can reduce the possible delays caused by function calls.
Handwritten:
fadd32:
addi sp , sp , -52
sw s0 , 0(sp)
sw s1 , 4(sp)
sw s2 , 8(sp)
sw s3 , 12(sp)
sw s4 , 16(sp)
sw s5 , 20(sp)
sw s6 , 24(sp)
sw s7 , 28(sp)
sw s8 , 32(sp)
sw s9 , 36(sp)
sw s10 , 40(sp)
sw s11 , 44(sp)
sw ra , 48(sp)
mv s0 , a0
mv s1 , a1
li t0 , 0x7fffffff
and t1 , s0 , t0
and t2 , s1 , t0
blt t2 , t1 , noswap
mv t0 , s0
mv s0 , s1
mv s1 , t0
noswap:
li t0 , 31
srl s2 , s0 , t0
srl s3 , s1 , t0
li t0 , 0x7fffff
li t1 , 0x800000
and t2 , s0 , t0
or s4 , t2 , t1
and t2 , s1 , t0
or s5 , t2 , t1
li t0 , 23
li t1 , 0xff
srl t2 , s0 , t0
and s6 , t2 , t1
srl t2 , s1 , t0
and s7 , t2 , t1
sub t0 , s6 , s7
li t1 , 24
blt t1 , t0 , setalign_1
mv s8 , t0
j setalign_exit
setalign_1:
mv s8 , t1
setalign_exit:
srl s5 , s5 ,s8
or t0 , s2 , s3
bne t0 , x0 , setma_1
add s4 , s4 , s5
j setma_exit
setma_1:
sub s4 , s4 , s5
setma_exit:
mv a0 , s4
call count_leading_zeros
mv s9 , a0
mv s10 , x0
li t0 , 8
blt t0 , s9 , shift_false
li t0 , 8
sub s10 , t0 , s9
srl s4 , s4 , s10
add s6 , s6 , s10
j shift_exit
shift_false:
li t0 , 8
sub s10 , s9 , t0
sll s4 , s4 , s10
sub s6 , s6 , s10
shift_exit:
li t0 , 0x80000000
and t0 , s0 , t0
li t1 , 23
sll t1 , s6 , t1
li t2 , 0x7fffff
and t2 , s4 , t2
or t0 , t0 , t1
or a0 , t0 , t2
lw s0 , 0(sp)
lw s1 , 4(sp)
lw s2 , 8(sp)
lw s3 , 12(sp)
lw s4 , 16(sp)
lw s5 , 20(sp)
lw s6 , 24(sp)
lw s7 , 28(sp)
lw s8 , 32(sp)
lw s9 , 36(sp)
lw s10 , 40(sp)
lw s11 , 44(sp)
lw ra , 48(sp)
addi sp , sp , 52
ret
count_leading_zeros:
addi sp , sp , -8
sw s0 , 0(sp)
sw ra , 4(sp)
mv s0 , a0 # s0 = x
srli t0 , s0 , 1
or s0 , s0 ,t0
srli t0 , s0 , 2
or s0 , s0 ,t0
srli t0 , s0 , 4
or s0 , s0 ,t0
srli t0 , s0 , 8
or s0 , s0 ,t0
srli t0 , s0 , 16
or s0 , s0 ,t0
srli t0 , s0 , 1
li t1 , 0x55555555
and t0 , t0 , t1
sub s0 , s0 , t0
li t1 , 0x33333333
and t2 , s0 , t1
srli t0 , s0 , 2
li t1 , 0x33333333
and t0 , t0 , t1
add s0 , t0 , t2
srli t0 , s0 , 4
add t0 , t0 , s0
li t1 , 0x0f0f0f0f
and s0 , t0 , t1
srli t0 , s0 , 8
add s0 , s0 , t0
srli t0 , s0 , 16
add s0 , s0 , t0
li t0 , 32
andi t1 , s0 , 0x7f
sub a0 , t0 , t1
lw s0 , 0(sp)
lw ra , 4(sp)
addi sp , sp , 8
ret
Compiler-optimized:
00010830 <fadd32>:
10830: 800007b7 lui a5,0x80000
10834: fff78793 add a5,a5,-1 # 7fffffff <__BSS_END__+0x7ffdb0db>
10838: 00a7f733 and a4,a5,a0
1083c: 00b7f7b3 and a5,a5,a1
10840: 00050813 mv a6,a0
10844: 00058613 mv a2,a1
10848: 00f74663 blt a4,a5,10854 <fadd32+0x24>
1084c: 00050613 mv a2,a0
10850: 00058813 mv a6,a1
10854: 00800537 lui a0,0x800
10858: 41765693 sra a3,a2,0x17
1085c: 41785793 sra a5,a6,0x17
10860: fff50713 add a4,a0,-1 # 7fffff <__BSS_END__+0x7db0db>
10864: 0ff6f693 zext.b a3,a3
10868: 0ff7f793 zext.b a5,a5
1086c: 00e675b3 and a1,a2,a4
10870: 40f687b3 sub a5,a3,a5
10874: 00e87733 and a4,a6,a4
10878: 01800893 li a7,24
1087c: 00a5e5b3 or a1,a1,a0
10880: 00a76733 or a4,a4,a0
10884: 00f8d463 bge a7,a5,1088c <fadd32+0x5c>
10888: 01800793 li a5,24
1088c: 40f757b3 sra a5,a4,a5
10890: 01066833 or a6,a2,a6
10894: 00f58733 add a4,a1,a5
10898: 00085463 bgez a6,108a0 <fadd32+0x70>
1089c: 40f58733 sub a4,a1,a5
108a0: 00175793 srl a5,a4,0x1
108a4: 00e7e7b3 or a5,a5,a4
108a8: 0027d593 srl a1,a5,0x2
108ac: 00b7e7b3 or a5,a5,a1
108b0: 0047d593 srl a1,a5,0x4
108b4: 00b7e7b3 or a5,a5,a1
108b8: 0087d593 srl a1,a5,0x8
108bc: 00b7e7b3 or a5,a5,a1
108c0: 0107d593 srl a1,a5,0x10
108c4: 00b7e7b3 or a5,a5,a1
108c8: 55555537 lui a0,0x55555
108cc: 0017d593 srl a1,a5,0x1
108d0: 55550513 add a0,a0,1365 # 55555555 <__BSS_END__+0x55530631>
108d4: 00a5f5b3 and a1,a1,a0
108d8: 40b787b3 sub a5,a5,a1
108dc: 33333537 lui a0,0x33333
108e0: 33350513 add a0,a0,819 # 33333333 <__BSS_END__+0x3330e40f>
108e4: 0027d593 srl a1,a5,0x2
108e8: 00a5f5b3 and a1,a1,a0
108ec: 00a7f7b3 and a5,a5,a0
108f0: 00f585b3 add a1,a1,a5
108f4: 0045d793 srl a5,a1,0x4
108f8: 0f0f1537 lui a0,0xf0f1
108fc: 00b787b3 add a5,a5,a1
10900: f0f50513 add a0,a0,-241 # f0f0f0f <__BSS_END__+0xf0cbfeb>
10904: 00a7f7b3 and a5,a5,a0
10908: 0087d593 srl a1,a5,0x8
1090c: 00b787b3 add a5,a5,a1
10910: 0107d593 srl a1,a5,0x10
10914: 00b787b3 add a5,a5,a1
10918: 07f7f793 and a5,a5,127
1091c: 02000593 li a1,32
10920: 40f585b3 sub a1,a1,a5
10924: 00800513 li a0,8
10928: 02b54863 blt a0,a1,10958 <fadd32+0x128>
1092c: fe878793 add a5,a5,-24
10930: 40f75733 sra a4,a4,a5
10934: 00f686b3 add a3,a3,a5
10938: 00971713 sll a4,a4,0x9
1093c: 01769693 sll a3,a3,0x17
10940: 800007b7 lui a5,0x80000
10944: 00975713 srl a4,a4,0x9
10948: 00d76733 or a4,a4,a3
1094c: 00f67533 and a0,a2,a5
10950: 00a76533 or a0,a4,a0
10954: 00008067 ret
10958: 01800593 li a1,24
1095c: 40f587b3 sub a5,a1,a5
10960: 00f71733 sll a4,a4,a5
10964: 40f686b3 sub a3,a3,a5
10968: 00971713 sll a4,a4,0x9
1096c: 01769693 sll a3,a3,0x17
10970: 800007b7 lui a5,0x80000
10974: 00975713 srl a4,a4,0x9
10978: 00d76733 or a4,a4,a3
1097c: 00f67533 and a0,a2,a5
10980: 00a76533 or a0,a4,a0
10984: 00008067 ret
count_leading_zeros | imul32 | fadd32 | fmul32 | |
---|---|---|---|---|
O0 | 81 | 105 | 191 | 284 |
O1 | 11 | 67 | 106 | 197 |
O2 | 11 | 11 | 107 | 195 |
O3 | 0 | 0 | 103 | 191 |
Os | 11 | 68 | 107 | 209 |
Ofast | 0 | 0 | 103 | 191 |
Through the process of reviewing optimized assembly code, I found some rules for writing efficient assembly code:
By following these rules when writing assembly code, you can create a relatively clear and efficient program. This will make it more convenient to carry out subsequent optimizations based on this foundation.