Try   HackMD

Assignment2: GNU Toolchain

contributed by <shhung>

Motivation

The work I selected is the Image scaling with Bilinear interpolation done by linyu0425.
Due to the interesting work, I selected the work for my assignment 2. In addition, I want to review more floating-point processing that has been implemented in the work, such as addition and multiplication.

Make C program be executed with rv32emu

Compiling C program is quite easy by gcc for risc-v.

$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o ImgScale ImgScale.S
$ rv32emu ImgScale

Make assembly program be executed with rv32emu

By referencing asm-hello which is the demo in rv32emu for System calls, we could have a Makefile and linker script for ours project.

Modification

All that needs to be done is modifying the syscall part in the original program, as Ripes has its own design for syscalls.
In order to focus on the performance of the process logic, I deleted the part that outputs to the console.

Optimization method

Since this work implements multiple sub-functions used for image scaling, I have split it into several parts. Due to this process, we can analyze each unit of image scaling and find out the parts to be optimized.
The function structure is shown in the figure below, and the performance of each part will be discussed later.

─── ImgScale
    ├── count_leading_zeros
    ├── imul32
    ├── fmul32
    └── fadd32

Bug

During testing the parts of the whole program, I found that the functions imul32 and fmul32 have a bit of error, and this error occurs in both the C and assembly programs.
Additionally, the imul32 returns a 32-bit signed integer, which could lead to overflow. The same problem occurs in fmul32 as well. Even so, there is no problem with the overall logic, so optimization will still be done based on the buggy program.

Execute the program

The workable code is available at src, and the Makefile is provided to generate ELF file which is runable on the rv32emu.

  1. Buile the ELF file

Following the command, you will get main.elf and ImgScaleFromC.elf.

$ make
  1. Run each of it

Make sure you have already installed rv32emu and set the path for rv32emu. If not, check the Lab2: RISC-V RV32I[MACF] emulator with ELF support to setup your environment.

$ rv32emu main.elf
$ rv32emu ImgScaleFromC.elf

Expected output:

For main.elf and ImgScaleFromC.elf, the output would only differ at elapsed cycle.

0.954780 0.877887 0.800995 0.724102 0.647210
0.921899 0.826679 0.731460 0.636240 0.541020
0.889018 0.775471 0.661924 0.548377 0.434830
0.856138 0.724263 0.592389 0.460514 0.328640
0.823257 0.673055 0.522853 0.372652 0.222450
elapsed cycle: 39580
inferior exit code 0

Contrast the handwritten and compiler-optimized assembly code

By following the example of ticks.c, we can evaluate our work by accumulating cycle count.

Handwritten O0 O1 O2 O3 Os Ofast
Size 56998 83940 82384 82364 83244 82364 83244
Cycle 39610 39065 16162 16459 12183 17012 12183

It can be obviously observed that the cycle count has a significant improvement when compiler optimization is applied. However, the size is not affected much.
Below, we will describe the difference between the assembly code generated by the compiler and the handwritten assembly code. The complete code is accessible at shhung.

count_leading_zeros

Since count_leading_zeros is already simplified, there is almost nothing to be optimized. The only thing that can be optimized is omitting store/load operations by replacing callee-saved registers with caller-saved registers.

imul32

By replacing callee-saved registers with caller-saved registers, store/load operations are not required in imul32.

fadd32

Since count_leading_zeros is only used once in fadd32, -O2 optimization and further levels replace the function call to count_leading_zeros in fadd32 with directly writing the operation logic in fadd32.

fmul32

Same as imul32, -O1 optimization generated code uses fewer callee-saved registers. This operation results in fewer load/store instructions.
For -O2 optimization and further levels, the function is branchless, and store/load operations are not required due to replacing callee-saved registers with caller-saved registers.

Difference

Based on the above observation results, it can be found that the purpose and method of optimization are very similar, so we only list the differences between handwritten and compiler-optimized code for fadd32.

It can be obviously found that there are no store/load operations at the prologue/epilogue of the compiler-optimized code. Additionally, please check line 30 of the compiler-optimized code. Comparing it with the handwritten code, you will find that count_leading_zeros is executed here. By embedding it directly into fadd32, you can reduce the possible delays caused by function calls.

Handwritten:

fadd32:
    addi    sp , sp , -52
    sw    s0 , 0(sp)
    sw    s1 , 4(sp)
    sw    s2 , 8(sp)
    sw    s3 , 12(sp)
    sw    s4 , 16(sp)
    sw    s5 , 20(sp)
    sw    s6 , 24(sp)
    sw    s7 , 28(sp)
    sw    s8 , 32(sp)
    sw    s9 , 36(sp)
    sw    s10 , 40(sp)
    sw    s11 , 44(sp)
    sw    ra , 48(sp)
    mv    s0 , a0
    mv    s1 , a1
    
    li    t0 , 0x7fffffff
    and    t1 , s0 , t0
    and    t2 , s1 , t0
    blt    t2 , t1 , noswap
    mv    t0 , s0
    mv    s0 , s1
    mv    s1 , t0
    noswap:
    li    t0 , 31
    srl    s2 , s0 , t0
    srl    s3 , s1 , t0
    li    t0 , 0x7fffff
    li    t1 , 0x800000
    and    t2 , s0 , t0
    or    s4 , t2 , t1
    and    t2 , s1 , t0
    or    s5 , t2 , t1
    li    t0 , 23
    li    t1 , 0xff
    srl    t2 , s0 , t0
    and    s6 , t2 , t1
    srl    t2 , s1 , t0
    and    s7 , t2 , t1
    
    sub    t0 , s6 , s7
    li    t1 , 24
    blt    t1 , t0 , setalign_1
    mv    s8 , t0
    j    setalign_exit
    setalign_1:
    mv    s8 , t1
    setalign_exit:
    
    srl    s5 , s5 ,s8
    or    t0 , s2 , s3
    bne    t0 , x0 , setma_1
    add    s4 , s4 , s5
    j    setma_exit
    setma_1:
    sub    s4 , s4 , s5
    setma_exit:
        
    mv    a0 , s4
    call    count_leading_zeros
    mv    s9 , a0
    mv    s10 , x0
    li    t0 , 8
    blt    t0 , s9 , shift_false
    li    t0 , 8
    sub    s10 , t0 , s9
    srl    s4 , s4 , s10
    add    s6 , s6 , s10
    j    shift_exit
    shift_false:
    li    t0 , 8
    sub    s10 , s9 , t0
    sll    s4 , s4 , s10
    sub    s6 , s6 , s10
    shift_exit:
    li    t0 , 0x80000000
    and    t0 , s0 , t0
    li    t1 , 23
    sll    t1 , s6 , t1
    li    t2 , 0x7fffff
    and    t2 , s4 , t2
    or    t0 , t0 , t1
    or    a0 , t0 , t2
    
    lw    s0 , 0(sp)
    lw    s1 , 4(sp)
    lw    s2 , 8(sp)
    lw    s3 , 12(sp)
    lw    s4 , 16(sp)
    lw    s5 , 20(sp)
    lw    s6 , 24(sp)
    lw    s7 , 28(sp)
    lw    s8 , 32(sp)
    lw    s9 , 36(sp)
    lw    s10 , 40(sp)
    lw    s11 , 44(sp)
    lw    ra , 48(sp)
    addi    sp , sp , 52
    ret

count_leading_zeros:
    addi    sp , sp , -8
    sw    s0 , 0(sp)
    sw    ra , 4(sp)
    mv    s0 , a0 # s0 = x

    srli    t0 , s0 , 1
    or    s0 , s0 ,t0
    srli    t0 , s0 , 2
    or    s0 , s0 ,t0
    srli    t0 , s0 , 4
    or    s0 , s0 ,t0
    srli    t0 , s0 , 8
    or    s0 , s0 ,t0
    srli    t0 , s0 , 16
    or    s0 , s0 ,t0
    srli    t0 , s0 , 1
    li    t1 , 0x55555555
    and    t0 , t0 , t1
    sub    s0 , s0 , t0
    li    t1 , 0x33333333
    and    t2 , s0 , t1
    srli    t0 , s0 , 2
    li    t1 , 0x33333333
    and    t0 , t0 , t1
    add    s0 , t0 , t2
    srli    t0 , s0 , 4
    add    t0 , t0 , s0
    li    t1 , 0x0f0f0f0f
    and    s0 , t0 , t1
    srli    t0 , s0 , 8
    add    s0 , s0 , t0
    srli    t0 , s0 , 16
    add    s0 , s0 , t0
    li    t0 , 32
    andi    t1 , s0 , 0x7f
    sub    a0 , t0 , t1
    
    lw    s0 , 0(sp)
    lw    ra , 4(sp)
    addi    sp , sp , 8
    ret

Compiler-optimized:

00010830 <fadd32>:
   10830:	800007b7          	lui	a5,0x80000
   10834:	fff78793          	add	a5,a5,-1 # 7fffffff <__BSS_END__+0x7ffdb0db>
   10838:	00a7f733          	and	a4,a5,a0
   1083c:	00b7f7b3          	and	a5,a5,a1
   10840:	00050813          	mv	a6,a0
   10844:	00058613          	mv	a2,a1
   10848:	00f74663          	blt	a4,a5,10854 <fadd32+0x24>
   1084c:	00050613          	mv	a2,a0
   10850:	00058813          	mv	a6,a1
   10854:	00800537          	lui	a0,0x800
   10858:	41765693          	sra	a3,a2,0x17
   1085c:	41785793          	sra	a5,a6,0x17
   10860:	fff50713          	add	a4,a0,-1 # 7fffff <__BSS_END__+0x7db0db>
   10864:	0ff6f693          	zext.b	a3,a3
   10868:	0ff7f793          	zext.b	a5,a5
   1086c:	00e675b3          	and	a1,a2,a4
   10870:	40f687b3          	sub	a5,a3,a5
   10874:	00e87733          	and	a4,a6,a4
   10878:	01800893          	li	a7,24
   1087c:	00a5e5b3          	or	a1,a1,a0
   10880:	00a76733          	or	a4,a4,a0
   10884:	00f8d463          	bge	a7,a5,1088c <fadd32+0x5c>
   10888:	01800793          	li	a5,24
   1088c:	40f757b3          	sra	a5,a4,a5
   10890:	01066833          	or	a6,a2,a6
   10894:	00f58733          	add	a4,a1,a5
   10898:	00085463          	bgez	a6,108a0 <fadd32+0x70>
   1089c:	40f58733          	sub	a4,a1,a5
   108a0:	00175793          	srl	a5,a4,0x1
   108a4:	00e7e7b3          	or	a5,a5,a4
   108a8:	0027d593          	srl	a1,a5,0x2
   108ac:	00b7e7b3          	or	a5,a5,a1
   108b0:	0047d593          	srl	a1,a5,0x4
   108b4:	00b7e7b3          	or	a5,a5,a1
   108b8:	0087d593          	srl	a1,a5,0x8
   108bc:	00b7e7b3          	or	a5,a5,a1
   108c0:	0107d593          	srl	a1,a5,0x10
   108c4:	00b7e7b3          	or	a5,a5,a1
   108c8:	55555537          	lui	a0,0x55555
   108cc:	0017d593          	srl	a1,a5,0x1
   108d0:	55550513          	add	a0,a0,1365 # 55555555 <__BSS_END__+0x55530631>
   108d4:	00a5f5b3          	and	a1,a1,a0
   108d8:	40b787b3          	sub	a5,a5,a1
   108dc:	33333537          	lui	a0,0x33333
   108e0:	33350513          	add	a0,a0,819 # 33333333 <__BSS_END__+0x3330e40f>
   108e4:	0027d593          	srl	a1,a5,0x2
   108e8:	00a5f5b3          	and	a1,a1,a0
   108ec:	00a7f7b3          	and	a5,a5,a0
   108f0:	00f585b3          	add	a1,a1,a5
   108f4:	0045d793          	srl	a5,a1,0x4
   108f8:	0f0f1537          	lui	a0,0xf0f1
   108fc:	00b787b3          	add	a5,a5,a1
   10900:	f0f50513          	add	a0,a0,-241 # f0f0f0f <__BSS_END__+0xf0cbfeb>
   10904:	00a7f7b3          	and	a5,a5,a0
   10908:	0087d593          	srl	a1,a5,0x8
   1090c:	00b787b3          	add	a5,a5,a1
   10910:	0107d593          	srl	a1,a5,0x10
   10914:	00b787b3          	add	a5,a5,a1
   10918:	07f7f793          	and	a5,a5,127
   1091c:	02000593          	li	a1,32
   10920:	40f585b3          	sub	a1,a1,a5
   10924:	00800513          	li	a0,8
   10928:	02b54863          	blt	a0,a1,10958 <fadd32+0x128>
   1092c:	fe878793          	add	a5,a5,-24
   10930:	40f75733          	sra	a4,a4,a5
   10934:	00f686b3          	add	a3,a3,a5
   10938:	00971713          	sll	a4,a4,0x9
   1093c:	01769693          	sll	a3,a3,0x17
   10940:	800007b7          	lui	a5,0x80000
   10944:	00975713          	srl	a4,a4,0x9
   10948:	00d76733          	or	a4,a4,a3
   1094c:	00f67533          	and	a0,a2,a5
   10950:	00a76533          	or	a0,a4,a0
   10954:	00008067          	ret
   10958:	01800593          	li	a1,24
   1095c:	40f587b3          	sub	a5,a1,a5
   10960:	00f71733          	sll	a4,a4,a5
   10964:	40f686b3          	sub	a3,a3,a5
   10968:	00971713          	sll	a4,a4,0x9
   1096c:	01769693          	sll	a3,a3,0x17
   10970:	800007b7          	lui	a5,0x80000
   10974:	00975713          	srl	a4,a4,0x9
   10978:	00d76733          	or	a4,a4,a3
   1097c:	00f67533          	and	a0,a2,a5
   10980:	00a76533          	or	a0,a4,a0
   10984:	00008067          	ret

Cycle count for each part

count_leading_zeros imul32 fadd32 fmul32
O0 81 105 191 284
O1 11 67 106 197
O2 11 11 107 195
O3 0 0 103 191
Os 11 68 107 209
Ofast 0 0 103 191

Summary

Through the process of reviewing optimized assembly code, I found some rules for writing efficient assembly code:

  1. Fewer store/load operations are better for callable functions.
  2. Rearrange instructions to reduce unnecessary conditional branches.

By following these rules when writing assembly code, you can create a relatively clear and efficient program. This will make it more convenient to carry out subsequent optimizations based on this foundation.