# Assignment2: GNU Toolchain
contributed by < [`CSIE523`](https://github.com/CSIE523/2023_Computer_Architecture_FALL/tree/main/hw2) >
I chose [Implement Binarization by count leading zero](https://hackmd.io/@edenlin/CompArchi_HW1) made from [林勁羽](https://github.com/JinYu1225/2023_Computer_Architecture/tree/main/Lab1).
## Motivation
Binarization is a common method in image processing. In order to make binarization more efficiently, I come up with some optimization ideas about this topic.
## Environment
I installed Ubuntu Linux 20.04-LTS on [Oracle Virtualbox](https://www.virtualbox.org/) and followed the [Lab2: RISC-V RV32I[MACF] emulator with ELF support](https://hackmd.io/@sysprog/SJAR5XMmi) step by step until I saw `hello.elf` successfully build.
```
~/rv32emu/tests$ cd asm-hello/
~/rv32emu/tests/asm-hello$ make
riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o hello.o hello.S
riscv-none-elf-ld -o hello.elf -T hello.ld --oformat=elf32-littleriscv hello.o
~/rv32emu/tests/asm-hello$ ls
hello.elf hello.ld hello.o hello.S Makefile
```
### elf Commands list
We can simply change `Ox` to `-O0`, `-O1`, `-O2`, `-O3` or `-Ofast`.
#### elf compile (We can simply change the number after '-O' to view different levels of compilation, such as '-O0' or '-Ofast'.)
```
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Ox hw2_binarization.c -o hw2_binarization_Ox.elf
```
#### Check elf instructions
```
riscv-none-elf-objdump -d hw2_binarization_Ox.elf
```
#### Check elf file header
```
riscv-none-elf-readelf -h hw2_binarization_Ox.elf
```
#### Check elf size
```
riscv-none-elf-size hw2_binarization_Ox.elf
```
#### elf execution
```
build/rv32emu hw2_binarization_Ox.elf
```
## Analysis
### C code Origin
```clike
#include <stdint.h>
#include <stdio.h>
uint32_t count_leading_zeros_32(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x3f)); // change 0x7f to 0x3f
}
int main(){
// pixel test
// 8-bit color depth for black and white photo
uint32_t picture[5] = {20,80,128,150,231};
uint32_t threshold = 128;
uint32_t *pixel = &picture;
for(int i = 0; i < 5; i++){
uint32_t sub = threshold - *(pixel+i);
printf("%d, ",i);
printf("before = %d, ",*(pixel+i));
sub = count_leading_zeros_32(sub);
if(sub)
*(pixel+i) = 0;
else
*(pixel+i) = 255;
printf("after = %d\n",*(pixel+i));
}
return 0;
}
```
The pixel values of an image are limited between 0 and 255, so we don't need to declare **uint32_t** to store variables. Instead, we can just declare **uint16_t** to store variables.
### C code Optimization
```clike
#include <stdint.h>
#include <stdio.h>
uint16_t count_leading_zeros_16(uint16_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x -= ((x >> 1) & 0x5555);
x = ((x >> 2) & 0x3333) + (x & 0x3333);
x = ((x >> 4) + x) & 0x0f0f;
x += (x >> 8);
return (16 - (x & 0x1f)); // change 0x3f to 0x1f
}
int main(){
// pixel test
// 8-bit color depth for black and white photo
uint16_t picture[5] = {20,80,128,150,231};
uint16_t threshold = 127;
uint16_t *pixel = picture;
for(int i = 0; i < 5; i++){
uint16_t sub = threshold - *(pixel+i);
printf("%d, ",i);
printf("before = %d\n, ",*(pixel+i));
sub = count_leading_zeros_16(sub);
*(pixel+i) = (sub) : 0 ? 255;
printf("after = %d\n",*(pixel+i));
printf("--------------------------------\n");
}
```
In addition to changing **uint32_t** to **uint16_t**, we can also eliminate the codes that are always zero since we declare uint16_t but are only using 8 bits of data.
```clike
uint32_t count_leading_zeros_32(uint32_t x)
{
...
x |= (x >> 16);
...
x += (x >> 16);
...
}
```
### -O0
```
count_leading_zeros_16:
addi sp,sp,-32
sw s0,28(sp)
addi s0,sp,32
mv a5,a0
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,1
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
or a5,a5,a4
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,2
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
or a5,a5,a4
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,4
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
or a5,a5,a4
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,8
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
or a5,a5,a4
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,1
slli a4,a5,16
srli a4,a4,16
li a5,20480
addi a5,a5,1365
and a5,a4,a5
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
sub a5,a4,a5
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,2
slli a4,a5,16
srli a4,a4,16
li a5,12288
addi a5,a5,819
and a5,a4,a5
slli a4,a5,16
srli a4,a4,16
lhu a5,-18(s0)
mv a3,a5
li a5,12288
addi a5,a5,819
and a5,a3,a5
slli a5,a5,16
srli a5,a5,16
add a5,a4,a5
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,4
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
add a5,a5,a4
slli a4,a5,16
srli a4,a4,16
li a5,4096
addi a5,a5,-241
and a5,a4,a5
sh a5,-18(s0)
lhu a5,-18(s0)
srli a5,a5,8
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
add a5,a5,a4
sh a5,-18(s0)
lhu a5,-18(s0)
andi a5,a5,31
slli a5,a5,16
srli a5,a5,16
li a4,16
sub a5,a4,a5
slli a5,a5,16
srli a5,a5,16
mv a0,a5
lw s0,28(sp)
addi sp,sp,32
jr ra
```
No optimization. The default value when no option is input.
`lw` : 1
`sw` : 1
`sh` : 9
`lhu`: 17
### -O1
```
count_leading_zeros_16:
srli a5,a0,1
or a0,a5,a0
srli a5,a0,2
or a5,a5,a0
srli a0,a5,4
or a0,a0,a5
srli a4,a0,8
or a4,a4,a0
srli a5,a4,1
li a3,20480
addi a3,a3,1365
and a5,a5,a3
sub a4,a4,a5
slli a4,a4,16
srli a4,a4,16
srli a5,a4,2
li a3,12288
addi a3,a3,819
and a5,a5,a3
and a4,a4,a3
add a5,a5,a4
srli a4,a5,4
add a5,a5,a4
li a4,4096
addi a4,a4,-241
and a5,a5,a4
srli a4,a5,8
add a5,a4,a5
andi a5,a5,31
li a0,16
sub a0,a0,a5
slli a0,a0,16
srli a0,a0,16
ret
```
Compared with option `-O0`, `-O1` eliminated many `Load and Store` instructions and any operations related to the `stack pointer`. Moreover, it reduces
```
srli a5,a5,1
slli a5,a5,16
srli a5,a5,16
lhu a4,-18(s0)
or a5,a5,a4
sh a5,-18(s0)
lhu a5,-18(s0)
```
to
```
srli a5,a0,1
or a0,a5,a0
```
.
### `-O2` (for count_leading_zeros_16, `-O2`, `-O3` and `-Ofast` are the same)
```c
count_leading_zeros_16:
srli a5,a0,1
or a0,a5,a0
srli a5,a0,2
or a5,a5,a0
srli a0,a5,4
or a0,a0,a5
srli a4,a0,8
or a4,a4,a0
li a3,20480
srli a5,a4,1
addi a3,a3,1365
and a5,a5,a3
sub a4,a4,a5
slli a4,a4,16
srli a4,a4,16
li a3,12288
addi a3,a3,819
srli a5,a4,2
and a5,a5,a3
and a4,a4,a3
add a5,a5,a4
srli a3,a5,4
li a4,4096
add a5,a5,a3
addi a4,a4,-241
and a5,a5,a4
srli a4,a5,8
add a5,a4,a5
andi a5,a5,31
li a0,16
sub a0,a0,a5
slli a0,a0,16
srli a0,a0,16
ret
```
The only difference between options `-O1` and `-O2` is the swapping of some instruction positions.
### Statistics
I think there are some mistakes in my program when counting cycle count, because it's impossible that cycle count is less than instret. Instret is total number of retired instructions, so cycle count must be equal to instret in single-cycle processor. In 5-stage processor, cycle count may be greater than or eqaul to instret due to hazards. Therefore, I'm still finding the solution to the problem.
| | O0_origin | O0_opt | O1_opt | O2_opt | O3_opt | Ofast_opt |
| ----------- | --------- | ------ | ------ | ------ | ------ | --------- |
| cycle count | 508 | 620 | 234 | 239 | 13 | 13 |
| instret | 210d | 20ce | 209a | 209b | 2099 | 2099 |
| text | 53692 | 52544 | 52096 | 52104 | 52056 | 52056 |
| data | 1924 | 1876 | 1876 | 1876 | 1876 | 1876 |
| bss | 1528 | 1528 | 1528 | 1528 | 1528 | 1528 |
| dec | 57144 | 55948 | 55500 | 55508 | 55460 | 55460 |
## Handwritten Assembly
### Origin
```c
.set STDOUT, 1
.set WRITE, 64
.set EXIT, 93
.data
test: .byte 20, 80, 128, 150, 231
result: .byte 0, 0, 0, 0, 0
.text
.global binarization_origin
.type binarization_origin, %function
binarization_origin:
addi sp, sp, -4
sw ra, 0(sp)
# a1 = address of pixel
# a2 = address of result
# a3 = threshold
la a1, test # load address of pixel
la a2, result # load address of result
li a3, 0x80 # load address of threshold
lbu a3, 0(a3) # load value of threshold
BINARIZATION:
li s1, 5 # count of haven't done pixel
B_LOOP:
lbu a4, 0(a1) # load value of pixel
sub a0, a3, a4 # threshold - pixel
jal ra, CLZ # jump to CLZ
bne a0, x0, FLOOR # branch to FLOOR if CLZ = 0
CEIL:
li a4, 255
j STORE_MOVE # jump to STORE_MOVE
FLOOR:
li a4, 0
STORE_MOVE:
sb a4, 0(a2)
addi s1, s1, -1 # total count -1
addi a1, a1, 1 # go to next address of pixel
addi a2, a2, 1 # go to next address of result
bne s1, x0, B_LOOP
li s1, 5 # make count for RESULT_CHECK
sub a2, a2, s1 # reset address of result
RESULT_CHECK:
lbu a0, 0(a2) # load result to a0
#jal ra, PRINT_INT
addi s1, s1, -1 # count print result
addi a2, a2, 1 # move to next address of result
bne s1, x0, RESULT_CHECK
# j EXIT
# Exit program
lw ra, 0(sp)
addi sp, sp, 4
ret
CLZ:
# a0: x = the number to be counted leading zero
# t0: shifted x
# t1: x shift & mask
# t2: mask
PAD1:
srli t0, a0, 1 # t0 = x >> 1
or a0, a0, t0 # x |= x >> 1
srli t0, a0, 2 # t0 = x >> 2
or a0, a0, t0 # x |= x >> 2
srli t0, a0, 4 # t0 = x >> 4
or a0, a0, t0 # x |= x >> 4
srli t0, a0, 8 # t0 = x >> 8
or a0, a0, t0 # x |= x >> 8
srli t0, a0, 16 # t0 = x >> 16
or a0, a0, t0 # x |= x >> 16
POPCNT:
li t2, 0x55555555 # load mask1 to t2
srli t0, a0, 1 # t0 = x >> 1
and t1, t0, t2 # t1 = (x >> 1) & mask1
sub a0, a0, t1 # x -= ((x >> 1) & mask1)
li t2, 0x33333333 # load mask2 to t2
srli t0, a0, 2 # t0 = x >> 2
and t1, t0, t2 # (x >> 2) & mask2
and a0, a0, t2 # x & mask2
add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2)
srli t0, a0, 4 # t0 = x >> 4
add a0, a0, t0 # x + (x >> 4)
li t2, 0x0f0f0f0f # load mask4 to t2
and a0, a0, t2 # ((x >> 4) + x) & mask4
srli t0, a0, 8 # t0 = x >> 8
add a0, a0, t0 # x += (x >> 8)
srli t0, a0, 16 # t0 = x >> 16
add a0, a0, t0 # x += (x >> 16)
andi t0, a0, 0x3f # t0 = x & 0x3f
li a0, 32 # a0 = 64
sub a0, a0, t0 # 64 - (x & 0x3f)
addi sp, sp, -4 # make space for ra in stack
sw ra, 0(sp) # push ra into stack
# jal ra, PRINT_INT
lw ra, 0(sp) # pull ra out of stack
addi sp, sp, 4 # free the space create in stack for ra
ret
PRINT_INT:
# li a7, 1 # print the value of input a0 in int
# ecall
addi sp, sp, -4
sw a0, 0(sp)
lw a0, 0(sp)
addi sp, sp, 4
ret
```
:::success
I fix the handwritten assembly code with eliminating some codes which may cause mistakes while compiling. The experiment result is listed at the bottom of this section.
:::
**Experiment Result**
<s>

</s>
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
### Handwritten Assembly Optimization
With print function loop unrolling and eliminatiing some useless instructions.
```c
.set STDOUT, 1
.set WRITE, 64
.set EXIT, 93
.data
test: .byte 20, 80, 128, 150, 231
result: .byte 0, 0, 0, 0, 0
.text
.global binarization_opt_v1
.type binarization_opt_v1, %function
binarization_opt_v1:
addi sp, sp, -4
sw ra, 0(sp)
# a1 = address of pixel
# a2 = address of result
# a3 = threshold
la a1, test # load address of pixel
la a2, result # load address of result
li a3, 0x7F # load address of threshold
BINARIZATION:
li s1, 5 # count of haven't done pixel
B_LOOP:
lbu a4, 0(a1) # load value of pixel
sub a0, a3, a4 # threshold - pixel
jal ra, CLZ # jump to CLZ
bne a0, x0, FLOOR # branch to FLOOR if CLZ = 0
CEIL:
li a4, 255
j STORE_MOVE
FLOOR:
li a4, 0
STORE_MOVE:
sb a4, 0(a2)
addi s1, s1, -1 # total count -1
addi a1, a1, 1 # go to next address of pixel
addi a2, a2, 1 # go to next address of result
bne s1, x0, B_LOOP
addi a2, a2, -5
# la a0, str2 # start to print the result
# li a7, 4
# ecall
RESULT_CHECK:
# lbu a0, 0(a2) # load result to a0
# li a7, 1 # print the value of input a0 in int
# ecall
# la a0, str1
# li a7, 4
# ecall
# lbu a0, 1(a2) # load result to a0
# li a7, 1 # print the value of input a0 in int
# ecall
# la a0, str1
# li a7, 4
# ecall
# lbu a0, 2(a2) # load result to a0
# li a7, 1 # print the value of input a0 in int
# ecall
# la a0, str1
# li a7, 4
# ecall
# lbu a0, 3(a2) # load result to a0
# li a7, 1 # print the value of input a0 in int
# ecall
# la a0, str1
# li a7, 4
# ecall
# lbu a0, 4(a2) # load result to a0
# li a7, 1 # print the value of input a0 in int
# ecall
# j EXIT
# Exit program
lw ra, 0(sp)
addi sp, sp, 4
ret
CLZ:
# a0: x = the number to be counted leading zero
# t0: shifted x
# t1: x shift & mask
# t2: mask
PAD1:
srli t0, a0, 1 # t0 = x >> 1
or a0, a0, t0 # x |= x >> 1
srli t0, a0, 2 # t0 = x >> 2
or a0, a0, t0 # x |= x >> 2
srli t0, a0, 4 # t0 = x >> 4
or a0, a0, t0 # x |= x >> 4
srli t0, a0, 8 # t0 = x >> 8
or a0, a0, t0 # x |= x >> 8
POPCNT:
li t2, 0x55555555 # load mask1 to t2
srli t0, a0, 1 # t0 = x >> 1
and t1, t0, t2 # t1 = (x >> 1) & mask1
sub a0, a0, t1 # x -= ((x >> 1) & mask1)
li t2, 0x33333333 # load mask2 to t2
srli t0, a0, 2 # t0 = x >> 2
and t1, t0, t2 # (x >> 2) & mask2
and a0, a0, t2 # x & mask2
add a0, t1, a0 # ((x >> 2) & mask2) + (x & mask2)
srli t0, a0, 4 # t0 = x >> 4
add a0, a0, t0 # x + (x >> 4)
li t2, 0x0f0f0f0f # load mask4 to t2
and a0, a0, t2 # ((x >> 4) + x) & mask4
srli t0, a0, 8 # t0 = x >> 8
add a0, a0, t0 # x += (x >> 8)
andi t0, a0, 0x1f # t0 = x & 0x1f
li a0, 16 # a0 = 16
sub a0, a0, t0 # 16 - (x & 0x1f)
ret
```
### Without CLZ
Just for comparison, I did another version without CLZ.
```c
.data
test: .byte 20, 80, 128, 150, 255
result: .byte 0, 0, 0, 0, 0
.text
.global binarization_opt_v2
.type binarization_opt_v2, %function
binarization_opt_v2:
addi sp, sp, -4
sw ra, 0(sp)
# a1 = address of pixel
# a2 = address of result
# a3 = threshold
la a1, test # load address of pixel
la a2, result # load address of result
li a3, 0x7F # load address of threshold
BINARIZATION:
li s1, 5 # count of haven't done pixel
addi a3, a3, 1
B_LOOP:
lbu a4, 0(a1) # load value of pixel
sltu t0, a4, a3
addi a0, t0, 255
sb a0, 0(a2)
lbu a4, 1(a1) # load value of pixel
sltu t0, a4, a3
addi a0, t0, 255
sb a0, 1(a2)
lbu a4, 2(a1) # load value of pixel
sltu t0, a4, a3
addi a0, t0, 255
sb a0, 2(a2)
lbu a4, 3(a1) # load value of pixel
sltu t0, a4, a3
addi a0, t0, 255
sb a0, 3(a2)
lbu a4, 4(a1) # load value of pixel
sltu t0, a4, a3
addi a0, t0, 255
sb a0, 4(a2)
RESULT_CHECK:
# lbu a0, 0(a2) # load result to a0
# jal ra, PRINT_INT
# addi s1, s1, -1 # count print result
# addi a2, a2, 1 # move to next address of result
# bne s1, x0, RESULT_CHECK
# Exit program
lw ra, 0(sp)
addi sp, sp, 4
ret
PRINT_INT:
li a7, 1 # print the value of input a0 in int
ecall
addi sp, sp, -4
sw a0, 0(sp)
lw a0, 0(sp)
addi sp, sp, 4
ret
```
**Experiment Result**
I am still continuing to understand how to adapt corresponding `write` system call to my handwritten assembly code in order to check the result. Thus, the assembly code above doesn't print anything which results in less cycle counts than ripes.
Here is my makefile to build related .S and .o files.
```
.PHONY: clean
include ../../../mk/toolchain.mk
CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall
OBJS = \
getcycles.o \
percount.o \
binarization_origin.o \
binarization_opt_v1.o \
binarization_opt_v2.o
BIN = perfcount.elf
%.o: %.S
$(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $<
%.o: %.c
$(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $<
all: $(BIN)
$(BIN): $(OBJS)
$(CROSS_COMPILE)gcc -o $@ $^
clean:
$(RM) $(BIN) $(OBJS)
```
```
$ make
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o getcycles.o getcycles.S
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o percount.o percount.c
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o binarization_origin.o binarization_origin.S
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o binarization_opt_v1.o binarization_opt_v1.S
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall -c -o binarization_opt_v2.o binarization_opt_v2.S
riscv-none-elf-gcc -o perfcount.elf getcycles.o percount.o binarization_origin.o binarization_opt_v1.o binarization_opt_v2.o
$ ~/rv32emu/build/rv32emu perfcount.elf
origin cycle count: 288
origin cycle count: 222
origin cycle count: 39
inferior exit code 0
```
| cycle counts | rv32emu | Ripes |
| --- | -------- | -------- |
| binarization_origin.S | 288 | 545 |
| binarization_opt_v1.S | 222 | 324 |
| binarization_opt_v2.S | 39 | 177 |