# Assignment2: GNU Toolchain
contributed by < [`yutingshih`](https://github.com/yutingshih) >
Topic: [Bfloat16 Logarithm](https://hackmd.io/@coding-ray/2023-ca-hw-1) by [`coding-ray`](https://github.com/coding-ray) (黃柏叡)
GitHub Repository: [`yutingshih/ca2023-hw2`](https://github.com/yutingshih/ca2023-hw2)
Motivation:
<!-- ## Environment Setup -->
## Porting from Ripes to rv32emu
### Modify Makefile for C Code
```shell
# Usage:
# make [all] compile all the targets
# make TARGET compile a specific target
# make TARGET [TARGET [...]] compile specific targets
# make test run tests for all the targets
# make test_TARGET run test for a specific target
# make test_TARGET [test_TARGET [...]] run tests for specific targets
# make clean delete all the executables
#
# Example:
# make (compile all the targets)
# make clean test (delete all the executables, compile and run all the tests)
# make all test_mul_bf16 (compile all the targets but only run test for mul_bf16)
BIN ?= i32_bf16 fp32_bf16 add_sub_bf16 mul_bf16 ln_bf16
CROSS ?= riscv-none-elf-
CC := $(CROSS)gcc
CFLAGS := -Wall -Wextra
LDLIBS := -lm
ifdef CROSS
CFLAGS += -march=rv32i -mabi=ilp32
RUNTIME ?= rv32emu
endif
all: $(BIN)
%: %.c
-$(CC) -D$(shell echo $@ | tr a-z A-Z)_TEST $(CFLAGS) -o $@ $< $(LDLIBS)
test: $(addprefix test_, $(BIN))
test_%: %
-@$(RUNTIME) $<
clean:
-@$(RM) -v $(BIN)
```
### Write Makefile for Assembly Code
Since the assembly code running on Ripes does not need to be compiled, the original reposity does not provide Makefile for assembly code. For the sake of convience, we will first write a Makefile ourselves.
```shell
TARGET ?= add_sub_bf16 i32_bf16 ln_bf16 mul_bf16 mul_shift_u32 mul_sum_u32
BIN := $(addsuffix .elf, $(TARGET))
CROSS := riscv-none-elf-
CC := $(CROSS)gcc
AS := $(CROSS)as
LD := $(CROSS)ld
CFLAGS := -march=rv32i -mabi=ilp32 -ffreestanding
ASFLAGS := -march=rv32i -mabi=ilp32 -R
LDFLAGS := --oformat=elf32-littleriscv -T link.ld
all: $(BIN)
%.elf: %.o syscall.o
$(LD) $(LDFLAGS) -o $@ $^
test: $(BIN)
@for i in $^; do rv32emu $$i; done
test_%: %.elf
@rv32emu $<
clean:
-@$(RM) -v $(BIN)
```
Note that here we use the [implicit rules](https://www.gnu.org/software/make/manual/html_node/Catalogue-of-Rules.html) to assemble the object file from the corresponding assembly code in the makefile.
```shell
%.o: %.s
$(AS) $(ASFLAGS) -o $@ $<
```
### Fix Syntax Errors
- `strnset` is not in the C standard library. Replace it with `memset`, which is defined in `string.h`.
- fix missing `,` in some of the assembly instructions.
### Add Global Directive
- add `.globl main` for the main function of each assembly program to suppress the linking errors.
### Implement System Call
```c=3
/* write system call implementation using RISC-V ecall */
int write(int fd, const void* buf, unsigned nbyte) {
asm(
".equ SYS_WRITE, 64" "\n\t"
"li a7, SYS_WRITE" "\n\t"
"ecall" "\n"
);
return 0;
}
/* exit system call implementation using RISC-V ecall */
int exit(int status) {
asm(
".equ SYS_EXIT, 93" "\n\t"
"li a7, SYS_EXIT" "\n\t"
"ecall" "\n"
);
return 0;
}
```
### Implement Print Library
Printing a character or a string is quite straightforward.
```c=23
/* print a character */
void print_char(char ch) {
write(STDOUT_FILENO, &ch, 1);
}
/* print a null-terminated string */
void print_string(const char* str) {
unsigned len = 0;
while (str[len]) len++;
write(STDOUT_FILENO, str, len);
}
```
Printing an integer is more complicated.
1. count the length of the string
2. convert each digit into a character
3. store the characters from the tail of array, so that the MSD (most significant digit) will be printed first, and the LSD (least significant digit) will be printed last
4. finally pass to the write system call
```c=51
// The max length of a 32-bit signed integer is 11.
// (10 characters for digits, and 1 for the sign character)
#define STACK_SIZE 12
/* print a signed integer */
void print_int(int num) {
int neg = num < 0;
if (neg) num = -num;
char stack[STACK_SIZE], *ptr = stack + STACK_SIZE;
do {
*--ptr = (char) ('0' + _remu(num, 10));
num = _divu(num, 10);
} while (num);
if (neg) *--ptr = '-';
int len = STACK_SIZE - (ptr - stack);
write(STDOUT_FILENO, ptr, len);
}
```
RV32I ISA does not support division and modulus, so we have to implement them by ourselves.
```c=35
/* calculate the quotient of (a0 / a1) */
unsigned _divu(unsigned a0, unsigned a1) {
unsigned q = 0;
while (a0 >= a1) {
a0 -= a1;
q++;
}
return q;
}
/* calculate the remainder of (a0 % a1) */
unsigned _remu(unsigned a0, unsigned a1) {
while (a0 >= a1) a0 -= a1;
return a0;
}
```
Performing division and modulus operations is quite expensive on RV32I platforms when the input number is large due to the loops and repetitive arithmetics. The performance can be enhanced by eliminating the loops, and the detailed will be later discussed [here](#Optimization-by-Simplifying-Division-and-Modulus).
### Testing Results
Run the following tests to ensure the correctness of the programs.
```shell
$ make test_add_sub_bf16 # 0: success, other: failed
0
inferior exit code 0
$ make test_i32_bf16 # 0: success, other: failed
0
inferior exit code 0
$ make test_ln_bf16 # 0: success, other: failed
0
inferior exit code 0
$ make test_mul_bf16 # 0: success, other: failed
0
inferior exit code 0
$ make test_mul_shift_u32 # 1000 * 1000 = 1000000
1000000
inferior exit code 0
$ make test_mul_sum_u32 # 6 + 0 = 6
6
inferior exit code 0
```
## Observation of Compiler Optimization (Handwritten vs Compiler-Generated Assembly)
### Handwritten Assembly
```riscv437=
# --- ln_bf16 ---
# return ln(abs(x))
# input:
# a0: x (bf16): number to transform
# output:
# a0: t (bf16): result of ln(abs(x))
# notes:
# s0: x, t (around last mul_bf16)
# s1: exp
# reference: ln_bf16.c
ln_bf16:
lb_prologue:
addi sp, sp, -12
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
lb_body:
# remove extra bits (otherwise, offset-by-one bug occurs)
li t0, 0xFFFF0000
and a0, a0, t0
# catch zero
bnez a0, lb_nonzero_input
li a0, 0xFF800000
j lb_epilogue
lb_nonzero_input:
mv s0, a0 # s0 = x, for x will be used later
li t0, 0x7F800000
and a0, s0, t0 # a0 = *px & 0x7F800000
srli a0, a0, 23
addi a0, a0, -127 # a0 = (a0 >> 23) - 127
jal i32_to_bf16 # a0 = (bf16) a0
mv s1, a0 # exp = a0
# set x's exponent to 0
li t1, 0x7F0000
li t2, 0x3F800000
and s0, s0, t1
or s0, s0, t2 # x = 0x3F800000 | (*px & 0x7F0000)
# calculate result (t)
mv a0, s0 # a0 = x
li a1, 0x3DE10000 # lnc3 = 0.109
jal mul_bf16 # a0 = lnc3 * x
li a1, 0xBF3B0000 # lnc2 = -0.73
jal add_bf16 # a0 = a0 + lnc2
mv a1, s0 # a1 = x
jal mul_bf16 # a0 = a0 * x
li a1, 0x40070000 # lnc1 = 2.11
jal add_bf16 # a0 = a0 + lnc1
mv a1, s0 # a1 = x
jal mul_bf16 # a0 = a0 * x
li a1, 0xBFBF0000 # lnc0 = -1.49
jal add_bf16 # a0 = a0 + lnc0
mv s0, a0 # s0 = t
li a0, 0x3F310000 # ln2 = 0.69
mv a1, s1 # a1 = exp
jal mul_bf16 # a0 = ln2 * exp
mv a1, s0 # a1 = t
jal add_bf16 # a0 = a0 + t (result)
lb_epilogue:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
addi sp, sp, 12
ret
```
### -O0 Optimization
```shell
make OPT=0
```
```shell
$ riscv-none-elf-size ln_bf16_0
text data bss dec hex filename
108492 2384 1528 112404 1b714 ln_bf16
```
```shell
$ rv_histogram ln_bf16_0
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 31.65% [8191 ] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
2. lw 13.55% [3508 ] ███████████████████████████████████████████████████████████████████████████████████████████████████████▏
3. sw 11.25% [2912 ] █████████████████████████████████████████████████████████████████████████████████████▋
4. jal 9.11% [2357 ] █████████████████████████████████████████████████████████████████████▎
5. beq 4.34% [1124 ] █████████████████████████████████
6. bne 2.99% [773 ] ██████████████████████▋
7. add 2.57% [665 ] ███████████████████▌
8. andi 2.47% [640 ] ██████████████████▊
9. slli 2.12% [548 ] ████████████████
10. sub 1.80% [467 ] █████████████▋
11. srli 1.72% [446 ] █████████████
12. or 1.72% [445 ] █████████████
13. jalr 1.67% [433 ] ████████████▋
14. bge 1.44% [372 ] ██████████▉
15. lui 1.30% [337 ] █████████▉
16. blt 1.23% [319 ] █████████▍
17. lbu 1.17% [304 ] ████████▉
18. auipc 1.17% [302 ] ████████▉
```
### -Ofast Optimization
```shell
make OPT=fast
```
```shell
$ riscv-none-elf-size ln_bf16_fast
text data bss dec hex filename
106072 2476 1528 110076 1adfc ln_bf16_fast
```
```shell
$ rv_histogram ln_bf16_fast
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 31.62% [8000 ] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
2. lw 12.52% [3168 ] ███████████████████████████████████████████████████████████████████████████████████████████████▍
3. sw 10.65% [2695 ] █████████████████████████████████████████████████████████████████████████████████▏
4. jal 9.06% [2292 ] █████████████████████████████████████████████████████████████████████
5. beq 4.43% [1121 ] █████████████████████████████████▊
6. bne 3.09% [783 ] ███████████████████████▌
7. add 2.66% [673 ] ████████████████████▎
8. andi 2.61% [660 ] ███████████████████▉
9. slli 2.35% [594 ] █████████████████▉
10. sub 1.92% [486 ] ██████████████▋
11. or 1.88% [475 ] ██████████████▎
12. srli 1.83% [464 ] █████████████▉
13. jalr 1.82% [461 ] █████████████▉
14. bge 1.53% [386 ] ███████████▋
15. blt 1.35% [342 ] ██████████▎
16. lui 1.21% [306 ] █████████▏
17. lbu 1.20% [304 ] █████████▏
18. auipc 1.19% [302 ] █████████
19. and 1.03% [260 ] ███████▊
```
### -Os Optimization
```shell
make OPT=s
```
```shell
$ riscv-none-elf-size ln_bf16_s
text data bss dec hex filename
104912 2520 1528 108960 1a9a0 ln_bf16_s
```
```shell
$ rv_histogram ln_bf16_s
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 31.87% [7971 ] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
2. lw 12.69% [3175 ] ███████████████████████████████████████████████████████████████████████████████████████████████▉
3. sw 10.74% [2687 ] █████████████████████████████████████████████████████████████████████████████████▏
4. jal 9.22% [2305 ] █████████████████████████████████████████████████████████████████████▋
5. beq 4.44% [1110 ] █████████████████████████████████▌
6. bne 3.15% [789 ] ███████████████████████▊
7. add 2.66% [666 ] ████████████████████▏
8. andi 2.55% [639 ] ███████████████████▎
9. slli 2.20% [549 ] ████████████████▌
10. sub 1.87% [467 ] ██████████████
11. srli 1.80% [449 ] █████████████▌
12. or 1.78% [445 ] █████████████▍
13. jalr 1.77% [442 ] █████████████▎
14. bge 1.50% [375 ] ███████████▎
15. blt 1.28% [321 ] █████████▋
16. lbu 1.22% [304 ] █████████▏
17. auipc 1.21% [302 ] █████████▏
18. lui 1.11% [277 ] ████████▎
```
### Observation and Analysis
1. The dump files are more than 20000 line of code
2. The instruction counts are more than 20000 instructions
3. The program with `-s` optimization has the smallest code size
## Benchmarking and Optimization
> reference: kdnvt's [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@kdnvt/2022-arch-hw1)
<!-- > :warning: We care about CSR cycles at the moment.
> Can you improve the assembly code generated by gcc with optimizations? Or, can you write even faster/smaller programs in RISC-V assembly?
> You may drop some function calls and apply techniques such as loop unrolling and peephole optimization. -->
### Benchmarking
Based on the RISC-V speficication: [chapter 10 of Volume 1, Unprivileged Specification version 20191213](https://riscv.org/technical/specifications/), RV32I ISA provides several 64-bit read-only user-level performance counters, including `CYCLE` (cycle count), `TIME` (real-time clock), and `INSTRET` (instructions retired). These performance counters can be read by using `RDCYCLE[H]`, `RDTIME[H]`, and `RDINSTRET[H]` pseudoinstructions respectively. We will use them to benchmark one of our programs [`ln_bf16.c`](https://github.com/yutingshih/ca2023-hw2/blob/main/src/ln_bf16.c).
#### Instructions Retired
The following assembly provided from [`perfcounter/getinstret.S`](https://github.com/sysprog21/rv32emu/blob/master/tests/perfcounter/getinstret.S) of rv32emu reads the `INSTRET` register to get the number of instructions retired when this function is invoked.
```c
.text
.globl get_instret
.align 2
get_instret:
csrr a1, instreth
csrr a0, instret
csrr a2, instreth
bne a1, a2, get_instret
ret
.size get_instret,.-get_instret
```
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
#### Cycle Count
To get the cycle count of the program's execution, we utilize `get_ticks` from the [`tests/ticks.c`](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) of rv32emu, which utilizes the `rdcycle` and `rdcycleh` instructions to read the lower and higher 32 bits of `CYCLE` CSR register respectively.
```c=3
uint64_t get_ticks(void) {
uint64_t result;
uint32_t l, h, h2;
asm volatile(
"rdcycleh %0\n"
"rdcycle %1\n"
"rdcycleh %2\n"
"sub %0, %0, %2\n"
"seqz %0, %0\n"
"sub %0, zero, %0\n"
"and %1, %1, %0\n"
: "=r"(h), "=r"(l), "=r"(h2));
result = (((uint64_t) h) << 32) | ((uint64_t) l);
return result;
}
```
#### Experimental Approach
For the sake of convinience, we compile the functions above as a single static library `perfcount.a` with the following Makefile. Note that we use the `Zicsr` extension for accessing the CSR registers.
```shell
LIB = perfcount.a
OBJ = getticks.o getcycles.o getinstret.o
CROSS ?= riscv-none-elf-
CC = $(CROSS)gcc
AS = $(CROSS)as
AR = $(CROSS)ar
CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall
ASFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32
ARFLAGS = rcs
all: $(LIB)
$(LIB): $(OBJ)
$(AR) $(ARFLAGS) -o $@ $^
clean:
-@$(RM) -v $(LIB) $(OBJ)
```
By inserting the function calls `get_ticks` or `get_instret` into the major part of the calculation we can measure the statistcs we care about.
```diff=183
int main() {
+ uint64_t tick1 = get_ticks();
float average_error = 0, maximal_error = 0;
int error_code = test_ln_bf16(40, &average_error, &maximal_error);
+ uint64_t tick2 = get_ticks();
if (error_code == 0) {
puts("Test for ln_bf16.c passed.");
printf("Average error: %.3f\n", average_error);
printf("Maximal error: %.3f\n", maximal_error);
+ printf("Elapsed cycles: %llu\n", tick2 - tick1);
return 0;
} else {
printf("Test %.2f for ln_bf16.c failed.\n", error_code / 100.0);
return 1;
}
}
```
#### Benchmark Results of Compiler Optimization
| Optimization | Cycle count | Instructions retired |
| ------------------------- | ----------- | -------------------- |
| -O0 (no optimization) | 361846 | 361819 |
| -O1 (optimized for speed) | 329353 | 329326 |
| -O2 (optimized for speed) | 328384 | 328357 |
| -O3 (optimized for speed) | 322345 | 322318 |
| -Os (optimized for size) | 328951 | 328924 |
> TODO:
> - add some description
> - visualize the data
### Optimization by Loop Unrolling
#### Description
The initial algorithm implementation heavily relied on multiplication operations. Due to the lack of multiplication instructions supported by RV32I, the original author, [`coding-ray`](https://github.com/coding-ray) (黃柏叡), replaced them with a loop that executes consecutive addition operations. However, this approach introduced additional jump branch overhead. Therefore, the most straightforward solution to enhance performance is to apply loop unrolling techniques.
#### Result
> TODO
### Optimization by Simplifying Division and Modulus
#### Description
The `print_int` function converts an integer stored in `a0` register into a character array stored in the stack, and then pass it to the `write` system call to print it out. The original implementation of this conversion involves division and modulus operations, which would be miserably expensive when passing a large number due to the loop (If the input integer has N digits, then the time complexity is $O(N)$).
referece: [Modulus without Division, a tutorial](http://homepage.cs.uiowa.edu/~dwjones/bcd/mod.shtml)
#### Result
> TODO
### Optimization Strategy 3
#### Description
#### Result
## To Do
- [ ] optimize `div` and `mod` in `print_int` with `write` system call
- calculate `div` and `mod` at the same time
- implement `div10` and `mod10` in which divisors are both 10
- [ ] optimize by loop unrolling
- [ ] visualization the benchmarking result
- [ ] How to conduct profiling?