Assignment2: GNU Toolchain

# Assignment2: GNU Toolchain contributed by < [`yutingshih`](https://github.com/yutingshih) > Topic: [Bfloat16 Logarithm](https://hackmd.io/@coding-ray/2023-ca-hw-1) by [`coding-ray`](https://github.com/coding-ray) (黃柏叡) GitHub Repository: [`yutingshih/ca2023-hw2`](https://github.com/yutingshih/ca2023-hw2) Motivation:  ## Porting from Ripes to rv32emu ### Modify Makefile for C Code ```shell # Usage: # make [all] compile all the targets # make TARGET compile a specific target # make TARGET [TARGET [...]] compile specific targets # make test run tests for all the targets # make test_TARGET run test for a specific target # make test_TARGET [test_TARGET [...]] run tests for specific targets # make clean delete all the executables # # Example: # make (compile all the targets) # make clean test (delete all the executables, compile and run all the tests) # make all test_mul_bf16 (compile all the targets but only run test for mul_bf16) BIN ?= i32_bf16 fp32_bf16 add_sub_bf16 mul_bf16 ln_bf16 CROSS ?= riscv-none-elf- CC := $(CROSS)gcc CFLAGS := -Wall -Wextra LDLIBS := -lm ifdef CROSS CFLAGS += -march=rv32i -mabi=ilp32 RUNTIME ?= rv32emu endif all: $(BIN) %: %.c -$(CC) -D$(shell echo $@ | tr a-z A-Z)_TEST $(CFLAGS) -o $@ $< $(LDLIBS) test: $(addprefix test_, $(BIN)) test_%: % -@$(RUNTIME) $< clean: -@$(RM) -v $(BIN) ``` ### Write Makefile for Assembly Code Since the assembly code running on Ripes does not need to be compiled, the original reposity does not provide Makefile for assembly code. For the sake of convience, we will first write a Makefile ourselves. ```shell TARGET ?= add_sub_bf16 i32_bf16 ln_bf16 mul_bf16 mul_shift_u32 mul_sum_u32 BIN := $(addsuffix .elf, $(TARGET)) CROSS := riscv-none-elf- CC := $(CROSS)gcc AS := $(CROSS)as LD := $(CROSS)ld CFLAGS := -march=rv32i -mabi=ilp32 -ffreestanding ASFLAGS := -march=rv32i -mabi=ilp32 -R LDFLAGS := --oformat=elf32-littleriscv -T link.ld all: $(BIN) %.elf: %.o syscall.o $(LD) $(LDFLAGS) -o $@ $^ test: $(BIN) @for i in $^; do rv32emu $$i; done test_%: %.elf @rv32emu $< clean: -@$(RM) -v $(BIN) ``` Note that here we use the [implicit rules](https://www.gnu.org/software/make/manual/html_node/Catalogue-of-Rules.html) to assemble the object file from the corresponding assembly code in the makefile. ```shell %.o: %.s $(AS) $(ASFLAGS) -o $@ $< ``` ### Fix Syntax Errors - `strnset` is not in the C standard library. Replace it with `memset`, which is defined in `string.h`. - fix missing `,` in some of the assembly instructions. ### Add Global Directive - add `.globl main` for the main function of each assembly program to suppress the linking errors. ### Implement System Call ```c=3 /* write system call implementation using RISC-V ecall */ int write(int fd, const void* buf, unsigned nbyte) { asm( ".equ SYS_WRITE, 64" "\n\t" "li a7, SYS_WRITE" "\n\t" "ecall" "\n" ); return 0; } /* exit system call implementation using RISC-V ecall */ int exit(int status) { asm( ".equ SYS_EXIT, 93" "\n\t" "li a7, SYS_EXIT" "\n\t" "ecall" "\n" ); return 0; } ``` ### Implement Print Library Printing a character or a string is quite straightforward. ```c=23 /* print a character */ void print_char(char ch) { write(STDOUT_FILENO, &ch, 1); } /* print a null-terminated string */ void print_string(const char* str) { unsigned len = 0; while (str[len]) len++; write(STDOUT_FILENO, str, len); } ``` Printing an integer is more complicated. 1. count the length of the string 2. convert each digit into a character 3. store the characters from the tail of array, so that the MSD (most significant digit) will be printed first, and the LSD (least significant digit) will be printed last 4. finally pass to the write system call ```c=51 // The max length of a 32-bit signed integer is 11. // (10 characters for digits, and 1 for the sign character) #define STACK_SIZE 12 /* print a signed integer */ void print_int(int num) { int neg = num < 0; if (neg) num = -num; char stack[STACK_SIZE], *ptr = stack + STACK_SIZE; do { *--ptr = (char) ('0' + _remu(num, 10)); num = _divu(num, 10); } while (num); if (neg) *--ptr = '-'; int len = STACK_SIZE - (ptr - stack); write(STDOUT_FILENO, ptr, len); } ``` RV32I ISA does not support division and modulus, so we have to implement them by ourselves. ```c=35 /* calculate the quotient of (a0 / a1) */ unsigned _divu(unsigned a0, unsigned a1) { unsigned q = 0; while (a0 >= a1) { a0 -= a1; q++; } return q; } /* calculate the remainder of (a0 % a1) */ unsigned _remu(unsigned a0, unsigned a1) { while (a0 >= a1) a0 -= a1; return a0; } ``` Performing division and modulus operations is quite expensive on RV32I platforms when the input number is large due to the loops and repetitive arithmetics. The performance can be enhanced by eliminating the loops, and the detailed will be later discussed [here](#Optimization-by-Simplifying-Division-and-Modulus). ### Testing Results Run the following tests to ensure the correctness of the programs. ```shell $ make test_add_sub_bf16 # 0: success, other: failed 0 inferior exit code 0 $ make test_i32_bf16 # 0: success, other: failed 0 inferior exit code 0 $ make test_ln_bf16 # 0: success, other: failed 0 inferior exit code 0 $ make test_mul_bf16 # 0: success, other: failed 0 inferior exit code 0 $ make test_mul_shift_u32 # 1000 * 1000 = 1000000 1000000 inferior exit code 0 $ make test_mul_sum_u32 # 6 + 0 = 6 6 inferior exit code 0 ``` ## Observation of Compiler Optimization (Handwritten vs Compiler-Generated Assembly) ### Handwritten Assembly ```riscv437= # --- ln_bf16 --- # return ln(abs(x)) # input: # a0: x (bf16): number to transform # output: # a0: t (bf16): result of ln(abs(x)) # notes: # s0: x, t (around last mul_bf16) # s1: exp # reference: ln_bf16.c ln_bf16: lb_prologue: addi sp, sp, -12 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) lb_body: # remove extra bits (otherwise, offset-by-one bug occurs) li t0, 0xFFFF0000 and a0, a0, t0 # catch zero bnez a0, lb_nonzero_input li a0, 0xFF800000 j lb_epilogue lb_nonzero_input: mv s0, a0 # s0 = x, for x will be used later li t0, 0x7F800000 and a0, s0, t0 # a0 = *px & 0x7F800000 srli a0, a0, 23 addi a0, a0, -127 # a0 = (a0 >> 23) - 127 jal i32_to_bf16 # a0 = (bf16) a0 mv s1, a0 # exp = a0 # set x's exponent to 0 li t1, 0x7F0000 li t2, 0x3F800000 and s0, s0, t1 or s0, s0, t2 # x = 0x3F800000 | (*px & 0x7F0000) # calculate result (t) mv a0, s0 # a0 = x li a1, 0x3DE10000 # lnc3 = 0.109 jal mul_bf16 # a0 = lnc3 * x li a1, 0xBF3B0000 # lnc2 = -0.73 jal add_bf16 # a0 = a0 + lnc2 mv a1, s0 # a1 = x jal mul_bf16 # a0 = a0 * x li a1, 0x40070000 # lnc1 = 2.11 jal add_bf16 # a0 = a0 + lnc1 mv a1, s0 # a1 = x jal mul_bf16 # a0 = a0 * x li a1, 0xBFBF0000 # lnc0 = -1.49 jal add_bf16 # a0 = a0 + lnc0 mv s0, a0 # s0 = t li a0, 0x3F310000 # ln2 = 0.69 mv a1, s1 # a1 = exp jal mul_bf16 # a0 = ln2 * exp mv a1, s0 # a1 = t jal add_bf16 # a0 = a0 + t (result) lb_epilogue: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) addi sp, sp, 12 ret ``` ### -O0 Optimization ```shell make OPT=0 ``` ```shell $ riscv-none-elf-size ln_bf16_0 text data bss dec hex filename 108492 2384 1528 112404 1b714 ln_bf16 ``` ```shell $ rv_histogram ln_bf16_0 +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 31.65% [8191 ] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 2. lw 13.55% [3508 ] ███████████████████████████████████████████████████████████████████████████████████████████████████████▏ 3. sw 11.25% [2912 ] █████████████████████████████████████████████████████████████████████████████████████▋ 4. jal 9.11% [2357 ] █████████████████████████████████████████████████████████████████████▎ 5. beq 4.34% [1124 ] █████████████████████████████████ 6. bne 2.99% [773 ] ██████████████████████▋ 7. add 2.57% [665 ] ███████████████████▌ 8. andi 2.47% [640 ] ██████████████████▊ 9. slli 2.12% [548 ] ████████████████ 10. sub 1.80% [467 ] █████████████▋ 11. srli 1.72% [446 ] █████████████ 12. or 1.72% [445 ] █████████████ 13. jalr 1.67% [433 ] ████████████▋ 14. bge 1.44% [372 ] ██████████▉ 15. lui 1.30% [337 ] █████████▉ 16. blt 1.23% [319 ] █████████▍ 17. lbu 1.17% [304 ] ████████▉ 18. auipc 1.17% [302 ] ████████▉ ``` ### -Ofast Optimization ```shell make OPT=fast ``` ```shell $ riscv-none-elf-size ln_bf16_fast text data bss dec hex filename 106072 2476 1528 110076 1adfc ln_bf16_fast ``` ```shell $ rv_histogram ln_bf16_fast +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 31.62% [8000 ] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 2. lw 12.52% [3168 ] ███████████████████████████████████████████████████████████████████████████████████████████████▍ 3. sw 10.65% [2695 ] █████████████████████████████████████████████████████████████████████████████████▏ 4. jal 9.06% [2292 ] █████████████████████████████████████████████████████████████████████ 5. beq 4.43% [1121 ] █████████████████████████████████▊ 6. bne 3.09% [783 ] ███████████████████████▌ 7. add 2.66% [673 ] ████████████████████▎ 8. andi 2.61% [660 ] ███████████████████▉ 9. slli 2.35% [594 ] █████████████████▉ 10. sub 1.92% [486 ] ██████████████▋ 11. or 1.88% [475 ] ██████████████▎ 12. srli 1.83% [464 ] █████████████▉ 13. jalr 1.82% [461 ] █████████████▉ 14. bge 1.53% [386 ] ███████████▋ 15. blt 1.35% [342 ] ██████████▎ 16. lui 1.21% [306 ] █████████▏ 17. lbu 1.20% [304 ] █████████▏ 18. auipc 1.19% [302 ] █████████ 19. and 1.03% [260 ] ███████▊ ``` ### -Os Optimization ```shell make OPT=s ``` ```shell $ riscv-none-elf-size ln_bf16_s text data bss dec hex filename 104912 2520 1528 108960 1a9a0 ln_bf16_s ``` ```shell $ rv_histogram ln_bf16_s +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 31.87% [7971 ] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 2. lw 12.69% [3175 ] ███████████████████████████████████████████████████████████████████████████████████████████████▉ 3. sw 10.74% [2687 ] █████████████████████████████████████████████████████████████████████████████████▏ 4. jal 9.22% [2305 ] █████████████████████████████████████████████████████████████████████▋ 5. beq 4.44% [1110 ] █████████████████████████████████▌ 6. bne 3.15% [789 ] ███████████████████████▊ 7. add 2.66% [666 ] ████████████████████▏ 8. andi 2.55% [639 ] ███████████████████▎ 9. slli 2.20% [549 ] ████████████████▌ 10. sub 1.87% [467 ] ██████████████ 11. srli 1.80% [449 ] █████████████▌ 12. or 1.78% [445 ] █████████████▍ 13. jalr 1.77% [442 ] █████████████▎ 14. bge 1.50% [375 ] ███████████▎ 15. blt 1.28% [321 ] █████████▋ 16. lbu 1.22% [304 ] █████████▏ 17. auipc 1.21% [302 ] █████████▏ 18. lui 1.11% [277 ] ████████▎ ``` ### Observation and Analysis 1. The dump files are more than 20000 line of code 2. The instruction counts are more than 20000 instructions 3. The program with `-s` optimization has the smallest code size ## Benchmarking and Optimization > reference: kdnvt's [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@kdnvt/2022-arch-hw1)  ### Benchmarking Based on the RISC-V speficication: [chapter 10 of Volume 1, Unprivileged Specification version 20191213](https://riscv.org/technical/specifications/), RV32I ISA provides several 64-bit read-only user-level performance counters, including `CYCLE` (cycle count), `TIME` (real-time clock), and `INSTRET` (instructions retired). These performance counters can be read by using `RDCYCLE[H]`, `RDTIME[H]`, and `RDINSTRET[H]` pseudoinstructions respectively. We will use them to benchmark one of our programs [`ln_bf16.c`](https://github.com/yutingshih/ca2023-hw2/blob/main/src/ln_bf16.c). #### Instructions Retired The following assembly provided from [`perfcounter/getinstret.S`](https://github.com/sysprog21/rv32emu/blob/master/tests/perfcounter/getinstret.S) of rv32emu reads the `INSTRET` register to get the number of instructions retired when this function is invoked. ```c .text .globl get_instret .align 2 get_instret: csrr a1, instreth csrr a0, instret csrr a2, instreth bne a1, a2, get_instret ret .size get_instret,.-get_instret ``` :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: #### Cycle Count To get the cycle count of the program's execution, we utilize `get_ticks` from the [`tests/ticks.c`](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) of rv32emu, which utilizes the `rdcycle` and `rdcycleh` instructions to read the lower and higher 32 bits of `CYCLE` CSR register respectively. ```c=3 uint64_t get_ticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } ``` #### Experimental Approach For the sake of convinience, we compile the functions above as a single static library `perfcount.a` with the following Makefile. Note that we use the `Zicsr` extension for accessing the CSR registers. ```shell LIB = perfcount.a OBJ = getticks.o getcycles.o getinstret.o CROSS ?= riscv-none-elf- CC = $(CROSS)gcc AS = $(CROSS)as AR = $(CROSS)ar CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -Wall ASFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 ARFLAGS = rcs all: $(LIB) $(LIB): $(OBJ) $(AR) $(ARFLAGS) -o $@ $^ clean: -@$(RM) -v $(LIB) $(OBJ) ``` By inserting the function calls `get_ticks` or `get_instret` into the major part of the calculation we can measure the statistcs we care about. ```diff=183 int main() { + uint64_t tick1 = get_ticks(); float average_error = 0, maximal_error = 0; int error_code = test_ln_bf16(40, &average_error, &maximal_error); + uint64_t tick2 = get_ticks(); if (error_code == 0) { puts("Test for ln_bf16.c passed."); printf("Average error: %.3f\n", average_error); printf("Maximal error: %.3f\n", maximal_error); + printf("Elapsed cycles: %llu\n", tick2 - tick1); return 0; } else { printf("Test %.2f for ln_bf16.c failed.\n", error_code / 100.0); return 1; } } ``` #### Benchmark Results of Compiler Optimization | Optimization | Cycle count | Instructions retired | | ------------------------- | ----------- | -------------------- | | -O0 (no optimization) | 361846 | 361819 | | -O1 (optimized for speed) | 329353 | 329326 | | -O2 (optimized for speed) | 328384 | 328357 | | -O3 (optimized for speed) | 322345 | 322318 | | -Os (optimized for size) | 328951 | 328924 | > TODO: > - add some description > - visualize the data ### Optimization by Loop Unrolling #### Description The initial algorithm implementation heavily relied on multiplication operations. Due to the lack of multiplication instructions supported by RV32I, the original author, [`coding-ray`](https://github.com/coding-ray) (黃柏叡), replaced them with a loop that executes consecutive addition operations. However, this approach introduced additional jump branch overhead. Therefore, the most straightforward solution to enhance performance is to apply loop unrolling techniques. #### Result > TODO ### Optimization by Simplifying Division and Modulus #### Description The `print_int` function converts an integer stored in `a0` register into a character array stored in the stack, and then pass it to the `write` system call to print it out. The original implementation of this conversion involves division and modulus operations, which would be miserably expensive when passing a large number due to the loop (If the input integer has N digits, then the time complexity is $O(N)$). referece: [Modulus without Division, a tutorial](http://homepage.cs.uiowa.edu/~dwjones/bcd/mod.shtml) #### Result > TODO ### Optimization Strategy 3 #### Description #### Result ## To Do - [ ] optimize `div` and `mod` in `print_int` with `write` system call - calculate `div` and `mod` at the same time - implement `div10` and `mod10` in which divisors are both 10 - [ ] optimize by loop unrolling - [ ] visualization the benchmarking result - [ ] How to conduct profiling?