# Assignment2: RISC-V Toolchain > contributed by [kk908676](https://github.com/kk908676) ## Question Selection I refer to the **Find First String of 1-bits of a Given Length by CLZ** from 林柏全 The following is the source code: ```c #include <stdint.h> #include <stdio.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } int find_string(uint64_t x, int n){ int clz; // leading zeros of x int pos = 0; // position of fist '1' bit from significant bit while(x != 0){ clz = count_leading_zeros(x); x = x << clz; pos = pos + clz; clz = count_leading_zeros(~x); if (clz >= n) return pos; x = x << clz; pos = pos + clz; } return -1; } int main() { uint64_t test_data[] = {0x0f00000000000000, 0x0000000000000000, 0x0123456789abcdef}; for (int i = 0; i < 3; i++) { uint64_t x = test_data[i]; int n = 4; int result = find_string(x, n); printf("Test Case %d: Input: 0x%016lx, Result: %d\n", i+1, x, result); } return 0; } ``` ## Motivation The importance of "Find First String of 1-bits of a Given Length by CLZ" is primarily manifested in its efficiency and relevance in bitwise operations, particularly in programming contexts where performance is emphasized. This includes optimizing algorithms, performance improvement, and applications in bit-level security and cryptography. ## Compile and excecute the code First, we need to write a Makefile that suits our needs. ```c .PHONY: clean include /home/an/rv32emu/mk/toolchain.mk ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 SOURCE = main.o initial.o getcycles.o VAR = -O0 CC = riscv-none-elf-gcc %.s: %.c $(CC) $(ASFLAGS) $(VAR) -s -o $@ $< %.o: %.s $(CC) $(ASFLAGS) -c -o $@ $< all: main.elf main.elf: $(SOURCE) $(CC) -o $@ $(SOURCE) clean: $(RM) main.elf main.o initial.o getcycles.o ``` Next, we execute the following instructions ```shell $ ~/rv32emu/build/rv32emu main.elf ``` ### Here, we obtain the O0 version <s> ![](https://hackmd.io/_uploads/HkxAy7LzT.png) </s> :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: Display the main.elf header ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100c2 Start of program headers: 52 (bytes into file) Start of section headers: 69400 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` Program size ```c text data bss dec hex filename 52590 1876 1528 55994 daba main.elf ``` ## O1 optimization cycle count <s> ![](https://hackmd.io/_uploads/SkjF_XUGT.png) </s> :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: Display the main.elf header ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100c2 Start of program headers: 52 (bytes into file) Start of section headers: 69408 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` Program size ```c text data bss dec hex filename 51812 1884 1528 55224 d7b8 main.elf ``` * It can be observed that both the cycle count and text size have decreased. ## O2 optimization cycle count ![](https://hackmd.io/_uploads/HyJVc7Izp.png) main.elf header ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x1016e Start of program headers: 52 (bytes into file) Start of section headers: 69408 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` Program size ```c text data bss dec hex filename 51834 1884 1528 55246 d7ce main.elf ``` * O2 didn't perform as well as expected; overall, it didn't outperform O1. ## O3 optimization cycle count ![](https://hackmd.io/_uploads/r1_-sX8MT.png) main.elf header ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x1016e Start of program headers: 52 (bytes into file) Start of section headers: 69408 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` Program size ```c text data bss dec hex filename 52380 1884 1528 55792 d9f0 main.elf ``` * O3 is similar to O2. ## Modified the original Assembly on the rv32emu First,I need to rewrite the code to fit the format of rv32emu.Because integer output is not supported, I attempted to modify the system call to output the cycle count. In this code, I have added... `.set SYSPRINTFCYCLE, 32` And added it to syscall.c ```shell static void syscall_printfcycle(riscv_t *rv) { uint32_t buffer = rv_get_reg(rv, rv_reg_a3); fprintf(stdout, "Cycle count: %d\n", buffer); rv_set_reg(rv, rv_reg_a3, 0); } ``` ```shell an@an-Standard-PC-i440FX-PIIX-1996:~/Desktop/Hw2_asm$ ~/rv32emu/build/rv32emu main.elf Cycle count: 348 inferior exit code 0 ``` ## Assembly optimization Later, I attempted to make changes to the overflow handling to reduce the cycle count. Eventually, I found that the overflow check here was unnecessary for this program and proceeded to remove it. I made reductions to these aspects: ```c blt t1, t3, 16 # if underflow then jump beq zero, zero, 12 addi t0, t0, -1 # underflow at lower bits sub t0, t0, t2 #[t0, t1] => x -= ((x>>1) & 0x5555555555555555) or t4, t1, zero xor t4, s4, t4 # nor t5, t1, zero (t4 = ~(s4 | zero)) bgeu t4, t3, 8 # if no overflow then jump addi t0, t0, 1 # if overflow upper bits plus 1 or t4, t1, zero xor t4, s4, t4 # nor t5, t1, zero (t4 = ~(s4 | zero)) bgeu t4, t3, 8 # if no overflow then jump addi t0, t0, 1 # if overflow upper bits plus 1 or t4, t1, zero xor t4, s4, t4 bgeu t4, t3, 8 addi t0, t0, 1 or t4, t1, zero xor t4, s4, t4 bgeu t4, t3, 8 addi t0, t0, 1 or t4, t1, zero xor t4, s4, t4 bgeu t4, t3, 8 addi t0, t0, 1 ``` This is the cycle count obtained after the reductions ```shell $ ~/rv32emu/build/rv32emu main.elf Cycle count: 314 inferior exit code 0 ``` Next, I want to try reducing the text size. * Before ```c # x |= (x>>1); srli t3, t1, 1 # shift lower bits of test data right with 1 bit slli t2, t0, 31 # shift upper bits of test data left with 31 bits or t3, t2, t3 # combine to get new lower bits of test data srli t2, t0, 1 # shift upper bound of test data right with 1 bit or t0, t0, t2 # [0~31]x | [0~31](x >> 1) or t1, t1, t3 # [32~63]x | [32~63](x >> 1) # x |= (x>>2); srli t3, t1, 2 slli t2, t0, 30 or t3, t2, t3 srli t2, t0, 2 or t0, t0, t2 or t1, t1, t3 # x |= (x>>4); srli t3, t1, 4 slli t2, t0, 28 or t3, t2, t3 srli t2, t0, 4 or t0, t0, t2 or t1, t1, t3 # x |= (x>>8); srli t3, t1, 8 slli t2, t0, 24 or t3, t2, t3 srli t2, t0, 8 or t0, t0, t2 or t1, t1, t3 # x |= (x>>16); srli t3, t1, 16 slli t2, t0, 16 or t3, t2, t3 srli t2, t0, 16 or t0, t0, t2 or t1, t1, t3 ``` * After ```c li a5, 1 li a6, 32 loop: sub a7, a6, a5 srl t3, t1, a5 # shift lower bits of test data right with n bit sll t2, t0, a7 # shift upper bits of test data left with 31-n bits or t3, t2, t3 # combine to get new lower bits of test data srl t2, t0, a5 # shift upper bound of test data right with n bit or t0, t0, t2 # [0~31]x | [0~31](x >> n) or t1, t1, t3 # [32~63]x | [32~63](x >> n) slli a5, a5, 1 beq a5, a6, CE j loop ``` Similarly, we execute it in the terminal to obtain the result. ```c text data bss dec hex filename 640 0 0 640 280 main.elf ``` ## Compare to the gcc compile ### analyze | target | cycle| size | | ------ | ---- | ---- | | O0 | 8918 | 52590 | | O1 | 6796 | 51812 | | O2 | 6788 | 51834 | | O3 | 6883 | 52380 | | assembly | 314 | 640 | :::warning You shall explain why C implementation is significantly slower. :notes: jserv :::