# Assignment2: GNU Toolchain contributed by < [HenryChaing](https://github.com/HenryChaing/Computer_Arch_112) > ## Selected Topic This time, the chosen topic is [`吳堉銨`](https://github.com/kk908676/ComputerArchitecture) string encryption and decryption using "count leading zero." The reason for selecting this topic is due to my interest in cybersecurity encryption and decryption. I found that it utilizes a relatively simple symmetric encryption and decryption method. During the process of studying the code, I also identified areas for improvement, which is why I chose it as the subject for this assignment. ## Environment Build The computer's operating system used is Windows 10. Therefore, Virtual Box is employed to set up a virtual machine, and the image file selected is Ubuntu 22.04.5 LTS. The first step is to install the "riscv-none-elf-gcc" package provided by the teacher on GitHub, along with "rv32emu," which is placed in the environment for the RV32 project. ## You should know before modify code In the main program, in order to evaluate the compiled performance, I referenced the files "ticks.c" and "perfcounter." In "ticks.c," it is possible to calculate the number of elapsed cycles using the methods provided internally. "perfcounter" includes two ".s" files that offer the "getcycles()" and "get_instret()" methods. Therefore, the generated ELF file after compilation can execute successfully and accurately print the number of elapsed cycles. #### main.c (includes ticks 、 getcycles) ```c #include <inttypes.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> // #include "clz.c" // #include "encrypt.c" // #include "decrypt.c" extern uint64_t get_cycles(); extern uint64_t get_instret(); uint16_t count_leading_zeros(uint64_t x); void decrypt(uint64_t *data, uint64_t key); void encrypt(uint64_t *data, uint64_t key); typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } #define WORDS 12 #define ROUNDS 7 int main() { ticks t0 = getticks(); unsigned int state[WORDS] = {0}; // measure cycles uint64_t instret = get_instret(); uint64_t oldcount = get_cycles(); uint64_t key = 0x0123456789ABCDEF; // Encryption key uint64_t test_data = 0x0000000010101010; // Test data in binary printf("Original Data:\n"); printf("Data: 0x%016lx\n", test_data); /* Encrypt and print encrypted data */ printf("\nEncrypted Data:\n"); encrypt(&test_data, key); //test_data ^= (key << count_leading_zeros(key)); printf("Data: 0x%016lx\n", test_data); /* Decrypt and print decrypted data */ printf("\nDecrypted Data:\n"); decrypt(&test_data, key); //test_data ^= (key << count_leading_zeros(key)); printf("Data: 0x%016lx\n", test_data); uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); printf("instret: %x\n", (unsigned) (instret & 0xffffffff)); //memset(state, 0, WORDS * sizeof(uint32_t)); int i=0; while(i<1000){ i++; }i=0; int* arr = (int*)malloc(sizeof(int)*200); ticks t1 = getticks(); printf("elapsed cycle: %" PRIu64 "\n", t1 - t0); return 0; } uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } void decrypt(uint64_t *data, uint64_t key) { uint16_t leading_zeros = count_leading_zeros(key); *data ^= (key << leading_zeros); //*data ^= (key << count_leading_zeros(key)); } void encrypt(uint64_t *data, uint64_t key) { uint16_t leading_zeros = count_leading_zeros(key); *data ^= (key << leading_zeros); //*data ^= (key << count_leading_zeros(key)); } ``` The Makefile needed to be rewritten. First, we introduced the GNU Toolchain and set aggregate variables (ASFLAGS, CROSS_COMPILE). Finally, we configured the target files and commands for various files to be compiled. There is a dependency relationship between the target and the target file. If the target file is modified, the target also needs to be partially recompiled. Following the dependency relationships are the rules, where each time a target is generated, the associated rule is executed, commonly involving the "gcc" command. #### Makefile ```shell .PHONY: clean include ../../mk/toolchain.mk #-Wall -Wextra ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 -Os CROSS_COMPILE = riscv-none-elf-gcc all: main.elf main.elf: getcycles.o getinstret.o main.o $(CROSS_COMPILE) -o main.elf getcycles.o getinstret.o main.o getcycles.o: getcycles.s $(CROSS_COMPILE) $(ASFLAGS) -c -o getcycles.o getcycles.s getinstret.o: getinstret.s $(CROSS_COMPILE) $(ASFLAGS) -c -o getinstret.o getinstret.s main.o: main.s $(CROSS_COMPILE) $(ASFLAGS) -c -o main.o main.s main.s: main.c $(CROSS_COMPILE) $(ASFLAGS) -S -o main.s main.c clean: $(RM) main.elf main.o ``` ### Linux command #### ls * After running the "ls" command, you can clearly see the files in the current directory, which can be used to observe whether the ELF file has been generated. ![](https://hackmd.io/_uploads/Hy_YHKif6.jpg) #### make clean * The "make clean" command can remove (rm) both "main.elf" and "main.o" files. ![](https://hackmd.io/_uploads/r1OtrKjf6.jpg) #### make * The "make" command can generate the target files we need, including "main.elf" and "main.s." ![](https://hackmd.io/_uploads/rkOtHtjGT.jpg) ## The original C program before improvement. #### main.c (before improvment) ```c #include <inttypes.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> extern uint64_t get_cycles(); extern uint64_t get_instret(); uint16_t count_leading_zeros(uint64_t x); void decrypt(uint64_t *data, uint64_t key); void encrypt(uint64_t *data, uint64_t key); #define WORDS 12 #define ROUNDS 7 int main() { unsigned int state[WORDS] = {0}; uint64_t key = 0x0123456789ABCDEF; // Encryption key uint64_t test_data = 0x0000000010101010; // Test data in binary printf("Original Data:\n"); printf("Data: 0x%016lx\n", test_data); /* Encrypt and print encrypted data */ printf("\nEncrypted Data:\n"); encrypt(&test_data, key); printf("Data: 0x%016lx\n", test_data); /* Decrypt and print decrypted data */ printf("\nDecrypted Data:\n"); decrypt(&test_data, key); printf("Data: 0x%016lx\n", test_data); return 0; } uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } void decrypt(uint64_t *data, uint64_t key) { uint16_t leading_zeros = count_leading_zeros(key); *data ^= (key << leading_zeros); } void encrypt(uint64_t *data, uint64_t key) { uint16_t leading_zeros = count_leading_zeros(key); *data ^= (key << leading_zeros); } ``` To improve program performance and facilitate comparison with the original program, I enhanced the "decrypt" and "encrypt" functions. Since these functions were only called once, I removed them as functions and implemented them directly in the main program. Below is the program result I obtained after this update. #### main.c (after improvment) ```c #include <inttypes.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> uint16_t count_leading_zeros(uint64_t x); #define WORDS 12 #define ROUNDS 7 int main() { unsigned int state[WORDS] = {0}; uint64_t key = 0x0123456789ABCDEF; // Encryption key uint64_t test_data = 0x0000000010101010; // Test data in binary printf("Original Data:\n"); printf("Data: 0x%016lx\n", test_data); /* Encrypt and print encrypted data */ printf("\nEncrypted Data:\n"); //encrypt test_data ^= (key << count_leading_zeros(key)); printf("Data: 0x%016lx\n", test_data); /* Decrypt and print decrypted data */ printf("\nDecrypted Data:\n"); //decrypt test_data ^= (key << count_leading_zeros(key)); printf("Data: 0x%016lx\n", test_data); return 0; } uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } ``` This effectively reduces two function calls, and it eliminates one instance of assigning a value (local variable leading zero) within the function. The performance evaluation below also demonstrates the effectiveness of this approach. ## **Compare Assembly Code** ``` I tried three different compilation options: O0, Ofast, and Osize, to compare the differences in terms of cycles and elf-size. ``` ## -O0 Optimized Assembly Code ### Assembly code (partial of code) ```c .file "main.c" .option nopic .attribute arch, "rv32i2p1_zicsr2p0" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .section .rodata .align 2 .LC1: .string "Original Data:" .align 2 .LC2: .string "Data: 0x%016lx\n" .align 2 .LC3: .string "\nEncrypted Data:" .align 2 .LC4: .string "\nDecrypted Data:" .text .align 2 .globl main .type main, @function main: addi sp,sp,-80 sw ra,76(sp) sw s0,72(sp) addi s0,sp,80 sw zero,-72(s0) sw zero,-68(s0) sw zero,-64(s0) sw zero,-60(s0) sw zero,-56(s0) sw zero,-52(s0) sw zero,-48(s0) sw zero,-44(s0) sw zero,-40(s0) sw zero,-36(s0) sw zero,-32(s0) sw zero,-28(s0) lui a5,%hi(.LC0) lw a4,%lo(.LC0)(a5) lw a5,%lo(.LC0+4)(a5) sw a4,-24(s0) sw a5,-20(s0) li a4,269488128 addi a4,a4,16 li a5,0 sw a4,-80(s0) sw a5,-76(s0) lui a5,%hi(.LC1) addi a0,a5,%lo(.LC1) call puts lw a4,-80(s0) lw a5,-76(s0) mv a2,a4 mv a3,a5 lui a5,%hi(.LC2) addi a0,a5,%lo(.LC2) call printf lui a5,%hi(.LC3) addi a0,a5,%lo(.LC3) call puts addi a5,s0,-80 lw a1,-24(s0) lw a2,-20(s0) mv a0,a5 call encrypt lw a4,-80(s0) lw a5,-76(s0) mv a2,a4 mv a3,a5 lui a5,%hi(.LC2) addi a0,a5,%lo(.LC2) call printf lui a5,%hi(.LC4) addi a0,a5,%lo(.LC4) call puts addi a5,s0,-80 lw a1,-24(s0) lw a2,-20(s0) mv a0,a5 call decrypt lw a4,-80(s0) lw a5,-76(s0) mv a2,a4 mv a3,a5 lui a5,%hi(.LC2) addi a0,a5,%lo(.LC2) call printf li a5,0 mv a0,a5 lw ra,76(sp) lw s0,72(sp) addi sp,sp,80 jr ra .size main, .-main .align 2 .globl count_leading_zeros .type count_leading_zeros, @function ``` ### execute result ``` n96121210@n96121210-VirtualBox:~/rv32emu/tests/Hw2$. /build/rv32emu main.elf Original Data: Data: Ox000000000001ee30 Encrypted Data: Data: Ox0000000000000000 Decrypted Data: Data: 0x0000000000000000 cycle count: 7825 instret: 2ed elapsed cycle: 10625 inferior exit code: 0 n96121210@n96121210-VirtualBox:~/rv32emu/tests/HW2$ riscv-none-elf-size main.elf text data bss dec hex filename 53504 1876 1528 56908 de4c main.elf ``` * Observation: * Line of code : `512` * Allocate `15` bytes on stack * execute output: * cycle count : `7825` * instret : `2ed` * elapsed cycle : `10625` * elf-size output: * program size : `56908` ### After C program optimized ``` n96121210@n96121210-VirtualBox:~/rv32emu/tests/Hw2$. /build/rv32emu main.elf Original Data: Data: Ox000000000001ee30 Encrypted Data: Data: Ox0000000000000000 Decrypted Data: Data: 0x0000000000000000 cycle count: 7595 instret: 265 elapsed cycle: 10395 inferior exit code: 0 n96121210@n96121210-VirtualBox:~/rv32emu/tests/HW2$ riscv-none-elf-size main.elf text data bss dec hex filename 53312 1876 1528 56716 dd8c main.elf ``` * Performance improvement: * cycle count down to : `7595` * intrution retired down to : `2f5` * elapsed cycle down to : `10395` * program size : `56716` ## -Ofast Optimized Assembly Code ### Assembly code (partial of code) ```c .file "main.c" .option nopic .attribute arch, "rv32i2p1_zicsr2p0" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .section .rodata.str1.4,"aMS",@progbits,1 .align 2 .LC0: .string "Original Data:" .align 2 .LC1: .string "Data: 0x%016lx\n" .align 2 .LC2: .string "\nEncrypted Data:" .align 2 .LC4: .string "\nDecrypted Data:" .section .text.startup,"ax",@progbits .align 2 .globl main .type main, @function main: lui a0,%hi(.LC0) addi sp,sp,-16 addi a0,a0,%lo(.LC0) sw ra,12(sp) sw s0,8(sp) call puts lui s0,%hi(.LC1) li a2,269488128 addi a2,a2,16 li a3,0 addi a0,s0,%lo(.LC1) call printf lui a0,%hi(.LC2) addi a0,a0,%lo(.LC2) call puts lui a5,%hi(.LC3) lw a2,%lo(.LC3)(a5) lw a3,%lo(.LC3+4)(a5) addi a0,s0,%lo(.LC1) call printf lui a0,%hi(.LC4) addi a0,a0,%lo(.LC4) call puts li a2,269488128 addi a0,s0,%lo(.LC1) addi a2,a2,16 li a3,0 call printf lw ra,12(sp) lw s0,8(sp) li a0,0 addi sp,sp,16 jr ra .size main, .-main .text .align 2 .globl count_leading_zeros .type count_leading_zeros, @function ``` ### execute result ``` n96121210@n96121210-VirtualBox:~/rv32emu/tests/Hw2$. /build/rv32emu main.elf Original Data: Data: Ox000000000001ee30 Encrypted Data: Data: Ox0000000000000000 Decrypted Data: Data: 0x0000000000000000 cycle count: 7070 instret: 2c8 elapsed cycle: 9822 inferior exit code: 0 n96121210@n96121210-VirtualBox:~/rv32emu/tests/HW2$ riscv-none-elf-size main.elf text data bss dec hex filename 52892 1884 1528 56304 dbf0 main.elf ``` * Observation: * Line of code : `404` * Allocate `2` bytes on stack * execute output: * cycle count : `7070` * instret : `2c8` * elapsed cycle : `9822` * elf-size output: * program size : `56304` ### After C program optimized ``` n96121210@n96121210-VirtualBox:~/rv32emu/tests/Hw2$. /build/rv32emu main.elf Original Data: Data: Ox000000000001ee30 Encrypted Data: Data: Ox0000000000000000 Decrypted Data: Data: 0x0000000000000000 cycle count: 7070 instret: 2cc elapsed cycle: 9819 inferior exit code: 0 n96121210@n96121210-VirtualBox:~/rv32emu/tests/HW2$ riscv-none-elf-size main.elf text data bss dec hex filename 51964 1884 1528 55376 d850 main.elf ``` * Performance improvement: * cycle count down to : `7070` * intrution retired down to : `2cc` * elapsed cycle down to : `9819` * program size : `55376` ## -Os Optimized Assembly Code ### Assembly code (partial of code) ```c .file "main.c" .option nopic .attribute arch, "rv32i2p1_zicsr2p0" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .align 2 .globl count_leading_zeros .type count_leading_zeros, @function count_leading_zeros: slli a4,a1,31 srli a5,a0,1 or a5,a4,a5 srli a4,a1,1 or a1,a4,a1 or a0,a5,a0 slli a4,a1,30 srli a5,a0,2 or a5,a4,a5 srli a2,a1,2 or a2,a2,a1 or a0,a5,a0 slli a4,a2,28 srli a5,a0,4 or a5,a4,a5 srli a3,a2,4 or a4,a5,a0 or a3,a3,a2 slli a2,a3,24 srli a5,a4,8 or a5,a2,a5 srli a2,a3,8 or a2,a2,a3 or a5,a5,a4 srli a3,a5,16 slli a4,a2,16 or a3,a4,a3 srli a4,a2,16 or a4,a4,a2 or a3,a3,a5 or a3,a4,a3 slli a2,a4,31 srli a5,a3,1 or a5,a2,a5 li a2,1431654400 addi a2,a2,1365 srli a1,a4,1 and a5,a5,a2 sub a5,a3,a5 and a2,a1,a2 sgtu a3,a5,a3 sub a4,a4,a2 sub a4,a4,a3 slli a2,a4,30 srli a3,a5,2 or a3,a2,a3 li a2,858992640 addi a2,a2,819 and a3,a3,a2 srli a1,a4,2 and a5,a5,a2 and a1,a1,a2 add a5,a3,a5 and a4,a4,a2 add a4,a1,a4 sltu a3,a5,a3 add a3,a3,a4 slli a2,a3,28 srli a4,a5,4 or a4,a2,a4 add a5,a4,a5 srli a2,a3,4 add a3,a2,a3 sltu a4,a5,a4 add a4,a4,a3 li a3,252645376 addi a3,a3,-241 and a4,a4,a3 and a5,a5,a3 slli a2,a4,24 srli a3,a5,8 or a3,a2,a3 add a5,a3,a5 srli a2,a4,8 add a4,a2,a4 sltu a3,a5,a3 add a3,a3,a4 slli a2,a3,16 srli a4,a5,16 or a4,a2,a4 add a5,a4,a5 srli a2,a3,16 sltu a4,a5,a4 add a3,a2,a3 add a4,a4,a3 add a4,a4,a5 andi a4,a4,127 li a0,64 sub a0,a0,a4 slli a0,a0,16 srli a0,a0,16 ret .size count_leading_zeros, .-count_leading_zeros .globl __ashldi3 .align 2 .globl decrypt .type decrypt, @function ``` ### execute result ``` n96121210@n96121210-VirtualBox:~/rv32emu/tests/Hw2$. /build/rv32emu main.elf Original Data: Data: Ox000000000001ee30 Encrypted Data: Data: Ox0000000000000000 Decrypted Data: Data: 0x0000000000000000 cycle count: 7378 instret: 2ce elapsed cycle: 10127 inferior exit code: 0 n96121210@n96121210-VirtualBox:~/rv32emu/tests/HW2$ riscv-none-elf-size main.elf text data bss dec hex filename 52178 1884 1528 55590 d926 main.elf ``` * Observation: * Line of code : `217` * Allocate `4` bytes on stack * execute output: * cycle count : `7378` * instret : `2ce` * elapsed cycle : `10127` * elf-size output: * program size : `55590` ### After C program optimized ``` n96121210@n96121210-VirtualBox:~/rv32emu/tests/Hw2$. /build/rv32emu main.elf Original Data: Data: Ox000000000001ee30 Encrypted Data: Data: Ox0000000000000000 Decrypted Data: Data: 0x0000000000000000 cycle count: 7070 instret: 2cc elapsed cycle: 9819 inferior exit code: 0 n96121210@n96121210-VirtualBox:~/rv32emu/tests/HW2$ riscv-none-elf-size main.elf text data bss dec hex filename 51964 1884 1528 55376 d850 main.elf ``` * Performance improvement: * cycle count down to : `7070` * intrution retired down to : `2cc` * elapsed cycle down to : `9819` * program size : `55376` ## Summary | Optimization level | original cycles number | modified cycles number | | ------------------ | ---------------- | ---------------- | | -O0 | 7825 | 7595 | | -Ofast | 7070 :o: | 7070 | | -Os | 7378 | 7070 | | Optimization level | original program size | modified program size | | ------------------ | ---------------- | ---------------- | | -O0 | 56908 | 56716 | | -Ofast | 56304 | 55376 | | -Os | 55590 :o: | 55376 | ### Conclusion We can observe that O0 is the most straightforward compilation option, resulting in assembly code that is larger in both space and execution cycles compared to the subsequent Os and Ofast compilation options. In the comparison between Os and Ofast, it is evident that their optimization directions differ. In the optimization of the C program, due to the reduction of function calls, there are indications of optimization in both space and execution time. ## Optimized the Handwriten Code I observed the assembly code generated by riscv-none-elf-gcc using the fastest execution optimization compilation option (Ofast). I noticed that the generated code minimized the number of branches as much as possible. In my hand-written assembly code, I also tried to simplify unnecessary jumps, which led to a successful performance improvement. #### Advanced Assembly code ```c .data str1: .string "Original Data:" str2: .string "\nEncrypted Data:" str3: .string "\nDecrypted Data:" .text .globl main main: li a1, 0x01234567 #key li a2, 0x89ABCDEF #key li a3, 0x00000000 #test li a4, 0x000000aa #test la a0, str1 addi a7, x0, 4 # print "Original Data:\n" ecall mv a0, a3 addi a7, x0, 34 # print "Data " ecall mv a0, a4 addi a7, x0, 34 # print "Data " ecall encrypt: li t0, 0 #leading_zeros mv a5, a1 #copy of high key mv a6, a2 #copy of low key jal count_leading_zeros add t0, a6, a5 li t3, 32 bne a5, t3, else1 j ct1 else1: sub t0, t0, a6 ct1: #data ^= (key << leading_zeros) mv a5, a1 #copy of high key mv a6, a2 #copy of low key sll a5, a5, t0 #left shift high key sll a6, a6, t0 #left shift low key li t1, 32 sub t0, t1, t0 srl t2, a6, t0 or a5, a5, t2 xor a3, a3, a5 xor a4, a4, a6 la a0, str2 addi a7, x0, 4 # print "\nEncrypted Data:" ecall mv a0, a3 addi a7, x0, 34 # print "Data " ecall mv a0, a4 addi a7, x0, 34 # print "Data " ecall decrypt: li t0, 0 #leading_zeros mv a5, a1 #copy of high key mv a6, a2 #copy of low key jal count_leading_zeros add t0, a6, a5 li t3, 32 bne a5, t3, else2 j ct2 else2: sub t0, t0, a6 ct2: #data ^= (key << leading_zeros) mv a5, a1 #copy of high key mv a6, a2 #copy of low key sll a5, a5, t0 #left shift high key sll a6, a6, t0 #left shift low key li t1, 32 sub t0, t1, t0 srl t2, a6, t0 or a5, a5, t2 xor a3, a3, a5 xor a4, a4, a6 la a0, str3 addi a7, x0, 4 # print "\nEncrypted Data:" ecall mv a0, a3 addi a7, x0, 34 # print "Data " ecall mv a0, a4 addi a7, x0, 34 # print "Data " ecall li a7, 10 # return 0 ecall count_leading_zeros: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp)#temporary register sw s3, 16(sp)#temporary register mv s0, a5 #high key mv s1, a6 #low key srli s2, s0, 1 # Right shift by 1 or s0, s0, s2 # Bitwise OR srli s2, s1, 1 # Right shift by 1 or s1, s1, s2 # Bitwise OR srli s2, s0, 2 # Right shift by 2 or s0, s0, s2 # Bitwise OR srli s2, s1, 2 # Right shift by 2 or s1, s1, s2 # Bitwise OR srli s2, s0, 4 # Right shift by 4 or s0, s0, s2 # Bitwise OR srli s2, s1, 4 # Right shift by 4 or s1, s1, s2 # Bitwise OR srli s2, s0, 8 # Right shift by 8 or s0, s0, s2 # Bitwise OR srli s2, s1, 8 # Right shift by 8 or s1, s1, s2 # Bitwise OR srli s2, s0, 16 # Right shift by 16 or s0, s0, s2 # Bitwise OR srli s2, s1, 16 # Right shift by 16 or s1, s1, s2 # Bitwise OR #x -= ((x >> 1) & 0x55555555); srli s2, s0, 1 # Right shift by 1 li s3, 0x55555555 and s2, s2, s3 # Apply bitmask 0x55555555 sub s0, s0, s2 # Subtract from x #x -= ((x >> 1) & 0x55555555); srli s2, s1, 1 # Right shift by 1 li s3, 0x55555555 and s2, s2, s3 # Apply bitmask 0x55555555 sub s1, s1, s2 # Subtract from x #x = ((x >> 2) & 0x33333333) + (x & 0x33333333); srli s2, s0, 2 # Right shift by 2 li s3, 0x33333333 and s2, s2, s3 # Apply bitmask 0x33333333 and s3, s3, s0 # Apply bitmask 0x33333333 add s0, s2, s3 # Add the results #x = ((x >> 2) & 0x33333333) + (x & 0x33333333); srli s2, s1, 2 # Right shift by 2 li s3, 0x33333333 and s2, s2, s3 # Apply bitmask 0x33333333 and s3, s3, s1 # Apply bitmask 0x33333333 add s1, s2, s3 # Add the results #x = ((x >> 4) + x) & 0x0f0f0f0f; srli s2, s0, 4 # Right shift by 4 add s0, s0, s2 # Add the result li s3, 0x0f0f0f0f and s0, s0, s3 #x = ((x >> 4) + x) & 0x0f0f0f0f; srli s2, s1, 4 # Right shift by 4 add s1, s1, s2 # Add the result li s3, 0x0f0f0f0f and s1, s1, s3 #x += (x >> 8); srli s2, s0, 8 # Right shift by 8 add s0, s0, s2 # Add the result #x += (x >> 8); srli s2, s1, 8 # Right shift by 8 add s1, s1, s2 # Add the result #x += (x >> 16); srli s2, s0, 16 # Right shift by 16 add s0, s0, s2 # Add the result #x += (x >> 16); srli s2, s1, 16 # Right shift by 16 add s1, s1, s2 # Add the result #return (32 - (x & 0x7f)) andi s2, s0, 0x7f # Apply bitmask 0x7f li s3, 32 # Load immediate value 64 sub s0, s3, s2 # Subtract from 64 #return (32 - (x & 0x7f)) andi s2, s1, 0x7f # Apply bitmask 0x7f li s3, 32 # Load immediate value 64 sub s1, s3, s2 # Subtract from 64 mv a5, s0 #x mv a6, s1 #x lw ra, 0(sp) # restore registers lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) addi sp, sp, 20 jr ra ``` In addition to the necessary jumps to "counting leading zero," I reduced `5` branches in the original code, resulting in the following performance improvement. * Before Performance Improvement ![](https://hackmd.io/_uploads/HkmLNnoza.jpg) :::info **Execution info** **Cycles**: &ensp; 278 **Instrs. retired**: &ensp; 240 **CPI**: &ensp; 1.16 **IPC**: &ensp; 0.863 **Clock rate**: &ensp; 0 hz ::: * After Performance Improvement ![](https://hackmd.io/_uploads/S1QLV2oGT.jpg) :::info **Execution info** **Cycles**: &ensp; 287 **Instrs. retired**: &ensp; 243 **CPI**: &ensp; 1.18 **IPC**: &ensp; 0.847 **Clock rate**: &ensp; 0hz ::: :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: ## Conclusion * <s>The GNU Toolchain and rv32emu allow us to quickly compile C and RISC-V assembly language in a Linux environment. In the future, I will try to use the -Ofast optimization level for compilation because, based on this experiment, I believe it offers the best performance. Additionally, I can optimize my code further using the packages available in riscv-none-elf.</s> :::warning Show me the insights. In particular, the techniques you have learnt. :notes: jserv :::