# Assignment2: Complete Applicationss ## My bfloat16 implementations Github repo link: https://github.com/kyl6092/ca2025_assignment2_rv32emu ### - Introduction In my last assignment, I implemented several bfloat16 operations in assembly and evaluated them through the Ripes simulator. The key metric is that the assembly code can configure for IEEE 754 floating point number (f32) or bfloat16 (bf16). By setting specific arguments (a3~a6), we can retrieve corresponding results. For example, a3 is shifting offset of mantissa mask and a4 is shifting offset for exponent mask. a5 deals with shifting offset for sign mask while a6 indicates parameter for multiplication. In last repository, I accidentally used an instruction "mul" which belongs to M-extension instructions. Now, I refined it and rewrote a new procedure called "my_mul" for simulating applications on rv32emu baremetal system. The optimization of the "my_mul" will be also seen in the following sections. ### - Initial Deployment To deploy my bfloat16 assembly in rv32emu system, I change the layout of the assembly code. Specifically, I removed the main label and test suites from the code and wrote the corresponding test functions in a C program, that is, the "main.c" file. After that, I built the application with modified Makefile and proper setup (from [Lab2: RISC-V](https://hackmd.io/@sysprog/Sko2Ja5pel)). Since my main.c is modified from instructor's chacha test suites, which locates in playground folder. There are some bfloat16 function implemented by C language, so I can do the performance comparisons between the compiled code and my assembly code. To clearly express my work, the modifications are explicitly listed below: - First, I modified the "Makefile" to add my bfloat16 implementation in the rule list. - Second, I add the function interfaces into main.c to make linker know where to find the implementations. For example (from commit [b40d254](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b40d254d772da9f30db392c51f9e5964529812cc)) ```c extern uint16_t f32_to_bf16(const uint32_t in); extern uint32_t bf16_to_f32(const uint16_t in); extern uint32_t my_add( const uint32_t in1, const uint32_t in2, const uint32_t reserv, const uint32_t mant_offset, const uint32_t exp_offset, const uint32_t sign_offset ); ... ``` - Last, I utilized the get_cycles() and get_instret() described in ```system/perfcounter.S``` to anaylze the improvement. (commit [68a8d5f](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/68a8d5f0c944e6d9fb9efc414c18ed3649f714de)) Here are CSR cycles performance of some operations implemented by C program. ```bash bf16_add PASSED Cycles: 432 Instructions: 432 bf16_sub PASSED Cycles: 373 Instructions: 373 bf16_mul PASSED Cycles: 464 Instructions: 464 bf16_div PASSED Cycles: 624 Instructions: 624 bf16_sqrt PASSED Cycles: 1586 Instructions: 1586 ``` Next, the performance of my initial bfloat16 implementations is listed as belows: (Note that I accidentally used M-extension instructions in last assignment, so I first implemented an simple version of 32-bit multiplication) ```bash bf16_add PASSED Cycles: 166 Instructions: 166 bf16_sub PASSED Cycles: 190 Instructions: 190 bf16_mul PASSED Cycles: 736 Instructions: 736 bf16_div PASSED Cycles: 271 Instructions: 271 bf16_sqrt PASSED Cycles: 3635 Instructions: 3635 ``` It is noteworthy that the cycles of addition, subtraction, division are improved by -61.5%, -49%, and -56.5% compared to original C implementations. However, we have degraded performance in multiplication and square root operation which involves with multiplications. In this case, my multiplication assembly code is followed by the rule that accumulating number by adding mutiplicand to a register with iterations defined by multiplier. This will cause large computation cycles as the multiplier increases. The code is written as: ```riscv= # === my_mul === .globl my_mul .type my_mul,%function my_mul: # a0 out (in1) # a1 in2 beq a0, zero, my_mul_zero beq a1, zero, my_mul_zero add x29, x0, a0 addi a0, x0, 0 addi x28, x0, 0 my_mul_loop: add a0, a0, x29 addi x28, x28, 1 bne x28, a1, my_mul_loop my_mul_ret: ret my_mul_zero: addi a0, x0, 0 ret .size my_mul,.-my_mul ``` Therefore, we need to apply better multiplication algorithm to optimize cycles. ### - The First Modification In this case, we can use shift-and-add method to implement an efficient multiplication. The metric is that we can utilize RV32I instructions to achieve the goal without any extensions. The algorithm is described as 1. Let return register be zero. 2. Check the LSB of the multiplier. (Just like we manually perform a polynomial multiplication) 3. If the LSB=1, add the multiplicand to the return register. 4. Shift the multiplicand left by one position. 5. Shift the multiplier right by one position. (This step can be seen as moving the next check LSB to right position. In 2., check LSB can be modeled by ```& 0x1```) 6. if multiplier $\neq$ 0, go to 2., else return the result register. The assembly code is written as (from commit [34f7495](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/34f749572e2086657c0d07c92fe28f00b32b97df)) ```riscv= # === my_mul === .globl my_mul .type my_mul,%function my_mul: # a0 out (in1) # a1 in2 add x29, x0, a0 beq a0, zero, my_mul_ret addi a0, x0, 0 beq a1, zero, my_mul_ret my_mul_loop: andi x28, a1, 1 beq x28, zero, my_mul_loop_1 add a0, a0, x29 my_mul_loop_1: slli x29, x29, 1 srli a1, a1, 1 bne a1, zero, my_mul_loop my_mul_ret: ret .size my_mul,.-my_mul ``` ```bash bf16_mul PASSED Cycles: 226 Instructions: 226 bf16_sqrt PASSED Cycles: 539 Instructions: 539 ``` Then, we can get better cycles performance. Compared to C implementations, bf16_mul and bf16_sqrt of this version reduces 51.2% and 66% in terms of cycles. ### - The Second Modification Afterwards, I encountered a problem that we cannot have a 32-bit multiplications in RV32I at this moment. Especially when I want to optimize other issues such as reciprocal square root, the current my_mul assembly is not capable for computing the right answer. (We also want to limit the code size, so we won't rely on the mul32 generated from the C compiler) In this case, I came up with the concept of "carry saved adder" design in VLSI. Since we have spared 32-bit registers in our risc-v architecture, we can make each shifting and addition involved with additional registers. That is, the additional registers are responsible for recording overflow bits information and will be accumulated in another return registers such as "a2". Two 32-bit registers will compose a 64-bit results, which tackles the problem I encountered at the beginning. The modified version is written as (from commit [b0cda41](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b0cda417742dbded938a5d9e948ccfba3c0fb2f8)) ```riscv # === my_mul === .globl my_mul .type my_mul,%function my_mul: # a0 out1 (in1) # a1 in2 # a2 (out2) add x29, x0, a0 beq a0, zero, my_mul_ret addi a0, x0, 0 addi t0, x0, 0 addi x19, x0, 0 addi x18, x0, 0 beq a1, zero, my_mul_ret my_mul_loop: slli x19, x19, 1 add x19, x19, x18 andi x28, a1, 1 beq x28, zero, my_mul_loop_1 add x28, a0, x29 sltu x18, x28, a0 add t0, t0, x18 add t0, t0, x19 add a0, x0, x28 my_mul_loop_1: slli x28, x29, 1 sltu x18, x28, x29 add x29, x0, x28 srli a1, a1, 1 bne a1, zero, my_mul_loop my_mul_ret: sw t0, 0(a2) ret .size my_mul,.-my_mul ``` ```bash bf16_mul PASSED Cycles: 270 Instructions: 270 bf16_sqrt PASSED Cycles: 899 Instructions: 899 ``` Although we have slight increased cycle results in bfloat16 multiplication and square root, the optimization of calculating reciprocal square root shows the dominant improvement if we make my_mul re-used in it. The detailed outcomes will be shown in the following section. In summary, the functionality of my_mul is extended with slight computational overheads. ## Tower of Hanoi ### - Introduction The tower of hanoi in instructor's solution focused on maintaining gray code to trace the Hamiltonian path. In the original assembly code, the program is based on Ripes simulator. Now, I made the code run on the rv32emu and succussfully produced the correct answer. Besides, based on the memory layout I learned in class, I perform some modifications to reduce cycles and instructions. **Last, I also found a bug (or a phenomonon) that should be mentioned in the rv32emu github repo, but at this moment, I'm not sure how to address the solutions for the problem. The related description will be at the Second Modification.** ### - Initial Deployment To successfully make rv32emu simulate the assembly code, I need to tackle the system call issues. Since the system call in Ripes simulator and rv32emu is defined differently, the printing assembly code should be modified first. The printing original code is ```riscv la x10, str1 addi x17, x0, 4 ecall addi x10, x9, 1 addi x17, x0, 1 ecall la x10, str2 addi x17, x0, 4 ecall addi x10, x11, 0 addi x17, x0, 11 ecall la x10, str3 addi x17, x0, 4 ecall addi x10, x12, 0 addi x17, x0, 11 ecall addi x10, x0, 10 addi x17, x0, 11 ecall ``` According to rv32emu, the system call should emphasize on a0, a1, a2, and a7 registers, where I set 0x1 to a0 for printing at stdout, and 0x40 to a7 for "write" syscall. As for a1 and a2, they are pointer and character length. Therefore, the code is modified as: (referenced from commit [b9f77f2](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b9f77f2bb7fcb81e41a7b152d19d66b0c325a163)) ```riscv .data data_peg: .byte 0x41, 0x42, 0x43 data_disk: .word 0x31, 0x32, 0x33, 0x34, 0x0a .text ... # handle peg name la s4, data_peg la s7, data_disk add t0, s4, s2 add t2, s4, s3 # Print "Move Disk " la a1, str1 li a2, 10 li a7, 0x40 li a0, 0x1 ecall # handle & Print disk name addi s6, s1, -32 add a1, s7, s6 li a2, 1 li a7, 0x40 li a0, 0x1 ecall # Print " from " la a1, str2 li a2, 6 li a7, 0x40 li a0, 0x1 ecall # Print peg name addi a1, t0, 0 li a2, 1 li a7, 0x40 li a0, 0x1 ecall # Print " to " la a1, str3 li a2, 4 li a7, 0x40 li a0, 0x1 ecall # Print peg name addi a1, t2, 0 li a2, 1 li a7, 0x40 li a0, 0x1 ecall # Print newline add a1, s7, 16 li a2, 1 li a7, 0x40 li a0, 0x1 ecall ``` At this moment, s4 and s7 contained the required ASCII code for printing proper messages. ```bash Test: My hanoi Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 637 Instructions: 645 ``` Now, we can see that given disk number=3, we have the correct results. ### - The First Modification The positions of disks are maintained in several word-align memory spaces. In original code, we have some shifting operations and base offset handle. Now, we pre-computed them in advance and directly assigned them to the registers to eliminate some instructions. For example, code like this: (commit [aa8338f](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/aa8338fca159626f943b5f35b4f196966764b4c3)) ```riscv addi s1, x0, 0 andi t1, t0, 1 bne t1, x0, disk_found addi s1, x0, 1 andi t1, t0, 2 bne t1, x0, disk_found addi s1, x0, 2 ... some code ... slli t0, s1, 2 addi t0, t0, 20 add t0, sp, t0 lw s2, 0(t0) bne s1, x0, handle_large ``` will be transformed into ```riscv addi s1, x0, 32 andi t1, t0, 1 bne t1, x0, disk_found addi s1, s1, 4 andi t1, t0, 2 bne t1, x0, disk_found addi s1, s1, 4 ... some code ... add t0, sp, s1 lw s2, 0(t0) addi s6, s1, -32 bne s6, x0, handle_large ``` With this, the disk=3 hanoi problem results in 571(-66) cycles and 579(-66) instr. Additionally, I found that the original implementation only dealt with odd disk number of hanoi problems, so I decided to extend its functionality. After I google search some references, it is noteworthy that the key for completing even disk number of hanoi problem is step direction. In original codes, the step is +2. However, the step would be -2 if we have even disk number. Under the finite field defined by modulus=3 (Three pegs), I implement the code written as: ```riscv continue_move: add t0, sp, s1 lw s2, 0(t0) addi s6, s1, -32 bne s6, x0, handle_large bne s8, zero, odd even: addi s3, s2, -2 bge s3, zero, display_move addi s3, s3, 3 jal x0, display_move odd: addi s3, s2, 2 addi t1, x0, 3 blt s3, t1, display_move sub s3, s3, t1 jal x0, display_move handle_large: lw t1, 32(sp) addi s3, x0, 3 sub s3, s3, s2 sub s3, s3, t1 ``` I verified the results with papers, and the result looks good. ```bash Test: My hanoi (disk=3) Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 571 Instructions: 579 Test: My hanoi (disk=4) Move Disk 1 from A to B Move Disk 2 from A to C Move Disk 1 from B to C Move Disk 3 from A to B Move Disk 1 from C to A Move Disk 2 from C to B Move Disk 1 from A to B Move Disk 4 from A to C Move Disk 1 from B to C Move Disk 2 from B to A Move Disk 1 from C to A Move Disk 3 from B to C Move Disk 1 from A to B Move Disk 2 from A to C Move Disk 1 from B to C Cycles: 1110 Instructions: 1118 ``` ### - The Second Modification As mentioned above, I found a bug or a phenomonon that if I executed other function involved with system call, the get_cycles of my hanoi assembly look weird. ```bash ----Before handling save registers (s1~s11)---- Test: My hanoi Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 78441 Instructions: 78449 --- ``` After I reviewed my modified hanoi code, I found that I used some saved registers (s1~s11). This may cause the csrr instruction in get_cycles read the wrong status and return a weird end cycles. Therefore, after using stack pointer x2 (or sp) to reserve the s1~s11 before return, I got a reasonable results: ```bash ----After handling save registers (s1~s11)---- Test: My hanoi Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 520 Instructions: 520 ----Execution with other testing...---- Test: My hanoi Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 508 Instructions: 508 --- ``` ## Fast Reciprocal Square Root ### - Introduction The fast reciprocal square root involves 32-bit multiplications and look-up-table for implmenting Newton's method. I think the key optimization of cycles locates how do we tackle the 32-bit multiplication. Therefore, my work is to substitute my_mul for the mul32 used in the fast_rsqrt(). ### - Initial Deployment At this moment, I can easily deploy the fast_sqrt() using C programming. ```c static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */ 11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */ 2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */ 362, 256, 181, 128, 90, /* 2^15 to 2^19 */ 64, 45, 32, 23, 16, /* 2^20 to 2^24 */ 11, 8, 6, 4, 3, /* 2^25 to 2^29 */ 2, 1 /* 2^30, 2^31 */ }; static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } uint32_t fast_rsqrt(uint32_t x) { // scaling 2^16 /* Handle edge cases */ if (x==0) return 0xFFFFFFFF; if (x==1) return 65536; int exp = 31-my_clz(x); uint32_t y = rsqrt_table[exp]; if (x > (1u << exp)) { my_uint64_t tmp; uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); y -= (uint32_t) ((delta * frac) >> 16); } for (int i = 0; i < 2; i++) { uint32_t y2 = (uint32_t) mul32(y,y); uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); y = (uint32_t)(mul32(y, (3u << 16)- xy2) >> 17); } return y; } ``` After running it with rv32emu, I got the result: ```bash Reciprocal Square Root PASSED Cycles: 4435 Instructions: 4435 ``` ### - The First Modification As mentioned above, I decided to re-use my assembly code of multiplication used in my bfloat16 for limiting code size. After adopting the concept of "carry-saved-adder", we can get 64-bit multiplication results via a0 and a2 registers. To this end, I add an union in my main.c for retrieving them. The modified code is written as (commit [b0cda41](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b0cda417742dbded938a5d9e948ccfba3c0fb2f8)) ```c typedef union { uint64_t whole; uint32_t part[2]; } my_uint64_t; uint32_t fast_rsqrt(uint32_t x) { // scaling 2^16 /* Handle edge cases */ if (x==0) return 0xFFFFFFFF; if (x==1) return 65536; int exp = 31-my_clz(x); uint32_t y = rsqrt_table[exp]; if (x > (1u << exp)) { my_uint64_t tmp; uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); tmp.part[0] = my_mul(delta, frac, &(tmp.part[1])); y -= (uint32_t) (tmp.whole >> 16); } for (int i = 0; i < 2; i++) { my_uint64_t tmp; tmp.part[0] = my_mul(y, y, &(tmp.part[1])); tmp.part[0] = my_mul(x, tmp.part[0], &(tmp.part[1])); uint32_t xy2 = (uint32_t)(tmp.whole >> 16); tmp.part[0] = my_mul(y, (3u << 16)- xy2, &(tmp.part[1])); y = (uint32_t)(tmp.whole>>17); } return y; } ``` Surprisingly, the improvement is near -56.1% with just modifying implementation of multiplication. ```bash Reciprocal Square Root PASSED Cycles: 1945 Instructions: 1945 ``` ### - The Second Modification Next, I found that the instructions of clz based on binary search algorithm can be expected as $\log_2{32}=5$. I decided to use loop unrolling method to implement the assembly code. The concept is that the program will always focus on MSB-side of searched half bits for congruency. If they are zero, the program moved LSB-side of half bits to the MSB-side. If they are non-zero, the program focus on th half bits of MSB-side bits and repeat the process. (commit [bafadaa](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/bafadaad1b5beb6b20672259c6834691d64fabd3)) ```riscv # === my_clz === .global my_clz .type my_clz,%function my_clz: # a0 out (in) addi t0, x0, 0 beq a0, zero, clz_rt clz_step1: srli x28, a0, 16 bne x28, x0, clz_step2 addi t0, t0, 16 slli a0, a0, 16 clz_step2: srli x28, a0, 24 bne x28, x0, clz_step3 addi t0, t0, 8 slli a0, a0, 8 clz_step3: srli x28, a0, 28 bne x28, x0, clz_step4 addi t0, t0, 4 slli a0, a0, 4 clz_step4: srli x28, a0, 30 bne x28, x0, clz_step5 addi t0, t0, 2 slli a0, a0, 2 clz_step5: srli x28, a0, 31 bne x28, x0, clz_rt addi t0, t0, 1 clz_rt: add a0, x0, t0 ret .size my_clz,.-my_clz ``` After applying ```int exp = 31-my_clz(x);```, the result shows -57 cycles improvement: ```bash Reciprocal Square Root PASSED Cycles: 1888 Instructions: 1888 ``` ## Highlight of Diassemble results ### - 32-bit multiplication Codes generated by the compiler ```ricv 00010294 <mul32>: 10294: fd010113 addi sp,sp,-48 10298: 02112623 sw ra,44(sp) 1029c: 02812423 sw s0,40(sp) 102a0: 03010413 addi s0,sp,48 102a4: fca42e23 sw a0,-36(s0) 102a8: fcb42c23 sw a1,-40(s0) 102ac: 00000513 li a0,0 102b0: 00000593 li a1,0 102b4: fea42423 sw a0,-24(s0) 102b8: feb42623 sw a1,-20(s0) 102bc: fe042223 sw zero,-28(s0) 102c0: 09c0006f j 1035c <mul32+0xc8> 102c4: fe442583 lw a1,-28(s0) 102c8: 00100513 li a0,1 102cc: 00b51533 sll a0,a0,a1 102d0: fd842583 lw a1,-40(s0) 102d4: 00b575b3 and a1,a0,a1 102d8: 06058c63 beqz a1,10350 <mul32+0xbc> 102dc: fdc42583 lw a1,-36(s0) 102e0: 00058613 mv a2,a1 102e4: 00000693 li a3,0 102e8: fe442583 lw a1,-28(s0) 102ec: fe058593 addi a1,a1,-32 102f0: 0005c863 bltz a1,10300 <mul32+0x6c> 102f4: 00b617b3 sll a5,a2,a1 102f8: 00000713 li a4,0 102fc: 02c0006f j 10328 <mul32+0x94> 10300: 00165513 srli a0,a2,0x1 10304: 01f00813 li a6,31 10308: fe442583 lw a1,-28(s0) 1030c: 40b805b3 sub a1,a6,a1 10310: 00b555b3 srl a1,a0,a1 10314: fe442503 lw a0,-28(s0) 10318: 00a697b3 sll a5,a3,a0 1031c: 00f5e7b3 or a5,a1,a5 10320: fe442583 lw a1,-28(s0) 10324: 00b61733 sll a4,a2,a1 10328: fe842803 lw a6,-24(s0) 1032c: fec42883 lw a7,-20(s0) 10330: 00e80533 add a0,a6,a4 10334: 00050313 mv t1,a0 10338: 01033333 sltu t1,t1,a6 1033c: 00f885b3 add a1,a7,a5 10340: 00b30833 add a6,t1,a1 10344: 00080593 mv a1,a6 10348: fea42423 sw a0,-24(s0) 1034c: feb42623 sw a1,-20(s0) 10350: fe442583 lw a1,-28(s0) 10354: 00158593 addi a1,a1,1 10358: feb42223 sw a1,-28(s0) 1035c: fe442503 lw a0,-28(s0) 10360: 01f00593 li a1,31 10364: f6a5d0e3 bge a1,a0,102c4 <mul32+0x30> 10368: fe842703 lw a4,-24(s0) 1036c: fec42783 lw a5,-20(s0) 10370: 00070513 mv a0,a4 10374: 00078593 mv a1,a5 10378: 02c12083 lw ra,44(sp) 1037c: 02812403 lw s0,40(sp) 10380: 03010113 addi sp,sp,48 10384: 00008067 ret ``` Codes written by me ```riscv 00013680 <my_mul>: 13680: 00a00eb3 add t4,zero,a0 13684: 04050863 beqz a0,136d4 <my_mul_ret> 13688: 00000513 li a0,0 1368c: 00000293 li t0,0 13690: 00000993 li s3,0 13694: 00000913 li s2,0 13698: 02058e63 beqz a1,136d4 <my_mul_ret> 0001369c <my_mul_loop>: 1369c: 00199993 slli s3,s3,0x1 136a0: 012989b3 add s3,s3,s2 136a4: 0015fe13 andi t3,a1,1 136a8: 000e0c63 beqz t3,136c0 <my_mul_loop_1> 136ac: 01d50e33 add t3,a0,t4 136b0: 00ae3933 sltu s2,t3,a0 136b4: 012282b3 add t0,t0,s2 136b8: 013282b3 add t0,t0,s3 136bc: 01c00533 add a0,zero,t3 000136c0 <my_mul_loop_1>: 136c0: 001e9e13 slli t3,t4,0x1 136c4: 01de3933 sltu s2,t3,t4 136c8: 01c00eb3 add t4,zero,t3 136cc: 0015d593 srli a1,a1,0x1 136d0: fc0596e3 bnez a1,1369c <my_mul_loop> 000136d4 <my_mul_ret>: 136d4: 00562023 sw t0,0(a2) 136d8: 00008067 ret ``` ### - Count Leading Zeros Codes generated by the compiler ```riscv 000105b8 <clz>: 105b8: fd010113 addi sp,sp,-48 105bc: 02112623 sw ra,44(sp) 105c0: 02812423 sw s0,40(sp) 105c4: 03010413 addi s0,sp,48 105c8: fca42e23 sw a0,-36(s0) 105cc: 02000793 li a5,32 105d0: fef42623 sw a5,-20(s0) 105d4: 01000793 li a5,16 105d8: fef42423 sw a5,-24(s0) 105dc: fe842783 lw a5,-24(s0) 105e0: fdc42703 lw a4,-36(s0) 105e4: 00f757b3 srl a5,a4,a5 105e8: fef42223 sw a5,-28(s0) 105ec: fe442783 lw a5,-28(s0) 105f0: 00078e63 beqz a5,1060c <clz+0x54> 105f4: fec42703 lw a4,-20(s0) 105f8: fe842783 lw a5,-24(s0) 105fc: 40f707b3 sub a5,a4,a5 10600: fef42623 sw a5,-20(s0) 10604: fe442783 lw a5,-28(s0) 10608: fcf42e23 sw a5,-36(s0) 1060c: fe842783 lw a5,-24(s0) 10610: 4017d793 srai a5,a5,0x1 10614: fef42423 sw a5,-24(s0) 10618: fe842783 lw a5,-24(s0) 1061c: fc0790e3 bnez a5,105dc <clz+0x24> 10620: fec42703 lw a4,-20(s0) 10624: fdc42783 lw a5,-36(s0) 10628: 40f707b3 sub a5,a4,a5 1062c: 00078513 mv a0,a5 10630: 02c12083 lw ra,44(sp) 10634: 02812403 lw s0,40(sp) 10638: 03010113 addi sp,sp,48 1063c: 00008067 ret ``` Codes written by me ```risv 0001377c <my_clz>: 1377c: 00000293 li t0,0 13780: 04050863 beqz a0,137d0 <clz_rt> 00013784 <clz_step1>: 13784: 01055e13 srli t3,a0,0x10 13788: 000e1663 bnez t3,13794 <clz_step2> 1378c: 01028293 addi t0,t0,16 13790: 01051513 slli a0,a0,0x10 00013794 <clz_step2>: 13794: 01855e13 srli t3,a0,0x18 13798: 000e1663 bnez t3,137a4 <clz_step3> 1379c: 00828293 addi t0,t0,8 137a0: 00851513 slli a0,a0,0x8 000137a4 <clz_step3>: 137a4: 01c55e13 srli t3,a0,0x1c 137a8: 000e1663 bnez t3,137b4 <clz_step4> 137ac: 00428293 addi t0,t0,4 137b0: 00451513 slli a0,a0,0x4 000137b4 <clz_step4>: 137b4: 01e55e13 srli t3,a0,0x1e 137b8: 000e1663 bnez t3,137c4 <clz_step5> 137bc: 00228293 addi t0,t0,2 137c0: 00251513 slli a0,a0,0x2 000137c4 <clz_step5>: 137c4: 01f55e13 srli t3,a0,0x1f 137c8: 000e1463 bnez t3,137d0 <clz_rt> 137cc: 00128293 addi t0,t0,1 000137d0 <clz_rt>: 137d0: 00500533 add a0,zero,t0 137d4: 00008067 ret ```