# Assignment2: Complete Applicationss
## My bfloat16 implementations
Github repo link: https://github.com/kyl6092/ca2025_assignment2_rv32emu
### - Introduction
In my last assignment, I implemented several bfloat16 operations in assembly and evaluated them through the Ripes simulator. The key metric is that the assembly code can configure for IEEE 754 floating point number (f32) or bfloat16 (bf16). By setting specific arguments (a3~a6), we can retrieve corresponding results. For example, a3 is shifting offset of mantissa mask and a4 is shifting offset for exponent mask. a5 deals with shifting offset for sign mask while a6 indicates parameter for multiplication.
In last repository, I accidentally used an instruction "mul" which belongs to M-extension instructions. Now, I refined it and rewrote a new procedure called "my_mul" for simulating applications on rv32emu baremetal system. The optimization of the "my_mul" will be also seen in the following sections.
### - Initial Deployment
To deploy my bfloat16 assembly in rv32emu system, I change the layout of the assembly code. Specifically, I removed the main label and test suites from the code and wrote the corresponding test functions in a C program, that is, the "main.c" file. After that, I built the application with modified Makefile and proper setup (from [Lab2: RISC-V](https://hackmd.io/@sysprog/Sko2Ja5pel)). Since my main.c is modified from instructor's chacha test suites, which locates in playground folder. There are some bfloat16 function implemented by C language, so I can do the performance comparisons between the compiled code and my assembly code.
To clearly express my work, the modifications are explicitly listed below:
- First, I modified the "Makefile" to add my bfloat16 implementation in the rule list.
- Second, I add the function interfaces into main.c to make linker know where to find the implementations. For example (from commit [b40d254](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b40d254d772da9f30db392c51f9e5964529812cc))
```c
extern uint16_t f32_to_bf16(const uint32_t in);
extern uint32_t bf16_to_f32(const uint16_t in);
extern uint32_t my_add(
const uint32_t in1,
const uint32_t in2,
const uint32_t reserv,
const uint32_t mant_offset,
const uint32_t exp_offset,
const uint32_t sign_offset
);
...
```
- Last, I utilized the get_cycles() and get_instret() described in ```system/perfcounter.S``` to anaylze the improvement. (commit [68a8d5f](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/68a8d5f0c944e6d9fb9efc414c18ed3649f714de))
Here are CSR cycles performance of some operations implemented by C program.
```bash
bf16_add PASSED
Cycles: 432
Instructions: 432
bf16_sub PASSED
Cycles: 373
Instructions: 373
bf16_mul PASSED
Cycles: 464
Instructions: 464
bf16_div PASSED
Cycles: 624
Instructions: 624
bf16_sqrt PASSED
Cycles: 1586
Instructions: 1586
```
Next, the performance of my initial bfloat16 implementations is listed as belows: (Note that I accidentally used M-extension instructions in last assignment, so I first implemented an simple version of 32-bit multiplication)
```bash
bf16_add PASSED
Cycles: 166
Instructions: 166
bf16_sub PASSED
Cycles: 190
Instructions: 190
bf16_mul PASSED
Cycles: 736
Instructions: 736
bf16_div PASSED
Cycles: 271
Instructions: 271
bf16_sqrt PASSED
Cycles: 3635
Instructions: 3635
```
It is noteworthy that the cycles of addition, subtraction, division are improved by -61.5%, -49%, and -56.5% compared to original C implementations. However, we have degraded performance in multiplication and square root operation which involves with multiplications. In this case, my multiplication assembly code is followed by the rule that accumulating number by adding mutiplicand to a register with iterations defined by multiplier. This will cause large computation cycles as the multiplier increases.
The code is written as:
```riscv=
# === my_mul ===
.globl my_mul
.type my_mul,%function
my_mul:
# a0 out (in1)
# a1 in2
beq a0, zero, my_mul_zero
beq a1, zero, my_mul_zero
add x29, x0, a0
addi a0, x0, 0
addi x28, x0, 0
my_mul_loop:
add a0, a0, x29
addi x28, x28, 1
bne x28, a1, my_mul_loop
my_mul_ret:
ret
my_mul_zero:
addi a0, x0, 0
ret
.size my_mul,.-my_mul
```
Therefore, we need to apply better multiplication algorithm to optimize cycles.
### - The First Modification
In this case, we can use shift-and-add method to implement an efficient multiplication. The metric is that we can utilize RV32I instructions to achieve the goal without any extensions. The algorithm is described as
1. Let return register be zero.
2. Check the LSB of the multiplier. (Just like we manually perform a polynomial multiplication)
3. If the LSB=1, add the multiplicand to the return register.
4. Shift the multiplicand left by one position.
5. Shift the multiplier right by one position. (This step can be seen as moving the next check LSB to right position. In 2., check LSB can be modeled by ```& 0x1```)
6. if multiplier $\neq$ 0, go to 2., else return the result register.
The assembly code is written as (from commit [34f7495](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/34f749572e2086657c0d07c92fe28f00b32b97df))
```riscv=
# === my_mul ===
.globl my_mul
.type my_mul,%function
my_mul:
# a0 out (in1)
# a1 in2
add x29, x0, a0
beq a0, zero, my_mul_ret
addi a0, x0, 0
beq a1, zero, my_mul_ret
my_mul_loop:
andi x28, a1, 1
beq x28, zero, my_mul_loop_1
add a0, a0, x29
my_mul_loop_1:
slli x29, x29, 1
srli a1, a1, 1
bne a1, zero, my_mul_loop
my_mul_ret:
ret
.size my_mul,.-my_mul
```
```bash
bf16_mul PASSED
Cycles: 226
Instructions: 226
bf16_sqrt PASSED
Cycles: 539
Instructions: 539
```
Then, we can get better cycles performance. Compared to C implementations, bf16_mul and bf16_sqrt of this version reduces 51.2% and 66% in terms of cycles.
### - The Second Modification
Afterwards, I encountered a problem that we cannot have a 32-bit multiplications in RV32I at this moment. Especially when I want to optimize other issues such as reciprocal square root, the current my_mul assembly is not capable for computing the right answer. (We also want to limit the code size, so we won't rely on the mul32 generated from the C compiler) In this case, I came up with the concept of "carry saved adder" design in VLSI.
Since we have spared 32-bit registers in our risc-v architecture, we can make each shifting and addition involved with additional registers. That is, the additional registers are responsible for recording overflow bits information and will be accumulated in another return registers such as "a2". Two 32-bit registers will compose a 64-bit results, which tackles the problem I encountered at the beginning.
The modified version is written as (from commit [b0cda41](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b0cda417742dbded938a5d9e948ccfba3c0fb2f8))
```riscv
# === my_mul ===
.globl my_mul
.type my_mul,%function
my_mul:
# a0 out1 (in1)
# a1 in2
# a2 (out2)
add x29, x0, a0
beq a0, zero, my_mul_ret
addi a0, x0, 0
addi t0, x0, 0
addi x19, x0, 0
addi x18, x0, 0
beq a1, zero, my_mul_ret
my_mul_loop:
slli x19, x19, 1
add x19, x19, x18
andi x28, a1, 1
beq x28, zero, my_mul_loop_1
add x28, a0, x29
sltu x18, x28, a0
add t0, t0, x18
add t0, t0, x19
add a0, x0, x28
my_mul_loop_1:
slli x28, x29, 1
sltu x18, x28, x29
add x29, x0, x28
srli a1, a1, 1
bne a1, zero, my_mul_loop
my_mul_ret:
sw t0, 0(a2)
ret
.size my_mul,.-my_mul
```
```bash
bf16_mul PASSED
Cycles: 270
Instructions: 270
bf16_sqrt PASSED
Cycles: 899
Instructions: 899
```
Although we have slight increased cycle results in bfloat16 multiplication and square root, the optimization of calculating reciprocal square root shows the dominant improvement if we make my_mul re-used in it. The detailed outcomes will be shown in the following section. In summary, the functionality of my_mul is extended with slight computational overheads.
## Tower of Hanoi
### - Introduction
The tower of hanoi in instructor's solution focused on maintaining gray code to trace the Hamiltonian path. In the original assembly code, the program is based on Ripes simulator. Now, I made the code run on the rv32emu and succussfully produced the correct answer. Besides, based on the memory layout I learned in class, I perform some modifications to reduce cycles and instructions.
**Last, I also found a bug (or a phenomonon) that should be mentioned in the rv32emu github repo, but at this moment, I'm not sure how to address the solutions for the problem. The related description will be at the Second Modification.**
### - Initial Deployment
To successfully make rv32emu simulate the assembly code, I need to tackle the system call issues. Since the system call in Ripes simulator and rv32emu is defined differently, the printing assembly code should be modified first.
The printing original code is
```riscv
la x10, str1
addi x17, x0, 4
ecall
addi x10, x9, 1
addi x17, x0, 1
ecall
la x10, str2
addi x17, x0, 4
ecall
addi x10, x11, 0
addi x17, x0, 11
ecall
la x10, str3
addi x17, x0, 4
ecall
addi x10, x12, 0
addi x17, x0, 11
ecall
addi x10, x0, 10
addi x17, x0, 11
ecall
```
According to rv32emu, the system call should emphasize on a0, a1, a2, and a7 registers, where I set 0x1 to a0 for printing at stdout, and 0x40 to a7 for "write" syscall. As for a1 and a2, they are pointer and character length. Therefore, the code is modified as: (referenced from commit [b9f77f2](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b9f77f2bb7fcb81e41a7b152d19d66b0c325a163))
```riscv
.data
data_peg: .byte 0x41, 0x42, 0x43
data_disk: .word 0x31, 0x32, 0x33, 0x34, 0x0a
.text
...
# handle peg name
la s4, data_peg
la s7, data_disk
add t0, s4, s2
add t2, s4, s3
# Print "Move Disk "
la a1, str1
li a2, 10
li a7, 0x40
li a0, 0x1
ecall
# handle & Print disk name
addi s6, s1, -32
add a1, s7, s6
li a2, 1
li a7, 0x40
li a0, 0x1
ecall
# Print " from "
la a1, str2
li a2, 6
li a7, 0x40
li a0, 0x1
ecall
# Print peg name
addi a1, t0, 0
li a2, 1
li a7, 0x40
li a0, 0x1
ecall
# Print " to "
la a1, str3
li a2, 4
li a7, 0x40
li a0, 0x1
ecall
# Print peg name
addi a1, t2, 0
li a2, 1
li a7, 0x40
li a0, 0x1
ecall
# Print newline
add a1, s7, 16
li a2, 1
li a7, 0x40
li a0, 0x1
ecall
```
At this moment, s4 and s7 contained the required ASCII code for printing proper messages.
```bash
Test: My hanoi
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 637
Instructions: 645
```
Now, we can see that given disk number=3, we have the correct results.
### - The First Modification
The positions of disks are maintained in several word-align memory spaces. In original code, we have some shifting operations and base offset handle. Now, we pre-computed them in advance and directly assigned them to the registers to eliminate some instructions. For example, code like this: (commit [aa8338f](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/aa8338fca159626f943b5f35b4f196966764b4c3))
```riscv
addi s1, x0, 0
andi t1, t0, 1
bne t1, x0, disk_found
addi s1, x0, 1
andi t1, t0, 2
bne t1, x0, disk_found
addi s1, x0, 2
... some code ...
slli t0, s1, 2
addi t0, t0, 20
add t0, sp, t0
lw s2, 0(t0)
bne s1, x0, handle_large
```
will be transformed into
```riscv
addi s1, x0, 32
andi t1, t0, 1
bne t1, x0, disk_found
addi s1, s1, 4
andi t1, t0, 2
bne t1, x0, disk_found
addi s1, s1, 4
... some code ...
add t0, sp, s1
lw s2, 0(t0)
addi s6, s1, -32
bne s6, x0, handle_large
```
With this, the disk=3 hanoi problem results in 571(-66) cycles and 579(-66) instr.
Additionally, I found that the original implementation only dealt with odd disk number of hanoi problems, so I decided to extend its functionality. After I google search some references, it is noteworthy that the key for completing even disk number of hanoi problem is step direction. In original codes, the step is +2. However, the step would be -2 if we have even disk number. Under the finite field defined by modulus=3 (Three pegs), I implement the code written as:
```riscv
continue_move:
add t0, sp, s1
lw s2, 0(t0)
addi s6, s1, -32
bne s6, x0, handle_large
bne s8, zero, odd
even:
addi s3, s2, -2
bge s3, zero, display_move
addi s3, s3, 3
jal x0, display_move
odd:
addi s3, s2, 2
addi t1, x0, 3
blt s3, t1, display_move
sub s3, s3, t1
jal x0, display_move
handle_large:
lw t1, 32(sp)
addi s3, x0, 3
sub s3, s3, s2
sub s3, s3, t1
```
I verified the results with papers, and the result looks good.
```bash
Test: My hanoi (disk=3)
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 571
Instructions: 579
Test: My hanoi (disk=4)
Move Disk 1 from A to B
Move Disk 2 from A to C
Move Disk 1 from B to C
Move Disk 3 from A to B
Move Disk 1 from C to A
Move Disk 2 from C to B
Move Disk 1 from A to B
Move Disk 4 from A to C
Move Disk 1 from B to C
Move Disk 2 from B to A
Move Disk 1 from C to A
Move Disk 3 from B to C
Move Disk 1 from A to B
Move Disk 2 from A to C
Move Disk 1 from B to C
Cycles: 1110
Instructions: 1118
```
### - The Second Modification
As mentioned above, I found a bug or a phenomonon that if I executed other function involved with system call, the get_cycles of my hanoi assembly look weird.
```bash
----Before handling save registers (s1~s11)----
Test: My hanoi
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 78441
Instructions: 78449
---
```
After I reviewed my modified hanoi code, I found that I used some saved registers (s1~s11). This may cause the csrr instruction in get_cycles read the wrong status and return a weird end cycles. Therefore, after using stack pointer x2 (or sp) to reserve the s1~s11 before return, I got a reasonable results:
```bash
----After handling save registers (s1~s11)----
Test: My hanoi
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 520
Instructions: 520
----Execution with other testing...----
Test: My hanoi
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 508
Instructions: 508
---
```
## Fast Reciprocal Square Root
### - Introduction
The fast reciprocal square root involves 32-bit multiplications and look-up-table for implmenting Newton's method. I think the key optimization of cycles locates how do we tackle the 32-bit multiplication. Therefore, my work is to substitute my_mul for the mul32 used in the fast_rsqrt().
### - Initial Deployment
At this moment, I can easily deploy the fast_sqrt() using C programming.
```c
static const uint16_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */
11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */
2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */
362, 256, 181, 128, 90, /* 2^15 to 2^19 */
64, 45, 32, 23, 16, /* 2^20 to 2^24 */
11, 8, 6, 4, 3, /* 2^25 to 2^29 */
2, 1 /* 2^30, 2^31 */
};
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
uint32_t fast_rsqrt(uint32_t x)
{
// scaling 2^16
/* Handle edge cases */
if (x==0) return 0xFFFFFFFF;
if (x==1) return 65536;
int exp = 31-my_clz(x);
uint32_t y = rsqrt_table[exp];
if (x > (1u << exp)) {
my_uint64_t tmp;
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp);
y -= (uint32_t) ((delta * frac) >> 16);
}
for (int i = 0; i < 2; i++) {
uint32_t y2 = (uint32_t) mul32(y,y);
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16);
y = (uint32_t)(mul32(y, (3u << 16)- xy2) >> 17);
}
return y;
}
```
After running it with rv32emu, I got the result:
```bash
Reciprocal Square Root PASSED
Cycles: 4435
Instructions: 4435
```
### - The First Modification
As mentioned above, I decided to re-use my assembly code of multiplication used in my bfloat16 for limiting code size. After adopting the concept of "carry-saved-adder", we can get 64-bit multiplication results via a0 and a2 registers. To this end, I add an union in my main.c for retrieving them.
The modified code is written as (commit [b0cda41](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/b0cda417742dbded938a5d9e948ccfba3c0fb2f8))
```c
typedef union {
uint64_t whole;
uint32_t part[2];
} my_uint64_t;
uint32_t fast_rsqrt(uint32_t x)
{
// scaling 2^16
/* Handle edge cases */
if (x==0) return 0xFFFFFFFF;
if (x==1) return 65536;
int exp = 31-my_clz(x);
uint32_t y = rsqrt_table[exp];
if (x > (1u << exp)) {
my_uint64_t tmp;
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp);
tmp.part[0] = my_mul(delta, frac, &(tmp.part[1]));
y -= (uint32_t) (tmp.whole >> 16);
}
for (int i = 0; i < 2; i++) {
my_uint64_t tmp;
tmp.part[0] = my_mul(y, y, &(tmp.part[1]));
tmp.part[0] = my_mul(x, tmp.part[0], &(tmp.part[1]));
uint32_t xy2 = (uint32_t)(tmp.whole >> 16);
tmp.part[0] = my_mul(y, (3u << 16)- xy2, &(tmp.part[1]));
y = (uint32_t)(tmp.whole>>17);
}
return y;
}
```
Surprisingly, the improvement is near -56.1% with just modifying implementation of multiplication.
```bash
Reciprocal Square Root PASSED
Cycles: 1945
Instructions: 1945
```
### - The Second Modification
Next, I found that the instructions of clz based on binary search algorithm can be expected as $\log_2{32}=5$. I decided to use loop unrolling method to implement the assembly code.
The concept is that the program will always focus on MSB-side of searched half bits for congruency. If they are zero, the program moved LSB-side of half bits to the MSB-side. If they are non-zero, the program focus on th half bits of MSB-side bits and repeat the process. (commit [bafadaa](https://github.com/kyl6092/ca2025_assignment2_rv32emu/commit/bafadaad1b5beb6b20672259c6834691d64fabd3))
```riscv
# === my_clz ===
.global my_clz
.type my_clz,%function
my_clz:
# a0 out (in)
addi t0, x0, 0
beq a0, zero, clz_rt
clz_step1:
srli x28, a0, 16
bne x28, x0, clz_step2
addi t0, t0, 16
slli a0, a0, 16
clz_step2:
srli x28, a0, 24
bne x28, x0, clz_step3
addi t0, t0, 8
slli a0, a0, 8
clz_step3:
srli x28, a0, 28
bne x28, x0, clz_step4
addi t0, t0, 4
slli a0, a0, 4
clz_step4:
srli x28, a0, 30
bne x28, x0, clz_step5
addi t0, t0, 2
slli a0, a0, 2
clz_step5:
srli x28, a0, 31
bne x28, x0, clz_rt
addi t0, t0, 1
clz_rt:
add a0, x0, t0
ret
.size my_clz,.-my_clz
```
After applying ```int exp = 31-my_clz(x);```, the result shows -57 cycles improvement:
```bash
Reciprocal Square Root PASSED
Cycles: 1888
Instructions: 1888
```
## Highlight of Diassemble results
### - 32-bit multiplication
Codes generated by the compiler
```ricv
00010294 <mul32>:
10294: fd010113 addi sp,sp,-48
10298: 02112623 sw ra,44(sp)
1029c: 02812423 sw s0,40(sp)
102a0: 03010413 addi s0,sp,48
102a4: fca42e23 sw a0,-36(s0)
102a8: fcb42c23 sw a1,-40(s0)
102ac: 00000513 li a0,0
102b0: 00000593 li a1,0
102b4: fea42423 sw a0,-24(s0)
102b8: feb42623 sw a1,-20(s0)
102bc: fe042223 sw zero,-28(s0)
102c0: 09c0006f j 1035c <mul32+0xc8>
102c4: fe442583 lw a1,-28(s0)
102c8: 00100513 li a0,1
102cc: 00b51533 sll a0,a0,a1
102d0: fd842583 lw a1,-40(s0)
102d4: 00b575b3 and a1,a0,a1
102d8: 06058c63 beqz a1,10350 <mul32+0xbc>
102dc: fdc42583 lw a1,-36(s0)
102e0: 00058613 mv a2,a1
102e4: 00000693 li a3,0
102e8: fe442583 lw a1,-28(s0)
102ec: fe058593 addi a1,a1,-32
102f0: 0005c863 bltz a1,10300 <mul32+0x6c>
102f4: 00b617b3 sll a5,a2,a1
102f8: 00000713 li a4,0
102fc: 02c0006f j 10328 <mul32+0x94>
10300: 00165513 srli a0,a2,0x1
10304: 01f00813 li a6,31
10308: fe442583 lw a1,-28(s0)
1030c: 40b805b3 sub a1,a6,a1
10310: 00b555b3 srl a1,a0,a1
10314: fe442503 lw a0,-28(s0)
10318: 00a697b3 sll a5,a3,a0
1031c: 00f5e7b3 or a5,a1,a5
10320: fe442583 lw a1,-28(s0)
10324: 00b61733 sll a4,a2,a1
10328: fe842803 lw a6,-24(s0)
1032c: fec42883 lw a7,-20(s0)
10330: 00e80533 add a0,a6,a4
10334: 00050313 mv t1,a0
10338: 01033333 sltu t1,t1,a6
1033c: 00f885b3 add a1,a7,a5
10340: 00b30833 add a6,t1,a1
10344: 00080593 mv a1,a6
10348: fea42423 sw a0,-24(s0)
1034c: feb42623 sw a1,-20(s0)
10350: fe442583 lw a1,-28(s0)
10354: 00158593 addi a1,a1,1
10358: feb42223 sw a1,-28(s0)
1035c: fe442503 lw a0,-28(s0)
10360: 01f00593 li a1,31
10364: f6a5d0e3 bge a1,a0,102c4 <mul32+0x30>
10368: fe842703 lw a4,-24(s0)
1036c: fec42783 lw a5,-20(s0)
10370: 00070513 mv a0,a4
10374: 00078593 mv a1,a5
10378: 02c12083 lw ra,44(sp)
1037c: 02812403 lw s0,40(sp)
10380: 03010113 addi sp,sp,48
10384: 00008067 ret
```
Codes written by me
```riscv
00013680 <my_mul>:
13680: 00a00eb3 add t4,zero,a0
13684: 04050863 beqz a0,136d4 <my_mul_ret>
13688: 00000513 li a0,0
1368c: 00000293 li t0,0
13690: 00000993 li s3,0
13694: 00000913 li s2,0
13698: 02058e63 beqz a1,136d4 <my_mul_ret>
0001369c <my_mul_loop>:
1369c: 00199993 slli s3,s3,0x1
136a0: 012989b3 add s3,s3,s2
136a4: 0015fe13 andi t3,a1,1
136a8: 000e0c63 beqz t3,136c0 <my_mul_loop_1>
136ac: 01d50e33 add t3,a0,t4
136b0: 00ae3933 sltu s2,t3,a0
136b4: 012282b3 add t0,t0,s2
136b8: 013282b3 add t0,t0,s3
136bc: 01c00533 add a0,zero,t3
000136c0 <my_mul_loop_1>:
136c0: 001e9e13 slli t3,t4,0x1
136c4: 01de3933 sltu s2,t3,t4
136c8: 01c00eb3 add t4,zero,t3
136cc: 0015d593 srli a1,a1,0x1
136d0: fc0596e3 bnez a1,1369c <my_mul_loop>
000136d4 <my_mul_ret>:
136d4: 00562023 sw t0,0(a2)
136d8: 00008067 ret
```
### - Count Leading Zeros
Codes generated by the compiler
```riscv
000105b8 <clz>:
105b8: fd010113 addi sp,sp,-48
105bc: 02112623 sw ra,44(sp)
105c0: 02812423 sw s0,40(sp)
105c4: 03010413 addi s0,sp,48
105c8: fca42e23 sw a0,-36(s0)
105cc: 02000793 li a5,32
105d0: fef42623 sw a5,-20(s0)
105d4: 01000793 li a5,16
105d8: fef42423 sw a5,-24(s0)
105dc: fe842783 lw a5,-24(s0)
105e0: fdc42703 lw a4,-36(s0)
105e4: 00f757b3 srl a5,a4,a5
105e8: fef42223 sw a5,-28(s0)
105ec: fe442783 lw a5,-28(s0)
105f0: 00078e63 beqz a5,1060c <clz+0x54>
105f4: fec42703 lw a4,-20(s0)
105f8: fe842783 lw a5,-24(s0)
105fc: 40f707b3 sub a5,a4,a5
10600: fef42623 sw a5,-20(s0)
10604: fe442783 lw a5,-28(s0)
10608: fcf42e23 sw a5,-36(s0)
1060c: fe842783 lw a5,-24(s0)
10610: 4017d793 srai a5,a5,0x1
10614: fef42423 sw a5,-24(s0)
10618: fe842783 lw a5,-24(s0)
1061c: fc0790e3 bnez a5,105dc <clz+0x24>
10620: fec42703 lw a4,-20(s0)
10624: fdc42783 lw a5,-36(s0)
10628: 40f707b3 sub a5,a4,a5
1062c: 00078513 mv a0,a5
10630: 02c12083 lw ra,44(sp)
10634: 02812403 lw s0,40(sp)
10638: 03010113 addi sp,sp,48
1063c: 00008067 ret
```
Codes written by me
```risv
0001377c <my_clz>:
1377c: 00000293 li t0,0
13780: 04050863 beqz a0,137d0 <clz_rt>
00013784 <clz_step1>:
13784: 01055e13 srli t3,a0,0x10
13788: 000e1663 bnez t3,13794 <clz_step2>
1378c: 01028293 addi t0,t0,16
13790: 01051513 slli a0,a0,0x10
00013794 <clz_step2>:
13794: 01855e13 srli t3,a0,0x18
13798: 000e1663 bnez t3,137a4 <clz_step3>
1379c: 00828293 addi t0,t0,8
137a0: 00851513 slli a0,a0,0x8
000137a4 <clz_step3>:
137a4: 01c55e13 srli t3,a0,0x1c
137a8: 000e1663 bnez t3,137b4 <clz_step4>
137ac: 00428293 addi t0,t0,4
137b0: 00451513 slli a0,a0,0x4
000137b4 <clz_step4>:
137b4: 01e55e13 srli t3,a0,0x1e
137b8: 000e1663 bnez t3,137c4 <clz_step5>
137bc: 00228293 addi t0,t0,2
137c0: 00251513 slli a0,a0,0x2
000137c4 <clz_step5>:
137c4: 01f55e13 srli t3,a0,0x1f
137c8: 000e1463 bnez t3,137d0 <clz_rt>
137cc: 00128293 addi t0,t0,1
000137d0 <clz_rt>:
137d0: 00500533 add a0,zero,t0
137d4: 00008067 ret
```