Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline contributed by < [AnnTaiwan](https://github.com/AnnTaiwan/ca2025-quizzes/tree/main) > >[!Note] AI tools usage >I use ChatGPT to assist with Quiz 1 by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage. ### Command to calculate code size: (don't count `.data` area, comment lines, and empty lines) ```c= awk '/^main:/{flag=1} flag && !/^[[:space:]]*#/ && NF' q1-bfloat16_rv32_mine_v1.s | wc -l sed -n '/^main:/,$p' q1-bfloat16_rv32_mine_v1.s | grep -v '^[[:space:]]*#' | grep -v '^$' | wc -l ``` ## Problem B: uf8_t decode and encode * [Full RISC-V assembly program for Problem B.](https://github.com/AnnTaiwan/ca2025-quizzes/blob/main/q1-uf8_rv32_mine_v1.s) ### Overview Problem B uses 8 bits number to represent 32 bits number, so it expends the range that 8-bit number can represent. It includes the `uf8_decode` and `uf8_encode` to implement this mapping. * 8-bit number format: ``` ┌──────────────┬────────────────┐ │ Exponent (4) │ Mantissa (4) │ └──────────────┴────────────────┘ 7 4 3 0 ``` #### Idea Divide the number range into 16 buckets(indicated by `exponent`, and each bucket is represented by 4 bits `mantissa`. The step size is $2^{exponent}$ in each bucket. #### uf8_decode ```c= * Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } ``` 1. Extract the exponent part and mantissa part. 2. the start value of each bucket is $(2^{exponent} -1) * 16$ 3. Calculate the offset's real value in this bucket. 4. The decoded value will be **the start of this bucket + the offset in this bucket.** #### uf8_encode ```c= /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` 1. Find useful number range in 32-bit number by `CLZ`. 2. Based on the `value`, find which bucket this `value` stays in. Search from $15^{th}$ or $exponent^{th}$ bucket to $0^{th}$ bucket. Search the lower-bound and the upper-bound of this bucket, and make sure `value` is at this range. 3. Calculate the offset in this bucket and store in `mantissa`. 4. Get the encoded result. ### Assignment 1 discussion TODO #### 1. Code design structure & Implementation > Professor's question: > Before writing code, I should write down the design idea and the structure of the code into a document. I mostly translate the `q1-uf8.c` c code into rv32 code line by line. Therefore, I will list the function name, parameters, return register, function purpose, and the code flow in below note. The usage of arguments and return values follow the **calling convention**. ##### Calling convention [reference](https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf) A calling convention defines how the arguments and return values are set when calling the function. According to this calling convention, I can place the function arguments sequentially in registers `a0`-`a7` when calling the function, and retrieve the return values from registers `a0`-`a1`. | Purpose | Register(s) | Notes | | ---------------------- | -------------------- | ----------------------------------- | | Function arguments | `a0`–`a7` | Up to 8 arguments | | Return values | `a0`–`a1` | 1 or 2 return values | | Return address | `ra` | Saved by caller or callee as needed | | Callee-saved registers | `s0`–`s11` | Must be preserved by callee | | Caller-saved registers | `t0`–`t6`, `a0`–`a7` | Can be overwritten by callee | | Stack pointer | `sp` | Points to current stack frame | * for example: ```c= int f(int a, int b){ return a + b; } ``` ``` f: add a0, a0, a1 # result = a + b jr ra # return back with return value a0 ``` >a is stored in `a0` >b is stored in `a1` >`a+b`'s data is stored in `a0`, and return. * In below functions, I mostly use registers `t0`-`t6` to complete the function. I want to decrease the usage of stack because callee needs to preserve registers `s0`-`s11`. --- ##### Function explanation ##### ==Function: `main`== **Purpose:** Starts the test process by calling `test`. If the `test` function returns nonzero (true), it prints `"All tests passed."` and exits with code `0`. Otherwise, it prints `"Some tests failed."` and exits with code `1`. **Parameters:** * None (program entry point) **Return Value:** * `a0`: Exit code * Exit code `0` if all tests pass * Exit code `1` if any test fails --- ##### ==Function: `test`== **Purpose:** Tests the correctness of the `uf8_encode` and `uf8_decode` functions. It loops **through all 256 possible 8-bit uf8 codes (`0–255`)**, decodes each one to a 20-bit value, then re-encodes it back to 8 bits. Check whether the result after decoding and re-encoding remains the same, and whether the decoded values are monotonically increasing. If any test fails, it **prints detailed debug information.** **Parameters:** * None **Return Value:** * `a0 = 1` → all tests passed * `a0 = 0` → at least one test failed **Notes:** * Use `jr ra` to return to `main`. * First, `test` need to store `ra` in stack because this function will call `uf8_encode` and `uf8_decode`. And then, those two functions will return back to `test` by their current `ra`. Hence, `test` must record its `ra` first. --- ##### ==Function: `CLZ`== **Purpose Summary:** Implements a software version of the **Count Leading Zeros (CLZ)** operation. It determines how many zero bits appear before the most significant `1` bit. **Parameters:** * `a0`: Input integer value (32-bit) **Return Value:** * `a0`: Number of **leading zeros** in the 32-bit binary representation of the input value **Notes:** * Use `jr ra` to return to `uf8_encode`. * It helps the `uf8_encode` decrease the number of loop iterations by skipping the unused bits. --- ##### ==Function: `uf8_decode`== **Purpose:** Decodes an **8-bit uf8 number** into a normal **20-bit integer**. The formula used reconstructs the integer value based on exponent and mantissa bits. **Parameters:** * `a0`: 8-bit uf8 code, which will be decoded into a normal 20-bit integer. * `uf8 d = [4-bit exponent | 4-bit mantissa]` * Upper 4 bits: exponent * Lower 4 bits: mantissa **Return Value:** * `a0`: 20-bit integer value (decoded result) **Notes:** * Use `jr ra` to return to `test`. --- ##### ==Function: `uf8_encode`== **Purpose:** Encodes a **20-bit integer** into an **8-bit uf8 representation**. If the value is smaller than 16, it is directly returned. (In range 1 [0~15], step size is 1, so, 1~15's encode results are also 0~15) Otherwise, it calls `CLZ` to find the position of the most significant bit (MSB), computes the exponent and mantissa accordingly, and combines them into one byte as the encoded result. Overall, this function is going to find which range the argument `a0` stay in and the offset in that range. **Parameters:** * `a0`: 20-bit integer value to encode **Return Value:** * `a0`: 8-bit uf8 code (4-bit exponent + 4-bit mantissa) **Notes:** * Use `jr ra` to return to `test`. * `uf8_encode` need to store `ra` in stack because this function will call `CLZ`. And then, `CLZ` function will return back to `uf8_encode` with return value `a0` by its current `ra`. Hence, `uf8_encode` must record its `ra` first. * The `CLZ` instruction optimizes the `exponent` search by directly locating the **most significant bit of value**(`msb`), allowing the algorithm to start near the correct exponent instead of incrementally searching through all possible values. This reduces the number of loop iterations from up to 15 to typically 1 or 2. ##### Code flow `main -> test -> uf8_decode -> uf8_encode` | Function | Called by | Purpose | | ------------ | ------------ | ------------------------------------------ | | `main` | — | Run the whole test and print result | | `test` | `main` | Loop through 0–255 to verify encode/decode | | `uf8_encode` | `test` | Encode integer to uf8 byte | | `uf8_decode` | `test` | Decode uf8 byte to integer | | `CLZ` | `uf8_encode` | Helper to find leading zeros | ##### Full RISC-V32 code [Full RISC-V assembly program for Problem B.](https://github.com/AnnTaiwan/ca2025-quizzes/blob/main/q1-uf8_rv32_mine_v1.s) #### 2. An example that can utilize the uf8 decode and encode > Professor's question: > Where can this method, the uf8 representation, be used? Give an example. **Thermometers** that measure body temperature, such as around **36.5 °C**, can benefit from using the **uf8 decode and encode method**. The human body temperature typically ranges between **30 °C and 40 °C**, with normal resting temperature around [**36.5 °C to 37.5 °C**](https://en.wikipedia.org/wiki/Human_body_temperature). As this range is relatively small, it is often unnecessary to represent temperature values using a full 32-bit floating-point number. Instead, an **8-bit integer** can be sufficient to encode the relevant range of body temperatures, which offers several advantages: 1. **Reduced memory usage** – storing each measurement in 8 bits instead of 32 bits saves memory, which is especially important for embedded systems or IoT devices. 2. **Lower power consumption** – smaller data representations reduce computational overhead and power usage during storage, transmission, or processing. By using a compact format like uf8, we can efficiently encode and decode temperature readings while maintaining acceptable accuracy for human body temperature monitoring. ### Result #### Pass all the tests * The simulator printed result shows that my code can pass all the tests, and the program can exit successfully: ![image](https://hackmd.io/_uploads/HklAeu56gl.png) ### Performance (on 5-stage processor) >When calculating the code size, I use [this command](https://hackmd.io/1pjTLmqSTK270DCZwf-pCg?both#Command-to-calculate-code-size-dont-count-data-area-comment-lines-and-empty-lines) to calculate. * My assembly code use below `CLZ`: ```c= CLZ: li t0, 32 #n li t1, 16 #c do_while: srl t2, a0, t1 #t2 is y beq t2, x0, SHIFT_TOO_MUCH sub t0, t0, t1 add a0, t2, x0 SHIFT_TOO_MUCH: srli t1, t1, 1 bne t1, x0, do_while sub a0, t0, a0 jr ra # jump to ra ``` * code size: **159 lines** * cycle count: **40347** ![image](https://hackmd.io/_uploads/Sk7C2ubTex.png) * When using compiler's `CLZ` with my program, the cycle count will become **59727**. ![image](https://hackmd.io/_uploads/ryiUhDZpee.png) * It spends a lot of instructions to `sw` the a0~a5 registers. * Assembly code from compiler * code size: **287 lines** ### Improvement on code size My code size is less than the compiler's. --- ## Problem C: bf16_t operations * [Full RISC-V assembly program for Problem C.](https://github.com/AnnTaiwan/ca2025-quizzes/blob/main/q1-bfloat16_rv32_mine_v1.s) * In this code, I specify the function structure before each function—such as `Function name`, `Purpose`, `Arguments`, and `Returns`—in the comments, in order to design the function and follow the calling convention during pre-development. For example: ```c= # Function: bf16_div # Purpose : Perform a/b # Arguments: # a0 - input value (bf16_t a) # a1 - input value (bf16_t b) # Returns: # a0 - multiplication result (bf16_t) bf16_div: # start of the function content ``` ### Overview **Problem C** implements various operations for the **bfloat16 (`bf16_t`) format**, including conversions between `float32` and `bf16`, as well as arithmetic operations such as **addition, subtraction, multiplication, division, and square root**. In addition, it provides comparison functions for `bf16_t` values. The file also includes a comprehensive set of **test routines** designed to verify the correctness of these implementations, with particular emphasis on handling **edge cases** and **special values** like NaN, infinity, and zero. * bf16_t format ``` ┌─────────┬──────────────┬──────────────┐ │Sign (1) │ Exponent (8) │ Mantissa (7) │ └─────────┴──────────────┴──────────────┘ 15 14 7 6 0 S: Sign bit (0 = positive, 1 = negative) E: Exponent bits (8 bits, bias = 127) M: Mantissa/fraction bits (7 bits) ``` * f32_t format ``` ┌─────────┬──────────────┬───────────────┐ │Sign (1) │ Exponent (8) │ Mantissa (23) │ └─────────┴──────────────┴───────────────┘ 31 30 23 22 0 S: Sign bit (0 = positive, 1 = negative) E: Exponent bits (8 bits, bias = 127) M: Mantissa/fraction bits (23 bits) ``` ### Implementation I mostly translate the `q1-bfloat16_t.c` c code into rv32 code line by line, but I adjust the `test_arithmetic` function into three separate cases, which is `add & sub`, `mul & div`, and `sqrt`. In addition, I add more than three test cases into this function to test the correctness. I put almost all test cases into `.data` area like below text: ```c= # --- Test case for test_arithmetic_add_sub --- D2_add_sub: # --- Test case 1 --- .word 0x3f800000 # a = 1.0 .word 0x40000000 # b = 2.0 .word 0x40400000 # ans_add = 3.0 (1.0 + 2.0) .word 0xBF800000 # ans_sub = -1.0 (1.0 - 2.0) # --- Test case 2 --- .word 0xC0000000 # a = -2.0 .word 0x3F000000 # b = 0.5 .word 0xBFC00000 # ans_add = -1.5 (-2.0 + 0.5) .word 0xC0200000 # ans_sub = -2.5 (-2.0 - 0.5) # --- Test case 3 --- .word 0xBF000000 # a = -0.5 .word 0x40490fd0 # b = 3.14159 .word 0x40290fd0 # ans_add ≈ 2.64159 (-0.5 + 3.14159 ≈ 2.6416) .word 0xc0690fd0 # ans_sub ≈ -3.64159 (-0.5 - 3.14159 ≈ -3.6416) len_D2_add_sub: .word 12 # number of words msg_add_sub_start: .asciz "\nTesting arithmetic operations (add & sub)...\n" msg_add_sub_done: .asciz " Arithmetic (add & sub): PASS\n" msg_add_err_too_large: .asciz "Addition failed" msg_sub_err_too_large: .asciz "Subtraction failed" ``` All printed messages and test data are placed in the `.data` section. In the example above, the code tests addition and subtraction operations. Each test case consists of four `f32_t` values: the first two (`a` and `b`) represent the operands for `a + b` or `a - b`, while the third and fourth values store the expected results of the addition and subtraction, respectively. These reference values are then used to evaluate the relative error of the computed results. With the label `D2_add_sub` and number `len_D2_add_sub`, I can use the test cases as an array which can be iterated through all datas in my test funciton. #### Improvement on determining errorness The original implementation judges the correctness of a result by comparing the computed relative error against a predefined threshold (`0.01f`). * original c code ```c= static int test_arithmetic(void) { printf("Testing arithmetic operations...\n"); bf16_t a = f32_to_bf16(1.0f); bf16_t b = f32_to_bf16(2.0f); bf16_t c = bf16_add(a, b); float result = bf16_to_f32(c); float diff = result - 3.0f; TEST_ASSERT((diff < 0 ? -diff : diff) < 0.01f, "Addition failed"); ``` However, I don't want to use the constant number to hold `0.01f`; instead, I use `xor` and `blt` with a number $2^{-7} = 0.0078125$, which is approximately equal to $0.01$. * **My less instructions method**: ```c= # s5: addition anser # s6: addition result from bf16_add # compare s5 s6 xor t0, s5, s6 # see the bit difference li t1, 0x10001 # plus one in order to use blt that t0 can be equal to t1 blt t0, t1, do_sub # OK, continue j print_rel_err_too_large_add # Fail, print error message and return 1 ``` I use `xor` to get the **bit difference** between answer and addition result. **The reason I use `0x10001`** is that `s5` and `s6` are in `f32_t` format, so the mantissa part is 23 bits. I want to use $2^{-7}$ as the determined threshold due to $2^{-7} = 0.0078125 < 0.01$, so $2^{-7}$ in mantissa can be represented as `0x10000`, which indicates there are still $23-7=16$ bits next the leading `1`. Hence, it becomes `0x10000`, and I plus one in order to let the `t0`(bit difference) can be equal to `0x10000` when using `blt`. In this method, I only need `xor, li, blt` to check if the relative error is too large. **By using this way**, **I don't need to** use a constant like `CONST_0_01` to represent `0.01f` and use the following commands to check if the relative error is too large. 1. `sub`: get the relative error 2. `neg`: make relative error be positive 3. `la`, `lw`: load the constant 0.01 in f32 format 4. `bltu`: compare with threshold * **more instructions method**: ```c= CONST_0_01: .word 0x3c23d70a # 0.01f # t: addition result # CONST_3_0: answer # diff = result - 3.0f la t1, CONST_3_0 lw t1, 0(t1) sub t2, t0, t1 # t2 = result - 3.0f # if (diff < 0) diff = -diff bltz t2, add_diff_abs j add_diff_chk add_diff_abs: neg t2, t2 add_diff_chk: # load threshold (0.01f) la t3, CONST_0_01 lw t3, 0(t3) # check |diff| < 0.01f bltu t2, t3, add_pass # fail, if relative error is larger than 0.01 ``` #### Improvement on decreasing the use of while loop For example, in `bf16_mul`, there are some **while loops** like below c code. It is going to left shift `mant_a` to let it acquire its implicit 1 and also adjust the `exp`, so it needs to keep left shift `mant_a` until the leading 1 in `mant_a` reaches the **8th bit position**. **However, the whole `while loop` can be replaced with `CLZ`.** * original c code (while loop in `bf16_mul`) ```c= while (!(mant_a & 0x80)) { mant_a <<= 1; exp_adjust--; } ``` **By using `CLZ`, I can directly calculate how many shift it needs** by dealing with the return value from `CLZ`. In this method, I can save the cycle count of iterating the loop. * Detail in below code: 1. Call `CLZ`. 2. `a0` is the number of leading zeros. 3. $t0 = 32 - a0$: remaining number of bits start from leading one to the rightmost bit. 4. The biggest bit number is $8^{th}$-bit (with the implicit $1$), so it does $t1= 8-t0$ to get the required shift amount. 5. Continue do `mant_a <<= t1;` and `exp_adjust -= t1;` * Replace `while loop` with `CLZ` ```c= # Prepare to call CLZ addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) mv a0, a6 # a0 is mant_a jal ra, CLZ # call CLZ to calculate how many steps the loop do li t0, 32 sub t0, t0, a0 # a0 is number of leading zero li t1, 8 sub t1, t1, t0 # number of shift in while loop for mant_a sll a6, a6, t1 # mant_a <<= t1 sub t5, t5, t1 # exp_adjust -= t1 lw a0, 4(sp) lw ra, 0(sp) addi sp, sp, 8 ``` #### Code flow starting from `main`: * Try to test all the cases. If it passes all, `main` will exit with code 0. Otherwise, it will exit with code 1. `main -> test_basic_conversions -> test_special_values -> test_arithmetic -> test_comparisons -> test_edge_cases -> test_rounding` | Function | Called by | Purpose | | ------------------------ | ------------------------ | --------------------------------------------------------------------- | | `main` | — | Run the whole test and print results | | `test_basic_conversions` | `main` | Test conversions between different numerical formats or data types | | `test_special_values` | `main` | Test handling of special numerical values like NaN, Infinity, or zero | | `test_arithmetic` | `main` | Test arithmetic operations (add, subtract, multiply, divide) | | `test_comparisons` | `main` | Test comparison operations (equal, greater, less) | | `test_edge_cases` | `main` | Test boundary or corner cases for numbers and operations | | `test_rounding` | `main` | Test rounding behavior and precision handling | ### Analysis * All RISC-V32I codes run on [ripes](https://ripes.me/). #### Actual instruction (replace pseudo instruction) After passing the code into ripes simulator, it can successfully generate the transformed assembly code; for example, ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00250513 addi x10 x10 2 8: 00400893 addi x17 x0 4 c: 00000073 ecall 10: 00000413 addi x8 x0 0 14: 10000517 auipc x10 0x10000 18: 05b50513 addi x10 x10 91 1c: 10000297 auipc x5 0x10000 20: 07f28293 addi x5 x5 127 24: 0002a583 lw x11 0 x5 28: 06c000ef jal x1 108 <test_basic_conversions> 2c: 00a46433 or x8 x8 x10 30: 6f4000ef jal x1 1780 <test_special_values> 34: 00a46433 or x8 x8 x10 38: 150000ef jal x1 336 <test_arithmetic> 3c: 00a46433 or x8 x8 x10 40: 514000ef jal x1 1300 <test_comparisons> 44: 00a46433 or x8 x8 x10 48: 011000ef jal x1 2064 <test_edge_cases> 4c: 00a46433 or x8 x8 x10 50: 1a5000ef jal x1 2468 <test_rounding> 54: 00a46433 or x8 x8 x10 58: 02040063 beq x8 x0 32 <print_pass> ``` >It means the code can continue to execute. #### 5-stage pipelined processor * The five stages are: 1. Instruction fetch (IF) 2. Instruction decode and register fetch (ID) 3. Execute (EX) 4. Memory access (MEM) 5. Register write back (WB) ![image](https://hackmd.io/_uploads/S19syNcale.png) ##### Trace `sw s0, 4(sp)` in 5-stage pipelined processor * actual instruction: ` 9c: 00812223 sw x8 4 x2` ##### 1. Instruction fetch (IF) ![image](https://hackmd.io/_uploads/r1YyNNqple.png) * This instruction is at `0x00812223`, so it can be seen in the Instr. memory. * Current `PC` is `0x0000009c`. Next `PC` is `PC + 4` because there is no branch during `sw`. * PC will increment by 4 automatically using the above adder, and this PC will become `0x000000a0` for next instruction after new instruction is fetched. ![image](https://hackmd.io/_uploads/rJW-SVcTel.png =30%x) ##### 2. Instruction decode and register fetch (ID) ![image](https://hackmd.io/_uploads/rk97SSqpex.png) * Instruction is decoded as `sw`, where opcode is equal to `sw`. * Due to `sw s0, 4(sp)`, the immediate is `4`, so the `imm` will output `0x00000004`. * `R2 idx` is `0x08` because `s0` is `x8`. * `REG1`: `0x7FFFFFF0` is current `x2`(`sp`) in `sw s0, 4(sp)`. However, when it really executes this instrution, the `sp` would be `0x7fffffd4` because previous instruction `addi sp, sp, -28` still not write back. `0x7FFFFFF0-28=0x7fffffd4` * `sw` didn't use `Reg2`. ##### 3. Execute (EX) ![image](https://hackmd.io/_uploads/BJx3RNcTlg.png) * ALU output Res `0x7fffffd8` where is the destination to store s0. * ALU perform `0x7fffffd4 + 0x4 = 0x7fffffd8` * `0x7fffffd4` is `sp`'s value. * `0x4` is the immediate. * No need to take branch. ##### 4. Memory access (MEM) ![image](https://hackmd.io/_uploads/SkqxgScpge.png) * Data memory accepts addr which is `0x7fffffd8`, and it does the `write` operation due to `sw`(write into memory). * `0x00000000` is the data that need to write into memory. * Ignore Read out. * Memory result ![image](https://hackmd.io/_uploads/HyGGKH9axx.png) * `x8(s0)`'s value is 0. ![image](https://hackmd.io/_uploads/SJwOKS9age.png =30%x) * At `0x7fffffd8`, the `s0`'s value is already written into this address. ##### 5. Register write back (WB) ![image](https://hackmd.io/_uploads/Bypxcrq6gg.png) * Nothing is written back to a register, because `sw` doesn’t produce a register result. ### Result #### Pass all the tests * The simulator printed result shows that my code can pass all the tests, and the program can exit successfully: ![image](https://hackmd.io/_uploads/rymLFLFage.png) >I separate arithmetic part into three parts: `add & sub`, `mul & div`, and `sqrt`. ### Performance (on 5-stage processor) >When calculating the code size, I use [this command](https://hackmd.io/1pjTLmqSTK270DCZwf-pCg?both#Command-to-calculate-code-size-dont-count-data-area-comment-lines-and-empty-lines) to calculate. * My assembly code * code size: **1464 lines** * cycle count: **8140** ![image](https://hackmd.io/_uploads/rylT4DKpge.png =50%x) * Assembly code from compiler * code size: **2854 lines** ### Improvement on code size My code size is less than the compiler's. ## Reference [Calling Convention](https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf) [Human_body_temperature](https://en.wikipedia.org/wiki/Human_body_temperature) https://zhuanlan.zhihu.com/p/658051034 ## Full RV32I code ### Problem B ```c= .data msg1: .asciz ": produces value " msg2: .asciz " but encodes back to " msg3: .asciz ": value " msg4: .asciz " <= previous_value " msg5: .asciz "All tests passed.\n" msg6: .asciz "Some tests failed.\n" newline: .asciz "\n" .align 2 .text .globl main main: jal ra, test # start to test beq a0, x0, Not_pass # fail la a0, msg5 # print msg5 when passing li a7, 4 ecall li a7, 93 # ecall: exit li a0, 0 # exit code is 0, successful ecall Not_pass: la a0, msg6 # print msg6 when not passing li a7, 4 ecall li a7, 93 # ecall: exit li a0, 1 # exit code is 1, not successful ecall test: addi sp, sp, -4 sw ra, 0(sp) # because test need to call other function addi s0, x0, -1 # previous_value li s1, 1 # passed, 1 means true, 0 means false li s2, 0 # f1, counter from 0 to 255 li s3, 256 # counter's end For_2: add a0, s2, x0 # prepare a0 for uf8_decode jal ra, uf8_decode add s4, a0, x0 # value (return value from uf8_decode) add a0, s4, x0 # prepare a0 for uf8_encode jal ra, uf8_encode add s5, a0, x0 # fl2 (return value from uf8_encode) test_if_1: beq s2, s5, test_if_2 mv a0, s2 # print s2(f1) li a7, 34 # (RARS) print integer in hex ecall la a0, msg1 # print msg1 li a7, 4 ecall mv a0, s4 # print value li a7, 1 ecall la a0, msg2 # print msg2 li a7, 4 ecall mv a0, s5 # prepare to print fl2(s5)'s hexdecimal li a7, 34 # (RARS) print integer in hex ecall la a0, newline # print newline li a7, 4 ecall li s1, 0 # passed = false test_if_2: blt s0, s4, after_if mv a0, s2 # print s2(f1) li a7, 34 # (RARS) print integer in hex ecall la a0, msg3 # print msg1 li a7, 4 ecall mv a0, s4 # print value li a7, 1 ecall la a0, msg4 # print msg2 li a7, 4 ecall mv a0, s0 # prepare to print s0(previous_value)'s hexdecimal li a7, 34 # (RARS) print integer in hex ecall la a0, newline # print newline li a7, 4 ecall li s1, 0 # passed = false after_if: mv s0, s4 addi s2, s2, 1 blt s2, s3, For_2 mv a0, s1 # return passed lw ra, 0(sp) addi sp, sp, 4 jr ra # jump to ra CLZ: li t0, 32 #n li t1, 16 #c do_while: srl t2, a0, t1 #t2 is y beq t2, x0, SHIFT_TOO_MUCH sub t0, t0, t1 add a0, t2, x0 SHIFT_TOO_MUCH: srli t1, t1, 1 bne t1, x0, do_while sub a0, t0, a0 jr ra # jump to ra uf8_decode: andi t0, a0, 0x0F # mantissa srli t1, a0, 4 # exponent li t2, 15 sub t2, t2, t1 # 15 - exponent li t3, 0x7FFF srl t3, t3, t2 slli t3, t3, 4 # offset sll t2, t0, t1 add a0, t2, t3 jr ra # jump to ra uf8_encode: addi sp, sp, -4 sw ra, 0(sp) # because it will call CLZ in this function add t6, a0, x0 # value li t0, 16 blt t6, t0, RETURN # if value < 16 jal ra, CLZ # call clz li t0, 31 sub t0, t0, a0 # msb, a0 is lz(return value from CLZ) add t1, a0, x0 # lz li t2, 0 # exponent # Start from a good initial guess li t3, 0 # overflow li t4, 5 blt t0, t4, Find_exact_exponent # go to Find_exact_exponent addi t2, t0, -4 li t4, 15 bge t4, t2, Cal_overflow # if 15 >= exponent li t2, 15 # exponent is 15 Cal_overflow: li t4, 0 # counter For_1: slli t5, t3, 1 addi t3, t5, 16 # overflow = (overflow << 1) + 16; addi t4, t4, 1 blt t4, t2, For_1 while_1: blez t2, Find_exact_exponent bge t6, t3, Find_exact_exponent addi t5, t3, -16 srli t3, t5, 1 # overflow = (overflow - 16) >> 1; addi t2, t2, -1 j while_1 Find_exact_exponent: li t5, 15 while_2: bge t2, t5, PRE_RETURN slli t4, t3, 1 addi t4, t4, 16 # next_overflow = (overflow << 1) + 16; blt t6, t4, PRE_RETURN add t3, t4, x0 addi t2, t2, 1 j while_2 PRE_RETURN: sub t1, t6, t3 srl t1, t1, t2 # mantissa slli t0, t2, 4 or a0, t0, t1 # prepare return value(a0) RETURN: lw ra, 0(sp) addi sp, sp, 4 jr ra # jump to ra ``` ### Problem C * Too Long, so put the link: [Full RISC-V assembly program for Problem C](https://github.com/AnnTaiwan/ca2025-quizzes/blob/main/q1-bfloat16_rv32_mine_v1.s)