# Assignment 1-B : Counting Bits Using Counting Leading Zeros contributed by [`sweets777`](https://github.com/sweets777/) ## Counting Bits `Description`: Given an integer n, return an array ans of length n+1 such that for each i<small> ($0 \le i \le n$)</small>, ans[i] is the number of 1's in the binary representation of i. `Constraints`: <small>$0 \le n \le 10^5$</small> `Example`: Input: n=2 ----> Output: [0,1,1] &nbsp;&nbsp;&nbsp;&nbsp; ## Solution I used `Dynamic Programming(DP)` to construct the result array. The key recurrence relation is that the bit count for any number i equals 1 plus the bit count of (i - hp2), where hp2 is the highest power of 2 less than or equal to i. The `Count Leading Zeros(CLZ)` function is employed as a highly efficient way to find this hp2 by quickly identifying the position of the most significant bit. ## Implementation ### C Code The source code can be found [here](https://github.com/sweets777/ca2025-quizzes/blob/main/hw1/HW1_B.c). ```=c #include <stdio.h> #include <stdint.h> static inline int count_leading_zeros(uint32_t x) { if (x == 0) return 32; int n = 0; if ((x & 0xFFFF0000u) == 0) { n += 16; x <<= 16; } if ((x & 0xFF000000u) == 0) { n += 8; x <<= 8; } if ((x & 0xF0000000u) == 0) { n += 4; x <<= 4; } if ((x & 0xC0000000u) == 0) { n += 2; x <<= 2; } if ((x & 0x80000000u) == 0) { n += 1; } return n; } int countBits_fill(int n, int *ans) { ans[0] = 0; for (int i = 1; i <= n; ++i) { int lz = count_leading_zeros((uint32_t)i); int hp2 = 1 << (31 - lz); ans[i] = ans[i - hp2] + 1; } return n + 1; } static void print_array(const char* title, const int* arr, int size) { printf("%s [", title); for (int i = 0; i < size; i++) { printf("%d", arr[i]); if (i < size - 1) printf(", "); } printf("]\n"); } int main(void) { int returnSize; /* test 1 */ { enum { N1 = 5 }; int result1[N1 + 1]; returnSize = countBits_fill(N1, result1); print_array("Input n=5, Output:", result1, returnSize); } /* test 2 */ { enum { N2 = 8 }; int result2[N2 + 1]; returnSize = countBits_fill(N2, result2); print_array("Input n=8, Output:", result2, returnSize); } /* test 3 */ { enum { N3 = 0 }; int result3[N3 + 1]; returnSize = countBits_fill(N3, result3); print_array("Input n=0, Output:", result3, returnSize); } return 0; } ``` ### Assembly Code The source code can be found [here](https://github.com/sweets777/ca2025-quizzes/blob/main/hw1/HW1_B.s). ```=Risc-V .data title1: .string "Input n=5, Output:" title2: .string "Input n=8, Output:" title3: .string "Input n=0, Output:" str_open_bracket: .string " [" str_close_bracket_newline: .string "]\n" str_comma_space: .string ", " .text .globl main main: addi sp, sp, -48 sw ra, 44(sp) # Test 1 : n = 5 li a0, 5 addi a1, sp, 20 jal countBits_fill mv s0, a0 la a0, title1 addi a1, sp, 20 mv a2, s0 jal print_array # Test 2 : n = 8 li a0, 8 mv a1, sp jal countBits_fill mv s0, a0 la a0, title2 mv a1, sp mv a2, s0 jal print_array # Test 3 : n = 0 li a0, 0 mv a1, sp jal countBits_fill mv s0, a0 la a0, title3 mv a1, sp mv a2, s0 jal print_array lw ra, 44(sp) addi sp, sp, 48 li a7, 10 ecall count_leading_zeros: bnez a0, .L_clz_start li a0, 32 ret .L_clz_start: li t0, 0 mv t1, a0 lui t2, 0xFFFF0 and t3, t1, t2 bnez t3, .L_clz_check8 addi t0, t0, 16 slli t1, t1, 16 .L_clz_check8: lui t2, 0xFF000 and t3, t1, t2 bnez t3, .L_clz_check4 addi t0, t0, 8 slli t1, t1, 8 .L_clz_check4: lui t2, 0xF0000 and t3, t1, t2 bnez t3, .L_clz_check2 addi t0, t0, 4 slli t1, t1, 4 .L_clz_check2: lui t2, 0xC0000 and t3, t1, t2 bnez t3, .L_clz_check1 addi t0, t0, 2 slli t1, t1, 2 .L_clz_check1: lui t2, 0x80000 and t3, t1, t2 bnez t3, .L_clz_done addi t0, t0, 1 .L_clz_done: mv a0, t0 ret countBits_fill: addi sp, sp, -20 sw ra, 16(sp) sw s0, 12(sp) sw s1, 8(sp) sw s2, 4(sp) sw s3, 0(sp) mv s0, a0 mv s1, a1 sw zero, 0(s1) li s2, 1 .L_for_loop_start: bgt s2, s0, .L_for_loop_end mv a0, s2 jal ra, count_leading_zeros li t0, 31 sub t1, t0, a0 li t2, 1 sll s3, t2, t1 sub t0, s2, s3 slli t1, t0, 2 add t2, s1, t1 lw t3, 0(t2) addi t3, t3, 1 slli t1, s2, 2 add t2, s1, t1 sw t3, 0(t2) addi s2, s2, 1 j .L_for_loop_start .L_for_loop_end: addi a0, s0, 1 lw s3, 0(sp) lw s2, 4(sp) lw s1, 8(sp) lw s0, 12(sp) lw ra, 16(sp) addi sp, sp, 20 ret print_array: addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) mv s0, a0 mv s1, a1 mv s2, a2 mv a0, s0 li a7, 4 ecall la a0, str_open_bracket li a7, 4 ecall li t0, 0 .L_print_loop: bge t0, s2, .L_print_loop_end slli t1, t0, 2 add t2, s1, t1 lw a0, 0(t2) li a7, 1 ecall addi t3, s2, -1 bge t0, t3, .L_skip_comma la a0, str_comma_space li a7, 4 ecall .L_skip_comma: addi t0, t0, 1 j .L_print_loop .L_print_loop_end: la a0, str_close_bracket_newline li a7, 4 ecall lw s2, 0(sp) lw s1, 4(sp) lw s0, 8(sp) lw ra, 12(sp) addi sp, sp, 16 ret ``` ## Results #### Test Data 1 * Input:n=5 * Output:[0,1,1,2,1,2] #### Test Data 2 * Input:n=8 * Output:[0,1,1,2,1,2,2,3,1] #### Test Data 3 * Input:n=0 * Output:[0] #### C Program Output <img src="https://hackmd.io/_uploads/BJDb41KTxl.png" width="500" alt="說明文字"> <p></p> #### RISC-V Assembly Output <img src="https://hackmd.io/_uploads/rJhArJt6ge.png" width="450" alt="說明文字"> ## Analysis I tested the code using [Ripes](https://ripes.me/) simulator. ### Disassembled Executable Code The source code can be found [here](https://github.com/sweets777/ca2025-quizzes/blob/main/hw1/HW1_B.txt). ``` 00000000 <main>: 0: fd010113 addi x2 x2 -48 4: 02112623 sw x1 44 x2 8: 00500513 addi x10 x0 5 c: 01410593 addi x11 x2 20 10: 0f0000ef jal x1 240 <countBits_fill> 14: 00050413 addi x8 x10 0 18: 10000517 auipc x10 0x10000 1c: fe850513 addi x10 x10 -24 20: 01410593 addi x11 x2 20 24: 00040613 addi x12 x8 0 28: 164000ef jal x1 356 <print_array> 2c: 00800513 addi x10 x0 8 30: 00010593 addi x11 x2 0 34: 0cc000ef jal x1 204 <countBits_fill> 38: 00050413 addi x8 x10 0 3c: 10000517 auipc x10 0x10000 40: fd750513 addi x10 x10 -41 44: 00010593 addi x11 x2 0 48: 00040613 addi x12 x8 0 4c: 140000ef jal x1 320 <print_array> 50: 00000513 addi x10 x0 0 54: 00010593 addi x11 x2 0 58: 0a8000ef jal x1 168 <countBits_fill> 5c: 00050413 addi x8 x10 0 60: 10000517 auipc x10 0x10000 64: fc650513 addi x10 x10 -58 68: 00010593 addi x11 x2 0 6c: 00040613 addi x12 x8 0 70: 11c000ef jal x1 284 <print_array> 74: 02c12083 lw x1 44 x2 78: 03010113 addi x2 x2 48 7c: 00a00893 addi x17 x0 10 80: 00000073 ecall 00000084 <count_leading_zeros>: 84: 00051663 bne x10 x0 12 <.L_clz_start> 88: 02000513 addi x10 x0 32 8c: 00008067 jalr x0 x1 0 00000090 <.L_clz_start>: 90: 00000293 addi x5 x0 0 94: 00050313 addi x6 x10 0 98: ffff03b7 lui x7 0xffff0 9c: 00737e33 and x28 x6 x7 a0: 000e1663 bne x28 x0 12 <.L_clz_check8> a4: 01028293 addi x5 x5 16 a8: 01031313 slli x6 x6 16 000000ac <.L_clz_check8>: ac: ff0003b7 lui x7 0xff000 b0: 00737e33 and x28 x6 x7 b4: 000e1663 bne x28 x0 12 <.L_clz_check4> b8: 00828293 addi x5 x5 8 bc: 00831313 slli x6 x6 8 000000c0 <.L_clz_check4>: c0: f00003b7 lui x7 0xf0000 c4: 00737e33 and x28 x6 x7 c8: 000e1663 bne x28 x0 12 <.L_clz_check2> cc: 00428293 addi x5 x5 4 d0: 00431313 slli x6 x6 4 000000d4 <.L_clz_check2>: d4: c00003b7 lui x7 0xc0000 d8: 00737e33 and x28 x6 x7 dc: 000e1663 bne x28 x0 12 <.L_clz_check1> e0: 00228293 addi x5 x5 2 e4: 00231313 slli x6 x6 2 000000e8 <.L_clz_check1>: e8: 800003b7 lui x7 0x80000 ec: 00737e33 and x28 x6 x7 f0: 000e1463 bne x28 x0 8 <.L_clz_done> f4: 00128293 addi x5 x5 1 000000f8 <.L_clz_done>: f8: 00028513 addi x10 x5 0 fc: 00008067 jalr x0 x1 0 00000100 <countBits_fill>: 100: fec10113 addi x2 x2 -20 104: 00112823 sw x1 16 x2 108: 00812623 sw x8 12 x2 10c: 00912423 sw x9 8 x2 110: 01212223 sw x18 4 x2 114: 01312023 sw x19 0 x2 118: 00050413 addi x8 x10 0 11c: 00058493 addi x9 x11 0 120: 0004a023 sw x0 0 x9 124: 00100913 addi x18 x0 1 00000128 <.L_for_loop_start>: 128: 05244263 blt x8 x18 68 <.L_for_loop_end> 12c: 00090513 addi x10 x18 0 130: f55ff0ef jal x1 -172 <count_leading_zeros> 134: 01f00293 addi x5 x0 31 138: 40a28333 sub x6 x5 x10 13c: 00100393 addi x7 x0 1 140: 006399b3 sll x19 x7 x6 144: 413902b3 sub x5 x18 x19 148: 00229313 slli x6 x5 2 14c: 006483b3 add x7 x9 x6 150: 0003ae03 lw x28 0 x7 154: 001e0e13 addi x28 x28 1 158: 00291313 slli x6 x18 2 15c: 006483b3 add x7 x9 x6 160: 01c3a023 sw x28 0 x7 164: 00190913 addi x18 x18 1 168: fc1ff06f jal x0 -64 <.L_for_loop_start> 0000016c <.L_for_loop_end>: 16c: 00140513 addi x10 x8 1 170: 00012983 lw x19 0 x2 174: 00412903 lw x18 4 x2 178: 00812483 lw x9 8 x2 17c: 00c12403 lw x8 12 x2 180: 01012083 lw x1 16 x2 184: 01410113 addi x2 x2 20 188: 00008067 jalr x0 x1 0 0000018c <print_array>: 18c: ff010113 addi x2 x2 -16 190: 00112623 sw x1 12 x2 194: 00812423 sw x8 8 x2 198: 00912223 sw x9 4 x2 19c: 01212023 sw x18 0 x2 1a0: 00050413 addi x8 x10 0 1a4: 00058493 addi x9 x11 0 1a8: 00060913 addi x18 x12 0 1ac: 00040513 addi x10 x8 0 1b0: 00400893 addi x17 x0 4 1b4: 00000073 ecall 1b8: 10000517 auipc x10 0x10000 1bc: e8150513 addi x10 x10 -383 1c0: 00400893 addi x17 x0 4 1c4: 00000073 ecall 1c8: 00000293 addi x5 x0 0 000001cc <.L_print_loop>: 1cc: 0322dc63 bge x5 x18 56 <.L_print_loop_end> 1d0: 00229313 slli x6 x5 2 1d4: 006483b3 add x7 x9 x6 1d8: 0003a503 lw x10 0 x7 1dc: 00100893 addi x17 x0 1 1e0: 00000073 ecall 1e4: fff90e13 addi x28 x18 -1 1e8: 01c2da63 bge x5 x28 20 <.L_skip_comma> 1ec: 10000517 auipc x10 0x10000 1f0: e5350513 addi x10 x10 -429 1f4: 00400893 addi x17 x0 4 1f8: 00000073 ecall 000001fc <.L_skip_comma>: 1fc: 00128293 addi x5 x5 1 200: fcdff06f jal x0 -52 <.L_print_loop> 00000204 <.L_print_loop_end>: 204: 10000517 auipc x10 0x10000 208: e3850513 addi x10 x10 -456 20c: 00400893 addi x17 x0 4 210: 00000073 ecall 214: 00012903 lw x18 0 x2 218: 00412483 lw x9 4 x2 21c: 00812403 lw x8 8 x2 220: 00c12083 lw x1 12 x2 224: 01010113 addi x2 x2 16 228: 00008067 jalr x0 x1 0 ``` ### 5-Stage Pipelined Processor A 5-Stage Pipelined Processor breaks down instruction execution into five distinct steps, much like an assembly line. This allows the processor to work on multiple instructions simultaneously, with each one in a different stage of completion. The goal is to improve throughput, meaning more instructions can be completed in a given amount of time. <img src="https://hackmd.io/_uploads/SyWkDxtTll.png" width="800" alt="說明文字"> <p></p> The five classic RISC-V stages are: <ul style="margin:0; padding-left:1.2em;"> <li style="margin:0 0 2px 0;">1. IF (Instruction Fetch): The first step is to get the instruction from memory.</li> <li style="margin:0 0 2px 0;">2. ID (Instruction Decode): Now the processor needs to figure out what the instruction means.</li> <li style="margin:0;">3. EX (Execute): This is where the actual calculation or computation occurs.</li> <li style="margin:0 0 2px 0;">4. MEM (Memory Access): This stage is for reading from or writing to data memory.</li> <li style="margin:0 0 2px 0;">5. WB (Write Back): The final step is to save the result.</li> </ul> #### I-type format I take the instruction `slli x6 x18 2` in the pseudo instruction for example and analyze how the processor operates the instruction in different stages. According to [RISC-V Instruction Set Manual (p.14)](https://riscv.org/specifications/ratified/): > The constant part is an unsigned 5-bit value ranging from 0 to 31, representing the shift amount. It performs a shift operation on the rs1 register, and the result is written to the rd register. SLLI is a logical left shift. > ![截圖 2025-10-13 凌晨12.55.43](https://hackmd.io/_uploads/ByOsaUKaxl.png) #### 1. IF <img src="https://hackmd.io/_uploads/ryFaNDFagg.png" width="300" alt="說明文字"> * We start with the instruction located at `0x00000158`, so the `addr` sent to the Instruction Memory is `0x00000158`. * The machine code for the instruction `slli x6 x18 2` is `0x00291313`, so the `instr` is `0x00291313`. * The PC will increment by 4 automatically using the adder. The current PC `0x00000158` is added with 4 to get the next sequential address, `0x0000015c`. * Since `slli` is neither a branch nor a jump instruction, the multiplexer before the PC selects the input from the adder (PC + 4). This means the next instruction will be fetched from address `0x0000015C`. #### 2. ID <img src="https://hackmd.io/_uploads/BygpDvtTge.png" width="350" alt="說明文字"> * The instruction `0x00291313` is decoded into several parts: * `opcode` = `SLLI` * `Wr idx` = `0x06` (the index for the destination register `x6`) * `R1 idx` = `0x12` (the index for the source register `x18`) * `imm.` = `0x00000002` (the immediate value for the shift amount) * This I-type instruction reads the value from one source register, `rs1`. The value from the register specified by `R1 idx` (`0x12`) is read. The diagram shows that the value read from this register is `0x00000005`. * Although the hardware might read a value from a second register (`R2 idx`), this value (`0x7ffffac`) will be ignored in the next stage since the `slli` operation only uses one register operand. * The current PC value (`0x00000158`) and the next PC value (`0x0000015c`) are simply passed through the pipeline register (IFID) to be used in later stages. #### 3. EX <img src="https://hackmd.io/_uploads/HklNKjwt6gx.png" width="350" alt="說明文字"> * The first multiplexer (for the upper ALU input) selects the value from `Reg 1` (`0x00000005`) as the first operand. * The values from `Reg 1` and `Reg 2` are also sent to the branch block, but no branch is taken since this is not a branch instruction. * The second multiplexer (for the lower ALU input) selects the immediate value (`0x00000002`) as the second operand. * The ALU performs a shift left logical operation on the two operands, so the `Res` is equal to `0x00000014` (which is the result of `0x00000005 << 2`). * The next PC value (`0x0000015c`) and `Wr idx` (`0x06`) are simply passed through this stage. #### 4. MEM <img src="https://hackmd.io/_uploads/rJ2Jy_tagx.png" width="250" alt="說明文字"> * The `Res` from ALU (`0x00000014`) is handled in several ways: * It passes through this stage and will go to the WB stage to be written back to the register. * It can be sent back (forwarded) to the EX stage for subsequent instructions to use. * It is used as the data memory address. Although `slli` is not a load instruction, the hardware still performs a read. The Read out value is `0x00050413`, but this will be ignored since the instruction does not need it. * Memory read data at address `0x00000014`, so `Read out` is equal to `0x00050413`. The table below denotes the data section of memory. <img src="https://hackmd.io/_uploads/S1L7MdFTge.png" width="800" alt="說明文字"> * The value from `Reg 2` (`0x7ffffac`) is sent to Data in, but the `Wr en` (Write enable) signal is disabled because `slli` is not a store instruction. * The `Wr idx` (`0x06`) and the next PC value (`0x0000015c`) are simply passed through this stage. #### 5. WB <img src="https://hackmd.io/_uploads/H1a8BuK6xx.png" width="180" alt="說明文字"> * The multiplexer selects the `Res` from the ALU as the final output, not the value read from data memory. Therefore, the output value is `0x00000014`. * The output value and `Wr idx` are sent back to the register block. Finally, the value `0x00000014` will be written into the `x6` register, whose ABI name is `t1`. ### Memory Update The data will be written back to the register only after the instruction `slli x6 x18 2` has finished execution and has been covered by other instructions in the processor. <img src="https://hackmd.io/_uploads/SyreROFpee.png" width="300" alt="說明文字"> <img src="https://hackmd.io/_uploads/rkP1sB5Teg.png" width="300" alt="說明文字"> <img src="https://hackmd.io/_uploads/rJ245dYpxg.png" width="300" alt="說明文字"> # Assignment 1-C : Decoding BFloat16: 16-bit Binary to Decimal contributed by [`sweets777`](https://github.com/sweets777/) ## BFloat16 to Decimal The `bfloat16` (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory, it represents a wide dynamic range of numeric values by using a floating radix point. `Description`: Given a 16-bit binary number in bfloat16 format, calculate the decimal value it represents. `Example`: Input: 0011111110000000 ----> Output: 1 ## Solution **Structure:** * Sign: 1 bit * Exponent: 8 bits * Mantissa (Fraction): 7 bits **Program execution flow:** **1. Parse the binary string** (`parse_b16`)**:** * First, it checks whether the input string is `NULL` and whether its length is exactly 16. * Then it iterates over the string and converts the `'0'` and `'1'` characters into a 16-bit unsigned integer (`uint16_t`). **2. Convert to** `double` (`bf16bits_to_double`)**:** * This is the core of the conversion. It decomposes the 16-bit integer into sign, exponent, and fraction according to the bfloat16 format. * Special cases: * If the exponent is `0xFF` (all ones), the value is **Infinity** or **NaN** (Not a Number). * If the exponent is `0x00` (all zeros), the value is **Zero** or a **subnormal number**. * Normal case: * For normal numbers, it reconstructs the `double` value using `Value = (-1)^sign × (1.fraction) × 2^(exponent − 127)`. **3. Main program** (`main`)**:** * The `main` function defines three 16-bit binary test strings. * It calls the conversion function for each string in sequence and uses `print_decimal` to print the resulting `double` to the screen, with formatting for special values such as `-0` and `inf`. ## Implementation ### C Code The source code can be found [here](https://github.com/sweets777/ca2025-quizzes/blob/main/hw1/HW1_C.c). ```=c #include <stdint.h> #include <stdbool.h> #include <math.h> #include <stdio.h> #include <string.h> static bool parse_b16(const char *s, uint16_t *out){ if (!s || strlen(s) != 16) return false; uint16_t v = 0; for (int i = 0; i < 16; i++){ char c = s[i]; if (c != '0' && c != '1') return false; v <<= 1; if (c == '1') v |= 1; *out = v; } return true; } static double bf16bits_to_double(uint16_t bits){ int sign = (bits >> 15) & 1; int exp = (bits >> 7) & 0xFF; int mant = bits & 0x7F; if (exp == 0xFF){ // Inf / NaN if (mant == 0) return sign ? -INFINITY : INFINITY; return NAN; } if (exp == 0){ // Zero / Subnormal if (mant == 0) return sign ? -0.0 : 0.0; double frac = (double)mant / 128.0; double val = ldexp(frac, 1 - 127); return sign ? -val : val; } double frac = 1.0 + (double)mant / 128.0; double val = ldexp(frac, exp - 127); return sign ? -val : val; } static double bf16_binstr_to_double(const char *b16){ uint16_t bits; if (!parse_b16(b16, &bits)) return NAN; return bf16bits_to_double(bits); } static void print_decimal(double x){ if (isnan(x)) { printf("NaN\n"); return; } if (isinf(x)) { printf("%sinf\n", signbit(x) ? "-" : "+"); return; } if (x == 0.0 && signbit(x)) { printf("-0\n"); return; } printf("%.12g\n", x); } int main(void){ const char *tests[3] ={ "0011111110000000", "1011111110000000", "0000000000001111" }; for(int i = 0; i < 3; i++){ double val = bf16_binstr_to_double(tests[i]); printf("Case %d: %s -> ", i + 1, tests[i]); print_decimal(val); } return 0; } ``` ### Assembly Code The source code can be found [here](https://github.com/sweets777/ca2025-quizzes/blob/main/hw1/HW1_C.s). ```=Risc-V .data test_cases: .string "0011111110000000" .string "1011111110000000" .string "0000000000001111" case_count: .word 3 # String literals for output str_case: .string "Case " str_colon: .string ": " str_arrow: .string " -> " str_newline:.string "\n" str_nan: .string "NaN" .text .globl main main: la s0, test_cases la s1, case_count lw s1, 0(s1) addi s2, x0, 1 loop_cases: beq s1, zero, end_program addi a7, x0, 4 la a0, str_case ecall addi a7, x0, 1 addi a0, s2, 0 ecall addi a7, x0, 4 la a0, str_colon ecall addi a7, x0, 4 addi a0, s0, 0 ecall addi a7, x0, 4 la a0, str_arrow ecall addi a1, s0, 0 add t0, x0, x0 add t1, x0, x0 addi t2, x0, 16 parse_loop: beq t1, t2, parse_done lb t3, 0(a1) slli t0, t0, 1 addi a1, a1, 1 addi t1, t1, 1 addi t4, x0, 49 bne t3, t4, parse_loop ori t0, t0, 1 j parse_loop parse_done: srli t1, t0, 15 slli t1, t1, 31 slli t2, t0, 17 srli t2, t2, 24 slli t3, t0, 25 srli t3, t3, 25 addi t4, x0, 255 beq t2, t4, check_inf_nan beq t2, zero, check_subnormal slli t2, t2, 23 slli t3, t3, 16 or a0, t1, t2 or a0, a0, t3 j print_float check_inf_nan: bne t3, zero, print_nan addi t2, x0, 255 slli t2, t2, 23 or a0, t1, t2 j print_float print_nan: addi a7, x0, 4 la a0, str_nan ecall j next_case check_subnormal: beq t3, zero, handle_zero slli t3, t3, 16 or a0, t1, t3 j print_float handle_zero: addi a0, t1, 0 j print_float print_float: addi a7, x0, 2 ecall j next_case next_case: addi a7, x0, 4 la a0, str_newline ecall addi s0, s0, 17 addi s2, s2, 1 addi s1, s1, -1 j loop_cases end_program: addi a7, x0, 10 ecall ``` ## Results #### Test Data 1 * Input:0011111110000000 * Output:1 #### Test Data 2 * Input:1011111110000000 * Output:-1 #### Test Data 3 * Input:0000000000001111 * Output:1.37753244237e-39 #### C Program Output <img src="https://hackmd.io/_uploads/HJqAZQ5aeg.png" width="500" alt="說明文字"> <p></p> #### RISC-V Assembly Output <img src="https://hackmd.io/_uploads/HyzgfXcpxl.png" width="400" alt="說明文字"> ## Analysis I tested the code using [Ripes](https://ripes.me/) simulator. ### Disassembled Executable Code The source code can be found [here](https://github.com/sweets777/ca2025-quizzes/blob/main/hw1/HW1_C.txt). ``` 00000000 <main>: 0: 10000417 auipc x8 0x10000 4: 00040413 addi x8 x8 0 8: 10000497 auipc x9 0x10000 c: 02b48493 addi x9 x9 43 10: 0004a483 lw x9 0 x9 14: 00100913 addi x18 x0 1 00000018 <loop_cases>: 18: 12048263 beq x9 x0 292 <end_program> 1c: 00400893 addi x17 x0 4 20: 10000517 auipc x10 0x10000 24: 01750513 addi x10 x10 23 28: 00000073 ecall 2c: 00100893 addi x17 x0 1 30: 00090513 addi x10 x18 0 34: 00000073 ecall 38: 00400893 addi x17 x0 4 3c: 10000517 auipc x10 0x10000 40: 00150513 addi x10 x10 1 44: 00000073 ecall 48: 00400893 addi x17 x0 4 4c: 00040513 addi x10 x8 0 50: 00000073 ecall 54: 00400893 addi x17 x0 4 58: 10000517 auipc x10 0x10000 5c: fe850513 addi x10 x10 -24 60: 00000073 ecall 64: 00040593 addi x11 x8 0 68: 000002b3 add x5 x0 x0 6c: 00000333 add x6 x0 x0 70: 01000393 addi x7 x0 16 00000074 <parse_loop>: 74: 02730263 beq x6 x7 36 <parse_done> 78: 00058e03 lb x28 0 x11 7c: 00129293 slli x5 x5 1 80: 00158593 addi x11 x11 1 84: 00130313 addi x6 x6 1 88: 03100e93 addi x29 x0 49 8c: ffde14e3 bne x28 x29 -24 <parse_loop> 90: 0012e293 ori x5 x5 1 94: fe1ff06f jal x0 -32 <parse_loop> 00000098 <parse_done>: 98: 00f2d313 srli x6 x5 15 9c: 01f31313 slli x6 x6 31 a0: 01129393 slli x7 x5 17 a4: 0183d393 srli x7 x7 24 a8: 01929e13 slli x28 x5 25 ac: 019e5e13 srli x28 x28 25 b0: 0ff00e93 addi x29 x0 255 b4: 01d38e63 beq x7 x29 28 <check_inf_nan> b8: 04038063 beq x7 x0 64 <check_subnormal> bc: 01739393 slli x7 x7 23 c0: 010e1e13 slli x28 x28 16 c4: 00736533 or x10 x6 x7 c8: 01c56533 or x10 x10 x28 cc: 0440006f jal x0 68 <print_float> 000000d0 <check_inf_nan>: d0: 000e1a63 bne x28 x0 20 <print_nan> d4: 0ff00393 addi x7 x0 255 d8: 01739393 slli x7 x7 23 dc: 00736533 or x10 x6 x7 e0: 0300006f jal x0 48 <print_float> 000000e4 <print_nan>: e4: 00400893 addi x17 x0 4 e8: 10000517 auipc x10 0x10000 ec: f5f50513 addi x10 x10 -161 f0: 00000073 ecall f4: 0280006f jal x0 40 <next_case> 000000f8 <check_subnormal>: f8: 000e0863 beq x28 x0 16 <handle_zero> fc: 010e1e13 slli x28 x28 16 100: 01c36533 or x10 x6 x28 104: 00c0006f jal x0 12 <print_float> 00000108 <handle_zero>: 108: 00030513 addi x10 x6 0 10c: 0040006f jal x0 4 <print_float> 00000110 <print_float>: 110: 00200893 addi x17 x0 2 114: 00000073 ecall 118: 0040006f jal x0 4 <next_case> 0000011c <next_case>: 11c: 00400893 addi x17 x0 4 120: 10000517 auipc x10 0x10000 124: f2550513 addi x10 x10 -219 128: 00000073 ecall 12c: 01140413 addi x8 x8 17 130: 00190913 addi x18 x18 1 134: fff48493 addi x9 x9 -1 138: ee1ff06f jal x0 -288 <loop_cases> 0000013c <end_program>: 13c: 00a00893 addi x17 x0 10 140: 00000073 ecall ``` ### 5-Stage Pipelined Processor A 5-Stage Pipelined Processor breaks down instruction execution into five distinct steps, much like an assembly line. This allows the processor to work on multiple instructions simultaneously, with each one in a different stage of completion. The goal is to improve throughput, meaning more instructions can be completed in a given amount of time. <img src="https://hackmd.io/_uploads/SyWkDxtTll.png" width="800" alt="說明文字"> <p></p> The five classic RISC-V stages are: <ul style="margin:0; padding-left:1.2em;"> <li style="margin:0 0 2px 0;">1. IF (Instruction Fetch): The first step is to get the instruction from memory.</li> <li style="margin:0 0 2px 0;">2. ID (Instruction Decode): Now the processor needs to figure out what the instruction means.</li> <li style="margin:0;">3. EX (Execute): This is where the actual calculation or computation occurs.</li> <li style="margin:0 0 2px 0;">4. MEM (Memory Access): This stage is for reading from or writing to data memory.</li> <li style="margin:0 0 2px 0;">5. WB (Write Back): The final step is to save the result.</li> </ul> #### I-type format I take the instruction `addi x8 x8 0` in the pseudo instruction for example and analyze how the processor operates the instruction in different stages. According to [RISC-V Instruction Set Manual (p.14)](https://riscv.org/specifications/ratified/): > The 12-bit immediate is sign-extended to 32 bits and then added to the value in the rs1 register. The result is written to the rd register. > ![截圖 2025-10-13 下午3.11.30](https://hackmd.io/_uploads/rJEFP7c6lx.png) #### 1. IF <img src="https://hackmd.io/_uploads/r1nbqm5agl.png" width="300" alt="說明文字"> * We start with the instruction located at `0x00000004`, so the addr sent to the Instruction Memory is `0x00000004`. * The machine code for the instruction `addi x8 x8 0` is `0x00040413`, so the instr is `0x00040413` * The PC will increment by 4 automatically using the adder. The current PC `0x00000004` is added with 4 to get the next sequential address, `0x00000008`. * Since `addi` is neither a branch nor a jump instruction, the multiplexer before the PC selects the input from the adder (PC + 4). This means the next instruction will be fetched from address `0x00000008`. #### 2. ID <img src="https://hackmd.io/_uploads/Hk3Xnmcpll.png" width="350" alt="說明文字"> * The instruction `0x00040413` is decoded into several parts: * `opcode` = `ADDI` * `Wr idx` = `0x08` (the index for the destination register `x8`) * `R1 idx` = `0x08` (the index for the source register `x8`) * `imm.` = `0x00000000` (12-bit immediate, sign-extended) * This I-type instruction reads a value from one source register, `rs1`. The value from the register specified by `R1 idx` (`0x08`) is read. The diagram shows that the value read from this register is `0x00000000`. * Although the hardware might read a value from a second register (`R2 idx`), this value will be ignored in the next stage since the `addi` operation only uses one register operand. * The current PC value (`0x00000004`) and the next PC value (`0x00000008`) are simply passed through the pipeline register (IFID) to be used in later stages. #### 3. EX <img src="https://hackmd.io/_uploads/B1Be44qpxx.png" width="350" alt="說明文字"> * The first multiplexer (for the upper ALU input) selects the value from `Reg 1` (`0x10000000`) as the first operand. * The values from `Reg 1` and `Reg 2` are also sent to the branch block, but no branch is taken since this is not a branch instruction. * The second multiplexer (for the lower ALU input) selects the immediate value (`0x00000000`) as the second operand. * The ALU performs an addition operation on the two operands, so the `Res` is equal to `0x10000000` (which is the result of `0x10000000` + `0x00000000`). * The next PC value (`0x00000008`) and `Wr idx` (`0x08`) are simply passed through this stage. #### 4. MEM <img src="https://hackmd.io/_uploads/BknoeBqall.png" width="250" alt="說明文字"> * The `Res` from the ALU (`0x10000000`) is handled in several ways: * It passes through this stage and will go to the WB stage to be written back to the register. * It can be sent back (forwarded) to the EX stage for subsequent instructions to use. * It is used as the data memory address. Although `addi` is not a load instruction, the hardware still performs a read. The Read out value is `0x31313030`, but this will be ignored since the instruction does not need it. * Memory read data at address `0x10000000`, so `Read out` is equal to `0x31313030`. The table below denotes the data section of memory. <img src="https://hackmd.io/_uploads/rkBgHrcael.png" width="800" alt="說明文字"> * The value from `Reg 2` is sent to Data in, but the `Wr en` (Write enable) signal is disabled because `addi` is not a store instruction. * The `Wr idx` (`0x08`) and the next PC value (`0x00000008`) are simply passed through this stage. #### 5. WB <img src="https://hackmd.io/_uploads/S1ZtSS9age.png" width="180" alt="說明文字"> * The multiplexer selects the `Res` from the ALU (`0x10000000`) as the final output, not the value read from data memory (`0x31313030`). Therefore, the output value is `0x10000000`. * The output value (`0x10000000`) and `Wr idx` (`0x08`) are sent back to the register block. Finally, the value will be written into the `x8` register. ### Memory Update The data will be written back to the register only after the instruction `addi x8 x8 0` has finished execution and has been covered by other instructions in the processor. **NOTE** The function of the instruction `addi x8 x8 0` is to add zero, which by itself does not change the value of the `x8` register. However, due to the processor's pipelined execution, as the `addi` instruction advances through the pipeline, the preceding instruction, `auipc x8 0x10000`, completes its Write Back stage first. Therefore, the result of `auipc`, `0x10000000`, is updated to register `x8`. The change in the value of `x8` that we observe is the result of the **previous instruction**, not caused by the `addi` instruction. <img src="https://hackmd.io/_uploads/rJ9z6S9peg.png" width="600" alt="說明文字"> <img src="https://hackmd.io/_uploads/Sy_76S9Tll.png" width="600" alt="說明文字"> <img src="https://hackmd.io/_uploads/Sy6J_r5Tgl.png" width="300" alt="說明文字"> <img src="https://hackmd.io/_uploads/rk6TsrqTle.png" width="300" alt="說明文字"> <img src="https://hackmd.io/_uploads/rkWq8B9Txl.png" width="300" alt="說明文字"> ## Reference * [Five EmbedDev](https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/rv32.html) * [RISC-V Instruction Set](https://riscv.org/specifications/ratified/) * [LeetCode 338_Counting Bits](https://leetcode.com/problems/counting-bits/description/)