Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [Potassium-chromate](https://github.com/Potassium-chromate/ca2025-quizzes.git) > ## Quiz 1 - Problem B Assume uf8 implements a logarithmic 8-bit codec that maps 20-bit unsigned integers $([0,1,015,792])$ to 8-bit symbols via logarithmic quantization, delivering 2.5:1 compression and ≤6.25% relative error. The uf8 encoding scheme is ideal for sensor data applications such as temperature and distance measurements where range is more important than precision. In graphics applications, it efficiently represents level-of-detail (LOD) distances and fog density values. The scheme is also well-suited for timer implementations that use exponential backoff strategies. However, the uf8 encoding should not be used for financial calculations where exact precision is required and rounding errors are unacceptable. It is inappropriate for cryptographic applications that require uniform distribution of values for security purposes. - Decoding: $D(b) = m \cdot 2^e + (2^e - 1) \cdot 16$ where $e = \lfloor b/16 \rfloor$ and $m=b$ mod $16$ - Encoding: $E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases}$ where $\text{offset}(e) = (2^e - 1) \cdot 16$ ### C code(uf8_decode) ```c /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } ``` ### Assembly Code(uf8_decode) ```asm uf8_decode: andi t2, a0, 0x0f # mantissa = fl & 0x0f srli t3, a0, 4 # exponent = fl >> 4 li t6, 0x7FFF # load constant 0x7FFF li t5, 15 # prepare 15 sub t4, t5, t3 # t4 = 15 - exponent srl t4, t6, t4 # t4 = 0x7FFF >> (15 - exponent) slli t4, t4, 4 # offset = t4 << 4 sll t2, t2, t3 # mantissa << exponent add a0, t2, t4 # return = mantissa<<exponent + offset ret ``` ### C code(uf8_encode) ```c uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` ### Assembly Code(uf8_encode) ```asm uf8_encode: li t2, 16 bgeu a0, t2, geu_16 # if (value < 16) return ret geu_16: mv t4, a0 # set x as input li t2, 32 # n = 32 li t3, 16 # c =16 clz: srl t5, t4, t3 # y = x >> c beq t5, x0, temp_label # if(y) sub t2, t2, t3 # n -= c mv t4, t5 # x = y temp_label: srli t3, t3, 1 bne t3, x0, clz sub t2, t2, t4 # lz = clz(value) li t3, 31 sub t2, t3, t2 # msb = 31 - lz li t3, 0 # exponent = 0 li t4, 0 # overflow = 0 li t5, 5 blt t2, t5, temp_label2 # if (msb >= 5) addi t3, t2, -4 # exponent = msb - 4; li t5, 15 bltu t3, t5, temp_label3 # if (exponent > 15) li t3, 15 # exponent = 15 temp_label3: li t5, 0 # e=0 start_loop: bge t5, t3, end_loop # if e >= exponent, exit slli t6, t4, 1 # t6 = (overflow << 1) addi t4, t6, 16 # overflow = t6 + 16 addi t5, t5, 1 # e++ j start_loop end_loop: bltu t3, x0, temp_label2 # if exponent <= 0, break while loop bgeu a0, t4, temp_label2 # if value >= overflow, break while loop addi t5, t4, -16 # t5 = (overflow - 16) srli t4, t4, 1 # overflow = t5 >> 1 addi t3, t3, -1 # exponent-- j end_loop # back to the start of the while loop temp_label2: li t5, 15 bgeu t3, t5, end_while # if exponent >= 15, then break the while loop slli t5, t4, 1 # t5 = (overflow << 1) addi t5, t5, 16 # next_overflow t5 + 16 blt a0, t5, end_while # if (value < next_overflow) break mv t4, t5 # overflow = next_overflow; addi t3, t3, 1 # exponent++; j temp_label2 end_while: sub t5, a0, t4 # t5 = value - overflow srl t5 ,t5, t3 # mantissa = t5 >> exponent slli a0, t3, 4 # a0 = exponent << 4 or a0, a0, t5 ret ``` Full C code: [q1-uf8.c](https://github.com/Potassium-chromate/ca2025-quizzes/blob/main/q1-uf8.c) Handwritten assembly: [q1-uf8_hand.s](https://github.com/Potassium-chromate/ca2025-quizzes/blob/main/q1-uf8_hand.s) ### Testing The test covers all integers from 0 to 255. Each value is passed to the encoder, then decoded back. The final result is expected to match the original input. If any inconsistency is detected, the program will raise an error message and terminate the test loop. ## Quiz 1 - Problem C > Full C code: [q1-bfloat16.c](https://github.com/Potassium-chromate/ca2025-quizzes/blob/main/q1-bfloat16.c) ### bf16_isnan The function identifies NaN values by examining the **exponent** and **mantissa** of the BF16 number: 1. The exponent is **all 1s** (`0xFF` in the 8-bit exponent field). 2. The mantissa is **non-zero** (at least one bit set). If both conditions are satisfied, the number is classified as NaN. **C code:** ```c static inline bool bf16_isnan(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && (a.bits & BF16_MANT_MASK); } ``` **assembly code:** ```asm bf16_isnan: li a1, 0x7F80 and a2, a0, a1 beq a2, a1, beq1 li a0, 0 ret beq1: li a1, 0x007F and a0, a0, a1 ret ``` ### bf16_isinf The function determines if a BF16 number is infinity by checking: 1. The **exponent** is all 1s (`0xFF` in the 8-bit exponent field). 2. The **mantissa** is zero. If both conditions are satisfied, the number is classified as infinity. **C code:** ```c static inline bool bf16_isinf(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && !(a.bits & BF16_MANT_MASK); } ``` **assembly code:** ```asm bf16_isinf: li a1, 0x7F80 and a2, a0, a1 beq a2, a1, beq2 li a0, 0 ret beq2: li a1, 0x007F and a1, a0, a1 li a0, 1 beq a1, x0, beq3 li a0, 0 beq3: ret ``` ### bf16_iszero The function determines if the BF16 value is zero by checking if all the bits of the **mantissa** and **exponent** are zero. 1. The **mantissa** (least significant 15 bits) must be zero. 2. The **exponent** (the most significant 1 bit of the 16-bit value) is irrelevant in this case because both positive and negative zero have an exponent of zero. **C code:** ```c static inline bool bf16_iszero(bf16_t a) { return !(a.bits & 0x7FFF); } ``` **assembly code:** ```asm bf16_iszero: li a1, 0x7FFF and a1, a0, a1 li a0, 1 beq a1, x0, beq4 li a0, 0 beq4: ret ``` ### f32_to_bf16 The function works by performing the following steps: 1. **Extracting the 32-bit representation:** The input `float` value is first converted to its raw 32-bit representation using `memcpy` to avoid any type casting issues. 2. **Handling special cases:** The function checks for `NaN`, `Infinity`, and other edge cases in the 32-bit float format by examining the exponent (the top 8 bits) and mantissa. 3. **Conversion:** - If the exponent is all ones (i.e., `0xFF`), the function truncates the value to fit into the 16-bit format. - For regular values, the mantissa is adjusted by extracting the most significant bit, and then the result is packed into the 16-bit BF16 format by shifting and masking the bits accordingly. **C code:** ```c static inline bf16_t f32_to_bf16(float val) { uint32_t f32bits; memcpy(&f32bits, &val, sizeof(float)); if (((f32bits >> 23) & 0xFF) == 0xFF) return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF}; f32bits += ((f32bits >> 16) & 1) + 0x7FFF; return (bf16_t) {.bits = f32bits >> 16}; } ``` **assembly code:** ```asm f32_to_bf16: srli a1, a0, 23 andi a1, a1, 0xFF li a2, 0xFF bne a1, a2, beq5 srli a1, a0, 16 andi a1, a1, 1 li a2, 0x7FFF add a1, a1, a2 add, a0, a0, a1 ret beq5: srli a1, a0, 16 li a0, 0xFFFF and a0, a1, a0 ret ``` ### bf16_to_f32 The function performs the following steps to convert a BF16 value to a 32-bit float: 1. **Shift the BF16 value:** The input BF16 value is left-shifted by 16 bits, effectively padding the lower 16 bits with zeros. This shifts the exponent and mantissa into their correct positions for a 32-bit float format. 2. **Interpret as a 32-bit float:** The resulting 32-bit value is then interpreted as a 32-bit floating-point number by copying the bits into a `float` variable using `memcpy`. 3. **Return the result:** The function then returns the 32-bit floating-point value. **C code:** ```c static inline float bf16_to_f32(bf16_t val) { uint32_t f32bits = ((uint32_t) val.bits) << 16; float result; memcpy(&result, &f32bits, sizeof(float)); return result; } ``` **assembly code:** ```asm bf16_to_f32: slli a0, a0, 16 ret ``` ### Testing 1. **test_basic_conversions** Purpose: Validates the conversion between 32-bit floats (float) and 16-bit BF16 values. Description: - Iterates over a set of representative float values: `0.0f, 1.0f, -1.0f, 2.0f, -2.0f, 0.5f, -0.5f, 3.14159f, -3.14159f, 1e10f, -1e10f` - Converts each float to BF16 and back to float. - Checks: - Sign consistency between original and converted value. - Relative error is within 1% for non-zero, finite numbers. Outcome: Prints ``"Basic conversions: PASS"`` if all conversions pass the checks. 2. **test_special_values** Purpose: Tests the handling of BF16 special values, including infinity, NaN, and zero. Description: - Tests the following BF16 values: - Positive infinity (`+Inf`) - Negative infinity (`-Inf`) - NaN (`Not a Number`) - Zero (`0.0f`) - Negative zero (`-0.0f`) - Uses helper functions: - `bf16_isinf` to detect infinity. - `bf16_isnan` to detect NaN. - `bf16_iszero` to detect zero values. - Verifies that each special value is correctly identified. Outcome: Prints `"Special values: PASS"` if all checks succeed. ## 868. Binary Gap ### Probelm description Given a positive integer `n`, find and return the longest distance between any two adjacent `1`'s in the binary representation of `n`. If there are no two adjacent `1`'s, return `0`. Two `1`'s are adjacent if there are only `0`'s separating them (possibly no `0`'s). The distance between two `1`'s is the absolute difference between their bit positions. For example, the two `1`'s in `"1001"` have a distance of 3. **Example 1:** >Input: n = 22 Output: 2 Explanation: 22 in binary is "10110". The first adjacent pair of 1's is "10110" with a distance of 2. The second adjacent pair of 1's is "10110" with a distance of 1. The answer is the largest of these two distances, which is 2. Note that "10110" is not a valid pair since there is a 1 separating the two 1's underlined. **Example 2:** >Input: n = 8 Output: 0 Explanation: 8 in binary is "1000". There are not any adjacent pairs of 1's in the binary representation of 8, so we return 0. **Example 3:** >Input: n = 5 Output: 2 Explanation: 5 in binary is "101". ### C code ```c int binaryGap(int n) { int gap = -1; int longest = 0; while (n > 0){ if(n&1){ if(gap != -1){ longest = (longest>gap)? longest: gap+1; } gap = 0; }else{ if(gap != -1) gap++; } n >>= 1; } return longest; } ``` ### assembly code ```asm binery_gap: li a1, -1 #int gap = -1; li a2, 0 #int longest = 0; li a3, -1 while_start: bge x0, a0, while_end andi a4, a0, 1 # a4 = n & 1 beq a4, x0, else beq a1, a3, if_end # if (gap != -1) blt a1, a2, if_end # if (gap >= longest) addi a2, a1, 1 if_end: li a1, 0 #gap = 0 j if_else_end else: beq a1, a3, if_else_end addi a1, a1, 1 #gap++ if_else_end: srli a0, a0, 1 j while_start while_end: mv a0, a2 ret ``` ### Testing This project provides both **assembly** and **C** implementations of the `binaryGap` function, along with corresponding testing functions to verify correctness. - **Assembly Version**: `binary_gap_hand.s` Contains the hand-written RISC-V assembly implementation of `binaryGap` and a testing function `test_binery_gap`. The test function runs multiple cases, compares the output to expected results, and prints `[pass]` or `[fail]` along with input, output, and correct answer. - **C Version**: `binary_gap.c` Contains a portable C implementation of `binaryGap` and a corresponding test function. It mirrors the assembly logic and validates the same set of test cases: - Input: 6 → Expected output: 1 - Input: 262 → Expected output: 6 - Input: 1024 → Expected output: 0 - Input: 22 → Expected output: 2 - Input: 4102 → Expected output: 10 Both implementations are tested automatically and produce consistent results. These test functions help ensure correctness of the binary gap computation and serve as a reference for verifying optimizations or alternate implementations. **Links to code**: - Assembly with testing: [binary_gap_hand.s](https://github.com/Potassium-chromate/ca2025-quizzes/blob/main/binary_gap_hand.s) - C with testing: [binary_gap.c](https://github.com/Potassium-chromate/ca2025-quizzes/blob/main/binary_gap.c) ## Result and Analysis ### 5-stage pipelined processor Ripes provides various pipeline forms, and this time we are using a Five-stage RISC-V Processor. ![image](https://hackmd.io/_uploads/B1ng8xtall.png) The five stage are: #### 1. Instruction Fetch (IF) Fetches the instruction from memory based on the current **Program Counter (PC)**. In the absence of branches or jumps, the PC increments by 4 each cycle. ![IF Stage](https://hackmd.io/_uploads/S1MPcgF6gx.png) - The **PC** holds the address of the current instruction (e.g., `0x10000517`). - The PC value is sent to the **Instruction Memory**, which returns the 32-bit instruction at that address. - After fetching, the PC updates to the next instruction (`PC + 4`). --- #### 2. Instruction Decode (ID) Decodes the fetched instruction, reads register operands, and prepares the immediate value for ALU operations. ![ID Stage](https://hackmd.io/_uploads/rk_-pgYaxl.png) - Example instruction: `0x10000517` - Decoded fields: - **Opcode:** `AUIPC` - **Destination Register (Wr idx):** `x9` - **Immediate:** `0x10000000` (upper 20 bits = `0x10000`, lower 12 bits = 0) - Since this is a **U-type** instruction, no registers are read: - `R1 idx = 0x00`, `R2 idx = 0x00` - `Reg1 = 0x00000000`, `Reg2 = 0x00000000` - The current PC (`0x00000000`) and next PC (`0x00000004`) are passed forward. --- ### 3. Execute (EX) Performs arithmetic or logical operations, or calculates memory addresses. Also determines if a branch is taken. ![EX Stage](https://hackmd.io/_uploads/rkrqaeYTll.png) - **First-level multiplexers** select between register values (Reg1, Reg2). Since this is a U-type instruction, they are not used. - **Second-level multiplexers** choose between the **current PC** and **immediate value** as ALU operands (`Op1`, `Op2`). - The **ALU** adds these operands: - The **Branch Unit** checks for branch conditions, but no branch is taken here. - The **Next PC (0x00000004)** and **Wr idx (0x09)** are passed forward. --- ### 4. Memory Access (MEM) Handles memory operations for load/store instructions. The ALU result provides the memory address. ![MEM Stage](https://hackmd.io/_uploads/r1ll0xKpgx.png) - **Reg2** is connected to **Data In**, but since this instruction does not write to memory, write enable remains disabled. - The **Next PC (0x00000004)** and **Wr idx (0x09)** continue to the next stage. - This stage effectively passes data forward without modification. --- ### 5. Write Back (WB) Writes computation results back into the register file. ![WB Stage](https://hackmd.io/_uploads/B1qOCgKpgg.png) - The multiplexer selects **ALU Result (Res)** as the final output. - Output value: - This value is written into **register x9 (ABI name: s1)**. - At the end of this cycle, the instruction has completed its execution, updating the register file. --- ### Quiz 1 - Problem B | Version | Cycles | Instrs. retired | CPI | IPC | Clock rate | | -------- | -------- | --------------- | --- | --- | ---------- | | GCC `-O0` | 106191 | 72864 | 1.46|0.686| 30.20KHz | | GCC `-O1` | 45252 | 33067 | 1.37|0.731| 6.28KHz | | GCC `-O2` | 41442 | 30877 | 1.34|0.745| 38.07KHz | | Handwritten ASM | 40888 | 28402 | 1.44|0.695| 61.01KHz | The table compares execution statistics across different implementations of the same program: GCC-generated code with various optimization levels (`-O0`, `-O1`, `-O2`) and a handcrafted assembly version. - **Cycle count** and **retired instructions** drop significantly as optimization increases. - **CPI (Cycles Per Instruction)** and **IPC (Instructions Per Cycle)** show that higher optimization improves efficiency, with `-O2` achieving the best instruction throughput. - **Handwritten assembly** executes fewer instructions and achieves the highest simulated clock rate, highlighting the potential of manual tuning even compared with optimized compiler output. ### Quiz 1 - Problem C **Test program: `test_basic_conversions`** | Version | Cycles | Instrs. retired | CPI | IPC | Clock rate | | -------- | -------- | --------------- | --- | --- | ---------- | | GCC `-O0` | 9211 | 6947 | 1.33|0.754| 32.59KHz | | GCC `-O1` | 7031 | 5389 | 1.3|0.766| 28.10 KHz | | GCC `-O2` | 6965 | 5336 | 1.31|0.766| 13.47 KHz | | Handwritten ASM | 508 | 329 | 1.54|0.648| 459.73 Hz | **Test program: `test_basic_conversions`** | Version | Cycles | Instrs. retired | CPI | IPC | Clock rate | | -------- | -------- | --------------- | --- | --- | ---------- | | GCC `-O0` | 3570 | 2697 | 1.32|0.755| 5.08 KHz | | GCC `-O1` | 3142 | 2379 | 1.32|0.757| 669.22 Hz | | GCC `-O2` | 3142 | 2379 | 1.32|0.757| 3.15 KHz | | Handwritten ASM | 183 | 101 | 1.81|0.552| 179.24 Hz | ### 868. Binary Gap | Version | Cycles | Instrs. retired | CPI | IPC | Clock rate | | -------- | -------- | --------------- | --- | --- | ---------- | | GCC `-O0` | 18763 | 13970 | 1.34|0.745| 30.88 KHz | | GCC `-O1` | 18179 | 13625 | 1.33|0.749| 27.63 Hz | | GCC `-O2` | 17989 | 13493 | 1.33|0.75| 28.07 KHz | | Handwritten ASM | 1146 | 638 | 1.8|0.557| 711.8 Hz |