arch2025-homework1

# arch2025-homework1 contributed by <`TonyLinX`> ## Ripes Setting ![image](https://hackmd.io/_uploads/HyG7tBcTgg.png) ![image](https://hackmd.io/_uploads/rkHQFH5Tee.png) ## Problem B : uf8 Logarithmic 8-bit Codec ### Goal Compress a 20-bit unsigned integer (range 0 … 1,015,792) into an 8-bit symbol using logarithmic quantization, achieving roughly a 2.5:1 compression ratio with ≤ 6.25% relative error. ### Core Idea The 8-bit representation is structured as follows: - **Exponent**: high 4 bits $e = ⌊b/16⌋$ - **Mantissa**: low 4 bits $m = b \bmod 16$ ### Decoding (`uf8_decode`) Given `b`: $$ D(b) = m \cdot 2^{e} + (2^{e} - 1)\cdot 16 $$ ### Encoding (`uf8_encode`) Given a non-negative integer (`v`): $$ E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases} $$ where $\text{offset}(e) = (2^e - 1) \cdot 16$ ### `uf8_decode` Implementation and Optimization The function `uf8_decode()` is designed to decompress an 8-bit uf8 symbol back into its approximate 20-bit original integer value. ### Test Cases (Set of 5) * **Input: `0x00`** → e=0, m=0 → Expected Output: 0 * **Input: `0x0F`** → e=0, m=15 → Expected Output: 15 * **Input: `0x1F`** → e=1, m=15 → Expected Output: 46 * **Input: `0xF0`** → e=15, m=0 → Expected Output: 524,272 * **Input: `0xFF`** → e=15, m=15 → Expected Output: 1,015,792 ### Baseline (Commit [3198652](https://github.com/sysprog21/ca2025-quizzes/commit/3198652c4e672a5c1ea69e33c46632e9dbfba5f5)) This section establishes a baseline by writing the test cases into the `.data` section and examining the assembly code for `uf8_decode` generated by the C compiler. **Compiler-Generated `uf8_decode` Assembly:** ``` uf8_decode: srli x16, a0, 4 # x16 = e = fl >> 4 addi x17, x0, 15 # x17 = 15 sub x18, x17, x16 # x18 = 15 - e lui x19, 0x8 # x19 = 0x8000 addi x19, x19, -1 # x19 = 0x7FFF sra x20, x19, x18 # x20 = 0x7FFF >> (15 - e) andi x21, a0, 15 # mantissa = fl & 0x0F slli x22, x20, 4 # offset = (0x7FFF >> (15 - e)) << 4 sll x10, x21, x16 # mantissa <<= e add a0, x22, x10 # result = offset + (mantissa << e) jalr x0, ra, 0 # return ``` ### Optimization Idea (Commit [5e129a9](https://github.com/sysprog21/ca2025-quizzes/commit/5e129a9e3b7cd3deb99be23da68d08f34bb16e55)) The assembly code generated by the C compiler for `uf8_decode` consists of 11 instructions (not counting `ret`, and treating `li` as two instructions). The process for calculating the `offset` is particularly circuitous, especially the steps involving loading `0x7FFF` and the subsequent subtraction and shift operations. > **Side Note:** > Question. Why is `0x7FFF` loaded in two steps (`lui x19, 0x8` and `addi x19, x19, -1`)? > Answer. Our goal is to load the value `0x00007FFF`. A single instruction like `lui x10, 0x7FFF` would be incorrect, as it would result in `0x07FFF000`. It is also not possible to write `lui x11, 0x00007FFF` because the immediate value for the `lui` instruction is limited to 20 bits. The main optimization strategy is to perform a mathematical simplification of the original formula. **Original Formula:** $$ D(b) = m \cdot 2^e + (2^e - 1) \cdot 16 $$ **Optimized Formula:** By rearranging the terms, we get a more efficient equivalent expression: $$ D(b) = (m + 16)\cdot 2^e - 16 $$ This optimized formula leads to much simpler and more direct assembly code. **Manually Optimized `uf8_decode` Assembly:** ``` uf8_decode: srli t0, a0, 4 # t0 = a0 >> 4 andi a0, a0, 0x0f # a0 = a0 & 0x0f (fl & 0x0f) addi a0, a0, 16 # a0 = a0 + 16 sll a0, a0, t0 # a0 = a0 << t0 addi a0, a0, -16 # a0 = a0 - 16 jalr x0, ra, 0 ``` ### Performance Evaluation and Analysis To precisely quantify the results of the manual optimization, I executed both the "Compiler Baseline Version" and the "Manually Optimized Version" in the Ripes simulator using the same dataset (a loop containing the 5 test cases). The core metrics for evaluation were Total Instructions and Total Cycles. | Metric | C compiler | Manual Opt V1 | Improvement | | ----------- | -------- | ------------- | -------- | |Total Instructions | 113 | 88 | -22.12% | |Total Cycles | 153 | 128 | -16.33% | C compiler `uf8_decode` : ![image](https://hackmd.io/_uploads/B1cxyD9Txg.png) Manual Opt V1 `uf8_decode` : ![image](https://hackmd.io/_uploads/By6i0Lqaex.png) Based on the measurement results from the Ripes simulator, our manually optimized version shows significant improvements across key metrics: **Significant Improvement in Code Efficiency:** The total number of executed instructions was reduced from 113 to 88, a decrease of 22.1%. This improvement stems directly from the streamlining of the `uf8_decode` function itself, which was condensed from 11 instructions to just 5. Since my test program calls this function 5 times within a loop, this instruction saving is compounded, reflecting in the overall program's instruction count. **Substantial Performance (Speed) Enhancement:** The total number of execution cycles dropped from 153 to 128, representing a 16.3% reduction in overall execution time. This directly demonstrates that our optimized program runs faster than the version automatically generated by the compiler. ## Problem C - bfloat16 ### Goal The core objective of Problem C is to build a complete software arithmetic library for the `bfloat16` (Brain Floating Point) 16-bit floating-point format. This library must be capable of performing operations such as **addition (`bf16_add`), subtraction (`bf16_sub`), multiplication (`bf16_mul`), division (`bf16_div`), and square root (`bf16_sqrt`)**. However, the most critical constraint and challenge is that all code must run on the base **RV32I** instruction set. The `I` stands for `Integer`, which means the processor we are using **lacks a hardware Floating-Point Unit (FPU)**. At the hardware level, the CPU has no understanding of floating-point numbers and cannot execute any floating-point instructions. Therefore, the task in this problem can be summarized as: using integer-only arithmetic instructions to simulate a floating-point processor and build a high-performance `bfloat16` Soft Float (software floating-point) library. ### Core Idea Since the hardware cannot process floating-point numbers, how do we achieve this goal? The core concept is to **treat the 16-bit `bfloat16` value not as a single integer, but as a packed data structure** containing three independent parts: the sign, exponent, and mantissa. All operations follow a clear three-step process: Unpack -> Pure Integer Computation -> Repack. #### 1. Unpack This is the first step for any operation. Using **bitwise instructions** (like `srli` for logical right shift and `andi` for bitwise AND), the incoming 16-bit `bfloat16` value is broken down into three separate **pure integers**: * **Sign:** 1 bit, representing positive or negative. * **Exponent:** 8 bits, representing the magnitude of the number. * **Mantissa:** 7 bits, representing the precision of the number. After this step is complete, we are left with only integers that the CPU natively supports. #### 2. Pure Integer Computation This is the core of the algorithm. We operate on these separated integer parts according to the mathematical rules of floating-point arithmetic. Taking `bf16_add` (addition) as an example: * **Compare exponents:** Use integer subtraction (`sub`) to compare the exponents of the two numbers to determine which has a larger magnitude. * **Align mantissas:** The mantissa of the number with the smaller exponent is shifted right (`srli`) to align it with the magnitude of the larger number. * **Compute mantissas:** The aligned mantissas are then computed using integer addition (`add`) or subtraction (`sub`). * **Normalization:** Check if the resulting mantissa has overflowed or requires adjustment, and update the exponent accordingly using integer addition/subtraction (`addi`). This entire process uses no floating-point operations whatsoever. Instead, the logic of floating-point arithmetic is cleverly transformed into a series of integer operations that the CPU can understand. #### 3. Repack Once all computations are finished, we will have three new integer results: `result_sign`, `result_exp`, and `result_mant`. The final step is to use **bitwise instructions** (like `slli` for logical left shift and `ori` for bitwise OR) to recombine and pack these three parts back into a single 16-bit `bfloat16` format, which is returned as the function's result. #### 4. Example (`1.0 + 0.5`) Using `f32_to_bf16`, `1.0` and `0.5` are converted to `0x3F80` and `0x3F00`, respectively. ![image](https://hackmd.io/_uploads/By7s_icagg.png) **1. Unpack:** Using bitwise operations, we break them down into three **pure integer** parts. | Value | bfloat16 | Binary Representation `(S EEEEEEEE MMMMMMM)` | Sign (S) | Exponent (E)| Mantissa (M) | | :--- | :--- | :--- | :--- | :--- | :--- | | **1.0f** | `0x3F80` | `0 01111111 0000000` | `0` | `127` | `0` | | **0.5f** | `0x3F00` | `0 01111110 0000000` | `0` | `126` | `0` | Next, we add the **implicit leading 1** (the `1` in `1.Mantissa`) to the mantissa. This `1` is the 8th bit, to the left of the 7 stored mantissa bits: * The full mantissa of `a`, `M_full_a` = `0b10000000` (decimal integer `128`) * The full mantissa of `b`, `M_full_b` = `0b10000000` (decimal integer `128`) **2. Pure Integer Computation:** * **Align Exponents:** * Compare exponents using integer subtraction: `127 - 126 = 1`. The exponent of `a` is larger. * To align the exponents, the mantissa of `b` (the number with the smaller exponent) must be **shifted right** by `1` bit. * `M_full_b_aligned = M_full_b >> 1 = 128 >> 1 = 64`. * Now, the base exponent for both numbers is unified to the larger value, `127`. * **Compute Mantissa:** * Since the signs of both numbers are the same (`0`), we perform an addition. * This step is a **pure integer addition**: `result_mant = M_full_a + M_full_b_aligned` * `128 + 64 = 192`. * The binary representation of `192` is `11000000`. * **Normalize:** * Check the result `192` (`0b11000000`). Its most significant bit is still the 8th bit, meaning the result `1.xxxxxx` is still within the normalized range and requires no adjustment. * Therefore, the final exponent is the aligned exponent, `127`. **3. Repack:** The three resulting **integers** are recombined into a single 16-bit integer using bitwise operations (`slli`, `ori`). * **Final Parts (Integers):** * `result_sign`: `0` * `result_exp`: `127` * `result_mant_full`: `192` (`0b11000000`) * **Assembly Process:** * From `result_mant_full` (`0b11000000`), **remove** the implicit `1`, keeping only the lower 7 bits: `0b1000000` (decimal integer `64`). * Shift the sign bit left by 15 positions: `0 << 15` -> `0x0000` * Shift the exponent `127` left by 7 positions: `127 << 7` -> `0x3F80` * Place the new mantissa `64` in the lowest bits: `64` -> `0x0040` * Combine all three parts using a bitwise `OR`: `0x0000 | 0x3F80 | 0x0040 = 0x3FC0` Finally, the integer `0x3FC0` is output, which is the standard `bfloat16` representation of `1.5f`. ### `bf16_mul` #ToDO ## Signal Visualization in a 5-Stage Pipeline >Provide explanations for both program functionality and instruction-level operations using the Ripes simulator.Use visualization of signals such as register write/enable, multiplexer selection, etc.Demonstrate each stage: IF, ID, IE, MEM, WB. Explain memory update steps and their correctness. Reference：[Link](https://hackmd.io/@sysprog/SkkbXLJRR#Dot-product) ### Analysis The following analysis is based on the assembly code from 'Problem B,' loaded into the Ripes simulator. ![image](https://hackmd.io/_uploads/SJl9AIqTxg.png) The simulator displays the program memory as follows. The leftmost column (0, 4, 8, etc.) represents the byte offset from the start of the program. The middle column shows the 32-bit machine code for each instruction, not a memory address. ``` 00000000 <main>: 0: 10000597 auipc x11 0x10000 4: 00058593 addi x11 x11 0 8: 10000617 auipc x12 0x10000 c: ffd60613 addi x12 x12 -3 10: 00000693 addi x13 x0 0 14: 00500713 addi x14 x0 5 ``` ### The 5-Stage Pipeline Walkthrough #### Instruction Fetch (IF) Stage The simulator highlights the first instruction, auipc x11, 0x10000, entering the Instruction Fetch (IF) stage. ![image](https://hackmd.io/_uploads/S1nr0Ic6lx.png) * Instruction Fetching: The Program Counter (PC) starts at 0x00000000 and sends this address to the instruction memory, which fetches the corresponding machine code, 0x10000597. * PC Update Logic: * The current PC value (0x0) is fed into an adder. Since auipc is a standard 32-bit instruction, a constant 4 is selected as the other input to the adder, not 2. >How does the processor determine the PC increment (+2 vs. +4)? >The processor determines the instruction length by examining the two least significant bits of the machine code. >If the lowest 2 bits are 11, it is a standard 32-bit instruction, and the PC must be updated to PC + 4. >If the lowest 2 bits are 00, 01, or 10, it is a compressed 16-bit instruction, and the PC is updated to PC + 2. >For the current `auipc` instruction (machine code 0x10000597), the binary representation of 0x597 ends in ...0101 1001 0111. The LSBs are 11, so the processor identifies it as a 4-byte instruction and calculates the next PC as PC + 4. * The adder computes the next sequential address: 0x00000000 + 4 = 0x00000004. * This result (0x4) is sent to a multiplexer (MUX) that determines the next PC value. * Since auipc is not a branch or jump instruction, the control unit directs the MUX to select the PC + 4 path. * The MUX output (0x4) is fed back to the PC register. On the next rising clock edge, the PC's write enable signal is asserted, and its value is updated from 0x0 to 0x4. ![image](https://hackmd.io/_uploads/H1mv0I56xe.png) --- #### Instruction Decode (ID) Stage In the next cycle, the first instruction `auipc` moves to the Instruction Decode (ID) stage, while the second instruction `addi` enters the IF stage. The ID stage is responsible for decoding the fetched instruction and preparing the necessary operands. ![image](https://hackmd.io/_uploads/r1B-0Iqagx.png) * Instruction Source: The `auipc` instruction and the current PC value (`0x0`) are passed from the IF/ID pipeline register into this stage. * Decode Unit: * This unit parses the machine code `0x10000597`. * By analyzing the opcode, it identifies the instruction as AUIPC. This information is crucial for generating control signals for subsequent stages (e.g., selecting the immediate value path instead of a register value path for the ALU). * It identifies the destination register `rd` as `x11` (index `0x0b`). This index is forwarded to the `Wr idx` (Write index) input of the ID/EX register, instructing later stages that the result should be written back to `x11`. * Since `auipc` does not use source registers (`rs1`, `rs2`), their corresponding indices are `0x00` (referencing the `x0` zero register). * Immediate Generator (Imm.): * This unit extracts the immediate value `0x10000` from the machine code. * As defined by the `auipc` instruction format, this immediate is then shifted left by 12 bits to produce the final immediate value `0x10000000` used in the calculation. * Register File: The source register indices (`R1 idx` and `R2 idx`) are both `0x00`, so the values read from the `Reg1` and `Reg2` ports are both `0`. ![image](https://hackmd.io/_uploads/r1HxA89pll.png) --- #### Execute (EX) Stage The `auipc` instruction now proceeds to the Execute (EX) stage, which performs the actual calculation. ![image](https://hackmd.io/_uploads/rkR2aU56gg.png) * ALU Operation: * The control signals, derived from the `auipc` opcode, configure the ALU's input multiplexers. * One multiplexer selects the PC value (`0x0`) from the ID/EX register as the first operand (`op1`). * Another multiplexer selects the generated immediate value (`0x10000000`) as the second operand (`op2`). * The ALU control unit, also guided by the opcode, instructs the ALU to perform an addition. * The ALU calculates the result: `0x0 + 0x10000000 = 0x10000000`. * Branch Unit: The branch comparison logic is inactive for this instruction type. It would only be used for conditional branch instructions (like `beq`), where it compares the values of `Reg1` and `Reg2` to determine if a branch should be taken. ![image](https://hackmd.io/_uploads/BJMCaLqalx.png) --- #### Memory Access (MEM) Stage The `auipc` instruction moves to the Memory Access (MEM) stage. The primary role of this stage is to interact with data memory, a process exclusive to load and store instructions. ![image](https://hackmd.io/_uploads/H1Kia89aee.png) * Read from Memory (Load Instructions): For instructions like `lw` or `lh`, the ALU result (the memory address) would be sent to the data memory, `MemRead` would be asserted, and the fetched data would be passed to the next stage. * Write to Memory (Store Instructions): For instructions like `sw` or `sb`, the ALU result (address) and the data from `Reg2` would be sent to data memory, and `MemWrite` would be asserted. For non-memory instructions like `auipc` (and all R-type instructions), this stage is effectively idle. The control unit sets `MemRead=0` and `MemWrite=0`, ensuring the data memory is not accessed. The value seen on the data memory's "Read out" port (`0xf01f0f00` in the simulation) is therefore irrelevant garbage data. The ALU result (`0x10000000`) bypasses the data memory and is passed directly to the MEM/WB pipeline register, along with the destination register index `rd` (`0x0b`). ![image](https://hackmd.io/_uploads/Byf5TUq6xe.png) --- #### Write-Back (WB) Stage Finally, the `auipc` instruction reaches the Write-Back (WB) stage, where the result is written into the register file. ![image](https://hackmd.io/_uploads/S1e3Ea8c6ge.png) * Data Selection: A multiplexer selects the ALU result (`0x10000000`) from the MEM/WB register as the data to be written. * Register File Write: * This data is sent to the `Wr data` port of the Register File. * The destination register index (`0x0b` for `x11`) is sent to the `Wr idx` port. * The control signal `RegWrite` (`Wr En` in the diagram) is asserted (set to 1), enabling the write operation. * The screenshot shows the state just before the write completes on the clock edge; `x11` has not yet been updated. ![image](https://hackmd.io/_uploads/SyT8aU5peg.png) #### Final Result After the WB stage completes its execution, we can observe in the Register File that the `x11` register has been successfully updated with the final value 0x10000000. ![image](https://hackmd.io/_uploads/rJ3ZpL9agx.png)