hw1 - HackMD

# **Assignment 1– UF8 (Problem B), RV32I (Ripes)** **1. Problem Statement** In this assignment, we are tasked with implementing a variable-length encoding and decoding scheme (UF8) using RISC-V RV32I assembly. The program will handle a set of integers (test inputs), encode them, decode them, and check the accuracy of the decoding using predefined error bounds. **Encoding/Decoding Details:** uf8_encode(a0=v) → a0=byte: Encodes the integer v into a variable-length encoding byte. uf8_decode(a0=b) → a0=value: Decodes the byte b into the corresponding integer value. --- **2. Algorithm** **UF8 Encoding:** The encoding process involves converting a value v into a byte. The byte is split into two parts: - The exponent e determines the number of bits used for the mantissa m. - The mantissa is shifted and combined with an offset derived from the exponent. The encoding algorithm can be described as: e = floor(log2((v >> 4) + 1)) m = v - offset The result is encoded_byte = (e << 4) | m **UF8 Decoding:** The decoding process reverses the encoding operation : Extract the exponent e and mantissa m from the byte. Compute the value as : decoded_value = (m << e) + offset **Error Bound:** The error is checked by comparing the absolute difference between the decoded value and the original value, scaled by a factor of 16, to the original value: |decoded_value - original_value| * 16 <= original_value If the error condition is met, the result is valid. --- **3. Implementation (RV32I)** Encoding ``` uf8_encode: li t0, 16 blt a0, t0, Lret_v # v<16 -> v addi t1, x0, 0 # e addi t2, x0, 0 # offset li t3, 16 # const 16 Lfind_e: slli t4, t2, 1 add t4, t4, t3 # next_offset = (offset<<1)+16 blt a0, t4, Lfound_e addi t2, t4, 0 # offset = next_offset addi t1, t1, 1 # e++ li t5, 15 bge t1, t5, Lfound_e # guard e<=15 j Lfind_e Lfound_e: sub t6, a0, t2 srl t6, t6, t1 # mantissa slli t4, t1, 4 or a0, t4, t6 # (e<<4)|mantissa ret Lret_v: ret ``` Decoding ``` uf8_decode: srli t0, a0, 4 # e andi t1, a0, 0x0F # m li t2, 1 sll t2, t2, t0 # 1<<e addi t2, t2, -1 # (1<<e)-1 slli t2, t2, 4 # offset = ((1<<e)-1)<<4 sll t3, t1, t0 # m<<e add a0, t3, t2 # value ret ``` **Design Decisions** I chose to implement the while loop for determining e in the initial version (baseline). However, I later optimized the algorithm by using bit manipulation (log2-based method) to avoid the iterative process and reduce cycle time. The program uses li and addi instructions for constants and offsets, which are important for both encoding and decoding steps. --- **4. Pipeline Analysis** **Observations** Branch Resolution: In uf8_encode, the branch instructions are resolved in the EX stage, and any misprediction results in a flush. This is visible in the pipeline when the blt instruction is executed. Load-Use Hazard: After the lw instruction in uf8_encode, there is a stall because the sw instruction is consuming the same value, which is resolved by inserting a 1-cycle bubble. **Pipeline Stages** IF: Fetches the next instruction. ID: Decodes the instruction and reads registers. EX: Performs arithmetic, logic, or branch decision. MEM: Reads from or writes to memory. WB: Writes the result back to the register file. The primary pipeline hazard that we are dealing with is branch misprediction. **Exeample** jal ra, uf8_decode ![image](https://hackmd.io/_uploads/Hy4nRmrnlg.png) ![image](https://hackmd.io/_uploads/r198Amr3el.png) 1.When the jal ra, uf8_decode instruction enters the EX (Execute) stage, the target address is calculated. However, the subsequent two instructions, sw a0, 0(s1) and addi s1, s1, 4, have already been sent into the pipeline in the IF (Instruction Fetch) and ID (Instruction Decode) stages. ![image](https://hackmd.io/_uploads/HyG6AXSnle.png) ![image](https://hackmd.io/_uploads/SySO0Qrnxl.png) 2.These instructions are now out of sync with the new program counter and need to be flushed (i.e., removed) from the pipeline to avoid executing instructions based on the wrong PC.The processor realizes that the control flow has changed and will discard the already-fetched instructions that are no longer relevant, ensuring that the correct instruction at the new target address (uf8_decode) is executed. --- **5. Tests & Verification** test_inputs (Expanded with Boundary Values and Random Large Values) ``` test_inputs: .word 0, 15, 16, 47, 48, 108, 1000000, 9999999 ``` Boundary Values: 15/16, 47/48 to test the transitions in the encoding process. Random Large Values: 1000000, 9999999 to test the program's handling of large numbers. After running the program, we verified the results using check_out: ![image](https://hackmd.io/_uploads/HkVIXEBhlg.png) All values in check_out should be 1, indicating that all tests passed within the error bounds. It means that the algorithm works correctly for all test cases, including boundary values and large inputs. The implementation passed all checks, and the error bound condition was met for all test inputs. --- **6.Optimization & Results** The baseline version used a while loop to find e, which required multiple comparisons for each value. In the optimized version, I used bit manipulation to calculate e directly, reducing the number of comparisons and improving the overall performance. | Version | `uf8_encode` Inst. | `uf8_decode` Inst. | Total Cycles (8 items) | Notes | | ------------- | -----------------: | -----------------: | ---------------------: | --------------------------------------- | | **Baseline** | 24 | 24 | 994 cycles | While loop search for e, more cycles | | **Optimized** | 14 | 12 | 732 cycles | Direct bit manipulation for e, faster | The optimized version significantly reduced the number of cycles by eliminating the while loop and using bit manipulation, improving the performance of the encoding and decoding functions. --- **7.Reproduce** 1.Clone the repository. 2.Open the uf8_baseline.s or uf8_opt.s file in Ripes. 3.Assemble and run the program. 4.Check the memory section .data to verify the results in enc_out, dec_out, and check_out. github: https://github.com/larry920623/hw1/tree/main # **Assignment 1– Problem C: BFloat16 (bf16) Conversion** code ``` ############################################################## # BF16 Arithmetic Emulator (RV32IMC, no FPU) # 完整版：含 stack、巢狀呼叫安全、四則運算與 debug 輸出 ############################################################## .data .align 2 f32_data: .word 0x41200000, 0x41A00000, 0x41F00000, 0x42200000 # 10, 20, 30, 40 out_add: .word 0,0,0,0 out_sub: .word 0,0,0,0 out_mul: .word 0,0,0,0 out_div: .word 0,0,0,0 msg_i: .asciz "[DEBUG] i=" msg_add: .asciz " ADD=" msg_sub: .asciz " SUB=" msg_mul: .asciz " MUL=" msg_div: .asciz " DIV=" msg_nl: .asciz "\n" msg_done: .asciz "DONE\n" .bss .data .align 4 stack_area: .word 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # 16×4 = 64 bytes .word 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # 共 32 個 word ≈ 128 bytes stack_top: .text .global _start _start: # 初始化 stack la sp, stack_top la s0, f32_data li s3, 4 li t6, 0 loop: bge t6, s3, done slli t0, t6, 2 add t1, s0, t0 lw a0, 0(t1) # f32 → bf16 jal ra, f32_to_bf16 mv s1, a0 # bf16_A addi t6, t6, 1 bge t6, s3, done slli t0, t6, 2 add t1, s0, t0 lw a0, 0(t1) jal ra, f32_to_bf16 mv s2, a0 # bf16_B # === ADD === mv a0, s1 mv a1, s2 jal ra, bf16_add la t1, out_add slli t2, t6, 2 add t1, t1, t2 sw a0, -4(t1) # === SUB === mv a0, s1 mv a1, s2 jal ra, bf16_sub la t1, out_sub slli t2, t6, 2 add t1, t1, t2 sw a0, -4(t1) # === MUL === mv a0, s1 mv a1, s2 jal ra, bf16_mul la t1, out_mul slli t2, t6, 2 add t1, t1, t2 sw a0, -4(t1) # === DIV === mv a0, s1 mv a1, s2 jal ra, bf16_div la t1, out_div slli t2, t6, 2 add t1, t1, t2 sw a0, -4(t1) # === Debug Print === la a0, msg_i li a7, 4 ecall addi a0, t6, -1 li a7, 1 ecall # print ADD/SUB/MUL/DIV la a0, msg_add li a7, 4 ecall la t1, out_add slli t2, t6, 2 add t1, t1, t2 lw a0, -4(t1) li a7, 34 ecall la a0, msg_sub li a7, 4 ecall la t1, out_sub add t1, t1, t2 lw a0, -4(t1) li a7, 34 ecall la a0, msg_mul li a7, 4 ecall la t1, out_mul add t1, t1, t2 lw a0, -4(t1) li a7, 34 ecall la a0, msg_div li a7, 4 ecall la t1, out_div add t1, t1, t2 lw a0, -4(t1) li a7, 34 ecall la a0, msg_nl li a7, 4 ecall j loop done: la a0, msg_done li a7, 4 ecall li a7, 10 ecall ############################################################## # f32_to_bf16: round-to-even ############################################################## f32_to_bf16: srli t0, a0, 16 # high16 li t1, 0x00008000 # rounding bit mask and t2, a0, t1 beq t2, x0, no_round addi t0, t0, 1 no_round: li t3, 0xFFFF and a0, t0, t3 ret .align 2 ############################################################## # bf16_to_f32 ############################################################## bf16_to_f32: slli a0, a0, 16 ret .align 2 ############################################################## # bf16_to_int ############################################################## bf16_to_int: srli t0, a0, 7 andi t0, t0, 0xFF beq t0, x0, bf16_int_zero andi t1, a0, 0x7F ori t1, t1, 0x80 li t2, 134 sub t3, t0, t2 bge t3, x0, bf16_int_shl neg t4, t3 srl a0, t1, t4 ret bf16_int_shl: sll a0, t1, t3 ret bf16_int_zero: mv a0, x0 ret .align 2 ############################################################## # int_to_bf16 ############################################################## int_to_bf16: beq a0, x0, int_bf16_zero mv t1, a0 li t0, 0 1: srli t1, t1, 1 beq t1, x0, 2f addi t0, t0, 1 j 1b 2: li t2, 127 add t2, t2, t0 li t5, 7 sub t3, t0, t5 blt t3, x0, m8_left srl t4, a0, t3 andi t4, t4, 0xFF j m8_done m8_left: neg t5, t3 sll t4, a0, t5 andi t4, t4, 0xFF m8_done: andi t4, t4, 0x7F slli t2, t2, 7 or a0, t2, t4 ret int_bf16_zero: mv a0, x0 ret .align 2 ############################################################## # bf16_add / sub / mul / div (with stack frames) ############################################################## bf16_add: addi sp, sp, -8 sw ra, 4(sp) sw t0, 0(sp) jal ra, bf16_to_int mv t0, a0 mv a0, a1 jal ra, bf16_to_int add a0, t0, a0 jal ra, int_to_bf16 lw ra, 4(sp) lw t0, 0(sp) addi sp, sp, 8 ret .align 2 bf16_sub: addi sp, sp, -8 sw ra, 4(sp) sw t0, 0(sp) jal ra, bf16_to_int mv t0, a0 mv a0, a1 jal ra, bf16_to_int sub a0, t0, a0 jal ra, int_to_bf16 lw ra, 4(sp) lw t0, 0(sp) addi sp, sp, 8 ret .align 2 bf16_mul: addi sp, sp, -8 sw ra, 4(sp) sw t0, 0(sp) jal ra, bf16_to_int mv t0, a0 mv a0, a1 jal ra, bf16_to_int mul a0, t0, a0 jal ra, int_to_bf16 lw ra, 4(sp) lw t0, 0(sp) addi sp, sp, 8 ret .align 2 bf16_div: addi sp, sp, -8 sw ra, 4(sp) sw t0, 0(sp) jal ra, bf16_to_int mv t0, a0 mv a0, a1 jal ra, bf16_to_int beq a0, x0, div_zero div a0, t0, a0 jal ra, int_to_bf16 j end_div div_zero: mv a0, x0 end_div: lw ra, 4(sp) lw t0, 0(sp) addi sp, sp, 8 ret ``` --- **Purpose** This project implements BFloat16 (bf16) conversion and basic arithmetic (add/sub/mul/div) entirely using integer operations. It is designed for the Ripes RV32I simulator and does not use any floating-point instructions. The goal is to understand how 32-bit IEEE 754 floating-point numbers can be approximated in 16-bit BFloat16 format and how simple arithmetic can be simulated through integer manipulation. **Features** | Function | Description | | ---------------------- | ----------------------------------------------------------------------------------------------- | | `f32_to_bf16` | Converts 32-bit float bit pattern to 16-bit bf16, using **round-to-nearest-even**. | | `bf16_to_f32` | Expands 16-bit bf16 bits back into a 32-bit float bit pattern. | | `bf16_to_int` | Approximates a bf16 value as an integer by extracting exponent/mantissa. | | `int_to_bf16` | Encodes an integer back into bf16 format by locating the MSB and packing exponent/mantissa. | | `bf16_add/sub/mul/div` | Performs arithmetic using **integer conversion**, safe for nested calls (stack protected). | | `main` | Loads test data, performs all four arithmetic operations, and prints results with debug labels. | --- **Algorithm** 1.f32_to_bf16: Extract high 16 bits of f32 input. If low 16 bits ≥ 0x8000, round up. Output only the 16-bit bf16 value. ``` bf16 = (f32 >> 16) if (f32 & 0x00008000) != 0 → bf16 += 1 ``` 2.bf16_to_f32: Reconstructs an f32 by left-shifting bf16 bits 16 positions. ``` f32_bits = bf16 << 16 ``` 3.bf16_to_int: Extract exponent and mantissa fields. Remove bias (127). Rebuilds approximate integer via shifting. ``` e_raw = (bits >> 7) & 0xFF mant8 = ((bits & 0x7F) | 0x80) shift = e_raw - 134 int = mant8 << shift (if shift >= 0) int = mant8 >> -shift (if shift < 0) ``` 4.int_to_bf16: Finds the most significant bit (MSB) position of the integer. Encodes exponent = 127 + MSB index. Scales mantissa to 7 bits and packs bf16 bits. ``` n = floor(log2(int)) e_raw = 127 + n mant7 = bits from 1.xxx pattern bf16 = (e_raw << 7) | (mant7 & 0x7F) ``` 5.arithmetic functions (add/sub/mul/div) All four operations share the same structure: 1.Convert both bf16 operands → integer. 2.Perform integer arithmetic. 3.Convert result → bf16. Example (bf16_add) ``` jal ra, bf16_to_int # a0 = int(A) mv t0, a0 mv a0, a1 jal ra, bf16_to_int # a0 = int(B) add a0, t0, a0 jal ra, int_to_bf16 # a0 = result_bf16 ``` Each arithmetic routine uses stack frames: ``` addi sp, sp, -8 sw ra, 4(sp) sw t0, 0(sp) ... lw ra, 4(sp) lw t0, 0(sp) addi sp, sp, 8 ret ``` --- **Test Data** ``` .data f32_data: .word 0x41200000, 0x41A00000, 0x41F00000, 0x42200000 # 10, 20, 30, 40 out_add: .word 0,0,0,0 out_sub: .word 0,0,0,0 out_mul: .word 0,0,0,0 out_div: .word 0,0,0,0 ``` The program automatically converts each adjacent pair (A[i], A[i+1]) into bf16 and performs:ADD, SUB, MUL, DIV | Operation | Expression | Expected bf16 | Approx. Decimal | | --------- | ------------------------- | ---------------------- | ----------------- | | **ADD** | (10+20), (20+30), (30+40) | 0x4317, 0x4321, 0x432C | ≈ 30.5, 50, 70 | | **SUB** | (10−20), (20−30), (30−40) | 0x42DE, 0x42CA, 0x42B8 | ≈ −10, −10, −10 | | **MUL** | (10×20), (20×30), (30×40) | 0x4523, 0x4575, 0x45A5 | ≈ 200, 600, 1200 | | **DIV** | (10/20), (20/30), (30/40) | 0x40C0, 0x4080, 0x4040 | ≈ 0.75, 0.66, 0.5 | **Run in Ripes** ![image](https://hackmd.io/_uploads/Sy02AEYpxl.png) ![image](https://hackmd.io/_uploads/HyFzJHKale.png) | Address Range | Description | | ------------------------- | --------------------- | | `0x10000000 ~ 0x1000000C` | Input data (f32_data) | | `0x10000010 ~ 0x1000001C` | out_add results | | `0x10000020 ~ 0x1000002C` | out_sub results | | `0x10000030 ~ 0x1000003C` | out_mul results | | `0x10000040 ~ 0x1000004C` | out_div results | # **LeetCode 1342 — Number of Steps to Reduce a Number to Zero** Example: Input: num = 14 Output: 6 Explanation: Step 1) 14 is even; divide by 2 and obtain 7. Step 2) 7 is odd; subtract 1 and obtain 6. Step 3) 6 is even; divide by 2 and obtain 3. Step 4) 3 is odd; subtract 1 and obtain 2. Step 5) 2 is even; divide by 2 and obtain 1. Step 6) 1 is odd; subtract 1 and obtain 0. C ``` #include <stdio.h> int numberOfSteps(int num) { int steps = 0; while (num > 0) { if (num % 2 == 0) num /= 2; else num -= 1; steps++; } return steps; } int main() { int test_inputs[] = {14, 8, 123}; int n = 3; int outputs[3]; for (int i = 0; i < n; i++) { outputs[i] = numberOfSteps(test_inputs[i]); printf("num=%d, steps=%d\n", test_inputs[i], outputs[i]); } return 0; } ``` Assembly ``` .text j main # --------------------------------------------------- # numberOfSteps using integer operations # optimized: steps = bit_length(num)-1 + popcount(num) # --------------------------------------------------- numberOfSteps: addi t0, x0, 0 # steps add t1, a0, x0 # copy of num beq t1, x0, Ldone_steps # count number of 1 bits (popcount) addi t2, x0, 0 # ones_count Lpop_loop: beq t1, x0, Lpop_done andi t3, t1, 1 add t2, t2, t3 srli t1, t1, 1 j Lpop_loop Lpop_done: add t0, t0, t2 # add ones_count # compute bit_length-1 add t1, a0, x0 # reset t1 = num li t3, 0 # bitlen Lbitlen_loop: beq t1, x0, Lbitlen_done srli t1, t1, 1 addi t3, t3, 1 j Lbitlen_loop Lbitlen_done: addi t3, t3, -1 add t0, t0, t3 add a0, t0, x0 ret Ldone_steps: li a0, 0 ret # --------------------------------------------------- # main # --------------------------------------------------- main: la s0, test_inputs la s1, outputs li s2, 3 # n=3 addi s3, x0, 0 Lloop: beq s3, s2, Ldone slli t0, s3, 2 add t1, s0, t0 lw a0, 0(t1) jal ra, numberOfSteps la t2, outputs add t2, t2, t0 sw a0, 0(t2) addi s3, s3, 1 j Lloop Ldone: j Ldone # --------------------------------------------------- # Data # --------------------------------------------------- .data test_inputs: .word 14,8,123 outputs: .word 0,0,0 ``` **Optimization Explanation** using bitwise operations inspired by the Problem Bstrategy: 1.Observation: Every 1 in the binary representation requires a subtract 1 operation.Every bit shift (divide by 2) corresponds to moving through the binary digits until the number becomes zero. 2.Formula: steps = (number of 1 bits in num) + (bit_length(num) - 1) number of 1 bits → counts how many subtract-1 operations are needed. bit_length(num) - 1 → counts how many divide-by-2 operations are needed. 3.Assembly Implementation: Compute popcount of num using bitwise AND and shift loop. Compute bit length by shifting num right until it becomes zero. Sum both to get total steps. 4.Advantages: Avoids repeated conditional checks in the loop. Each iteration of the loops is predictable, making assembly translation simpler. Efficient in RV32I, since only integer operations and bit shifts are used.