arch2025-homework1

# Assignment 1: RISC-V Assembly and Instruction Pipeline contributed by < [xwt](@xwt) > > **AI tools usage** ## Problem b in [Quiz1](https://hackmd.io/@sysprog/arch2025-quiz1-sol) ### Overview Assume uf8 implements a logarithmic 8-bit codec that maps 20-bit unsigned integers ( [0,1,015,792]) to 8-bit symbols via logarithmic quantization, delivering 2.5:1 compression and ≤6.25% relative error. #### Decoding $$　D(b) = m \cdot 2^e + (2^e - 1) \cdot 16　$$ #### Encoding $$　E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases}　$$ Where $\text{offset}(e) = (2^e - 1) \cdot 16　$ #### Error Analysis Absolute Error: $\Delta_{\max} = 2^e - 1$ Relative Error: $\varepsilon_{\max} = 1/16 = 6.25\%$ Expected Error: $\mathbb{E}[\varepsilon] \approx 3\%$ #### Information Theory Input Entropy: 20 bits Output Entropy: 8 bits Theoretical Minimum: 7.6 bits (for 6.25% error bound) Efficiency: 8/7.6 = 95% optimal | Exponent | Range | Step Size | | -------- |:-------------------- | --------- | | 0 | [0, 15] | 1 | | 1 | [16, 46] | 2 | | 2 | [48, 108] | 4 | | 3 | [112, 232] | 8 | | ... | ... | $2^e$ | | 15 | [524,272, 1,015,792] | 32,768 | #### clz A key part of the uf8_encode process is the use of the CLZ (Count Leading Zeros) operation. CLZ quickly determines the position of the most significant bit in a 32-bit integer. This allows the encoder to efficiently estimate the exponent that represents the magnitude of the input value. Instead of using slow logarithmic or floating-point operations, clz provides a fast integer-based way to identify how large a number is. 1. clz stackless Uses conditional branches (beq /bne) to progressively shift and test bits. It shifts the number right by 16, 8, 4, 2, 1 bits iteratively, halving c each time, to find where the first 1 appears. 2. clz branchless The code counts leading zeros (CLZ) of a 32-bit unsigned value without any branch that depends on the data. It performs a fixed sequence of “probe & shift” on chunk sizes 16, 8, 4, 2, 1 bits: | Feature | clz_stackless | clz_branchless | |:------------------------ | ------------------ |:----------------------------------------- | | **Uses branches** | Yes | No | | **Timing behavior** | Variable | Constant | | **Code size** | Smaller | Larger | | **Impact** | Pipeline may flush | No flush | | **Applicable Scenarios** | low-end MCUs | CPUs with pipelines such as RISC-V or ARM | * Pipeline flush A pipeline flush occurs when the instructions currently in the CPU pipeline must be discarded because the processor’s control flow has changed unexpectedly, usually due to a branch misprediction, exception, or interrupt. When this happens, all the instructions that were fetched or partially executed after the mispredicted branch (or before the correct target is known) become invalid, so the processor flushes them out of the pipeline and starts fetching the correct instructions again. This process causes the CPU to discard partially executed instructions, refill the pipeline from the correct instruction address, and results in a performance penalty due to several lost cycles. Pipeline flushes are commonly caused by branch mispredictions, exceptions, interrupts, self-modifying code, or pipeline hazards that invalidate prior assumptions. They lead to reduced instruction throughput and decreased efficiency, especially in deep pipelines such as modern RISC-V or ARM processors. To avoid a pipeline flush, I implemented the clz_branchless version. ### clz #### stackless The code provided by the professor is a stackless version. I first implemented this version, and then compared it with the branchless version. * c code ```c= static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` * assebly code ```risc-v= clz: li t0, 32 # n = 32 li t1, 16 # c = 16 c: srl t2, a0, t1 # y = x >> c beqz t2, y # if (y == 0) skip sub t0, t0, t1 # n -= c mv a0, t2 # x = y y: srli t1, t1, 1 # c >>= 1 bnez t1, c # while (c) sub a0, t0, a0 # return n - x ret ``` ![螢幕擷取畫面 2025-10-19 183333](https://hackmd.io/_uploads/rkxj04zCgg.png) #### branchless ```risc-v= clz_branchless: li t0, 32 # n = 32 mv t1, a0 # u = x # step 16 srli t2, t1, 16 # t2 = u >> 16 snez t2, t2 # t2 = (u>>16)!=0 ? 1:0 slli t3, t2, 4 # t3 = t2<<4 = 0 or 16 srl t1, t1, t3 # u >>= t3 sub t0, t0, t3 # n -= t3 # step 8 srli t2, t1, 8 snez t2, t2 # 0/1 slli t3, t2, 3 # 0 or 8 srl t1, t1, t3 sub t0, t0, t3 # step 4 srli t2, t1, 4 snez t2, t2 # 0/1 slli t3, t2, 2 # 0 or 4 srl t1, t1, t3 sub t0, t0, t3 # step 2 srli t2, t1, 2 snez t2, t2 # 0/1 slli t3, t2, 1 # 0 or 2 srl t1, t1, t3 sub t0, t0, t3 # step 1 srli t2, t1, 1 snez t2, t2 # 0/1 mv t3, t2 # 0 or 1 srl t1, t1, t3 sub t0, t0, t3 # return n - u sub a0, t0, t1 ret ``` ![螢幕擷取畫面 2025-10-19 183949](https://hackmd.io/_uploads/S1grMeBf0ll.png) In the Ripes simulator, the execution results show that the stackless version required 56 cycles to complete 28 instructions, giving a CPI of 2.0 and IPC of 0.5. In contrast, the branchless version finished in only 43 cycles while executing 33 instructions, achieving a CPI of 1.3 and IPC of 0.767. This means the branchless implementation, though slightly longer in instruction count, executed more efficiently because it avoided conditional branches that often cause pipeline flushes. Without these branches, the instruction flow was smoother, minimizing stalls and allowing better pipeline utilization. As a result, the branchless version improved performance by reducing wasted cycles, increasing instruction throughput, and achieving a higher execution efficiency — an expected outcome on pipelined architectures like RISC-V. ### uf8_decode ``` risc-v= uf8_decode: andi t0, a0, 0x0F # t0 = mantissa srli t1, a0, 4 # t1 = exponent (0..15) li t2, 15 sub t2, t2, t1 # t2 = 15 - exponent li t3, 0x7FFF srl t3, t3, t2 # t3 = 0x7FFF >> (15 - exponent) slli t3, t3, 4 # t3 <<= 4 = offset sll t4, t0, t1 # t4 = mantissa << exponent add a0, t4, t3 # a0 = t4 + offset ret ``` ### uf8_encode ``` risc-v= uf8_encode: addi sp, sp, -4 # Allocate stack space# Allocate stack space sw ra, 0(sp) # Save return address of test mv s3, a0 li t0, 16 blt a0, t0, encode_done jal ra, clz # a0 = lz li t1, 31 sub t1, t1, a0 # t1 = msb li t2, 0 # exponent li t3, 0 # overflow li t4, 5 blt t1, t4, exact # t4=(msb<5) addi t2, t1, -4 # exponent = msb-4 li t4, 16 blt t2, t4 , overflow #exponent < 16 li t2, 15 #exponent = 15 overflow: li t5, 0 # e=0 overflow_loop: bge t5, t2, estimate #e >= exponent slli t3, t3, 1 #overflow << 1 addi t3, t3, 16 #(overflow << 1) + 16 addi t5, t5, 1 #e++ j overflow_loop estimate: beqz t2, exact #exponent = 0 bge s3, t3, exact #value < overflow addi t3, t3, -16 #overflow - 16 srli t3, t3, 1 #(overflow - 16) >> 1 addi t2, t2, -1 # exponent-- j estimate exact: li t5, 15 bgeu t2, t5, exact_done # if (exponent >= 15) break slli t4, t3, 1 #(overflow << 1) addi t4, t4, 16 #next_overflow=(overflow << 1) + 16 bltu s3, t4, exact_done ## value < next_overflow break mv t3, t4 #overflow = next_overflo addi t2, t2, 1 #exponent++ j exact exact_done: sub t4, s3, t3 #value - overflow srl t4, t4, t2 #(value - overflow) >> exponent slli t2, t2, 4 #exponent << 4 or a0, t2, t4 #(exponent << 4) | mantissa encode_done: lw ra, 0(sp) addi sp, sp, 4 ret ``` ### Result ![螢幕擷取畫面 2025-11-06 214444](https://hackmd.io/_uploads/ByIRIQcJbx.png) ![螢幕擷取畫面 2025-11-06 214425](https://hackmd.io/_uploads/HJggPX9J-x.png) ![螢幕擷取畫面 2025-11-06 225204](https://hackmd.io/_uploads/ryyKL451-x.png) ![螢幕擷取畫面 2025-11-06 225224](https://hackmd.io/_uploads/BJ4tU45kbg.png) ### Problem b full code :::spoiler full c code ```c= #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> typedef uint8_t uf8; static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } /* Test encode/decode round-trip */ static bool test(void) { int32_t previous_value = -1; bool passed = true; for (int i = 0; i < 256; i++) { uint8_t fl = i; int32_t value = uf8_decode(fl); uint8_t fl2 = uf8_encode(value); if (fl != fl2) { printf("%02x: produces value %d but encodes back to %02x\n", fl, value, fl2); passed = false; } if (value <= previous_value) { printf("%02x: value %d <= previous_value %d\n", fl, value, previous_value); passed = false; } previous_value = value; } return passed; } int main(void) { if (test()) { printf("All tests passed.\n"); return 0; } return 1; } ``` ::: :::spoiler full assembly code ```risc-v= .data str0: .string ":produces value " str1: .string " but encodes back to " str2: .string ":value " str3: .string " <= previous_value " str4: .string "\n All tests passed.\n" str5: .string "\n somthing wrong.\n" nl: .string "\n" .text .globl main # main: addi sp, sp, -4 # adjust stack for item sw ra, 0(sp) # save return address of main jal ra, test # go to test beqz a0, fail # tests failed la a0, str4 # load string4 li a7, 4 # Call system call (printString) ecall j exit fail: la a0, str5 # load string4 li a7, 4 # Call system call (printString) ecall exit: lw ra, 0(sp) # restore register for caller addi sp, sp, 4 # adjust stack to delete item li a7, 10 # exit the program ecall test: addi sp, sp, -20 # allocate stack space sw ra, 16(sp) # save return address of test sw s0, 12(sp) # save s0 sw s1, 8(sp) # save s1 sw s1, 4(sp) # save s2 sw s1, 0(sp) # save s3 li s0, -1 # previous_value = -1 li s1, 1 # bool passed = true li s2, 0 # i = 0 li s3, 0 # value = 0 loop1: li t0, 256 # t0 = 256 bge s2, t0, true # i = 256 go to true mv a0, s2 # a0 = i jal ra, uf8_decode # call uf8_decode mv s3, a0 # s3 = value = uf8_decode(fl) jal ra, uf8_encode # call uf8_encode mv t2, a0 # t2 = fl2 = uf8_encode(value) beq s2, t2, next # fl = fl2 go to next li s1, 0 # bool passed = 0 (False) mv a0, s2 # fl jal ra, print_int # print int (fl) la a0, str0 # ":produces value " jal ra, print_str # print string (str0) mv a0, s3 # value jal ra, print_int # print int (value) la a0, str1 # "but encodes back to " jal ra, print_str # print string (str1) mv a0, t2 # fl2 jal ra, print_int # print int (fl2) jal ra, print_nl # print string (nl) next: blt s0, s3, update # previous < value go to update li s1, 0 # bool passed = 0 (False) mv a0, s2 # fl jal ra, print_int la a0, str2 # ":value " jal ra, print_str mv a0, s3 # value jal ra, print_int la a0, str3 # "<= previous_value" jal ra, print_str mv a0, s0 # previous_value jal ra, print_int la a0, str3 jal ra, print_nl update: mv s0, s3 # previous_value = value addi s2, s2, 1 # i++ j loop1 #go to loop1 true: mv a0, s1 # move bool value into a0 lw s3, 0(sp) # restore s3-0 from stack lw s2, 4(sp) lw s1, 8(sp) lw s0, 12(sp) lw ra, 16(sp) # restore ra from stack addi sp, sp, 20 # release this function's stack frame ret # return to caller uf8_decode: andi t0, a0, 0x0F # mantissa = fl & 0x0f srli t1, a0, 4 # exponent = fl >> 4 li t2, 15 sub t2, t2, t1 # t2 = 15 - exponent li t3, 0x7FFF srl t3, t3, t2 # t3 = 0x7FFF >> (15 - exponent) slli t3, t3, 4 # offset= t3 <<= 4 sll t4, t0, t1 # t4 = mantissa << exponent add a0, t4, t3 # a0 = t4 + offset ret # return to caller uf8_encode: addi sp, sp, -4 sw ra, 0(sp) li t0, 16 blt a0, t0, encode_done #value < 16 go to encode_done jal ra, clz # call clz,return a0(lz) li t1, 31 sub t1, t1, a0 # msb = 31 - lz li t2, 0 # exponent = 0 li t3, 0 # overflow = 0 li t4, 5 blt t1, t4, exact # msb < 5 go to exact addi t2, t1, -4 # exponent = msb-4 li t4, 16 blt t2, t4, overflow # exponent < 16 go to overflow li t2, 15 # exponent = 15 overflow: li t5, 0 # e = 0 overflow_loop: bge t5, t2, estimate # e >= exponent go to estimate slli t3, t3, 1 # overflow << 1 addi t3, t3, 16 # (overflow << 1) + 16 addi t5, t5, 1 # e++ j overflow_loop #go to overflow_loop estimate: beqz t2, exact #exponent = 0 bge s3, t3, exact #value < overflow go to exact addi t3, t3, -16 #overflow - 16 srli t3, t3, 1 #(overflow - 16) >> 1 addi t2, t2, -1 # exponent-- j estimate #go to estimate exact: li t5, 15 bgeu t2, t5, exact_done # if (exponent >= 15) go to exact_done slli t4, t3, 1 # (overflow << 1) addi t4, t4, 16 # next_overflow = (overflow << 1) + 16 bltu s3, t4, exact_done # value < next_overflow go to exact_done mv t3, t4 # overflow = next_overflo addi t2, t2, 1 # exponent++ j exact exact_done: sub t4, s3, t3 # value - overflow srl t4, t4, t2 # (value - overflow) >> exponent slli t2, t2, 4 # exponent << 4 or a0, t2, t4 # (exponent << 4) | mantissa encode_done: lw ra, 0(sp) addi sp, sp, 4 ret clz: li t0, 32 # n = 32 li t1, 16 # c = 16 c: srl t2, a0, t1 # y = x >> c beqz t2, y # if (y == 0) skip sub t0, t0, t1 # n -= c mv a0, t2 # x = y y: srli t1, t1, 1 # c >>= 1 bnez t1, c # while (c) sub a0, t0, a0 # return n - x ret print_int: # a0 = int li a7, 1 ecall ret print_str: # a0 = &str li a7, 4 ecall ret print_nl: # a0 = &str la a0, nl li a7, 4 ecall ret ``` ::: ## Problem c in [Quiz1](https://hackmd.io/@sysprog/arch2025-quiz1-sol) ### Overview The bfloat16 format (16-bit, from Google Brain) preserves float32’s dynamic range by keeping the same 8-bit exponent, but reduces precision to a 7-bit significand (vs. 23). Bit Layout ``` ┌─────────┬──────────────┬──────────────┐ │Sign (1) │ Exponent (8) │ Mantissa (7) │ └─────────┴──────────────┴──────────────┘ 15 14 7 6 0 S: Sign bit (0 = positive, 1 = negative) E: Exponent bits (8 bits, bias = 127) M: Mantissa/fraction bits (7 bits) ``` The value 𝑣 of a BFloat16 number is calculated as: $$v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)$$ where: * $S \in \{0, 1\}$ is the sign bit * $E \in [1, 254]$ is the biased exponent * $M \in [0, 127]$ is the mantissa value Special Cases * Zero:$E = 0, M = 0 \Rightarrow v = (-1)^S \times 0$ * Infinity: $E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty$ * NaN: $E = 255, M \neq 0 \Rightarrow v = \text{NaN}$ * Denormals: Not supported (flush to zero) ### Special Value Check Function :::spoiler c code ``` c= static inline bool bf16_isnan(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && (a.bits & BF16_MANT_MASK); } static inline bool bf16_isinf(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && !(a.bits & BF16_MANT_MASK); } static inline bool bf16_iszero(bf16_t a) { return !(a.bits & 0x7FFF); } ``` ::: :::spoiler assebly code ``` risc-v= # ---------------------------------------- # bf16_isnan(a0 = bf16 bits) -> a0 = 1/0 # ---------------------------------------- bf16_isnan: li t0, BF16_EXP_MASK # t0 = 0x7F80 and t1, a0, t0 # t1 = a & BF16_EXP_MASK bne t1, t0, return_F # if ((a & EXP_MASK) != EXP_MASK) -> 0 li t0, BF16_MANT_MASK # t0 = 0x007F and t1, a0, t0 # t1 = a & BF16_MANT_MASK snez a0, t1 # a0 = (t1 != 0) ? 1 : 0 ret # ---------------------------------------- # bf16_isinf(a0 = bf16 bits) -> a0 = 1/0 # ---------------------------------------- bf16_isinf: li t0, BF16_EXP_MASK # t0 = 0x7F80 and t1, a0, t0 # t1 = a & BF16_EXP_MASK bne t1, t0, return_F # if ((a & EXP_MASK) != EXP_MASK) -> 0 li t0, BF16_MANT_MASK # t0 = 0x007F and t1, a0, t0 # t1 = a & BF16_MANT_MASK seqz a0, t1 # a0 = (t1 == 0) ? 1 : 0 ret # ---------------------------------------- # bf16_iszero(a0 = bf16 bits) -> a0 = 1/0 # ---------------------------------------- bf16_iszero: li t0, 0x7FFF # t0 = 0x7FFF and t1, a0, t0 # t1 = a & 0x7FFF seqz a0, t1 # a0 = (t1 == 0) ? 1 : 0 ret # ---------------------------------------- # return 0 # ---------------------------------------- return_F: li a0, 0 ret ``` ::: ### Conversion Function :::spoiler c code ``` c= ``` ::: :::spoiler assebly code ``` risc-v= # ---------------------------------------- # f32_to_bf16 / bf16_to_f32 # ---------------------------------------- f32_to_bf16: srli t0, a0, 23 # f32bits >> 23 andi t0, t0, 0xFF # (f32bits >> 23) & 0xFF li t1, 0xFF beq t0, t1, bit16 # exponent == 0xFF srli t2, a0, 16 # t2 = f32bits >> 16 andi t2, t2, 1 # t2 = ((f32bits >> 16) & 1) li t3, 0x7FFF add t2, t2, t3 add a0, a0, t2 srli a0, a0, 16 ret bit16: # exponent == 0xFF srli a0, a0, 16 # f32bits >> 16 ret bf16_to_f32: slli a0, a0, 16 ret ``` ::: ### Addition and Substraction :::spoiler c code ``` c= ``` ::: :::spoiler assebly code ``` risc-v= ``` ::: ### Multiplication :::spoiler c code ``` c= ``` ::: :::spoiler assebly code ``` risc-v= ``` ::: ### Division :::spoiler c code ``` c= ``` ::: :::spoiler assebly code ``` risc-v= ``` ::: ### Square :::spoiler c code ``` c= ``` ::: :::spoiler assebly code ``` risc-v= ``` ::: ## 5-Stage Pipeline Processor | Stage | full name | Description | | ------- |:---------------------------------- |:-------------------------------------------------------- | | **IF** | Instruction Fetch | Fetch the instruction from memory | | **ID** | Instruction Decode & Register Read | Decode the instruction and read register operands | | **EX** | Execution or Address Calculation | Execute the operation or calculate the effective address | | **MEM** | Data Memory Access | Access data memory for load or store instructions | | **WB** | Write Back | Write the result back into the destination register | 1. IF 2. ID 3. EX 4. MEM 5. WB | Metric | Full Name | Description | |:------------------ |:---------------------- |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Cycles** | Clock Cycles | The total number of clock cycles elapsed since program execution began. | | **Instrs Retired** | Instructions Retired | The total number of instructions that have successfully passed through all pipeline stages (IF → ID → EX → MEM → WB) and completed execution. Flushed or canceled instructions are not counted. | | **CPI** | Cycles Per Instruction | The average number of clock cycles required to complete one instruction. Calculated as **CPI = Cycles ÷ Instrs Retired**. A lower CPI indicates better performance and fewer stalls. | | **IPC** | Instructions Per Cycle | The average number of instructions completed per clock cycle. Calculated as **IPC = Instrs Retired ÷ Cycles**. A higher IPC indicates better parallelism and pipeline efficiency. | ## Reference [Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol) [Pipeline](https://hackmd.io/@ExcitedMail/rynNyOKJY)