arch2025-homework1

# Assignment 1: RISC-V Assembly and Instruction Pipeline Contributed by [yuyan7498](https://github.com/yuyan7498) Code: [Github repository](https://github.com/yuyan7498/ca2025-quizzes) This repository contains three folders: * `problem_B_and_c` — RISC-V assembly implementations for Quiz 1, Problem B and Problem C, including both the program sources and their tests. * `next_power_of_two` — clz implementation: Compute the smallest power of two greater than or equal to a 32-bit unsigned integer * `dist_2_points` — bfloat16 implementation: Calculate two-Point minimum distance * * `leetcode231_power_of_two` — extra practice ## Problem B UF8 ⇄ uint32 This write-up explains the overall test harness (main) and the three core routines—`decode`, `encode`, and `clz32`—with design ideas, key code snippets, and quick correctness intuition. ### Target & Data Format UF8: an 8-bit custom integer encoding, `eeee mmmm` - High 4 bits `e` = exponent - Low 4 bits `m` = mantissa Definition: ``` value = ((0x7FFF >> (15 - e)) << 4) + (m << e) = 16 * (2^e - 1) + m * 2^e ``` with `e ∈ [0,15]`, `m ∈ [0,15]`. ### clz32 - Goal: return the number of leading zeros of a 32-bit unsigned integer. - Idea: binary search over 16/8/4/2/1 bit chunks; if the upper chunk is non-zero, shift down and accumulate the bit width. If `a0=0`, return `32`. #### C code ```c static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` #### Assembly code ```asm # ------------------------------------------------------------ # clz32 # a0=x; returns clz(x). Uses t0=n, t1=c, t2=y # ------------------------------------------------------------ clz32: li t0, 32 # n = 32 li t1, 16 # c = 16 1: srl t2, a0, t1 # y = x >> c beq t2, x0, 2f # if (!y) skip update sub t0, t0, t1 # n -= c mv a0, t2 # x = y 2: srli t1, t1, 1 # c >>= 1 bnez t1, 1b # while (c) sub a0, t0, a0 # return n - x ret ``` ### decode Recall: ``` value = 16 * (2^e - 1) + m * 2^e ``` #### C Code ```c /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } ``` #### Assembly Code ```asm decode: andi t0, a0, 0x0f # m srli t1, a0, 4 # e li t2, 15 sub t2, t2, t1 # 15 - e li t3, 0x7FFF srl t3, t3, t2 # 0x7FFF >> (15 - e) = (2^e - 1) slli t3, t3, 4 # *16 => offset sll t4, t0, t1 # m << e add a0, t4, t3 # value ret ``` #### Example (0x52: e=5, m=2) - `offset = 16*(2^5 − 1) = 496` - `m<<e = 2*32 = 64` - `value = 560` --- ### encode Goal: for a given `value`, find `e, m` s.t. ``` value = 16*(2^e − 1) + m*2^e, with e,m ∈ [0,15]. ``` #### Algorithm steps 1) Small fast path: if `value < 16` ⇒ `e=0`, `m=value` (return value). 2) Initial `e` guess: - `msb = 31 - clz32(value)` → estimate `e ≈ msb - 4` (since `m` is 4-bit). Clamp to `[0,15]`. 3) Build `overflow = 16*(2^e − 1)`: - Start from 0, loop `e` times with `overflow = (overflow<<1) + 16`. 4) Adjust `e` downward if needed: - While `e>0 && value < overflow`: rollback one level - `overflow = (overflow - 16) >> 1`, `e--`. 5) Adjust `e` upward if possible: - If `value ≥ next_overflow = (overflow<<1)+16` and `e<15`, then promote: - `overflow = next_overflow`, `e++`. 6) Solve `m`: `m = (value - overflow) >> e`. 7) Pack: `(e<<4) | m`. #### C code ```c /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` #### Assembly code ```asm encode: # save ra/s0 addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) mv s0, a0 # value # 1) small fast path li t0, 16 bltu s0, t0, EncodeRetShort # 2) e ≈ msb - 4 (msb = 31 - clz) mv a0, s0 jal ra, clz32 li t5, 31 sub t5, t5, a0 # msb addi t1, t5, -4 # e li t6, 15 bgtu t1, t6, ClipTo15 NoClip: # 3) overflow = 16*(2^e - 1) li t2, 0 li t3, 0 ForELoop: beq t3, t1, ForEDone slli t2, t2, 1 addi t2, t2, 16 addi t3, t3, 1 j ForELoop ForEDone: # 4) adjust down if value < overflow AdjLoop: beqz t1, AdjDone bltu s0, t2, AdjStep j AdjDone AdjStep: addi t2, t2, -16 srli t2, t2, 1 addi t1, t1, -1 j AdjLoop AdjDone: # 5) adjust up if value ≥ next_overflow WhileLoop: li t6, 15 beq t1, t6, WhileDone slli t3, t2, 1 addi t3, t3, 16 # next_overflow bltu s0, t3, WhileDone mv t2, t3 addi t1, t1, 1 j WhileLoop WhileDone: # 6) m = (value - overflow) >> e sub t3, s0, t2 srl t3, t3, t1 # 7) (e<<4) | m slli t4, t1, 4 or a0, t4, t3 j EncodeRet EncodeRetShort: mv a0, s0 # e=0, m=value EncodeRet: lw ra, 12(sp) lw s0, 8(sp) addi sp, sp, 16 ret ClipTo15: li t1, 15 j NoClip ``` #### Why m ∈ [0,15] holds After adjustment, `overflow ≤ value < next_overflow`, where `next_overflow = overflow + 16*2^e`. Let `Δ = value − overflow`, then `0 ≤ Δ < 16*2^e`. Thus `m = (Δ >> e)` ⇒ `0 ≤ m < 16`. #### Example (`value=560`) - `clz32(560)=22` → `msb=9` → estimate `e=5` - `overflow = 16*(2^5−1)=496`, `next_overflow=496+512=1008` - `overflow ≤ 560 < next_overflow` (no further adjust) - `m = (560−496)>>5 = 64>>5 = 2` - result `(e<<4)|m = 0x52` #### result ![image](https://hackmd.io/_uploads/BJr85Dqaex.png) ![image](https://hackmd.io/_uploads/r1ND9w9pxg.png) ![image](https://hackmd.io/_uploads/r1yOqD5ale.png) ![image](https://hackmd.io/_uploads/SJ6_5v9agx.png) ## Problem C — BF16 Arithmetic ### Bit Layout & Constants (bf16) - **sign**: bit 15 - **exponent**: bits 14..7 (8 bits), bias = 127 - **mantissa**: bits 6..0 (7 bits) #### Key masks & special values: - `BF16_SIGN_MASK = 0x8000` - `BF16_EXP_MASK = 0x7F80` - `BF16_MANT_MASK = 0x007F` - `+Inf = 0x7F80`, `-Inf = 0xFF80`, `NaN (canonical) = 0x7FC0` - `+0 = 0x0000`, `-0 = 0x8000` --- ### bf16_isnan **Idea**: In IEEE formats, exp=all-ones means NaN or Inf; mantissa≠0 ⇒ NaN. #### C code ```c static inline bool bf16_isnan(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && (a.bits & BF16_MANT_MASK); } ``` #### Assembly code ```asm # return ((bits & 0x7F80)==0x7F80) && (bits & 0x007F) bf16_isnan: li t2, 0x7F80 and t0, a0, t2 bne t0, t2, Lnan_false li t3, 0x007F and t1, a0, t3 beq t1, x0, Lnan_false li a0, 1 ret Lnan_false: mv a0, x0 ret ``` --- ### bf16_isinf **Idea**: exp=all-ones and mantissa=0 ⇒ Inf. #### C code ```c static inline bool bf16_isinf(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && !(a.bits & BF16_MANT_MASK); } ``` #### Assembly code ```asm # return ((bits & 0x7F80)==0x7F80) && !(bits & 0x007F) bf16_isinf: li t2, 0x7F80 and t0, a0, t2 bne t0, t2, Linf_false li t3, 0x007F and t1, a0, t3 bne t1, x0, Linf_false li a0, 1 ret Linf_false: mv a0, x0 ret ``` --- ### bf16_iszero **Idea**: ±0 are zeros; ignore sign bit. #### C code ```c static inline bool bf16_iszero(bf16_t a) { return !(a.bits & 0x7FFF); } ``` #### Assembly code ```asm # return !(bits & 0x7FFF) bf16_iszero: li t0, 0x7FFF and t0, a0, t0 bne t0, x0, Lzero_false li a0, 1 ret Lzero_false: mv a0, x0 ret ``` --- ### f32_to_bf16 (RNE, ties-to-even) #### Idea: - If exp==0xFF (Inf/NaN), pass-through high16 (preserves NaN payload). - Else round-to-nearest-even (RNE) by +0x7FFF and +(high16&1) before chopping. #### C code ```c static inline bf16_t f32_to_bf16(float val) { uint32_t f32bits; memcpy(&f32bits, &val, sizeof(float)); if (((f32bits >> 23) & 0xFF) == 0xFF) return (bf16_t){ .bits = (f32bits >> 16) & 0xFFFF }; f32bits += ((f32bits >> 16) & 1) + 0x7FFF; // RNE ties-to-even return (bf16_t){ .bits = f32bits >> 16 }; } ``` #### Assembly code ```asm # if exp==0xFF: return high16(f32bits) # else: RNE by adding ((high16&1)+0x7FFF), then return high16 f32_to_bf16: srli t0, a0, 23 andi t0, t0, 0xFF li t1, 0xFF beq t0, t1, Lfte_special srli t2, a0, 16 # high16 andi t2, t2, 1 # (high16 & 1) li t3, 0x7FFF add t2, t2, t3 # increment for RNE add a0, a0, t2 srli a0, a0, 16 ret Lfte_special: srli a0, a0, 16 andi a0, a0, 0xFFFF ret ``` --- ### bf16_to_f32 **Idea**: Widen: place bf16 in the high 16 bits of a float's 32-bit layout. #### C code ```c static inline float bf16_to_f32(bf16_t val) { uint32_t f32bits = ((uint32_t)val.bits) << 16; float out; memcpy(&out, &f32bits, sizeof(out)); return out; } ``` #### Assembly code ```asm # return ((uint32_t)bits)<<16 bf16_to_f32: andi a0, a0, 0xFFFF slli a0, a0, 16 ret ``` --- ### bf16_add / bf16_sub - Special cases first: NaN propagation; Inf rules; zeros shortcuts. - Align exponents: shift the smaller mant right; large gaps short-circuit. - Same sign: add mant, handle carry (>>1; exp++), overflow ⇒ ±Inf. - Different signs: subtract big–small; if zero ⇒ signed zero; else left-normalize until hidden-1. - Pack: (sign<<15) | ((exp&0xFF)<<7) | (mant&0x7F). #### C code ```c static inline bf16_t bf16_add(bf16_t a, bf16_t b) { uint16_t sa=(a.bits>>15)&1, sb=(b.bits>>15)&1; int16_t ea=(a.bits>>7)&0xFF, eb=(b.bits>>7)&0xFF; uint16_t ma=a.bits&0x7F, mb=b.bits&0x7F; if (ea==0xFF) { if (ma) return a; if (eb==0xFF) return (mb||sa==sb)? b: BF16_NAN(); return a; } if (eb==0xFF) return b; if (!ea && !ma) return b; if (!eb && !mb) return a; if (ea) ma|=0x80; if (eb) mb|=0x80; int16_t ed=ea-eb, re; uint16_t rs; uint32_t rm; if (ed>0){ re=ea; if (ed>8) return a; mb >>= ed; } else if (ed<0){ re=eb; if (ed<-8) return b; ma >>= -ed; } else re=ea; if (sa==sb){ rs=sa; rm=(uint32_t)ma+mb; if (rm&0x100){ rm>>=1; if (++re>=0xFF) return (bf16_t){.bits=(rs<<15)|0x7F80}; } } else { if (ma>=mb){ rs=sa; rm=ma-mb; } else { rs=sb; rm=mb-ma; } if (!rm) return BF16_ZERO(); while (!(rm&0x80)){ rm<<=1; if (--re<=0) return BF16_ZERO(); } } return (bf16_t){ .bits=(rs<<15)|((re&0xFF)<<7)|(rm&0x7F) }; } static inline bf16_t bf16_sub(bf16_t a, bf16_t b) { b.bits ^= BF16_SIGN_MASK; // flip sign return bf16_add(a, b); } ``` #### Assembly code ```asm # Extract fields srli t0, a0, 15; andi t0, t0, 1 # sign_a srli t1, a1, 15; andi t1, t1, 1 # sign_b srli t2, a0, 7; andi t2, t2, 0xFF # exp_a srli t3, a1, 7; andi t3, t3, 0xFF # exp_b andi t4, a0, 0x7F # mant_a andi t5, a1, 0x7F # mant_b # Specials (NaN/Inf/0) — see full routine # ... # Align exponents sub a2, t2, t3 beq a2, x0, 1f blt x0, a2, gt_path # exp_a > exp_b # exp_a < exp_b neg a3, a2 li t6, 8 blt t6, a3, ret_b srl t4, t4, a3 # mant_a >>= -diff mv t6, t3 # result_exp=exp_b j aligned gt_path: li t6, 8 blt t6, a2, ret_a srl t5, t5, a2 # mant_b >>= diff mv t6, t2 aligned: # Same sign → add → normalize on carry # Diff sign → subtract → left-normalize until hidden 1 # ... # Pack slli a3, a3, 15 andi a5, t6, 0xFF; slli a5, a5, 7 andi a4, a4, 0x7F or a3, a3, a5 or a0, a3, a4 ret ``` **Rounding note**: the final mantissa is truncated to 7 bits (no GRS), i.e. fraction rounding is effectively toward-zero. To switch to RNE, add guard/round/sticky before packing. --- ### bf16_mul - Handle NaN/Inf/0 first (Inf*0 ⇒ NaN etc.). - Normalize mantissas: add hidden-1 for normals; left-normalize subnormals and track an exponent adjust. - Do integer 8×8 multiply (shift-add). - Exponent: (ea + eb + subnorm_adjust) − bias. - Normalize product: choose >>8 or >>7 to land 7-bit mantissa; bump exp on >>8. - Overflow ⇒ ±Inf; underflow ⇒ ±0 (or denormal if near). #### C code ```c static inline bf16_t bf16_mul(bf16_t a, bf16_t b) { uint16_t sa=(a.bits>>15)&1, sb=(b.bits>>15)&1; int16_t ea=(a.bits>>7)&0xFF, eb=(b.bits>>7)&0xFF; uint16_t ma=a.bits&0x7F, mb=b.bits&0x7F; uint16_t rs = sa ^ sb; if (ea==0xFF){ if (ma) return a; if (!eb && !mb) return BF16_NAN(); return (bf16_t){.bits=(rs<<15)|0x7F80}; } if (eb==0xFF){ if (mb) return b; if (!ea && !ma) return BF16_NAN(); return (bf16_t){.bits=(rs<<15)|0x7F80}; } if ((!ea && !ma) || (!eb && !mb)) return (bf16_t){.bits=rs<<15}; int16_t adj=0; if (!ea){ while(!(ma&0x80)){ ma<<=1; adj--; } ea=1; } else ma|=0x80; if (!eb){ while(!(mb&0x80)){ mb<<=1; adj--; } eb=1; } else mb|=0x80; uint32_t prod=(uint32_t)ma*mb; int32_t re=(int32_t)ea+eb-BF16_EXP_BIAS+adj; uint32_t mm; if (prod & 0x8000){ mm=(prod>>8)&0x7F; re++; } else mm=(prod>>7)&0x7F; if (re>=0xFF) return (bf16_t){.bits=(rs<<15)|0x7F80}; if (re<=0) { if (re<-6) return (bf16_t){.bits=rs<<15}; mm >>= (1-re); re=0; } return (bf16_t){.bits=(rs<<15)|((re&0xFF)<<7)|(mm&0x7F)}; } ``` #### Assembly code ```asm # prod = mant_a * mant_b via shift-add (8 steps) mv a3, zero # prod mv a5, t6 # multiplier (mant_b) mv a2, t5 # multiplicand (mant_a) li t0, 8 Lmul_loop: andi t1, a5, 1 beqz t1, 1f add a3, a3, a2 1: slli a2, a2, 1 srli a5, a5, 1 addi t0, t0, -1 bnez t0, Lmul_loop ``` --- ### bf16_div - Normalize mantissas (add hidden-1 for normals). - Restoring division over 16 steps to form a 16-bit quotient accumulator. - Exponent: (ea − eb) + bias with subnormal adjustments. - Normalize quotient to have MSB in bit15; then take mantissa from >>8. - Overflow ⇒ ±Inf; exp<=0 ⇒ signed zero; else pack. #### C code ```c static inline bf16_t bf16_div(bf16_t a, bf16_t b) { uint16_t sa=(a.bits>>15)&1, sb=(b.bits>>15)&1; int16_t ea=(a.bits>>7)&0xFF, eb=(b.bits>>7)&0xFF; uint16_t ma=a.bits&0x7F, mb=b.bits&0x7F; uint16_t rs = sa ^ sb; if (eb==0xFF){ if (mb) return b; if (ea==0xFF && !ma) return BF16_NAN(); return (bf16_t){.bits=rs<<15}; } if (!eb && !mb){ if (!ea && !ma) return BF16_NAN(); return (bf16_t){.bits=(rs<<15)|0x7F80}; } if (ea==0xFF){ if (ma) return a; return (bf16_t){.bits=(rs<<15)|0x7F80}; } if (!ea && !ma) return (bf16_t){.bits=rs<<15}; if (ea) ma|=0x80; if (eb) mb|=0x80; uint32_t dividend=((uint32_t)ma)<<15, divisor=mb, q=0; for (int i=0;i<16;i++){ q <<= 1; if (dividend >= (divisor << (15-i))){ dividend -= (divisor << (15-i)); q |= 1; } } int32_t re = (int32_t)ea - eb + BF16_EXP_BIAS; if (!ea) re--; if (!eb) re++; if (q & 0x8000) q >>= 8; else { while (!(q & 0x8000) && re > 1){ q <<= 1; re--; } q >>= 8; } q &= 0x7F; if (re>=0xFF) return (bf16_t){.bits=(rs<<15)|0x7F80}; if (re<=0) return (bf16_t){.bits=rs<<15}; return (bf16_t){.bits=(rs<<15)|((re&0xFF)<<7)|(q&0x7F)}; } ``` #### Assembly code ```asm # quotient = (mant_a << 15) / mant_b slli a2, t4, 15 # remainder (dividend) mv a3, t5 # divisor li a4, 0 # quotient li a5, 0 # i=0 Ldiv_loop: li a7, 16 bge a5, a7, Ldiv_done slli a4, a4, 1 li a7, 15 sub a7, a7, a5 sll a7, a3, a7 bltu a2, a7, 1f sub a2, a2, a7 ori a4, a4, 1 1: addi a5, a5, 1 j Ldiv_loop Ldiv_done: ``` --- ### bf16_sqrt - Specials: sqrt(NaN)=NaN, sqrt(+Inf)=+Inf, sqrt(-Inf)=NaN, sqrt(±0)=0, negative finite ⇒ NaN. - Build integer mantissa m = 0x80 | mant. - Half the unbiased exponent: if odd, m <<= 1, new_exp=((e-1)/2)+bias; else new_exp=(e/2)+bias. - Binary search integer root result s.t. ((result*result)>>7) <= m (use shift-add multiply). - Normalize result into [128, 255]; adjust new_exp. - Overflow/underflow checks; pack. #### C code ```c static inline bf16_t bf16_sqrt(bf16_t a) { uint16_t s=(a.bits>>15)&1; int16_t e=(a.bits>>7)&0xFF; uint16_t m=a.bits&0x7F; if (e==0xFF){ if (m) return a; if (s) return BF16_NAN(); return a; } if (!e && !m) return BF16_ZERO(); if (s) return BF16_NAN(); if (!e) return BF16_ZERO(); int32_t ue = e - BF16_EXP_BIAS, new_exp; uint32_t mm = 0x80 | m; // [128,256) if (ue & 1){ mm <<= 1; new_exp=((ue-1)>>1)+BF16_EXP_BIAS; } else new_exp=(ue>>1)+BF16_EXP_BIAS; uint32_t low=90, high=256, result=128; while (low <= high){ uint32_t mid=(low+high)>>1; uint32_t sq=(mid*mid)/128; // >>7 if (sq <= mm){ result=mid; low=mid+1; } else high=mid-1; } if (result >= 256){ result >>= 1; new_exp++; } else if (result < 128){ while (result < 128 && new_exp > 1){ result <<= 1; new_exp--; } } uint16_t new_mant = result & 0x7F; if (new_exp >= 0xFF) return (bf16_t){.bits=0x7F80}; if (new_exp <= 0) return BF16_ZERO(); return (bf16_t){.bits=((new_exp&0xFF)<<7)|new_mant}; } ``` #### Assembly code ```asm # m in s2, new_exp in s3; result in t5 li s4, 90 # low li t6, 256 # high li t5, 128 # result sqrt_bsearch_loop: bgt s4, t6, sqrt_bsearch_done add t0, s4, t6 srai t0, t0, 1 # mid # sq = (mid*mid)>>7 via shift-add multiply mv a2, t0; mv a3, t0 li t2, 0; li s5, 16 sqrt_imul_loop: andi a4, a3, 1 beqz a4, 1f add t2, t2, a2 1: slli a2, a2, 1 srli a3, a3, 1 addi s5, s5, -1 bnez s5, sqrt_imul_loop mv t1, t2 srli t1, t1, 7 # sq bgt t1, s2, sq_gt_m mv t5, t0 # result = mid addi s4, t0, 1 # low = mid+1 j sqrt_bsearch_loop sq_gt_m: addi t6, t0, -1 # high = mid-1 j sqrt_bsearch_loop sqrt_bsearch_done: ``` --- ### Correctness & Trade-offs #### Rounding: - f32_to_bf16 implements RNE (ties-to-even). - Arithmetic ops truncate to 7-bit mantissa (no G/R/S), effectively toward-zero. ### Result ![image](https://hackmd.io/_uploads/SyviWF5Tel.png) ![image](https://hackmd.io/_uploads/Hyth-Y5age.png) ![image](https://hackmd.io/_uploads/Hk06bt5pxg.png) ![image](https://hackmd.io/_uploads/BywbzF9pgl.png) ![image](https://hackmd.io/_uploads/BkPmzYc6ee.png) ![image](https://hackmd.io/_uploads/BkwPGY96ex.png) ![image](https://hackmd.io/_uploads/HJuFMt5Txl.png) ![image](https://hackmd.io/_uploads/r1t9zK56xe.png) ![image](https://hackmd.io/_uploads/BJ_sfK5all.png) --- ## Problem Description: Next Power of Two Compute the smallest power of two greater than or equal to a 32-bit unsigned integer `x`. Return that power of two as a `uint32_t`. This operation shows how `clz` (count-leading-zeros) turns a bit-level insight into a constant-time primitive used in memory allocators, GPU buffers, ring buffers, and alignment routines. We’ll reuse the `clz32` routine from Quiz 1 and show a clean RV32I implementation calling it. All implementations in this article were carried out using the Ripes simulator with a 5-stage pipeline design. ### Concept To find the smallest power of two ≥ `x`, observe: - `floor_log2(x) = 31 - clz(x)` (for nonzero `x`) - `next_pow2(x) = 1 << (32 - clz(x - 1))` Example mapping: | x | Binary (32-bit) | clz(x-1) | Next Power of Two | |----|------------------|----------|-------------------| | 1 | 000…0001 | 31 | 1 | | 2 | 000…0010 | 30 | 2 | | 3 | 000…0011 | 30 | 4 | | 5 | 000…0101 | 29 | 8 | | 9 | 000…1001 | 28 | 16 | | 17 | 0001…0001 | 27 | 32 | Intuition: - If `x` is already a power of two, then `x-1` fills all lower bits with `1`. - `clz(x-1)` gives the position distance to the top `1`. - Shifting `1u << (32 - clz(x-1))` lands exactly on the next power-of-two bit. ### C Implementations #### clz32 (binary search style) + next_power_of_two ```c #include <stdio.h> #include <stdint.h> static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } // ------------------------------------------------------ // Compute Next Power of Two using CLZ (32-bit) // ------------------------------------------------------ static inline uint32_t next_power_of_two(uint32_t x) { if (x == 0u) return 1u; unsigned lz = clz(x - 1u); // clz(x-1) return 1u << (32u - lz); // 1 << (32 - clz(x-1)) } int main(void) { const uint32_t in[] = {0,1,2,3,5,9,17,33}; const uint32_t exp[] = {1,1,2,4,8,16,32,64}; const int N = (int)(sizeof(in)/sizeof(in[0])); int pass = 0; puts("=== Next Power of Two ==="); for (int i = 0; i < N; ++i) { uint32_t got = next_power_of_two(in[i]); if (got == exp[i]) { printf("PASS %u -> %u\n", in[i], got); ++pass; } else { printf("FAIL %u exp=%u got=%u\n", in[i], exp[i], got); } } printf("PASS %d/%d\n", pass, N); return 0; } ``` result: ![image](https://hackmd.io/_uploads/SyE-gE9axl.png) ![image](https://hackmd.io/_uploads/r1TWxEqagx.png) Explanation (why it works): - If `x` is power of two → `(x-1)` sets all less-significant bits to 1. - `clz(x-1)` pinpoints how far the highest `1` is from the MSB. - Shift a single `1` to that next position to get the next power of two. Examples: - `next_power_of_two(9) → 16` - `next_power_of_two(17) → 32` - `next_power_of_two(33) → 64` ### RISC-V Assembly Implementation #### clz32 (binary search style) + next_power_of_two ```asm .text .globl main # ===================================================== # main — Next Power of Two test # ===================================================== main: li s1, 0 # pass_count = 0 la a0, STR_HDR # print header li a7, 4 ecall la s2, npt_cases_in # inputs ptr la s3, npt_cases_exp # expected ptr lw s4, 0(s2) # N addi s2, s2, 4 addi s3, s3, 4 li s0, 0 # i = 0 Loop: beq s0, s4, Summary lw a0, 0(s2) # a0 = x mv s5, a0 # keep original x in s5 (callee won't clobber) jal ra, next_power_of_two mv t1, a0 # got lw t2, 0(s3) # exp bne t1, t2, PrintFail # PASS la a0, STR_PASS li a7, 4 ecall li a7, 1 # print x (from s5) mv a0, s5 ecall la a0, STR_ARROW li a7, 4 ecall li a7, 1 # print got mv a0, t1 ecall la a0, STR_NL li a7, 4 ecall addi s1, s1, 1 # pass_count++ j Cont PrintFail: la a0, STR_FAIL li a7, 4 ecall li a7, 1 mv a0, s5 # print x (from s5) ecall la a0, STR_ARROW li a7, 4 ecall li a7, 1 mv a0, t1 # print got ecall la a0, STR_NL li a7, 4 ecall Cont: addi s0, s0, 1 addi s2, s2, 4 addi s3, s3, 4 j Loop Summary: la a0, STR_PASS_WORD li a7, 4 ecall li a7, 1 mv a0, s1 ecall la a0, STR_SLASH li a7, 4 ecall li a7, 1 # print N_TOTAL la a0, N_TOTAL lw a0, 0(a0) ecall la a0, STR_NL2 li a7, 4 ecall li a7, 10 ecall # ===================================================== # next_power_of_two(x) # a0: x -> a0: next pow2 >= x # ===================================================== .globl next_power_of_two next_power_of_two: beq a0, x0, NPT_ret1 # if x==0 -> 1 mv t3, ra # save caller's ra addi a0, a0, -1 # a0 = x - 1 jal ra, clz32 # a0 = clz(x-1) mv ra, t3 # restore ra li t0, 32 sub t0, t0, a0 # t0 = 32 - clz(x-1) li a0, 1 sll a0, a0, t0 # a0 = 1 << (32 - clz(x-1)) ret NPT_ret1: li a0, 1 ret # ===================================================== # clz32(x) # a0: x -> a0: clz(x) # ===================================================== .globl clz32 clz32: li t0, 32 # n = 32 li t1, 16 # c = 16 CLZ_loop: srl t2, a0, t1 # y = x >> c beq t2, x0, CLZ_skip sub t0, t0, t1 # n -= c mv a0, t2 # x = y CLZ_skip: srli t1, t1, 1 # c >>= 1 bnez t1, CLZ_loop sub a0, t0, a0 # return n - x ret # ===================================================== # Data # ===================================================== .data N_TOTAL: .word 8 npt_cases_in: .word 8 .word 0,1,2,3,5,9,17,33 npt_cases_exp: .word 8 .word 1,1,2,4,8,16,32,64 STR_HDR: .string "=== Next Power of Two ===\n" STR_PASS: .string "PASS " STR_FAIL: .string "FAIL " STR_PASS_WORD: .string "PASS " STR_SLASH: .string "/" STR_ARROW: .string " -> " STR_NL: .string "\n" STR_NL2: .string "\n" ``` result: ![image](https://hackmd.io/_uploads/rkbCy45agg.png) ![image](https://hackmd.io/_uploads/HyT01Eq6xg.png) --- #### Test Data Example ```asm .data npt_cases_in: .word 8 .word 0,1,2,3,5,9,17,33 npt_cases_exp: .word 8 .word 1,1,2,4,8,16,32,64 ``` --- #### Running in Ripes Simulator (RV32I, 5-stage) ![Screenshot 2025-10-12 153249](https://hackmd.io/_uploads/B1PnKAuTxx.png) Expected Console Output: ```text === Next Power of Two === PASS 0 -> 1 PASS 1 -> 1 PASS 2 -> 2 PASS 3 -> 4 PASS 5 -> 8 PASS 9 -> 16 PASS 17 -> 32 PASS 33 -> 64 PASS 8/8 ``` --- #### Pipeline Visualization Consider the final shift that materializes the result: ```asm sll a0, a0, t0 # a0 = 1 << (32 - clz(x-1)) ``` - IF: PC shows the next sequential address; instruction memory outputs the `SLL` word. ![image](https://hackmd.io/_uploads/rk-WZ49Tel.png) 1. The address from which this cycle’s instruction is being read. 1. PC register value = PC_next = PC_fetch + 4 = 0x00000174 1. Fetched instruction word = 0x00551533 As derived above for sll a0, a0, t0. - ID: `rs1=a0` (initialized to 1), `rs2=t0` (shift amount), `funct3=001` (SLL), `opcode=OP`. ![image](https://hackmd.io/_uploads/S18zWVcagg.png) 1. funct7=0000000 (0)、rs2=x5 (t0=5)、rs1=x10 (a0=10)、funct3=001 (SLL)、rd=x10 (10)、opcode=0110011 (0x33) → 0x00551533 1. The Reg1 and Reg2 value is both 0x00000020, the actual shift amount for this SLL is the forwarded result of sub t0, t0, a0 (computed in the prior ALU stage) - EX: Barrel shifter applies variable shift from `t0`; result becomes the next power of two. ![image](https://hackmd.io/_uploads/S12N-E5axe.png) 1. Operand A = a0 = 0x00000001 1. The base value to be shifted. 1. Operand B = effective shift amount = (t0 & 0x1F) 1. Important: the EX forwarding mux picks up the new t0 (the output of sub t0, t0, a0 from the previous cycle). Thus, in EX you should see the correct shift distance. 1. ALU result = a0 << (t0 & 0x1F) - MEM: No data memory access for SLL. ![image](https://hackmd.io/_uploads/SJei-49pxg.png) - WB: Result written back to `a0`; no hazards if `t0` was produced in the prior ALU stage with proper forwarding. ![image](https://hackmd.io/_uploads/SkHhZ4qaee.png) 1. Destination register = rd = x10 (a0). 1. Selected value = ALU result (not PC+4). 1. Final write: a0 ← next_power_of_two(x) per the table above. --- ✅ Results Summary | Input | Expected | Computed | |-------|----------|----------| | 0 | 1 | 1 | | 1 | 1 | 1 | | 2 | 2 | 2 | | 3 | 4 | 4 | | 5 | 8 | 8 | | 9 | 16 | 16 | | 17 | 32 | 32 | | 33 | 64 | 64 | All cases PASS ✅ ### LeetCode Context Although LeetCode doesn’t have a dedicated “Next Power of Two” problem, it appears frequently inside system design and optimization tasks: | Problem ID | Title | Relation | |------------|-----------------------|-------------------------------------------------------| | 231 | Power of Two | Core idea (single-bit property) | | 707 / 622 | Design Linked List / Circular Queue | Choose capacity = next power of two to use `& (cap-1)` indexing | | 191 | Number of 1 Bits | Same bitwise family | | 50 | Pow(x, n) | Power computation & bit manipulation | Additionally, this GitHub repo contains my implementation and a few optimizations for LeetCode 231 (Power of Two). I had planned to center this write-up on that problem, but because it doesn’t make direct use of clz, I pivoted to Next Power of Two instead. Example (circular queue wrap-around optimization): ```c index = (index + 1) & (capacity - 1); // works only if capacity is a power of two ``` ### Discussion and Conclusion `next_power_of_two` is a compact, high-performance primitive built on `clz`. It converts an arbitrary size `x` to the smallest power of two ≥ `x`, which is exactly what alignment and capacity sizing need. The core idea: compute `x−1`, use `clz(x−1)` to locate the top bit, then shift 1 to that position to materialize the aligned capacity. - With hardware CLZ (e.g., Zbb), this becomes just a handful of cycles. - Even in pure RV32I, a software `clz32` plus a few ALU ops provides a fast, branch-light primitive. Common uses include: - Memory allocation (aligning heap blocks) - Graphics/compute pipelines (buffer sizing) - FFT / audio processing (rounding sample counts) ### References - [LeetCode 231: Power of Two](https://leetcode.com/problems/power-of-two/) - RISC-V ISA Manual (RV32I) - Bit Twiddling Hacks (next power of two via clz/bit-spreading) ## Two-Point Minimum Distance (BF16) ### Objective Given two point sets A and B (point coordinates stored in bf16 encoding), calculate the minimum cross-set distance: $$\min_{p \in A, q \in B} \sqrt{(q_x - p_x)^2 + (q_y - p_y)^2}$$ Output results for each test case, compare with "expected answers", and print PASS/FAIL. ### Main Flow 1. Read n_a, n_b and the starting addresses of point arrays A and B. 2. Call `find_min_distance(n_a, n_b, ptrA, ptrB)` to get the minimum distance (bf16 bits). 3. Navigate to `expected_min_dist` at the end of the case and perform bitwise equality comparison with the result: - Equal → Print PASS - Otherwise → Print FAIL (exp: ..., got: ...) 4. Move to the beginning of the next test case until all are completed. ### Code Note: The full implementation is on GitHub ```assembly .data # ========== Message Strings ========== start_msg: .string "===== Two-Point Minimum Distance Calculation (BF16) =====" end_msg: .string "===== Calculation Complete =====" input_msg: .string "Test Case " result_msg: .string "Minimum Distance: " newline: .string "\n" pass_msg: .string "PASS\n" fail_msg_prefix: .string "FAIL (exp: " fail_msg_mid: .string ", got: " fail_msg_tail: .string ")\n" arrow_msg: .string " -> " # ========== Test Data (Examples) ========== test_data: .word 3 # 3 test cases # Test Case 1: A={(0,0), (1,0)}, B={(0,1), (1,1)}, exp = 1.0 (0x3F80) .word 2 # n_a .word 2 # n_b # A .word 0x0000, 0x0000 # (0, 0) .word 0x3F80, 0x0000 # (1, 0) # B .word 0x0000, 0x3F80 # (0, 1) .word 0x3F80, 0x3F80 # (1, 1) # exp .word 0x3F80 # Test Case 2: A={(0,0)}, B={(3,4)}, exp = 5.0 (0x40A0) .word 1 # n_a .word 1 # n_b # A .word 0x0000, 0x0000 # (0, 0) # B .word 0x4040, 0x4080 # (3, 4) # exp .word 0x40A0 # Test Case 3: A={(0,0), (2,0)}, B={(1,1), (3,1)}, exp = sqrt(2) ≈ 1.414 (0x3FB5) .word 2 # n_a .word 2 # n_b # A .word 0x0000, 0x0000 # (0, 0) .word 0x4000, 0x0000 # (2, 0) # B .word 0x3F80, 0x3F80 # (1, 1) .word 0x4040, 0x3F80 # (3, 1) # exp .word 0x3FB5 .text .globl main # ========================================== # main # ========================================== main: # Print header la a0, start_msg li a7, 4 ecall li a0, 10 # '\n' li a7, 11 ecall la s0, test_data lw s1, 0(s0) # s1 = number of test cases addi s0, s0, 4 li s2, 0 # s2 = current case index main_test_loop: bge s2, s1, main_done # Print "Test Case i" la a0, input_msg # "Test Case " li a7, 4 ecall addi a0, s2, 1 li a7, 1 # print int ecall li a0, 10 li a7, 11 # print char '\n' ecall # Read case header lw a0, 0(s0) # n_a lw a1, 4(s0) # n_b addi a2, s0, 8 # A array starting address # B array starting address = A start + n_a * 8 bytes slli t0, a0, 3 # n_a * 8 add a3, a2, t0 # Preserve outer pointers addi sp, sp, -16 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw ra, 12(sp) # Call minimum distance function jal ra, find_min_distance mv s3, a0 # s3 = calculation result (bf16 bits) # Restore lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw ra, 12(sp) addi sp, sp, 16 # Calculate expected answer address: # total_points_bytes = (n_a + n_b) * 8 # header 2 words = 8 bytes # t3 = s0 + 8 + (n_a+n_b)*8 lw t0, 0(s0) # n_a lw t1, 4(s0) # n_b add t2, t0, t1 slli t2, t2, 3 # *8 addi t2, t2, 8 # + header(2 words) add t3, s0, t2 lw s4, 0(t3) # s4 = expected answer (bf16 bits) # Print current result and PASS/FAIL la a0, result_msg li a7, 4 ecall mv a0, s3 # got li a7, 1 ecall la a0, arrow_msg # " -> " li a7, 4 ecall beq s3, s4, __print_pass # FAIL: Print "FAIL (exp: <exp>, got: <got>)\n" la a0, fail_msg_prefix li a7, 4 ecall mv a0, s4 # exp li a7, 1 ecall la a0, fail_msg_mid li a7, 4 ecall mv a0, s3 # got li a7, 1 ecall la a0, fail_msg_tail li a7, 4 ecall j __after_passfail __print_pass: la a0, pass_msg li a7, 4 ecall __after_passfail: addi s0, t3, 4 addi s2, s2, 1 j main_test_loop main_done: la a0, end_msg li a7, 4 ecall li a0, 10 li a7, 11 ecall li a0, 0 li a7, 10 # exit ecall # ========================================== # find_min_distance # Parameters: a0=n_a, a1=n_b, a2=ptrA, a3=ptrB # Return: a0 = minimum distance (bf16 bits) # ========================================== find_min_distance: addi sp, sp, -32 sw ra, 28(sp) sw s0, 24(sp) sw s1, 20(sp) sw s2, 16(sp) sw s3, 12(sp) sw s4, 8(sp) sw s5, 4(sp) mv s0, a0 # n_a mv s1, a1 # n_b mv s2, a2 # ptrA mv s3, a3 # ptrB li s4, 0x7F80 # min_dist = +Inf (bf16 bits) li s5, 0 # i find_outer_loop: bge s5, s0, find_done # A[i] address = ptrA + i*8 slli t0, s5, 3 add t0, s2, t0 lw a0, 0(t0) # ax lw a1, 4(t0) # ay li t1, 0 # j find_inner_loop: bge t1, s1, find_next_i # B[j] address = ptrB + j*8 slli t2, t1, 3 add t2, s3, t2 lw a2, 0(t2) # bx lw a3, 4(t2) # by # Save needed registers addi sp, sp, -12 sw a0, 0(sp) sw a1, 4(sp) sw t1, 8(sp) jal ra, calc_distance # a0 = dist (bf16 bits) mv t3, a0 # Restore lw a0, 0(sp) lw a1, 4(sp) lw t1, 8(sp) addi sp, sp, 12 # If dist < min_dist → update bltu t3, s4, find_update_min j find_continue find_update_min: mv s4, t3 find_continue: addi t1, t1, 1 j find_inner_loop find_next_i: addi s5, s5, 1 j find_outer_loop find_done: mv a0, s4 # return min lw ra, 28(sp) lw s0, 24(sp) lw s1, 20(sp) lw s2, 16(sp) lw s3, 12(sp) lw s4, 8(sp) lw s5, 4(sp) addi sp, sp, 32 ret # ========================================== # calc_distance(ax, ay, bx, by) → bf16 distance # ========================================== calc_distance: addi sp, sp, -24 sw ra, 20(sp) sw s0, 16(sp) sw s1, 12(sp) sw s2, 8(sp) sw s3, 4(sp) mv s3, a0 # ax mv s1, a1 # ay mv s2, a3 # by # dx = bx - ax mv a0, a2 # bx mv a1, s3 # ax jal ra, bf16_sub # a0 = dx mv a1, a0 jal ra, bf16_mul # a0 = dx^2 mv s0, a0 # s0 = dx^2 # dy = by - ay mv a0, s2 # by mv a1, s1 # ay jal ra, bf16_sub # a0 = dy mv a1, a0 jal ra, bf16_mul # a0 = dy^2 # dx^2 + dy^2 mv a1, a0 # dy^2 mv a0, s0 # dx^2 jal ra, bf16_add # a0 = sum # sqrt(sum) jal ra, bf16_sqrt # a0 = distance lw ra, 20(sp) lw s0, 16(sp) lw s1, 12(sp) lw s2, 8(sp) lw s3, 4(sp) addi sp, sp, 24 ret ``` ### Result ![image](https://hackmd.io/_uploads/rJ1kioiTge.png) Where 16256=0x3F80(=1.0), 16544=0x40A0(=5.0), 16309≈0x3FB5(≈1.414). ![image](https://hackmd.io/_uploads/HJOyisj6xe.png)