# Assignment 1: RISC-V Assembly and Instruction Pipeline
Contributed by [yuyan7498](https://github.com/yuyan7498)
Code: [Github repository](https://github.com/yuyan7498/ca2025-quizzes)
This repository contains three folders:
* `problem_B_and_c` — RISC-V assembly implementations for Quiz 1, Problem B and Problem C, including both the program sources and their tests.
* `next_power_of_two` — clz implementation: Compute the smallest power of two greater than or equal to a 32-bit unsigned integer
* `dist_2_points` — bfloat16 implementation: Calculate two-Point minimum distance
* * `leetcode231_power_of_two` — extra practice
## Problem B UF8 ⇄ uint32
This write-up explains the overall test harness (main) and the three core routines—`decode`, `encode`, and `clz32`—with design ideas, key code snippets, and quick correctness intuition.
### Target & Data Format
UF8: an 8-bit custom integer encoding, `eeee mmmm`
- High 4 bits `e` = exponent
- Low 4 bits `m` = mantissa
Definition:
```
value = ((0x7FFF >> (15 - e)) << 4) + (m << e)
= 16 * (2^e - 1) + m * 2^e
```
with `e ∈ [0,15]`, `m ∈ [0,15]`.
### clz32
- Goal: return the number of leading zeros of a 32-bit unsigned integer.
- Idea: binary search over 16/8/4/2/1 bit chunks; if the upper chunk is non-zero, shift down and accumulate the bit width. If `a0=0`, return `32`.
#### C code
```c
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
#### Assembly code
```asm
# ------------------------------------------------------------
# clz32
# a0=x; returns clz(x). Uses t0=n, t1=c, t2=y
# ------------------------------------------------------------
clz32:
li t0, 32 # n = 32
li t1, 16 # c = 16
1:
srl t2, a0, t1 # y = x >> c
beq t2, x0, 2f # if (!y) skip update
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
2:
srli t1, t1, 1 # c >>= 1
bnez t1, 1b # while (c)
sub a0, t0, a0 # return n - x
ret
```
### decode
Recall:
```
value = 16 * (2^e - 1) + m * 2^e
```
#### C Code
```c
/* Decode uf8 to uint32_t */
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
```
#### Assembly Code
```asm
decode:
andi t0, a0, 0x0f # m
srli t1, a0, 4 # e
li t2, 15
sub t2, t2, t1 # 15 - e
li t3, 0x7FFF
srl t3, t3, t2 # 0x7FFF >> (15 - e) = (2^e - 1)
slli t3, t3, 4 # *16 => offset
sll t4, t0, t1 # m << e
add a0, t4, t3 # value
ret
```
#### Example (0x52: e=5, m=2)
- `offset = 16*(2^5 − 1) = 496`
- `m<<e = 2*32 = 64`
- `value = 560`
---
### encode
Goal: for a given `value`, find `e, m` s.t.
```
value = 16*(2^e − 1) + m*2^e, with e,m ∈ [0,15].
```
#### Algorithm steps
1) Small fast path: if `value < 16` ⇒ `e=0`, `m=value` (return value).
2) Initial `e` guess:
- `msb = 31 - clz32(value)` → estimate `e ≈ msb - 4` (since `m` is 4-bit). Clamp to `[0,15]`.
3) Build `overflow = 16*(2^e − 1)`:
- Start from 0, loop `e` times with `overflow = (overflow<<1) + 16`.
4) Adjust `e` downward if needed:
- While `e>0 && value < overflow`: rollback one level
- `overflow = (overflow - 16) >> 1`, `e--`.
5) Adjust `e` upward if possible:
- If `value ≥ next_overflow = (overflow<<1)+16` and `e<15`, then promote:
- `overflow = next_overflow`, `e++`.
6) Solve `m`: `m = (value - overflow) >> e`.
7) Pack: `(e<<4) | m`.
#### C code
```c
/* Encode uint32_t to uf8 */
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
#### Assembly code
```asm
encode:
# save ra/s0
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
mv s0, a0 # value
# 1) small fast path
li t0, 16
bltu s0, t0, EncodeRetShort
# 2) e ≈ msb - 4 (msb = 31 - clz)
mv a0, s0
jal ra, clz32
li t5, 31
sub t5, t5, a0 # msb
addi t1, t5, -4 # e
li t6, 15
bgtu t1, t6, ClipTo15
NoClip:
# 3) overflow = 16*(2^e - 1)
li t2, 0
li t3, 0
ForELoop:
beq t3, t1, ForEDone
slli t2, t2, 1
addi t2, t2, 16
addi t3, t3, 1
j ForELoop
ForEDone:
# 4) adjust down if value < overflow
AdjLoop:
beqz t1, AdjDone
bltu s0, t2, AdjStep
j AdjDone
AdjStep:
addi t2, t2, -16
srli t2, t2, 1
addi t1, t1, -1
j AdjLoop
AdjDone:
# 5) adjust up if value ≥ next_overflow
WhileLoop:
li t6, 15
beq t1, t6, WhileDone
slli t3, t2, 1
addi t3, t3, 16 # next_overflow
bltu s0, t3, WhileDone
mv t2, t3
addi t1, t1, 1
j WhileLoop
WhileDone:
# 6) m = (value - overflow) >> e
sub t3, s0, t2
srl t3, t3, t1
# 7) (e<<4) | m
slli t4, t1, 4
or a0, t4, t3
j EncodeRet
EncodeRetShort:
mv a0, s0 # e=0, m=value
EncodeRet:
lw ra, 12(sp)
lw s0, 8(sp)
addi sp, sp, 16
ret
ClipTo15:
li t1, 15
j NoClip
```
#### Why m ∈ [0,15] holds
After adjustment, `overflow ≤ value < next_overflow`, where
`next_overflow = overflow + 16*2^e`.
Let `Δ = value − overflow`, then `0 ≤ Δ < 16*2^e`.
Thus `m = (Δ >> e)` ⇒ `0 ≤ m < 16`.
#### Example (`value=560`)
- `clz32(560)=22` → `msb=9` → estimate `e=5`
- `overflow = 16*(2^5−1)=496`, `next_overflow=496+512=1008`
- `overflow ≤ 560 < next_overflow` (no further adjust)
- `m = (560−496)>>5 = 64>>5 = 2`
- result `(e<<4)|m = 0x52`
#### result




## Problem C — BF16 Arithmetic
### Bit Layout & Constants (bf16)
- **sign**: bit 15
- **exponent**: bits 14..7 (8 bits), bias = 127
- **mantissa**: bits 6..0 (7 bits)
#### Key masks & special values:
- `BF16_SIGN_MASK = 0x8000`
- `BF16_EXP_MASK = 0x7F80`
- `BF16_MANT_MASK = 0x007F`
- `+Inf = 0x7F80`, `-Inf = 0xFF80`, `NaN (canonical) = 0x7FC0`
- `+0 = 0x0000`, `-0 = 0x8000`
---
### bf16_isnan
**Idea**: In IEEE formats, exp=all-ones means NaN or Inf; mantissa≠0 ⇒ NaN.
#### C code
```c
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
```
#### Assembly code
```asm
# return ((bits & 0x7F80)==0x7F80) && (bits & 0x007F)
bf16_isnan:
li t2, 0x7F80
and t0, a0, t2
bne t0, t2, Lnan_false
li t3, 0x007F
and t1, a0, t3
beq t1, x0, Lnan_false
li a0, 1
ret
Lnan_false:
mv a0, x0
ret
```
---
### bf16_isinf
**Idea**: exp=all-ones and mantissa=0 ⇒ Inf.
#### C code
```c
static inline bool bf16_isinf(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
!(a.bits & BF16_MANT_MASK);
}
```
#### Assembly code
```asm
# return ((bits & 0x7F80)==0x7F80) && !(bits & 0x007F)
bf16_isinf:
li t2, 0x7F80
and t0, a0, t2
bne t0, t2, Linf_false
li t3, 0x007F
and t1, a0, t3
bne t1, x0, Linf_false
li a0, 1
ret
Linf_false:
mv a0, x0
ret
```
---
### bf16_iszero
**Idea**: ±0 are zeros; ignore sign bit.
#### C code
```c
static inline bool bf16_iszero(bf16_t a)
{
return !(a.bits & 0x7FFF);
}
```
#### Assembly code
```asm
# return !(bits & 0x7FFF)
bf16_iszero:
li t0, 0x7FFF
and t0, a0, t0
bne t0, x0, Lzero_false
li a0, 1
ret
Lzero_false:
mv a0, x0
ret
```
---
### f32_to_bf16 (RNE, ties-to-even)
#### Idea:
- If exp==0xFF (Inf/NaN), pass-through high16 (preserves NaN payload).
- Else round-to-nearest-even (RNE) by +0x7FFF and +(high16&1) before chopping.
#### C code
```c
static inline bf16_t f32_to_bf16(float val)
{
uint32_t f32bits; memcpy(&f32bits, &val, sizeof(float));
if (((f32bits >> 23) & 0xFF) == 0xFF)
return (bf16_t){ .bits = (f32bits >> 16) & 0xFFFF };
f32bits += ((f32bits >> 16) & 1) + 0x7FFF; // RNE ties-to-even
return (bf16_t){ .bits = f32bits >> 16 };
}
```
#### Assembly code
```asm
# if exp==0xFF: return high16(f32bits)
# else: RNE by adding ((high16&1)+0x7FFF), then return high16
f32_to_bf16:
srli t0, a0, 23
andi t0, t0, 0xFF
li t1, 0xFF
beq t0, t1, Lfte_special
srli t2, a0, 16 # high16
andi t2, t2, 1 # (high16 & 1)
li t3, 0x7FFF
add t2, t2, t3 # increment for RNE
add a0, a0, t2
srli a0, a0, 16
ret
Lfte_special:
srli a0, a0, 16
andi a0, a0, 0xFFFF
ret
```
---
### bf16_to_f32
**Idea**: Widen: place bf16 in the high 16 bits of a float's 32-bit layout.
#### C code
```c
static inline float bf16_to_f32(bf16_t val)
{
uint32_t f32bits = ((uint32_t)val.bits) << 16;
float out; memcpy(&out, &f32bits, sizeof(out));
return out;
}
```
#### Assembly code
```asm
# return ((uint32_t)bits)<<16
bf16_to_f32:
andi a0, a0, 0xFFFF
slli a0, a0, 16
ret
```
---
### bf16_add / bf16_sub
- Special cases first: NaN propagation; Inf rules; zeros shortcuts.
- Align exponents: shift the smaller mant right; large gaps short-circuit.
- Same sign: add mant, handle carry (>>1; exp++), overflow ⇒ ±Inf.
- Different signs: subtract big–small; if zero ⇒ signed zero; else left-normalize until hidden-1.
- Pack: (sign<<15) | ((exp&0xFF)<<7) | (mant&0x7F).
#### C code
```c
static inline bf16_t bf16_add(bf16_t a, bf16_t b) {
uint16_t sa=(a.bits>>15)&1, sb=(b.bits>>15)&1;
int16_t ea=(a.bits>>7)&0xFF, eb=(b.bits>>7)&0xFF;
uint16_t ma=a.bits&0x7F, mb=b.bits&0x7F;
if (ea==0xFF) { if (ma) return a; if (eb==0xFF) return (mb||sa==sb)? b: BF16_NAN(); return a; }
if (eb==0xFF) return b;
if (!ea && !ma) return b;
if (!eb && !mb) return a;
if (ea) ma|=0x80; if (eb) mb|=0x80;
int16_t ed=ea-eb, re; uint16_t rs; uint32_t rm;
if (ed>0){ re=ea; if (ed>8) return a; mb >>= ed; }
else if (ed<0){ re=eb; if (ed<-8) return b; ma >>= -ed; }
else re=ea;
if (sa==sb){
rs=sa; rm=(uint32_t)ma+mb;
if (rm&0x100){ rm>>=1; if (++re>=0xFF) return (bf16_t){.bits=(rs<<15)|0x7F80}; }
} else {
if (ma>=mb){ rs=sa; rm=ma-mb; } else { rs=sb; rm=mb-ma; }
if (!rm) return BF16_ZERO();
while (!(rm&0x80)){ rm<<=1; if (--re<=0) return BF16_ZERO(); }
}
return (bf16_t){ .bits=(rs<<15)|((re&0xFF)<<7)|(rm&0x7F) };
}
static inline bf16_t bf16_sub(bf16_t a, bf16_t b) {
b.bits ^= BF16_SIGN_MASK; // flip sign
return bf16_add(a, b);
}
```
#### Assembly code
```asm
# Extract fields
srli t0, a0, 15; andi t0, t0, 1 # sign_a
srli t1, a1, 15; andi t1, t1, 1 # sign_b
srli t2, a0, 7; andi t2, t2, 0xFF # exp_a
srli t3, a1, 7; andi t3, t3, 0xFF # exp_b
andi t4, a0, 0x7F # mant_a
andi t5, a1, 0x7F # mant_b
# Specials (NaN/Inf/0) — see full routine
# ...
# Align exponents
sub a2, t2, t3
beq a2, x0, 1f
blt x0, a2, gt_path # exp_a > exp_b
# exp_a < exp_b
neg a3, a2
li t6, 8
blt t6, a3, ret_b
srl t4, t4, a3 # mant_a >>= -diff
mv t6, t3 # result_exp=exp_b
j aligned
gt_path:
li t6, 8
blt t6, a2, ret_a
srl t5, t5, a2 # mant_b >>= diff
mv t6, t2
aligned:
# Same sign → add → normalize on carry
# Diff sign → subtract → left-normalize until hidden 1
# ...
# Pack
slli a3, a3, 15
andi a5, t6, 0xFF; slli a5, a5, 7
andi a4, a4, 0x7F
or a3, a3, a5
or a0, a3, a4
ret
```
**Rounding note**: the final mantissa is truncated to 7 bits (no GRS), i.e. fraction rounding is effectively toward-zero. To switch to RNE, add guard/round/sticky before packing.
---
### bf16_mul
- Handle NaN/Inf/0 first (Inf*0 ⇒ NaN etc.).
- Normalize mantissas: add hidden-1 for normals; left-normalize subnormals and track an exponent adjust.
- Do integer 8×8 multiply (shift-add).
- Exponent: (ea + eb + subnorm_adjust) − bias.
- Normalize product: choose >>8 or >>7 to land 7-bit mantissa; bump exp on >>8.
- Overflow ⇒ ±Inf; underflow ⇒ ±0 (or denormal if near).
#### C code
```c
static inline bf16_t bf16_mul(bf16_t a, bf16_t b)
{
uint16_t sa=(a.bits>>15)&1, sb=(b.bits>>15)&1;
int16_t ea=(a.bits>>7)&0xFF, eb=(b.bits>>7)&0xFF;
uint16_t ma=a.bits&0x7F, mb=b.bits&0x7F;
uint16_t rs = sa ^ sb;
if (ea==0xFF){ if (ma) return a; if (!eb && !mb) return BF16_NAN();
return (bf16_t){.bits=(rs<<15)|0x7F80}; }
if (eb==0xFF){ if (mb) return b; if (!ea && !ma) return BF16_NAN();
return (bf16_t){.bits=(rs<<15)|0x7F80}; }
if ((!ea && !ma) || (!eb && !mb)) return (bf16_t){.bits=rs<<15};
int16_t adj=0;
if (!ea){ while(!(ma&0x80)){ ma<<=1; adj--; } ea=1; } else ma|=0x80;
if (!eb){ while(!(mb&0x80)){ mb<<=1; adj--; } eb=1; } else mb|=0x80;
uint32_t prod=(uint32_t)ma*mb;
int32_t re=(int32_t)ea+eb-BF16_EXP_BIAS+adj;
uint32_t mm;
if (prod & 0x8000){ mm=(prod>>8)&0x7F; re++; } else mm=(prod>>7)&0x7F;
if (re>=0xFF) return (bf16_t){.bits=(rs<<15)|0x7F80};
if (re<=0) { if (re<-6) return (bf16_t){.bits=rs<<15};
mm >>= (1-re); re=0; }
return (bf16_t){.bits=(rs<<15)|((re&0xFF)<<7)|(mm&0x7F)};
}
```
#### Assembly code
```asm
# prod = mant_a * mant_b via shift-add (8 steps)
mv a3, zero # prod
mv a5, t6 # multiplier (mant_b)
mv a2, t5 # multiplicand (mant_a)
li t0, 8
Lmul_loop:
andi t1, a5, 1
beqz t1, 1f
add a3, a3, a2
1: slli a2, a2, 1
srli a5, a5, 1
addi t0, t0, -1
bnez t0, Lmul_loop
```
---
### bf16_div
- Normalize mantissas (add hidden-1 for normals).
- Restoring division over 16 steps to form a 16-bit quotient accumulator.
- Exponent: (ea − eb) + bias with subnormal adjustments.
- Normalize quotient to have MSB in bit15; then take mantissa from >>8.
- Overflow ⇒ ±Inf; exp<=0 ⇒ signed zero; else pack.
#### C code
```c
static inline bf16_t bf16_div(bf16_t a, bf16_t b)
{
uint16_t sa=(a.bits>>15)&1, sb=(b.bits>>15)&1;
int16_t ea=(a.bits>>7)&0xFF, eb=(b.bits>>7)&0xFF;
uint16_t ma=a.bits&0x7F, mb=b.bits&0x7F;
uint16_t rs = sa ^ sb;
if (eb==0xFF){ if (mb) return b; if (ea==0xFF && !ma) return BF16_NAN();
return (bf16_t){.bits=rs<<15}; }
if (!eb && !mb){ if (!ea && !ma) return BF16_NAN();
return (bf16_t){.bits=(rs<<15)|0x7F80}; }
if (ea==0xFF){ if (ma) return a;
return (bf16_t){.bits=(rs<<15)|0x7F80}; }
if (!ea && !ma) return (bf16_t){.bits=rs<<15};
if (ea) ma|=0x80; if (eb) mb|=0x80;
uint32_t dividend=((uint32_t)ma)<<15, divisor=mb, q=0;
for (int i=0;i<16;i++){
q <<= 1;
if (dividend >= (divisor << (15-i))){
dividend -= (divisor << (15-i));
q |= 1;
}
}
int32_t re = (int32_t)ea - eb + BF16_EXP_BIAS;
if (!ea) re--; if (!eb) re++;
if (q & 0x8000) q >>= 8;
else {
while (!(q & 0x8000) && re > 1){ q <<= 1; re--; }
q >>= 8;
}
q &= 0x7F;
if (re>=0xFF) return (bf16_t){.bits=(rs<<15)|0x7F80};
if (re<=0) return (bf16_t){.bits=rs<<15};
return (bf16_t){.bits=(rs<<15)|((re&0xFF)<<7)|(q&0x7F)};
}
```
#### Assembly code
```asm
# quotient = (mant_a << 15) / mant_b
slli a2, t4, 15 # remainder (dividend)
mv a3, t5 # divisor
li a4, 0 # quotient
li a5, 0 # i=0
Ldiv_loop:
li a7, 16
bge a5, a7, Ldiv_done
slli a4, a4, 1
li a7, 15
sub a7, a7, a5
sll a7, a3, a7
bltu a2, a7, 1f
sub a2, a2, a7
ori a4, a4, 1
1: addi a5, a5, 1
j Ldiv_loop
Ldiv_done:
```
---
### bf16_sqrt
- Specials: sqrt(NaN)=NaN, sqrt(+Inf)=+Inf, sqrt(-Inf)=NaN, sqrt(±0)=0, negative finite ⇒ NaN.
- Build integer mantissa m = 0x80 | mant.
- Half the unbiased exponent: if odd, m <<= 1, new_exp=((e-1)/2)+bias; else new_exp=(e/2)+bias.
- Binary search integer root result s.t. ((result*result)>>7) <= m (use shift-add multiply).
- Normalize result into [128, 255]; adjust new_exp.
- Overflow/underflow checks; pack.
#### C code
```c
static inline bf16_t bf16_sqrt(bf16_t a)
{
uint16_t s=(a.bits>>15)&1; int16_t e=(a.bits>>7)&0xFF; uint16_t m=a.bits&0x7F;
if (e==0xFF){ if (m) return a; if (s) return BF16_NAN(); return a; }
if (!e && !m) return BF16_ZERO();
if (s) return BF16_NAN();
if (!e) return BF16_ZERO();
int32_t ue = e - BF16_EXP_BIAS, new_exp;
uint32_t mm = 0x80 | m; // [128,256)
if (ue & 1){ mm <<= 1; new_exp=((ue-1)>>1)+BF16_EXP_BIAS; }
else new_exp=(ue>>1)+BF16_EXP_BIAS;
uint32_t low=90, high=256, result=128;
while (low <= high){
uint32_t mid=(low+high)>>1;
uint32_t sq=(mid*mid)/128; // >>7
if (sq <= mm){ result=mid; low=mid+1; }
else high=mid-1;
}
if (result >= 256){ result >>= 1; new_exp++; }
else if (result < 128){
while (result < 128 && new_exp > 1){ result <<= 1; new_exp--; }
}
uint16_t new_mant = result & 0x7F;
if (new_exp >= 0xFF) return (bf16_t){.bits=0x7F80};
if (new_exp <= 0) return BF16_ZERO();
return (bf16_t){.bits=((new_exp&0xFF)<<7)|new_mant};
}
```
#### Assembly code
```asm
# m in s2, new_exp in s3; result in t5
li s4, 90 # low
li t6, 256 # high
li t5, 128 # result
sqrt_bsearch_loop:
bgt s4, t6, sqrt_bsearch_done
add t0, s4, t6
srai t0, t0, 1 # mid
# sq = (mid*mid)>>7 via shift-add multiply
mv a2, t0; mv a3, t0
li t2, 0; li s5, 16
sqrt_imul_loop:
andi a4, a3, 1
beqz a4, 1f
add t2, t2, a2
1: slli a2, a2, 1
srli a3, a3, 1
addi s5, s5, -1
bnez s5, sqrt_imul_loop
mv t1, t2
srli t1, t1, 7 # sq
bgt t1, s2, sq_gt_m
mv t5, t0 # result = mid
addi s4, t0, 1 # low = mid+1
j sqrt_bsearch_loop
sq_gt_m:
addi t6, t0, -1 # high = mid-1
j sqrt_bsearch_loop
sqrt_bsearch_done:
```
---
### Correctness & Trade-offs
#### Rounding:
- f32_to_bf16 implements RNE (ties-to-even).
- Arithmetic ops truncate to 7-bit mantissa (no G/R/S), effectively toward-zero.
### Result









---
## Problem Description: Next Power of Two
Compute the smallest power of two greater than or equal to a 32-bit unsigned integer `x`.
Return that power of two as a `uint32_t`. This operation shows how `clz` (count-leading-zeros) turns a bit-level insight into a constant-time primitive used in memory allocators, GPU buffers, ring buffers, and alignment routines. We’ll reuse the `clz32` routine from Quiz 1 and show a clean RV32I implementation calling it.
All implementations in this article were carried out using the Ripes simulator with a 5-stage pipeline design.
### Concept
To find the smallest power of two ≥ `x`, observe:
- `floor_log2(x) = 31 - clz(x)` (for nonzero `x`)
- `next_pow2(x) = 1 << (32 - clz(x - 1))`
Example mapping:
| x | Binary (32-bit) | clz(x-1) | Next Power of Two |
|----|------------------|----------|-------------------|
| 1 | 000…0001 | 31 | 1 |
| 2 | 000…0010 | 30 | 2 |
| 3 | 000…0011 | 30 | 4 |
| 5 | 000…0101 | 29 | 8 |
| 9 | 000…1001 | 28 | 16 |
| 17 | 0001…0001 | 27 | 32 |
Intuition:
- If `x` is already a power of two, then `x-1` fills all lower bits with `1`.
- `clz(x-1)` gives the position distance to the top `1`.
- Shifting `1u << (32 - clz(x-1))` lands exactly on the next power-of-two bit.
### C Implementations
#### clz32 (binary search style) + next_power_of_two
```c
#include <stdio.h>
#include <stdint.h>
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
// ------------------------------------------------------
// Compute Next Power of Two using CLZ (32-bit)
// ------------------------------------------------------
static inline uint32_t next_power_of_two(uint32_t x)
{
if (x == 0u) return 1u;
unsigned lz = clz(x - 1u); // clz(x-1)
return 1u << (32u - lz); // 1 << (32 - clz(x-1))
}
int main(void)
{
const uint32_t in[] = {0,1,2,3,5,9,17,33};
const uint32_t exp[] = {1,1,2,4,8,16,32,64};
const int N = (int)(sizeof(in)/sizeof(in[0]));
int pass = 0;
puts("=== Next Power of Two ===");
for (int i = 0; i < N; ++i) {
uint32_t got = next_power_of_two(in[i]);
if (got == exp[i]) {
printf("PASS %u -> %u\n", in[i], got);
++pass;
} else {
printf("FAIL %u exp=%u got=%u\n", in[i], exp[i], got);
}
}
printf("PASS %d/%d\n", pass, N);
return 0;
}
```
result:


Explanation (why it works):
- If `x` is power of two → `(x-1)` sets all less-significant bits to 1.
- `clz(x-1)` pinpoints how far the highest `1` is from the MSB.
- Shift a single `1` to that next position to get the next power of two.
Examples:
- `next_power_of_two(9) → 16`
- `next_power_of_two(17) → 32`
- `next_power_of_two(33) → 64`
### RISC-V Assembly Implementation
#### clz32 (binary search style) + next_power_of_two
```asm
.text
.globl main
# =====================================================
# main — Next Power of Two test
# =====================================================
main:
li s1, 0 # pass_count = 0
la a0, STR_HDR # print header
li a7, 4
ecall
la s2, npt_cases_in # inputs ptr
la s3, npt_cases_exp # expected ptr
lw s4, 0(s2) # N
addi s2, s2, 4
addi s3, s3, 4
li s0, 0 # i = 0
Loop:
beq s0, s4, Summary
lw a0, 0(s2) # a0 = x
mv s5, a0 # keep original x in s5 (callee won't clobber)
jal ra, next_power_of_two
mv t1, a0 # got
lw t2, 0(s3) # exp
bne t1, t2, PrintFail
# PASS
la a0, STR_PASS
li a7, 4
ecall
li a7, 1 # print x (from s5)
mv a0, s5
ecall
la a0, STR_ARROW
li a7, 4
ecall
li a7, 1 # print got
mv a0, t1
ecall
la a0, STR_NL
li a7, 4
ecall
addi s1, s1, 1 # pass_count++
j Cont
PrintFail:
la a0, STR_FAIL
li a7, 4
ecall
li a7, 1
mv a0, s5 # print x (from s5)
ecall
la a0, STR_ARROW
li a7, 4
ecall
li a7, 1
mv a0, t1 # print got
ecall
la a0, STR_NL
li a7, 4
ecall
Cont:
addi s0, s0, 1
addi s2, s2, 4
addi s3, s3, 4
j Loop
Summary:
la a0, STR_PASS_WORD
li a7, 4
ecall
li a7, 1
mv a0, s1
ecall
la a0, STR_SLASH
li a7, 4
ecall
li a7, 1 # print N_TOTAL
la a0, N_TOTAL
lw a0, 0(a0)
ecall
la a0, STR_NL2
li a7, 4
ecall
li a7, 10
ecall
# =====================================================
# next_power_of_two(x)
# a0: x -> a0: next pow2 >= x
# =====================================================
.globl next_power_of_two
next_power_of_two:
beq a0, x0, NPT_ret1 # if x==0 -> 1
mv t3, ra # save caller's ra
addi a0, a0, -1 # a0 = x - 1
jal ra, clz32 # a0 = clz(x-1)
mv ra, t3 # restore ra
li t0, 32
sub t0, t0, a0 # t0 = 32 - clz(x-1)
li a0, 1
sll a0, a0, t0 # a0 = 1 << (32 - clz(x-1))
ret
NPT_ret1:
li a0, 1
ret
# =====================================================
# clz32(x)
# a0: x -> a0: clz(x)
# =====================================================
.globl clz32
clz32:
li t0, 32 # n = 32
li t1, 16 # c = 16
CLZ_loop:
srl t2, a0, t1 # y = x >> c
beq t2, x0, CLZ_skip
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
CLZ_skip:
srli t1, t1, 1 # c >>= 1
bnez t1, CLZ_loop
sub a0, t0, a0 # return n - x
ret
# =====================================================
# Data
# =====================================================
.data
N_TOTAL: .word 8
npt_cases_in:
.word 8
.word 0,1,2,3,5,9,17,33
npt_cases_exp:
.word 8
.word 1,1,2,4,8,16,32,64
STR_HDR: .string "=== Next Power of Two ===\n"
STR_PASS: .string "PASS "
STR_FAIL: .string "FAIL "
STR_PASS_WORD: .string "PASS "
STR_SLASH: .string "/"
STR_ARROW: .string " -> "
STR_NL: .string "\n"
STR_NL2: .string "\n"
```
result:


---
#### Test Data Example
```asm
.data
npt_cases_in:
.word 8
.word 0,1,2,3,5,9,17,33
npt_cases_exp:
.word 8
.word 1,1,2,4,8,16,32,64
```
---
#### Running in Ripes Simulator (RV32I, 5-stage)

Expected Console Output:
```text
=== Next Power of Two ===
PASS 0 -> 1
PASS 1 -> 1
PASS 2 -> 2
PASS 3 -> 4
PASS 5 -> 8
PASS 9 -> 16
PASS 17 -> 32
PASS 33 -> 64
PASS 8/8
```
---
#### Pipeline Visualization
Consider the final shift that materializes the result:
```asm
sll a0, a0, t0 # a0 = 1 << (32 - clz(x-1))
```
- IF: PC shows the next sequential address; instruction memory outputs the `SLL` word.

1. The address from which this cycle’s instruction is being read.
1. PC register value = PC_next = PC_fetch + 4 = 0x00000174
1. Fetched instruction word = 0x00551533
As derived above for sll a0, a0, t0.
- ID: `rs1=a0` (initialized to 1), `rs2=t0` (shift amount), `funct3=001` (SLL), `opcode=OP`.

1. funct7=0000000 (0)、rs2=x5 (t0=5)、rs1=x10 (a0=10)、funct3=001 (SLL)、rd=x10 (10)、opcode=0110011 (0x33) → 0x00551533
1. The Reg1 and Reg2 value is both 0x00000020, the actual shift amount for this SLL is the forwarded result of sub t0, t0, a0 (computed in the prior ALU stage)
- EX: Barrel shifter applies variable shift from `t0`; result becomes the next power of two.

1. Operand A = a0 = 0x00000001
1. The base value to be shifted.
1. Operand B = effective shift amount = (t0 & 0x1F)
1. Important: the EX forwarding mux picks up the new t0 (the output of sub t0, t0, a0 from the previous cycle). Thus, in EX you should see the correct shift distance.
1. ALU result = a0 << (t0 & 0x1F)
- MEM: No data memory access for SLL.

- WB: Result written back to `a0`; no hazards if `t0` was produced in the prior ALU stage with proper forwarding.

1. Destination register = rd = x10 (a0).
1. Selected value = ALU result (not PC+4).
1. Final write: a0 ← next_power_of_two(x) per the table above.
---
✅ Results Summary
| Input | Expected | Computed |
|-------|----------|----------|
| 0 | 1 | 1 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 4 | 4 |
| 5 | 8 | 8 |
| 9 | 16 | 16 |
| 17 | 32 | 32 |
| 33 | 64 | 64 |
All cases PASS ✅
### LeetCode Context
Although LeetCode doesn’t have a dedicated “Next Power of Two” problem, it appears frequently inside system design and optimization tasks:
| Problem ID | Title | Relation |
|------------|-----------------------|-------------------------------------------------------|
| 231 | Power of Two | Core idea (single-bit property) |
| 707 / 622 | Design Linked List / Circular Queue | Choose capacity = next power of two to use `& (cap-1)` indexing |
| 191 | Number of 1 Bits | Same bitwise family |
| 50 | Pow(x, n) | Power computation & bit manipulation |
Additionally, this GitHub repo contains my implementation and a few optimizations for LeetCode 231 (Power of Two). I had planned to center this write-up on that problem, but because it doesn’t make direct use of clz, I pivoted to Next Power of Two instead.
Example (circular queue wrap-around optimization):
```c
index = (index + 1) & (capacity - 1); // works only if capacity is a power of two
```
### Discussion and Conclusion
`next_power_of_two` is a compact, high-performance primitive built on `clz`. It converts an arbitrary size `x` to the smallest power of two ≥ `x`, which is exactly what alignment and capacity sizing need. The core idea: compute `x−1`, use `clz(x−1)` to locate the top bit, then shift 1 to that position to materialize the aligned capacity.
- With hardware CLZ (e.g., Zbb), this becomes just a handful of cycles.
- Even in pure RV32I, a software `clz32` plus a few ALU ops provides a fast, branch-light primitive.
Common uses include:
- Memory allocation (aligning heap blocks)
- Graphics/compute pipelines (buffer sizing)
- FFT / audio processing (rounding sample counts)
### References
- [LeetCode 231: Power of Two](https://leetcode.com/problems/power-of-two/)
- RISC-V ISA Manual (RV32I)
- Bit Twiddling Hacks (next power of two via clz/bit-spreading)
## Two-Point Minimum Distance (BF16)
### Objective
Given two point sets A and B (point coordinates stored in bf16 encoding), calculate the minimum cross-set distance:
$$\min_{p \in A, q \in B} \sqrt{(q_x - p_x)^2 + (q_y - p_y)^2}$$
Output results for each test case, compare with "expected answers", and print PASS/FAIL.
### Main Flow
1. Read n_a, n_b and the starting addresses of point arrays A and B.
2. Call `find_min_distance(n_a, n_b, ptrA, ptrB)` to get the minimum distance (bf16 bits).
3. Navigate to `expected_min_dist` at the end of the case and perform bitwise equality comparison with the result:
- Equal → Print PASS
- Otherwise → Print FAIL (exp: ..., got: ...)
4. Move to the beginning of the next test case until all are completed.
### Code
Note: The full implementation is on GitHub
```assembly
.data
# ========== Message Strings ==========
start_msg: .string "===== Two-Point Minimum Distance Calculation (BF16) ====="
end_msg: .string "===== Calculation Complete ====="
input_msg: .string "Test Case "
result_msg: .string "Minimum Distance: "
newline: .string "\n"
pass_msg: .string "PASS\n"
fail_msg_prefix: .string "FAIL (exp: "
fail_msg_mid: .string ", got: "
fail_msg_tail: .string ")\n"
arrow_msg: .string " -> "
# ========== Test Data (Examples) ==========
test_data:
.word 3 # 3 test cases
# Test Case 1: A={(0,0), (1,0)}, B={(0,1), (1,1)}, exp = 1.0 (0x3F80)
.word 2 # n_a
.word 2 # n_b
# A
.word 0x0000, 0x0000 # (0, 0)
.word 0x3F80, 0x0000 # (1, 0)
# B
.word 0x0000, 0x3F80 # (0, 1)
.word 0x3F80, 0x3F80 # (1, 1)
# exp
.word 0x3F80
# Test Case 2: A={(0,0)}, B={(3,4)}, exp = 5.0 (0x40A0)
.word 1 # n_a
.word 1 # n_b
# A
.word 0x0000, 0x0000 # (0, 0)
# B
.word 0x4040, 0x4080 # (3, 4)
# exp
.word 0x40A0
# Test Case 3: A={(0,0), (2,0)}, B={(1,1), (3,1)}, exp = sqrt(2) ≈ 1.414 (0x3FB5)
.word 2 # n_a
.word 2 # n_b
# A
.word 0x0000, 0x0000 # (0, 0)
.word 0x4000, 0x0000 # (2, 0)
# B
.word 0x3F80, 0x3F80 # (1, 1)
.word 0x4040, 0x3F80 # (3, 1)
# exp
.word 0x3FB5
.text
.globl main
# ==========================================
# main
# ==========================================
main:
# Print header
la a0, start_msg
li a7, 4
ecall
li a0, 10 # '\n'
li a7, 11
ecall
la s0, test_data
lw s1, 0(s0) # s1 = number of test cases
addi s0, s0, 4
li s2, 0 # s2 = current case index
main_test_loop:
bge s2, s1, main_done
# Print "Test Case i"
la a0, input_msg # "Test Case "
li a7, 4
ecall
addi a0, s2, 1
li a7, 1 # print int
ecall
li a0, 10
li a7, 11 # print char '\n'
ecall
# Read case header
lw a0, 0(s0) # n_a
lw a1, 4(s0) # n_b
addi a2, s0, 8 # A array starting address
# B array starting address = A start + n_a * 8 bytes
slli t0, a0, 3 # n_a * 8
add a3, a2, t0
# Preserve outer pointers
addi sp, sp, -16
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw ra, 12(sp)
# Call minimum distance function
jal ra, find_min_distance
mv s3, a0 # s3 = calculation result (bf16 bits)
# Restore
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw ra, 12(sp)
addi sp, sp, 16
# Calculate expected answer address:
# total_points_bytes = (n_a + n_b) * 8
# header 2 words = 8 bytes
# t3 = s0 + 8 + (n_a+n_b)*8
lw t0, 0(s0) # n_a
lw t1, 4(s0) # n_b
add t2, t0, t1
slli t2, t2, 3 # *8
addi t2, t2, 8 # + header(2 words)
add t3, s0, t2
lw s4, 0(t3) # s4 = expected answer (bf16 bits)
# Print current result and PASS/FAIL
la a0, result_msg
li a7, 4
ecall
mv a0, s3 # got
li a7, 1
ecall
la a0, arrow_msg # " -> "
li a7, 4
ecall
beq s3, s4, __print_pass
# FAIL: Print "FAIL (exp: <exp>, got: <got>)\n"
la a0, fail_msg_prefix
li a7, 4
ecall
mv a0, s4 # exp
li a7, 1
ecall
la a0, fail_msg_mid
li a7, 4
ecall
mv a0, s3 # got
li a7, 1
ecall
la a0, fail_msg_tail
li a7, 4
ecall
j __after_passfail
__print_pass:
la a0, pass_msg
li a7, 4
ecall
__after_passfail:
addi s0, t3, 4
addi s2, s2, 1
j main_test_loop
main_done:
la a0, end_msg
li a7, 4
ecall
li a0, 10
li a7, 11
ecall
li a0, 0
li a7, 10 # exit
ecall
# ==========================================
# find_min_distance
# Parameters: a0=n_a, a1=n_b, a2=ptrA, a3=ptrB
# Return: a0 = minimum distance (bf16 bits)
# ==========================================
find_min_distance:
addi sp, sp, -32
sw ra, 28(sp)
sw s0, 24(sp)
sw s1, 20(sp)
sw s2, 16(sp)
sw s3, 12(sp)
sw s4, 8(sp)
sw s5, 4(sp)
mv s0, a0 # n_a
mv s1, a1 # n_b
mv s2, a2 # ptrA
mv s3, a3 # ptrB
li s4, 0x7F80 # min_dist = +Inf (bf16 bits)
li s5, 0 # i
find_outer_loop:
bge s5, s0, find_done
# A[i] address = ptrA + i*8
slli t0, s5, 3
add t0, s2, t0
lw a0, 0(t0) # ax
lw a1, 4(t0) # ay
li t1, 0 # j
find_inner_loop:
bge t1, s1, find_next_i
# B[j] address = ptrB + j*8
slli t2, t1, 3
add t2, s3, t2
lw a2, 0(t2) # bx
lw a3, 4(t2) # by
# Save needed registers
addi sp, sp, -12
sw a0, 0(sp)
sw a1, 4(sp)
sw t1, 8(sp)
jal ra, calc_distance # a0 = dist (bf16 bits)
mv t3, a0
# Restore
lw a0, 0(sp)
lw a1, 4(sp)
lw t1, 8(sp)
addi sp, sp, 12
# If dist < min_dist → update
bltu t3, s4, find_update_min
j find_continue
find_update_min:
mv s4, t3
find_continue:
addi t1, t1, 1
j find_inner_loop
find_next_i:
addi s5, s5, 1
j find_outer_loop
find_done:
mv a0, s4 # return min
lw ra, 28(sp)
lw s0, 24(sp)
lw s1, 20(sp)
lw s2, 16(sp)
lw s3, 12(sp)
lw s4, 8(sp)
lw s5, 4(sp)
addi sp, sp, 32
ret
# ==========================================
# calc_distance(ax, ay, bx, by) → bf16 distance
# ==========================================
calc_distance:
addi sp, sp, -24
sw ra, 20(sp)
sw s0, 16(sp)
sw s1, 12(sp)
sw s2, 8(sp)
sw s3, 4(sp)
mv s3, a0 # ax
mv s1, a1 # ay
mv s2, a3 # by
# dx = bx - ax
mv a0, a2 # bx
mv a1, s3 # ax
jal ra, bf16_sub # a0 = dx
mv a1, a0
jal ra, bf16_mul # a0 = dx^2
mv s0, a0 # s0 = dx^2
# dy = by - ay
mv a0, s2 # by
mv a1, s1 # ay
jal ra, bf16_sub # a0 = dy
mv a1, a0
jal ra, bf16_mul # a0 = dy^2
# dx^2 + dy^2
mv a1, a0 # dy^2
mv a0, s0 # dx^2
jal ra, bf16_add # a0 = sum
# sqrt(sum)
jal ra, bf16_sqrt # a0 = distance
lw ra, 20(sp)
lw s0, 16(sp)
lw s1, 12(sp)
lw s2, 8(sp)
lw s3, 4(sp)
addi sp, sp, 24
ret
```
### Result

Where 16256=0x3F80(=1.0), 16544=0x40A0(=5.0), 16309≈0x3FB5(≈1.414).
