# arch2025-homework1
You can see my source code [here](https://github.com/ray3628/ca2025-quizzes).
# Problem B
For problem B, I think the encode and decode function are both efficient, the only thing we can optimize might be rearrange the some instruction to make it more efficient, but the effect might not be significant.
But, there's a function that we can do some work on it.
The clz function here, using while loop and right shift to count the number of leading zeros, which might cost lots of time and resources.
**clz function:**
```asm=1
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
**Corresponding risc-V code:**
```asm=1
clz:
li t0, 32 # t0, n = 32
li t1, 16 # t1, c = 16
clz_loop:
srl t2, a1, t1 # y = x >> c
beqz t2, clz_nochange
sub t0, t0, t1 # n -= c
mv a1, t2 # x = y
clz_nochange:
srli t1, t1, 1
bnez t1, clz_loop
sub a1, t0, a1 # return n - x
ret
```

(The console generate P which means pass.)

To verify my guess, let's talk about my new version of clz in c first.
**New version of c code:**
```asm=1
int clz(uint32_t a) {
int number_of_leading_zeros = 0;
if (a == 0) {
return 32;
}
if (a <= 0x0000FFFF) {
number_of_leading_zeros += 16;
a <<= 16;
}
if (a <= 0x00FFFFFF) {
number_of_leading_zeros += 8;
a <<= 8;
}
if (a <= 0x0FFFFFFF) {
number_of_leading_zeros += 4;
a <<= 4;
}
if (a <= 0x3FFFFFFF) {
number_of_leading_zeros += 2;
a <<= 2;
}
if (a <= 0x7FFFFFFF) {
number_of_leading_zeros += 1;
}
return number_of_leading_zeros;
}
```
The advantage of this newly designed code is that, compared to the original clz implementation, it completely eliminates the use of loops. Instead, it uses only a small number of if statements and shift operations to calculate the number of leading zeros.
My idea is that for a 32-bit input, the algorithm checks the number of leading zeros using thresholds of 16, 8, 4, 2, and 1. Each time the condition indicates that there are more leading zeros, the count is increased accordingly, and the value is left-shifted. Since the conditions are checked from larger to smaller ranges, the algorithm can detect up to 31 leading zeros.
Additionally, there is one special case for 32 leading zeros: when the input is 0, the function simply returns 32 directly.
**New version of risc-V code:**
```asm=1
clz:
li t0, 0
beq a1, x0, input_zero
li t1, 0x0000FFFF
bgtu a1, t1, clz_1
addi t0, t0, 16
slli a1, a1, 16
clz_1:
li t1, 0x00FFFFFF
bgtu a1, t1, clz_2
addi t0, t0, 8
slli a1, a1, 8
clz_2:
li t1, 0x0FFFFFFF
bgtu a1, t1, clz_3
addi t0, t0, 4
slli a1, a1, 4
clz_3:
li t1, 0x3FFFFFFF
bgtu a1, t1, clz_4
addi t0, t0, 2
slli a1, a1, 2
clz_4:
li t1, 0x7FFFFFFF
bgtu a1, t1, end_clz
addi t0, t0, 1
slli a1, a1, 1
end_clz:
add a1, t0, x0
ret
input_zero:
li a1, 32
ret
```

(The console generate P which means pass.)

I think it's a huge improvement!
# **Problem C**
**Before we get started**:
Basically, the problem involves implementing a set of functions that perform arithmetic operations for bf16 — such as addition, multiplication, and so on.
My main task is to translate each of these functions from C into RISC-V assembly code independently, and then feed them with various test inputs to verify the correctness of my translation.
To ensure good verification coverage, I will include both typical and edge case inputs in my tests. All expected results will be generated by the original C code compiled with the compiler, rather than calculated manually.
For performance comparison, the same set of test data will be used for all implementations. I will present two versions of the RISC-V code: the first one is a straightforward line-by-line translation, while the second one is a more optimized version.
# 1. bf16_isnan function
**C code:**
```asm=1
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
```
So, it's a function that can check whether a input is NaN.
We know NaN is 0x7FC0 by default (general case is mantissa not equals to 0, and exponent bits are all 1), so what this function does is just do some bit analyze to check the conditions.
**Line-by-line version Risc-V code:**
```asm=1
bf16_isnan:
li t0, BF16_EXP_MASK
and t1, a0, t0
bne t1, t0, bf16_isnan_false
li t0, BF16_MANT_MASK
and t1, a0, t0
beqz t1, bf16_isnan_false
bf16_isnan_true:
li a0, 1
ret
bf16_isnan_false:
li a0, 0
ret
```

The design of line_by_line version is actually intuitive, just branch to true or false when the condition is met.
But, is there a better way to do the translation?
Here, i measure the performance of line_by_line version and compare with branchless version.
**Blanchless version:**
```asm=1
bf16_isnan:
li t0, BF16_EXP_MASK
and t1, a0, t0
xor t1, t1, t0
seqz t1, t1
andi t2, a0, BF16_MANT_MASK
snez t2, t2
and a0, t1, t2
ret
```

As you can see, the performance of these two are similar, but the blanchless version has less code here.
I think branchless one is a little better when considering parallel processing.
# 2. bf16_isinf function
**C code:**
```asm=1
static inline bool bf16_isinf(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
!(a.bits & BF16_MANT_MASK);
}
```
**Line-by-line version risc-V code:**
```asm=1
bf16_isinf:
li t0, BF16_EXP_MASK
and t1, a0, t0
bne t1, t0, bf16_isinf_false
li t0, BF16_MANT_MASK
and t1, a0, t0
bnez t1, bf16_isinf_false
bf16_isinf_true:
li a0, 1
ret
bf16_isinf_false:
li a0, 0
ret
```

Just like isnan function, this function check whether the exponent and mantissa bit fit with infinity.
Because the bit operation are so similar, we can turn it into branchless version easily and see which one is better.
**Branchless version:**
```asm=1
bf16_isinf:
li t0, BF16_EXP_MASK
and t1, a0, t0
xor t1, a0, t1
seqz t1, t1
andi t2, a0, BF16_MANT_MASK
seqz t2, t2
and a0, t1, t2
ret
```

The cycles here are a little bit better than line-by-line version, also the number of lines of code are a little bit less than that, I think the branchless version is better than line-by-line version for sure.
# 3. bf16_iszero function
**C code:**
```asm=1
static inline bool bf16_iszero(bf16_t a)
{
return !(a.bits & 0x7FFF);
}
```
Just check whether all the bits except sign bit are zeros, nothing special here.
```asm=1
bf16_iszero:
li t0, 0x7FFF
and a0, a0, t0
not a0, a0
ret
```
# 4. f32_to_bf16 function
**C code:**
```asm=1
static inline bf16_t f32_to_bf16(float val)
{
uint32_t f32bits;
memcpy(&f32bits, &val, sizeof(float));
if (((f32bits >> 23) & 0xFF) == 0xFF)
return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF};
f32bits += ((f32bits >> 16) & 1) + 0x7FFF;
return (bf16_t) {.bits = f32bits >> 16};
}
```
**Line-by-line version risc-V code:**
```asm=1
f32_to_bf16:
mv t0, a0
li t1, 0xFF
srli t2, t0, 23
and t2, t2, t1
beq t2, t1, f32_to_bf16_A
srli t0, t0, 16
andi t0, t0, 1
li t2, 0x7FFF
add t0, t0, t2
add t0, a0, t0 # f32bits += ((f32bits >> 16) & 1) + 0x7FFF;
srli a0, t0, 16 # return (bf16_t) {.bits = f32bits >> 16};
ret
f32_to_bf16_A:
srli a0, a0, 16
li t0, 0xFFFF
and a0, a0, t0
ret
```

The line-by-line code is actually pretty intuitive.
I just translate everything in the c code, so let's see how I optimize the code.
**A better version:**
```asm=1
f32_to_bf16:
li t0, 0xFF
srli t1, a0, 23 # (f32bits >> 23)
and t1, t1, t0 # ((f32bits >> 23) & 0xFF)
beq t1, t0, f32_to_bf16_A # if (((f32bits >> 23) & 0xFF) == 0xFF)
srli t0, a0, 16
andi t0, t0, 1
li t1, 0x7FFF
add t0, t0, t1
add a0, a0, t0 # f32bits += ((f32bits >> 16) & 1) + 0x7FFF
f32_to_bf16_A:
srli a0, a0, 16
ret
```

Here, I delete the line 1, 12, 13, 16, 17 from the line-by-line version, and try to use less register, and it seems that the result is pretty good.
But why is that fine to do so? Especially why I can delete the line 12 and 13, and use only one return case?
The answer is the variable "f32bits" is an unsigned integer!
When we shift right an unsigned value, we don't need to consider what the sign bit is, just use the srli instruction and you will be always right.
Then observe these two return case in c code, the first case need to and 0xFFFF additionally, but as I said, the f32bits variable is a unsigned, if we use srli not srai, the "& 0xFFFF" operation is just redundant, because the upper 16 bits will always be 0.
That sounds good, but can we optimize it further?
Yes, because the if statement in c code is just checking whether the exponents bits are all 1, so we can just load a number that extract the exponent bits, then compare it.
**Best version:**
```asm=1
f32_to_bf16:
li t0, 0x7F800000
and t1, a0, t0 # ((f32bits >> 23) & 0xFF)
beq t1, t0, f32_to_bf16_A # if (((f32bits >> 23) & 0xFF) == 0xFF)
srli t0, a0, 16
andi t0, t0, 1
li t1, 0x7FFF
add t0, t0, t1
add a0, a0, t0 # f32bits += ((f32bits >> 16) & 1) + 0x7FFF
f32_to_bf16_A:
srli a0, a0, 16
ret
```

# 5. bf16_to_f32 function
```asm=1
static inline float bf16_to_f32(bf16_t val)
{
uint32_t f32bits = ((uint32_t) val.bits) << 16;
float result;
memcpy(&result, &f32bits, sizeof(float));
return result;
}
```
Turning a bf16 to f32 is easy, just slli 16, no big deal here.
```asm=1
bf16_to_f32:
slli a0, a0, 16
ret
```
# 6. bf16_add function
It's a large function, but I don't think there's much room for major improvements since the original algorithm is already quite efficient. However, I will still point out a few areas where minor optimizations can be made.
**Line-by-line version risc-V code:**
```asm=1
bf16_add:
srli s1, a0, 15
andi s1, s1, 1
srli s2, a1, 15
andi s2, s2, 1
srli s3, a0, 7
andi s3, s3, 0xFF
srli s4, a1, 7
andi s4, s4, 0xFF
andi s5, a0, 0x7F
andi s6, a1, 0x7F
li t0, 0xFF
bne t0, s3, bf16_add_1 # if (exp_a == 0xFF)
beqz s5, bf16_add_2 # if (mant_a)
ret # return a
bf16_add_2:
bne t0, s4, bf16_add_1 # if (exp_b == 0xFF)
xor t1, s1, s2 # sign_a == sign_b
seqz t1, t1 # sign_a == sign_b
or t1, s6, t1
beqz t1, bf16_add_3 # (mant_b || sign_a == sign_b) ? b : BF16_NAN()
mv a0, a1 # return b
ret
bf16_add_3:
li a0, BF16_NAN_BITS # return BF16_NAN()
ret
bf16_add_1:
bne t0, s4, bf16_add_4
mv a0, a1
ret
bf16_add_4:
seqz t1, s3
seqz t2, s5
and t1, t1, t2
beqz t1, bf16_add_5 # if (!exp_a && !mant_a)
mv a0, a1 # return b
ret
bf16_add_5:
seqz t1, s4
seqz t2, s6
and t1, t1, t2
beqz t1, bf16_add_6 # if (!exp_b && !mant_b)
ret # return a
bf16_add_6:
beqz s3, bf16_add_7 # if (exp_a)
ori s5, s5, 0x80 # mant_a |= 0x80
bf16_add_7:
beqz s4, bf16_add_8 # if (exp_b)
ori s6, s6, 0x80 # mant_b |= 0x80
bf16_add_8:
sub s7, s3, s4 # exp_diff = exp_a - exp_b
bltz s7, bf16_add_9 # else if (exp_diff < 0)
beqz s7, bf16_add_11 # else
mv s8, s3 # result_exp = exp_a
li t1, 8
ble s7, t1, bf16_add_10 # if (exp_diff > 8)
ret # return a
bf16_add_10:
srl s6, s6, s7 # mant_b >>= exp_diff
j bf16_add_13
bf16_add_9:
mv s8, s4 # result_exp = exp_b
li t1, -8
bge s7, t1, bf16_add_12 # if (exp_diff < -8)
mv a0, a1 # return b
ret # return b
bf16_add_12:
neg t1, s7
srl s5, s5, t1 # mant_a >>= -exp_diff
j bf16_add_13
bf16_add_11:
mv s8, s3 # result_exp = exp_a
bf16_add_13:
bne s1, s2, bf16_add_14 # if (sign_a == sign_b)
mv s9, s1 # result_sign = sign_a
add s10, s5, s6 # result_mant = (uint32_t) mant_a + mant_b
andi t1, s10, 0x100
beqz t1, bf16_add_15 # if (result_mant & 0x100)
srli s10, s10, 1 # result_mant >>= 1
addi s8, s8, 1 # ++result_exp >= 0xFF (result_exp will not return to original value)
blt s8, t0, bf16_add_15 # if (++result_exp >= 0xFF)
slli t1, s9, 15 # result_sign << 15
li t2, 0x7F80
or a0, t1, t2 # return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}
ret
bf16_add_14:
blt s5, s6, bf16_add_16 # if (mant_a >= mant_b)
mv s9, s1 # result_sign = sign_a
sub s10, s5, s6 # result_mant = mant_a - mant_b
j bf16_add_15
bf16_add_16:
mv s9, s2 # result_sign = sign_b
sub s10, s6, s5 # result_mant = mant_b - mant_a
bf16_add_15:
seqz t1, s10 # if (!result_mant)
beqz t1, bf16_add_17 # if (!result_mant)
li a0, BF16_ZERO_BITS # return BF16_ZERO()
ret
bf16_add_17:
loop1:
andi t1, s10, 0x80
seqz t1, t1
beqz t1, bf16_add_18 # while (!(result_mant & 0x80))
slli s10, s10, 1 # result_mant <<= 1
addi s8, s8, -1 # if (--result_exp <= 0)
blez s8, bf16_add_19 # if (--result_exp <= 0)
j loop1
bf16_add_19:
li a0, BF16_ZERO_BITS # return BF16_ZERO()
ret
bf16_add_18:
slli s9, s9, 15 # (result_sign << 15)
andi s8, s8, 0xFF # (result_exp & 0xFF)
slli s8, s8, 7 # ((result_exp & 0xFF) << 7)
andi s10, s10, 0x7F # (result_mant & 0x7F)
or s9, s9, s8 # or them together and return
or a0, s9, s10
ret
```

**First**, because there are so many return b or return some constant value in the code, but everytime when I need to return those value, I write the same code.
So, what I need to do to improve this is replace the same return process by jump to the same address. For example.
```asm=1
mv a0, a1 # return b
ret
```
Becomes
```asm=1
j return_b
```
Where return_b define the process for returning b.
(This may result in more cost in time, but in return, less code for sure.)
**Second**, for the c code here.
```asm=1
if (exp_a == 0xFF) {
if (mant_a)
return a;
if (exp_b == 0xFF)
return (mant_b || sign_a == sign_b) ? b : BF16_NAN();
return a;
}
```
Below is my line-by-line version, but if we enter this if statement, the mant_a becomes not important at all.
```asm=1
li t0, 0xFF
bne t0, s3, bf16_add_1 # if (exp_a == 0xFF)
beqz s5, bf16_add_2 # if (mant_a)
ret # return a
bf16_add_2:
bne t0, s4, bf16_add_1 # if (exp_b == 0xFF)
xor t1, s1, s2 # sign_a == sign_b
seqz t1, t1 # sign_a == sign_b
or t1, s6, t1
beqz t1, bf16_add_3 # (mant_b || sign_a == sign_b) ? b : BF16_NAN()
mv a0, a1 # return b
ret
bf16_add_3:
li a0, BF16_NAN_BITS # return BF16_NAN()
ret
```
So, my new version is.
```asm=1
li t0, 0xFF
bne t0, s3, bf16_add_1 # if (exp_a == 0xFF)
bne t0, s4, return_a # if (exp_b == 0xFF)
xor t1, s1, s2 # sign_a == sign_b
seqz t1, t1 # sign_a == sign_b
or t1, s6, t1
beqz t1, return_BF16_NAN_BITS # (mant_b || sign_a == sign_b) ? b : BF16_NAN()
j return_b
bf16_add_1:
bne t0, s4, bf16_add_4
j return_b
```
Less branch, and less code.
**Third**, by demorgan's law, this equals !(exp_b or mant_b), we can use only 2 instruction here not 3.
```asm=1
if (!exp_b && !mant_b)
return a;
```

# 7. Other optimization
In my opinion, the addition, multiplication, division, and square root functions are already implemented with efficient algorithms.
There isn’t much room for further optimization, except for similar minor improvements to those mentioned in the addition function.
**Supplement**: Regarding the binary search part of the sqrt function, although I could use a lookup table to reduce the search time complexity to O(1), the required storage cost would be really substantial. I do not consider this trade-off worthwhile, also it's not a general solution at all, so I will not explore it further.
```asm=1
/* Binary search for square root of m */
while (low <= high) {
uint32_t mid = (low + high) >> 1;
uint32_t sq = (mid * mid) / 128; /* Square and scale */
if (sq <= m) {
result = mid; /* This could be our answer */
low = mid + 1;
} else {
high = mid - 1;
}
}
```
# RISC-V 5-stage pipeline processor
Next, I’ll walk through an example to illustrate how a RISC-V 5-stage pipeline processor executes an instruction.
Here, let me show you the optimized version of clz code again.
```asm=1
clz:
li t0, 0
beq a1, x0, input_zero
li t1, 0x0000FFFF
bgtu a1, t1, clz_1
addi t0, t0, 16
slli a1, a1, 16
clz_1:
li t1, 0x00FFFFFF
bgtu a1, t1, clz_2
addi t0, t0, 8
slli a1, a1, 8
clz_2:
li t1, 0x0FFFFFFF
bgtu a1, t1, clz_3
addi t0, t0, 4
slli a1, a1, 4
clz_3:
li t1, 0x3FFFFFFF
bgtu a1, t1, clz_4
addi t0, t0, 2
slli a1, a1, 2
clz_4:
li t1, 0x7FFFFFFF
bgtu a1, t1, end_clz
addi t0, t0, 1
slli a1, a1, 1
end_clz:
add a1, t0, x0
ret
input_zero:
li a1, 32
ret
```
I would like to take the branch instruction for example.
```asm
beq a1, x0, input_zero
```
**1. The Istrunction Fetch stage:**
In this stage, the processor fetches the beq instruction from memory at the current program counter (PC).
Simultaneously, it speculatively increments PC by 4 to fetch the next sequential instruction.
At this point, the CPU does not know whether the branch will be taken or not.
The key here is that fetching always happens in parallel with other instructions. Branch decisions are not yet made.
**2. The Istrunction Decode stage:**
In this stage, the instruction is decoded, and the processor recognizes it as a conditional branch (beq).
The register file now is read:
->Source register a1 is read.
->Source register x0 is read.
Also, the branch target address is computed in parallel:
Branch Target = PC + sign-extended immediate offset
The control unit prepares signals for a possible branch.
At this point, the CPU knows where it would jump, but not whether it will actually jump.
**3. The Execute stage:**
The ALU performs the comparison between a0, and x0
If the condition is true, the branch is taken and the branch target PC is selected.
If the condition is false, the PC continues with the sequential path (PC + 4).
*This is the critical stage for branch instructions:*
Next PC is determined (either target or sequential).
A control signal tells the fetch unit whether to **flush** the incorrectly fetched instruction(s).
**4. The Memory Access stage:**
Since beq doesn’t load or store data, the MEM stage doesn’t perform any memory operation.
However, the instruction still passes through this stage as part of the pipeline.
**5. The Write back stage:**
beq doesn’t produce a value to write back to the register file.
The WB stage is idle for this instruction.
# Conclusion
The reason why I choose branch instruction for example is that, this is one of the most important aspects of understanding what cause a branch penalty.
Because the branch decision is only made during the EX stage, the instructions fetched during IF and ID stages (after beq) may be invalid if the branch is taken.
If branch is taken → the speculatively fetched instruction must be flushed.
Typically, a 5-stage pipeline has a branch penalty of **2 cycles** because:
The next instruction was fetched (IF)
And possibly decoded (ID)
The pipeline must discard these and redirect the PC to the branch target.
So, it is important to design a program with fewer branches if possible, or to replace branches with alternatives that cost **less than two cycles**.
I guess this also answers my own question — why sometimes using a branchless implementation doesn’t actually reduce the cycle count.