# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`ranvd`](https://github.com/ranvd) >
## Get sine value without floating point multiplication support
The sine function holds significant importance in both mathematics and computer graphics. Within the context of the [project](https://github.com/miloyip/line), the sine and cosine functions are employed to calculate the phase of where lines should be drawn.
Since [sine](https://en.wikipedia.org/wiki/Sine_and_cosine) function is defined as the ratio of the height of an angle to its corresponding [hypotenuse](https://en.wikipedia.org/wiki/Hypotenuse), the value always sit in $[-1,1]$. We can generalize the sine value into a function like the image below.

According to [Taylor expansion](https://en.wikipedia.org/wiki/Taylor_series), any bounded continuous function can be approximate by Taylor series. So, In computer science, we can use this technique to get the sine value and fix the domain in $[0, 2\pi]$. Derive from Taylor expansion, we can know that $$sin(x)\approx x-{x^3\over3!}+{x^5\over 5!}-\dots=\sum^n_{k=0}{(-1)^kx^{2k+1}\over (2k+1)!}$$
## C code
### `fmul32`
Since we don't support any NaN, Inf and denormalize number as the input. We can still provide a "just ready to used" floating point multiplication. We can return zero by checking the a and b value if either of them equals zero.
Furthermore, fundamental overflow detection can be carried out by examining the MSB in the result of the multiplication and confirming the equality of the true signed bits. Consequently, the statement `r ^ ia ^ ib` can be generated. If an overflow occurs, the variable `ovfl` will be set to -1; otherwise, it will be set to 0.
```c
float fmul32(float a, float b)
{
/* TODO: Special values like NaN and INF */
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
if (ia == 0 || ib == 0) return 0;
/* mantissa */
int32_t ma = (ia & 0x7FFFFF) | 0x800000;
int32_t mb = (ib & 0x7FFFFF) | 0x800000;
int32_t sea = ia & 0xFF800000;
int32_t seb = ib & 0xFF800000;
/* result of mantissa */
int32_t m = imul24(ma, mb);
int32_t mshift = getbit(m, 24);
m >>= mshift;
int32_t r = ((sea - 0x3f800000 + seb) & 0xFF800000) + m - (0x800000 & -!mshift);
int32_t ovfl = (r ^ ia ^ ib) >> 31;
r = r ^ ( (r ^ 0x7f800000) & ovfl);
return *(float *)&r;
}
```
### `imul32`
In floating-point multiplication, only the upper 25 bits are necessary when performing mantissa multiplication. To accommodate this, I have modified the function name from `imul32` to `imul24` in the following code. Instead of shifting the addend to the right, the approach involves shifting the result left, thereby ensuring the upper bit is consistently preserved.
```c
static int32_t imul24(int32_t a, int32_t b)
{
uint32_t r = 0;
for (; b; b >>= 1)
r = (r >> 1) + (a & -getbit(b, 0));
return r;
}
```
Since the input argument of `b` is always 24-bit lenght. I can easily use loop unrolling technique as below to reduce to execution cycle.
```c
static int32_t imul24(int32_t a, int32_t b)
{
uint32_t r = 0;
for (; b; b >>= 4){
r = (r >> 1) + (a & -getbit(b, 0));
r = (r >> 1) + (a & -getbit(b, 1));
r = (r >> 1) + (a & -getbit(b, 2));
r = (r >> 1) + (a & -getbit(b, 3));
}
return r;
}
```
### `fdiv32`
In the `fdiv32` function, NaN, inf and denormalized number are not allowed as `fmul32`. I still provide the basic floating point multiplication by the technique as the mention above.
```c
float fdiv32(float a, float b)
{
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
if (a == 0) return a;
if (b == 0) return *(float*)&(int){0x7f800000};
/* mantissa */
int32_t ma = (ia & 0x7FFFFF) | 0x800000;
int32_t mb = (ib & 0x7FFFFF) | 0x800000;
/* sign and exponent */
int32_t sea = ia & 0xFF800000;
int32_t seb = ib & 0xFF800000;
/* result of mantissa */
int32_t m = idiv24(ma, mb);
int32_t mshift = !getbit(m, 31);
m <<= mshift;
int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8;
int32_t ovfl = (ia ^ ib ^ r) >> 31;
r = r ^ ((r ^ 0x7f800000) & ovfl);
return *(float *) &r;
// return a / b;
}
```
### `idiv24`
`idiv24` can handle 32 bits integer division. The dividend and quotient can be update by shifting 1 bit left. If the dividend are greater then the divisor, the LSB of quotient should be 1. This is basicly the same as [short division](https://en.wikipedia.org/wiki/Short_division#:~:text=In%20arithmetic%2C%20short%20division%20is,remainders%20are%20notated%20as%20superscripts.).
```c
static int32_t idiv24(int32_t a, int32_t b){
uint32_t r = 0;
for (int i = 0; i < 32; i++){
r <<= 1;
if (a - b < 0){
a <<= 1;
continue;
}
r |= 1;
a -= b;
a <<= 1;
}
return r;
}
```
In the above code, we can know that `a` always need to be shifted left. The only difference is whether `b` should be subtract from a or not. In the opposite way, we can use `-(a < 0)` to determine whether `b` should be added back to `a` or not, which is called [Restoring division](https://en.wikipedia.org/wiki/Division_algorithm)
```c
static int32_t idiv24(int32_t a, int32_t b) {
uint32_t r = 0;
for (int i = 0; i < 32; i++) {
a -= b;
r = (r << 1) | a >= 0;
a = (a + (b & -(a < 0))) << 1;
}
return r;
}
```
Also, the loop time is fixed, the loop unrolling technique can fit very will.
> Ignore the loop unrolling example (Can be done by simply copy paste).
### `fadd32`
In floating-point addition, it is essential to align the smaller value with the greater value. In the straightforward approach, the comparison typically involves checking the exponent first and then comparing the mantissa. However, by leveraging the IEEE 754 format, this comparison can be performed simultaneously, which literaily means compare it directly, as the `cmp_a` and `cmp_b` shows.
After aligning the values, it is necessary to compare the equality of their signed bits. If the signed bits are not the same, subtraction should be applied.
```c
float fadd32(float a, float b) {
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
int32_t cmp_a = ia & 0x7fffffff;
int32_t cmp_b = ib & 0x7fffffff;
if (cmp_a < cmp_b)
iswap(ia, ib);
/* exponent */
int32_t ea = (ia >> 23) & 0xff;
int32_t eb = (ib >> 23) & 0xff;
/* mantissa */
int32_t ma = ia & 0x7fffff | 0x800000;
int32_t mb = ib & 0x7fffff | 0x800000;
int32_t align = (ea - eb > 24) ? 24 : (ea - eb);
mb >>= align;
if ((ia ^ ib) >> 31) {
ma -= mb;
} else {
ma += mb;
}
int32_t clz = count_leading_zeros(ma);
int32_t shift = 0;
if (clz <= 8) {
shift = 8 - clz;
ma >>= shift;
ea += shift;
} else {
shift = clz - 8;
ma <<= shift;
ea -= shift;
}
int32_t r = ia & 0x80000000 | ea << 23 | ma & 0x7fffff;
float tr = a + b;
return *(float *)&r;
}
```
### `f2i32` & `i2f32`
The functions `i2f32` and `f2i32` are used for converting values between floating-point and integer representations. While 2 means "to". `i2f` means "int to float".
```c
int f2i32(int x) {
int32_t a = *(int *)&x;
int32_t ma = (a & 0x7FFFFF) | 0x800000;
int32_t ea = ((a >> 23) & 0xFF) - 127;
if (ea < 0)
return 0;
else if (ea <= 23)
ma >>= (23 - ea);
else
ma <<= (ea - 23);
return ma;
}
int i2f32(int x) {
if (x == 0) return 0;
int32_t s = x & 0x80000000;
if (s) x = -x;
int32_t clz = count_leading_zeros(x);
int32_t e = 31 - clz + 127;
if (clz <= 8) {
x >>= 8 - clz;
} else {
x <<= clz - 8;
}
int r = s | e << 23 | x & 0x7fffff;
return r;
}
```
### `Sin`
The sine value can be obtained using the Taylor series. The following code represents the corresponding implementation of the Taylor series for sine calculation
```c
float myPow(float x, int n) {
float r = 1.0;
while (n) {
if (n & 0x1) {
r = fmul32(r, x);
n -= 1;
} else {
x = fmul32(x, x);
n >>= 1;
}
}
return r;
}
// n!
float factorial(int n) {
float r = 1.0;
for (int i = 1; i <= n; i++) {
r = fmul32(r, i2f32(i));
}
return r;
}
// Sine by Taylor series
float mySin(float x) {
float r = 0.0;
for (int n = 0; n < 5;
n++) {
int k = f2i32(fadd32(fmul32(i2f32(2), i2f32(n)), i2f32(1)));
int s = 1 ^ ((-2) & -(n & 0x1));
r = fadd32(r, fdiv32(fmul32(i2f32(s), myPow(x, k)), factorial(k)));
}
return r;
}
```
### Bare C code performance
with `-O0` option

with `-O2` option

with `-O3` option

## ABI code
### `fmul32` & `imul24`
By directly incorporating the logic of `imul24` within `fmul32`, we eliminate the need for an additional function call, which can improve performance and reduce the store/load operation overhead. This technique is often employed in optimization to enhance the efficiency of code execution.
```clike
fmul32:
# prologue
addi sp, sp, -4
sw ra, 0(sp)
seqz t0, a0
seqz t1, a0
or t0, t0, t1
beqz t0, 2f
li a0, 0
j 3f
2: li t2, 0x7FFFFF
and t0, a0, t2
and t1, a1, t2
addi t2, t2, 1
or t0, t0, t2 # t0 = ma
or t1, t1, t2 # t1 = mb
imul24:
li t3, 0 # t3 = m, r(in imul24)
1:
andi t4, t1, 0x1
neg t4, t4
and t4, t0, t4
srli t3, t3, 1
add t3, t3, t4
srli t1, t1, 1
bnez t1, 1b
mv t0, a0 # t0 = a
mv t1, a1 # t1 = b
mv a0, t3
li a1, 24
jal getbit # a0 = mshift
# m(t3) value computed
srl t3, t3, a0
li t2, 0xFF800000
and t0, t0, t2 # t0 = sea
and t1, t1, t2 # t1 = seb
li t2, 0x3f800000
sub t4, t0, t2
add t4, t4, t1
li t2, 0xFF800000
and t4, t4, t2 # t4 = ((sea - 0x3f800000 + seb) & 0xFF800000)
li t2, 0x7fffff
slli a0, a0, 23
or a0, a0, t2
and a0, a0, t3
add a0, a0, t4 # a0 = r(in fmul32)
# check overflow
xor t3, t0, t1
xor t3, t3, a0
srli t3, t3, 31 # t3 = ovfl
li t2, 0x7f800000
xor t4, t2, a0
and t4, t4, t3
xor a0, a0, t4
3: # epilogue
lw ra, 0(sp)
addi sp, sp, 4
ret
```
### `fdiv32` & `idiv24`
The same idea comes to fdiv32 and idiv32. We embedded the `idiv24` into `fdiv32` to prevent additional load/store overhead.
```clike
fdiv32:
# prologue
addi sp, sp, -4
sw ra, 0(sp)
beqz a0, 3f
bnez a1, 1f
li a0, 0x7f800000
j 3f
1:
li t2, 0x7FFFFF
and t0, t2, a0
and t1, t2, a1
addi t2, t2, 1
or t0, t0, t2 # t0 = ma
or t1, t1, t2 # t1 = mb
idiv24:
li t3, 0 # t3 = m, r(in idiv24)
li t4, 32 # t4 = end condition
2:
sub t0, t0, t1
sltz t2, t0
seqz t2, t2
slli t3, t3, 1
or t3, t3, t2
seqz t2, t2
neg t2, t2
and t5, t2, t1 # t5 = b & -(a < 0)
add t0, t0, t5
slli t0, t0, 1
addi t4, t4, -1
bnez t4, 2b
li t2, 0xFF800000
and t0, a0, t2 # t0 = sea
and t1, a1, t2 # t1 = seb
mv a0, t3
li a1, 31
jal getbit
seqz a0, a0 # a0 = mshift
sll t3, t3, a0 # t3 = m
li t2, 0x3f800000
sub t4, t0, t1
add t4, t4, t2 # t4 = sea - seb + 0x3f800000
neg a0, a0
li t2, 0x800000
and a0, a0, t2 # a0 = 0x800000 & -mshift
srli t3, t3, 8
addi t2, t2, -1
and t3, t3, t2 # t3 = m
sub a0, t4, a0
or a0, a0, t3
# check overflow
xor t3, t1, t0
xor t3, t3, a0
srli t3, t3, 31
li t2, 0x7f800000
xor t4, t2, a0
and t4, t4, t3
xor a0, a0, t4
3:
# epilogue
lw ra, 0(sp)
addi sp, sp, 4
ret
```
### `fadd32`
```clike
fadd32:
# prologue
addi sp, sp, -4
sw ra, 0(sp)
li t2, 0x7fffffff
and t0, a0, t2 # t0 = cmp_a
and t1, a1, t2 # t1 = cmp_b
bge t0, t1, 1f
mv t2, a0 # swap a0 = ia, a1 = ib
mv a0, a1
mv a1, t2
1:
srli t0, a0, 23
srli t1, a1, 23
andi t0, t0, 0xff # t0 = ea
andi t1, t1, 0xff # t1 = eb
li t2, 0x7fffff
and t3, a0, t2
and t4, a1, t2
addi t2, t2, 1
or t3, t3, t2 # t3 = ma
or t4, t4, t2 # t4 = mb
sub t2, t0, t1 # t2 = align
li t5, 24
bge t5, t2, 2f
li t2, 24
2:
srl t4, t4, t2 # mb >>= align
xor t2, a0, a1
srli t2, t2, 31
beqz t2, 3f
neg t4, t4
3:
add t3, t3, t4 # t3 = result of ma
# t1(eb) and t4(mb) are free to use
mv t4, a0 # t4 = ia
mv a0, t3
jal count_leading_zeros # a0 = clz
li t2, 8
blt t2, a0, 4f
sub t5, t2, a0
srl t3, t3, t5
add t0, t0, t5
j 5f
4:
sub t5, a0, t2 # t5 = shift
sll t3, t3, t5
sub t0, t0, t5
5:
li t2, 0x80000000
and a0, t2, t4
slli t0, t0, 23
or a0, a0, t0
li t2, 0x7fffff
and t3, t3, t2
or a0, a0, t3
# epilogue
lw ra, 0(sp)
addi sp, sp, 4
ret
```
### Performance
Upon analysis, it's clear that the CPI and the number of executed instructions (Instrs. Retired) align somewhere between the values achieved with the `-O2` and `-O0` options in the GCC compiler. This indicates there's ample opportunity for improvement.

## Assembly optimization
### Loop unrolling

In the optioned Ripes CPU architecture, there is no branch prediction mechanism, and the computation of branch addresses takes place in the 3rd stage. Consequently, a branch miss incurs a significant penalty of 2 cycles, which is notably high in this case.
Based on the given scenario, loop unrolling appears to be an ideal optimization technique in this case. I have applied the unrolling method to the `imul24` and `idiv24` functions, both of which are extensively used in this homework assignment.
```diff
imul24:
li t3, 0 # t3 = m, r(in imul24)
1:
andi t4, t1, 0x1
neg t4, t4
and t4, t0, t4
srli t3, t3, 1
add t3, t3, t4
srli t1, t1, 1
+ andi t4, t1, 0x1 # unroll 2
+ neg t4, t4
+ and t4, t0, t4
+ srli t3, t3, 1
+ add t3, t3, t4
+ srli t1, t1, 1
+ andi t4, t1, 0x1 # unroll 3
+ neg t4, t4
+ and t4, t0, t4
+ srli t3, t3, 1
+ add t3, t3, t4
+ srli t1, t1, 1
+ andi t4, t1, 0x1 # unroll 4
+ neg t4, t4
+ and t4, t0, t4
+ srli t3, t3, 1
+ add t3, t3, t4
+ srli t1, t1, 1
+ bnez t1, 1b
```
```diff
idiv24:
li t3, 0 # t3 = m, r(in idiv24)
li t4, 32 # t4 = end condition
2:
sub t0, t0, t1
sltz t2, t0
seqz t2, t2
slli t3, t3, 1
or t3, t3, t2
seqz t2, t2
neg t2, t2
and t5, t2, t1 # t5 = b & -(a < 0)
add t0, t0, t5
slli t0, t0, 1
+ sub t0, t0, t1
+ sltz t2, t0
+ seqz t2, t2
+ slli t3, t3, 1
+ or t3, t3, t2
+ seqz t2, t2
+ neg t2, t2
+ and t5, t2, t1 # t5 = b & -(a < 0)
+ add t0, t0, t5
+ slli t0, t0, 1
+ sub t0, t0, t1
+ sltz t2, t0
+ seqz t2, t2
+ slli t3, t3, 1
+ or t3, t3, t2
+ seqz t2, t2
+ neg t2, t2
+ and t5, t2, t1 # t5 = b & -(a < 0)
+ add t0, t0, t5
+ slli t0, t0, 1
+ sub t0, t0, t1
+ sltz t2, t0
+ seqz t2, t2
+ slli t3, t3, 1
+ or t3, t3, t2
+ seqz t2, t2
+ neg t2, t2
+ and t5, t2, t1 # t5 = b & -(a < 0)
+ add t0, t0, t5
+ slli t0, t0, 1
- addi t4, t4, -1
+ addi t4, t4, -4
+ bnez t4, 2b
```
### Performance of Loop unrolling
Upon analysis, it is apparent that the CPI is **lower than** the `-O3` optimization level, yet the number of executed instructions (Instructions Retired) is higher. We still execute about 2000 more operation in this implementation. This implies that there is still some room for optimizing our assembly code. There are likely some redundanct code that can be eliminated for better efficiency.

### Still thinking...
## Test Case
The values exhibit slight different when compared to `sinf` in the C math library. The primary deviation arises from the rounding strategy employed. Our implementation consistently utilizes rounding down, whereas [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules) adheres to the "Round to nearest even" method.
However, these discrepancies are acceptable, since floating point only guarantee accuracy up to 6 to 7 significant digits.
| degree | mySin | sinf (math.h) |
| ------ | ------------ | ------------- |
| 45.0 | 0.7071068287 | 0.7071067691 |
| 20.0 | 0.3420201540 | 0.3420201242 |
| 90.0 | 1.0000035763 | 1.0000000000 |