# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`ranvd`](https://github.com/ranvd) > ## Get sine value without floating point multiplication support The sine function holds significant importance in both mathematics and computer graphics. Within the context of the [project](https://github.com/miloyip/line), the sine and cosine functions are employed to calculate the phase of where lines should be drawn. Since [sine](https://en.wikipedia.org/wiki/Sine_and_cosine) function is defined as the ratio of the height of an angle to its corresponding [hypotenuse](https://en.wikipedia.org/wiki/Hypotenuse), the value always sit in $[-1,1]$. We can generalize the sine value into a function like the image below. ![](https://hackmd.io/_uploads/ry5XSnGga.png) According to [Taylor expansion](https://en.wikipedia.org/wiki/Taylor_series), any bounded continuous function can be approximate by Taylor series. So, In computer science, we can use this technique to get the sine value and fix the domain in $[0, 2\pi]$. Derive from Taylor expansion, we can know that $$sin(x)\approx x-{x^3\over3!}+{x^5\over 5!}-\dots=\sum^n_{k=0}{(-1)^kx^{2k+1}\over (2k+1)!}$$ ## C code ### `fmul32` Since we don't support any NaN, Inf and denormalize number as the input. We can still provide a "just ready to used" floating point multiplication. We can return zero by checking the a and b value if either of them equals zero. Furthermore, fundamental overflow detection can be carried out by examining the MSB in the result of the multiplication and confirming the equality of the true signed bits. Consequently, the statement `r ^ ia ^ ib` can be generated. If an overflow occurs, the variable `ovfl` will be set to -1; otherwise, it will be set to 0. ```c float fmul32(float a, float b) { /* TODO: Special values like NaN and INF */ int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b; if (ia == 0 || ib == 0) return 0; /* mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; int32_t sea = ia & 0xFF800000; int32_t seb = ib & 0xFF800000; /* result of mantissa */ int32_t m = imul24(ma, mb); int32_t mshift = getbit(m, 24); m >>= mshift; int32_t r = ((sea - 0x3f800000 + seb) & 0xFF800000) + m - (0x800000 & -!mshift); int32_t ovfl = (r ^ ia ^ ib) >> 31; r = r ^ ( (r ^ 0x7f800000) & ovfl); return *(float *)&r; } ``` ### `imul32` In floating-point multiplication, only the upper 25 bits are necessary when performing mantissa multiplication. To accommodate this, I have modified the function name from `imul32` to `imul24` in the following code. Instead of shifting the addend to the right, the approach involves shifting the result left, thereby ensuring the upper bit is consistently preserved. ```c static int32_t imul24(int32_t a, int32_t b) { uint32_t r = 0; for (; b; b >>= 1) r = (r >> 1) + (a & -getbit(b, 0)); return r; } ``` Since the input argument of `b` is always 24-bit lenght. I can easily use loop unrolling technique as below to reduce to execution cycle. ```c static int32_t imul24(int32_t a, int32_t b) { uint32_t r = 0; for (; b; b >>= 4){ r = (r >> 1) + (a & -getbit(b, 0)); r = (r >> 1) + (a & -getbit(b, 1)); r = (r >> 1) + (a & -getbit(b, 2)); r = (r >> 1) + (a & -getbit(b, 3)); } return r; } ``` ### `fdiv32` In the `fdiv32` function, NaN, inf and denormalized number are not allowed as `fmul32`. I still provide the basic floating point multiplication by the technique as the mention above. ```c float fdiv32(float a, float b) { int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b; if (a == 0) return a; if (b == 0) return *(float*)&(int){0x7f800000}; /* mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; /* sign and exponent */ int32_t sea = ia & 0xFF800000; int32_t seb = ib & 0xFF800000; /* result of mantissa */ int32_t m = idiv24(ma, mb); int32_t mshift = !getbit(m, 31); m <<= mshift; int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8; int32_t ovfl = (ia ^ ib ^ r) >> 31; r = r ^ ((r ^ 0x7f800000) & ovfl); return *(float *) &r; // return a / b; } ``` ### `idiv24` `idiv24` can handle 32 bits integer division. The dividend and quotient can be update by shifting 1 bit left. If the dividend are greater then the divisor, the LSB of quotient should be 1. This is basicly the same as [short division](https://en.wikipedia.org/wiki/Short_division#:~:text=In%20arithmetic%2C%20short%20division%20is,remainders%20are%20notated%20as%20superscripts.). ```c static int32_t idiv24(int32_t a, int32_t b){ uint32_t r = 0; for (int i = 0; i < 32; i++){ r <<= 1; if (a - b < 0){ a <<= 1; continue; } r |= 1; a -= b; a <<= 1; } return r; } ``` In the above code, we can know that `a` always need to be shifted left. The only difference is whether `b` should be subtract from a or not. In the opposite way, we can use `-(a < 0)` to determine whether `b` should be added back to `a` or not, which is called [Restoring division](https://en.wikipedia.org/wiki/Division_algorithm) ```c static int32_t idiv24(int32_t a, int32_t b) { uint32_t r = 0; for (int i = 0; i < 32; i++) { a -= b; r = (r << 1) | a >= 0; a = (a + (b & -(a < 0))) << 1; } return r; } ``` Also, the loop time is fixed, the loop unrolling technique can fit very will. > Ignore the loop unrolling example (Can be done by simply copy paste). ### `fadd32` In floating-point addition, it is essential to align the smaller value with the greater value. In the straightforward approach, the comparison typically involves checking the exponent first and then comparing the mantissa. However, by leveraging the IEEE 754 format, this comparison can be performed simultaneously, which literaily means compare it directly, as the `cmp_a` and `cmp_b` shows. After aligning the values, it is necessary to compare the equality of their signed bits. If the signed bits are not the same, subtraction should be applied. ```c float fadd32(float a, float b) { int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b; int32_t cmp_a = ia & 0x7fffffff; int32_t cmp_b = ib & 0x7fffffff; if (cmp_a < cmp_b) iswap(ia, ib); /* exponent */ int32_t ea = (ia >> 23) & 0xff; int32_t eb = (ib >> 23) & 0xff; /* mantissa */ int32_t ma = ia & 0x7fffff | 0x800000; int32_t mb = ib & 0x7fffff | 0x800000; int32_t align = (ea - eb > 24) ? 24 : (ea - eb); mb >>= align; if ((ia ^ ib) >> 31) { ma -= mb; } else { ma += mb; } int32_t clz = count_leading_zeros(ma); int32_t shift = 0; if (clz <= 8) { shift = 8 - clz; ma >>= shift; ea += shift; } else { shift = clz - 8; ma <<= shift; ea -= shift; } int32_t r = ia & 0x80000000 | ea << 23 | ma & 0x7fffff; float tr = a + b; return *(float *)&r; } ``` ### `f2i32` & `i2f32` The functions `i2f32` and `f2i32` are used for converting values between floating-point and integer representations. While 2 means "to". `i2f` means "int to float". ```c int f2i32(int x) { int32_t a = *(int *)&x; int32_t ma = (a & 0x7FFFFF) | 0x800000; int32_t ea = ((a >> 23) & 0xFF) - 127; if (ea < 0) return 0; else if (ea <= 23) ma >>= (23 - ea); else ma <<= (ea - 23); return ma; } int i2f32(int x) { if (x == 0) return 0; int32_t s = x & 0x80000000; if (s) x = -x; int32_t clz = count_leading_zeros(x); int32_t e = 31 - clz + 127; if (clz <= 8) { x >>= 8 - clz; } else { x <<= clz - 8; } int r = s | e << 23 | x & 0x7fffff; return r; } ``` ### `Sin` The sine value can be obtained using the Taylor series. The following code represents the corresponding implementation of the Taylor series for sine calculation ```c float myPow(float x, int n) { float r = 1.0; while (n) { if (n & 0x1) { r = fmul32(r, x); n -= 1; } else { x = fmul32(x, x); n >>= 1; } } return r; } // n! float factorial(int n) { float r = 1.0; for (int i = 1; i <= n; i++) { r = fmul32(r, i2f32(i)); } return r; } // Sine by Taylor series float mySin(float x) { float r = 0.0; for (int n = 0; n < 5; n++) { int k = f2i32(fadd32(fmul32(i2f32(2), i2f32(n)), i2f32(1))); int s = 1 ^ ((-2) & -(n & 0x1)); r = fadd32(r, fdiv32(fmul32(i2f32(s), myPow(x, k)), factorial(k))); } return r; } ``` ### Bare C code performance with `-O0` option ![](https://hackmd.io/_uploads/SyM4TLdlp.png) with `-O2` option ![](https://hackmd.io/_uploads/B1Bjgv_ga.png) with `-O3` option ![](https://hackmd.io/_uploads/ryxKgw_ea.png) ## ABI code ### `fmul32` & `imul24` By directly incorporating the logic of `imul24` within `fmul32`, we eliminate the need for an additional function call, which can improve performance and reduce the store/load operation overhead. This technique is often employed in optimization to enhance the efficiency of code execution. ```clike fmul32: # prologue addi sp, sp, -4 sw ra, 0(sp) seqz t0, a0 seqz t1, a0 or t0, t0, t1 beqz t0, 2f li a0, 0 j 3f 2: li t2, 0x7FFFFF and t0, a0, t2 and t1, a1, t2 addi t2, t2, 1 or t0, t0, t2 # t0 = ma or t1, t1, t2 # t1 = mb imul24: li t3, 0 # t3 = m, r(in imul24) 1: andi t4, t1, 0x1 neg t4, t4 and t4, t0, t4 srli t3, t3, 1 add t3, t3, t4 srli t1, t1, 1 bnez t1, 1b mv t0, a0 # t0 = a mv t1, a1 # t1 = b mv a0, t3 li a1, 24 jal getbit # a0 = mshift # m(t3) value computed srl t3, t3, a0 li t2, 0xFF800000 and t0, t0, t2 # t0 = sea and t1, t1, t2 # t1 = seb li t2, 0x3f800000 sub t4, t0, t2 add t4, t4, t1 li t2, 0xFF800000 and t4, t4, t2 # t4 = ((sea - 0x3f800000 + seb) & 0xFF800000) li t2, 0x7fffff slli a0, a0, 23 or a0, a0, t2 and a0, a0, t3 add a0, a0, t4 # a0 = r(in fmul32) # check overflow xor t3, t0, t1 xor t3, t3, a0 srli t3, t3, 31 # t3 = ovfl li t2, 0x7f800000 xor t4, t2, a0 and t4, t4, t3 xor a0, a0, t4 3: # epilogue lw ra, 0(sp) addi sp, sp, 4 ret ``` ### `fdiv32` & `idiv24` The same idea comes to fdiv32 and idiv32. We embedded the `idiv24` into `fdiv32` to prevent additional load/store overhead. ```clike fdiv32: # prologue addi sp, sp, -4 sw ra, 0(sp) beqz a0, 3f bnez a1, 1f li a0, 0x7f800000 j 3f 1: li t2, 0x7FFFFF and t0, t2, a0 and t1, t2, a1 addi t2, t2, 1 or t0, t0, t2 # t0 = ma or t1, t1, t2 # t1 = mb idiv24: li t3, 0 # t3 = m, r(in idiv24) li t4, 32 # t4 = end condition 2: sub t0, t0, t1 sltz t2, t0 seqz t2, t2 slli t3, t3, 1 or t3, t3, t2 seqz t2, t2 neg t2, t2 and t5, t2, t1 # t5 = b & -(a < 0) add t0, t0, t5 slli t0, t0, 1 addi t4, t4, -1 bnez t4, 2b li t2, 0xFF800000 and t0, a0, t2 # t0 = sea and t1, a1, t2 # t1 = seb mv a0, t3 li a1, 31 jal getbit seqz a0, a0 # a0 = mshift sll t3, t3, a0 # t3 = m li t2, 0x3f800000 sub t4, t0, t1 add t4, t4, t2 # t4 = sea - seb + 0x3f800000 neg a0, a0 li t2, 0x800000 and a0, a0, t2 # a0 = 0x800000 & -mshift srli t3, t3, 8 addi t2, t2, -1 and t3, t3, t2 # t3 = m sub a0, t4, a0 or a0, a0, t3 # check overflow xor t3, t1, t0 xor t3, t3, a0 srli t3, t3, 31 li t2, 0x7f800000 xor t4, t2, a0 and t4, t4, t3 xor a0, a0, t4 3: # epilogue lw ra, 0(sp) addi sp, sp, 4 ret ``` ### `fadd32` ```clike fadd32: # prologue addi sp, sp, -4 sw ra, 0(sp) li t2, 0x7fffffff and t0, a0, t2 # t0 = cmp_a and t1, a1, t2 # t1 = cmp_b bge t0, t1, 1f mv t2, a0 # swap a0 = ia, a1 = ib mv a0, a1 mv a1, t2 1: srli t0, a0, 23 srli t1, a1, 23 andi t0, t0, 0xff # t0 = ea andi t1, t1, 0xff # t1 = eb li t2, 0x7fffff and t3, a0, t2 and t4, a1, t2 addi t2, t2, 1 or t3, t3, t2 # t3 = ma or t4, t4, t2 # t4 = mb sub t2, t0, t1 # t2 = align li t5, 24 bge t5, t2, 2f li t2, 24 2: srl t4, t4, t2 # mb >>= align xor t2, a0, a1 srli t2, t2, 31 beqz t2, 3f neg t4, t4 3: add t3, t3, t4 # t3 = result of ma # t1(eb) and t4(mb) are free to use mv t4, a0 # t4 = ia mv a0, t3 jal count_leading_zeros # a0 = clz li t2, 8 blt t2, a0, 4f sub t5, t2, a0 srl t3, t3, t5 add t0, t0, t5 j 5f 4: sub t5, a0, t2 # t5 = shift sll t3, t3, t5 sub t0, t0, t5 5: li t2, 0x80000000 and a0, t2, t4 slli t0, t0, 23 or a0, a0, t0 li t2, 0x7fffff and t3, t3, t2 or a0, a0, t3 # epilogue lw ra, 0(sp) addi sp, sp, 4 ret ``` ### Performance Upon analysis, it's clear that the CPI and the number of executed instructions (Instrs. Retired) align somewhere between the values achieved with the `-O2` and `-O0` options in the GCC compiler. This indicates there's ample opportunity for improvement. ![](https://hackmd.io/_uploads/S14WZDOeT.png) ## Assembly optimization ### Loop unrolling ![](https://hackmd.io/_uploads/BJMKiOOe6.png) In the optioned Ripes CPU architecture, there is no branch prediction mechanism, and the computation of branch addresses takes place in the 3rd stage. Consequently, a branch miss incurs a significant penalty of 2 cycles, which is notably high in this case. Based on the given scenario, loop unrolling appears to be an ideal optimization technique in this case. I have applied the unrolling method to the `imul24` and `idiv24` functions, both of which are extensively used in this homework assignment. ```diff imul24: li t3, 0 # t3 = m, r(in imul24) 1: andi t4, t1, 0x1 neg t4, t4 and t4, t0, t4 srli t3, t3, 1 add t3, t3, t4 srli t1, t1, 1 + andi t4, t1, 0x1 # unroll 2 + neg t4, t4 + and t4, t0, t4 + srli t3, t3, 1 + add t3, t3, t4 + srli t1, t1, 1 + andi t4, t1, 0x1 # unroll 3 + neg t4, t4 + and t4, t0, t4 + srli t3, t3, 1 + add t3, t3, t4 + srli t1, t1, 1 + andi t4, t1, 0x1 # unroll 4 + neg t4, t4 + and t4, t0, t4 + srli t3, t3, 1 + add t3, t3, t4 + srli t1, t1, 1 + bnez t1, 1b ``` ```diff idiv24: li t3, 0 # t3 = m, r(in idiv24) li t4, 32 # t4 = end condition 2: sub t0, t0, t1 sltz t2, t0 seqz t2, t2 slli t3, t3, 1 or t3, t3, t2 seqz t2, t2 neg t2, t2 and t5, t2, t1 # t5 = b & -(a < 0) add t0, t0, t5 slli t0, t0, 1 + sub t0, t0, t1 + sltz t2, t0 + seqz t2, t2 + slli t3, t3, 1 + or t3, t3, t2 + seqz t2, t2 + neg t2, t2 + and t5, t2, t1 # t5 = b & -(a < 0) + add t0, t0, t5 + slli t0, t0, 1 + sub t0, t0, t1 + sltz t2, t0 + seqz t2, t2 + slli t3, t3, 1 + or t3, t3, t2 + seqz t2, t2 + neg t2, t2 + and t5, t2, t1 # t5 = b & -(a < 0) + add t0, t0, t5 + slli t0, t0, 1 + sub t0, t0, t1 + sltz t2, t0 + seqz t2, t2 + slli t3, t3, 1 + or t3, t3, t2 + seqz t2, t2 + neg t2, t2 + and t5, t2, t1 # t5 = b & -(a < 0) + add t0, t0, t5 + slli t0, t0, 1 - addi t4, t4, -1 + addi t4, t4, -4 + bnez t4, 2b ``` ### Performance of Loop unrolling Upon analysis, it is apparent that the CPI is **lower than** the `-O3` optimization level, yet the number of executed instructions (Instructions Retired) is higher. We still execute about 2000 more operation in this implementation. This implies that there is still some room for optimizing our assembly code. There are likely some redundanct code that can be eliminated for better efficiency. ![](https://hackmd.io/_uploads/HJmPlKOgp.png) ### Still thinking... ## Test Case The values exhibit slight different when compared to `sinf` in the C math library. The primary deviation arises from the rounding strategy employed. Our implementation consistently utilizes rounding down, whereas [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules) adheres to the "Round to nearest even" method. However, these discrepancies are acceptable, since floating point only guarantee accuracy up to 6 to 7 significant digits. | degree | mySin | sinf (math.h) | | ------ | ------------ | ------------- | | 45.0 | 0.7071068287 | 0.7071067691 | | 20.0 | 0.3420201540 | 0.3420201242 | | 90.0 | 1.0000035763 | 1.0000000000 |