Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by < ranvd >

Get sine value without floating point multiplication support

The sine function holds significant importance in both mathematics and computer graphics. Within the context of the project, the sine and cosine functions are employed to calculate the phase of where lines should be drawn.

Since sine function is defined as the ratio of the height of an angle to its corresponding hypotenuse, the value always sit in

[- 1, 1]

. We can generalize the sine value into a function like the image below.

According to Taylor expansion, any bounded continuous function can be approximate by Taylor series. So, In computer science, we can use this technique to get the sine value and fix the domain in

[0, 2 π]

. Derive from Taylor expansion, we can know that

s i n (x) \approx x - \frac{x^{3}}{3!} + \frac{x^{5}}{5!} - \dots = \sum_{k = 0}^{n} \frac{(- 1)^{k} x^{2 k + 1}}{(2 k + 1)!}

C code

`fmul32`

Since we don't support any NaN, Inf and denormalize number as the input. We can still provide a "just ready to used" floating point multiplication. We can return zero by checking the a and b value if either of them equals zero.

Furthermore, fundamental overflow detection can be carried out by examining the MSB in the result of the multiplication and confirming the equality of the true signed bits. Consequently, the statement r ^ ia ^ ib can be generated. If an overflow occurs, the variable ovfl will be set to -1; otherwise, it will be set to 0.

float fmul32(float a, float b)
{
    /* TODO: Special values like NaN and INF */
    int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
    if (ia == 0 || ib == 0) return 0;
    
    /* mantissa */
    int32_t ma = (ia & 0x7FFFFF) | 0x800000;
    int32_t mb = (ib & 0x7FFFFF) | 0x800000;

    int32_t sea = ia & 0xFF800000;
    int32_t seb = ib & 0xFF800000;

    /* result of mantissa */
    int32_t m = imul24(ma, mb);
    int32_t mshift = getbit(m, 24);
    m >>= mshift;

    int32_t r = ((sea - 0x3f800000 + seb) & 0xFF800000) + m - (0x800000 & -!mshift);
    int32_t ovfl = (r ^ ia ^ ib) >> 31;
    r = r ^ ( (r ^ 0x7f800000) & ovfl);
    return *(float *)&r;
}

`imul32`

In floating-point multiplication, only the upper 25 bits are necessary when performing mantissa multiplication. To accommodate this, I have modified the function name from imul32 to imul24 in the following code. Instead of shifting the addend to the right, the approach involves shifting the result left, thereby ensuring the upper bit is consistently preserved.

static int32_t imul24(int32_t a, int32_t b)
{
    uint32_t r = 0;
    for (; b; b >>= 1)
        r = (r >> 1) + (a & -getbit(b, 0));
    return r;
}

Since the input argument of b is always 24-bit lenght. I can easily use loop unrolling technique as below to reduce to execution cycle.

static int32_t imul24(int32_t a, int32_t b)
{
    uint32_t r = 0;
    for (; b; b >>= 4){
        r = (r >> 1) + (a & -getbit(b, 0));
        r = (r >> 1) + (a & -getbit(b, 1));
        r = (r >> 1) + (a & -getbit(b, 2));
        r = (r >> 1) + (a & -getbit(b, 3));
    }
    return r;
}

`fdiv32`

In the fdiv32 function, NaN, inf and denormalized number are not allowed as fmul32. I still provide the basic floating point multiplication by the technique as the mention above.

float fdiv32(float a, float b)
{
    int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
    if (a == 0) return a;
    if (b == 0) return *(float*)&(int){0x7f800000};
    /* mantissa */
    int32_t ma = (ia & 0x7FFFFF) | 0x800000;
    int32_t mb = (ib & 0x7FFFFF) | 0x800000;

    /* sign and exponent */
    int32_t sea = ia & 0xFF800000;
    int32_t seb = ib & 0xFF800000;

    /* result of mantissa */
    int32_t m = idiv24(ma, mb);
    int32_t mshift = !getbit(m, 31);
    m <<= mshift;

    int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8;
    int32_t ovfl = (ia ^ ib ^ r) >> 31;
    r = r ^ ((r ^ 0x7f800000) & ovfl);
    
    return *(float *) &r;
    // return a / b;
}

`idiv24`

idiv24 can handle 32 bits integer division. The dividend and quotient can be update by shifting 1 bit left. If the dividend are greater then the divisor, the LSB of quotient should be 1. This is basicly the same as short division.

static int32_t idiv24(int32_t a, int32_t b){
    uint32_t r = 0;
    for (int i = 0; i < 32; i++){
        r <<= 1;
        if (a - b < 0){
            a <<= 1;
            continue;
        }

        r |= 1;
        a -= b;
        a <<= 1;
    }

    return r;
}

In the above code, we can know that a always need to be shifted left. The only difference is whether b should be subtract from a or not. In the opposite way, we can use -(a < 0) to determine whether b should be added back to a or not, which is called Restoring division

static int32_t idiv24(int32_t a, int32_t b) {
    uint32_t r = 0;
    for (int i = 0; i < 32; i++) {
        a -= b;
        r = (r << 1) | a >= 0;
        a = (a + (b & -(a < 0))) << 1;
    }

    return r;
}

Also, the loop time is fixed, the loop unrolling technique can fit very will.

Ignore the loop unrolling example (Can be done by simply copy paste).

`fadd32`

In floating-point addition, it is essential to align the smaller value with the greater value. In the straightforward approach, the comparison typically involves checking the exponent first and then comparing the mantissa. However, by leveraging the IEEE 754 format, this comparison can be performed simultaneously, which literaily means compare it directly, as the cmp_a and cmp_b shows.

After aligning the values, it is necessary to compare the equality of their signed bits. If the signed bits are not the same, subtraction should be applied.

float fadd32(float a, float b) {
    int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;

    int32_t cmp_a = ia & 0x7fffffff;
    int32_t cmp_b = ib & 0x7fffffff;

    if (cmp_a < cmp_b)
        iswap(ia, ib);
    /* exponent */
    int32_t ea = (ia >> 23) & 0xff;
    int32_t eb = (ib >> 23) & 0xff;

    /* mantissa */
    int32_t ma = ia & 0x7fffff | 0x800000;
    int32_t mb = ib & 0x7fffff | 0x800000;

    int32_t align = (ea - eb > 24) ? 24 : (ea - eb);

    mb >>= align;
    if ((ia ^ ib) >> 31) {
        ma -= mb;
    } else {
        ma += mb;
    }

    int32_t clz = count_leading_zeros(ma);
    int32_t shift = 0;
    if (clz <= 8) {
        shift = 8 - clz;
        ma >>= shift;
        ea += shift;
    } else {
        shift = clz - 8;
        ma <<= shift;
        ea -= shift;
    }

    int32_t r = ia & 0x80000000 | ea << 23 | ma & 0x7fffff;
    float tr = a + b;
    return *(float *)&r;
}

`f2i32` & `i2f32`

The functions i2f32 and f2i32 are used for converting values between floating-point and integer representations. While 2 means "to". i2f means "int to float".

int f2i32(int x) {
    int32_t a = *(int *)&x;
    int32_t ma = (a & 0x7FFFFF) | 0x800000;
    int32_t ea = ((a >> 23) & 0xFF) - 127;
    if (ea < 0)
        return 0;
    else if (ea <= 23)
        ma >>= (23 - ea);
    else
        ma <<= (ea - 23);

    return ma;
}

int i2f32(int x) {
    if (x == 0) return 0;

    int32_t s = x & 0x80000000;
    if (s) x = -x;

    int32_t clz = count_leading_zeros(x);
    int32_t e = 31 - clz + 127;

    if (clz <= 8) {
        x >>= 8 - clz;
    } else {
        x <<= clz - 8;
    }

    int r = s | e << 23 | x & 0x7fffff;
    return r;
}

`Sin`

The sine value can be obtained using the Taylor series. The following code represents the corresponding implementation of the Taylor series for sine calculation

float myPow(float x, int n) {
    float r = 1.0;
    while (n) {
        if (n & 0x1) {
            r = fmul32(r, x);
            n -= 1;
        } else {
            x = fmul32(x, x);
            n >>= 1;
        }
    }
    return r;
}

// n!
float factorial(int n) {
    float r = 1.0;
    for (int i = 1; i <= n; i++) {
        r = fmul32(r, i2f32(i));
    }
    return r;
}

// Sine by Taylor series
float mySin(float x) {
    float r = 0.0;
    for (int n = 0; n < 5;
         n++) { 
        int k = f2i32(fadd32(fmul32(i2f32(2), i2f32(n)), i2f32(1)));
        int s = 1 ^ ((-2) & -(n & 0x1));
        r = fadd32(r, fdiv32(fmul32(i2f32(s), myPow(x, k)), factorial(k)));
    }
    return r;
}

Bare C code performance

with -O0 option

with -O2 option

with -O3 option

ABI code

`fmul32` & `imul24`

By directly incorporating the logic of imul24 within fmul32, we eliminate the need for an additional function call, which can improve performance and reduce the store/load operation overhead. This technique is often employed in optimization to enhance the efficiency of code execution.

fmul32:
    # prologue
    addi sp, sp, -4
    sw ra, 0(sp)
    
    seqz t0, a0
    seqz t1, a0
    or t0, t0, t1
    beqz t0, 2f
    li a0, 0
    j 3f

2:  li t2, 0x7FFFFF
    and t0, a0, t2
    and t1, a1, t2
    addi t2, t2, 1
    or t0, t0, t2 # t0 = ma
    or t1, t1, t2 # t1 = mb
imul24:
    li t3, 0 # t3 = m, r(in imul24)
1:  
    andi t4, t1, 0x1
    neg t4, t4
    and t4, t0, t4
    srli t3, t3, 1
    add t3, t3, t4
    srli t1, t1, 1
    bnez t1, 1b

    mv t0, a0 # t0 = a
    mv t1, a1 # t1 = b
    mv a0, t3
    li a1, 24
    jal getbit # a0 = mshift

# m(t3) value computed
    srl t3, t3, a0

    li t2, 0xFF800000
    and t0, t0, t2 # t0 = sea
    and t1, t1, t2 # t1 = seb

    li t2, 0x3f800000
    sub t4, t0, t2
    add t4, t4, t1
    li t2, 0xFF800000
    and t4, t4, t2 # t4 = ((sea - 0x3f800000 + seb) & 0xFF800000)
    
    li t2, 0x7fffff
    slli a0, a0, 23
    or a0, a0, t2
    and a0, a0, t3
    add a0, a0, t4 # a0 = r(in fmul32)

    # check overflow
    xor t3, t0, t1
    xor t3, t3, a0
    srli t3, t3, 31 # t3 = ovfl

    li t2, 0x7f800000
    xor t4, t2, a0
    and t4, t4, t3
    xor a0, a0, t4

3:  # epilogue
    lw ra, 0(sp)
    addi sp, sp, 4
    ret

`fdiv32` & `idiv24`

The same idea comes to fdiv32 and idiv32. We embedded the idiv24 into fdiv32 to prevent additional load/store overhead.

fdiv32:
    # prologue
    addi sp, sp, -4
    sw ra, 0(sp)

    beqz a0, 3f
    bnez a1, 1f
    li a0, 0x7f800000
    j 3f
1:
    li t2, 0x7FFFFF
    and t0, t2, a0
    and t1, t2, a1
    addi t2, t2, 1
    or t0, t0, t2 # t0 = ma
    or t1, t1, t2 # t1 = mb
idiv24:
    li t3, 0 # t3 = m, r(in idiv24)
    li t4, 32 # t4 = end condition
2:
    sub t0, t0, t1
    sltz t2, t0
    seqz t2, t2
    slli t3, t3, 1
    or t3, t3, t2
    seqz t2, t2
    neg t2, t2
    and t5, t2, t1 # t5 = b & -(a < 0)
    add t0, t0, t5
    slli t0, t0, 1
    
    addi t4, t4, -1
    bnez t4, 2b 

    li t2, 0xFF800000
    and t0, a0, t2 # t0 = sea
    and t1, a1, t2 # t1 = seb
    mv a0, t3
    li a1, 31
    jal getbit
    seqz a0, a0 # a0 = mshift
    sll t3, t3, a0 # t3 = m

    li t2, 0x3f800000
    sub t4, t0, t1
    add t4, t4, t2 # t4 = sea - seb + 0x3f800000
    neg a0, a0
    li t2, 0x800000
    and a0, a0, t2 # a0 = 0x800000 & -mshift
    srli t3, t3, 8
    addi t2, t2, -1
    and t3, t3, t2 # t3 = m
    sub a0, t4, a0
    or a0, a0, t3

    # check overflow
    xor t3, t1, t0
    xor t3, t3, a0
    srli t3, t3, 31

    li t2, 0x7f800000
    xor t4, t2, a0
    and t4, t4, t3
    xor a0, a0, t4
3:
    # epilogue
    lw ra, 0(sp)
    addi sp, sp, 4
    ret

`fadd32`

fadd32:
    # prologue
    addi sp, sp, -4
    sw ra, 0(sp)
    
    li t2, 0x7fffffff
    and t0, a0, t2 # t0 = cmp_a
    and t1, a1, t2 # t1 = cmp_b
    bge t0, t1, 1f
    mv t2, a0 # swap a0 = ia, a1 = ib
    mv a0, a1
    mv a1, t2
1:  
    srli t0, a0, 23
    srli t1, a1, 23
    andi t0, t0, 0xff # t0 = ea
    andi t1, t1, 0xff # t1 = eb

    li t2, 0x7fffff
    and t3, a0, t2
    and t4, a1, t2
    addi t2, t2, 1
    or t3, t3, t2 # t3 = ma
    or t4, t4, t2 # t4 = mb
    
    sub t2, t0, t1 # t2 = align
    li t5, 24
    bge t5, t2, 2f
    li t2, 24
2:
    srl t4, t4, t2 # mb >>= align
    xor t2, a0, a1
    srli t2, t2, 31
    beqz t2, 3f
    neg t4, t4
3:
    add t3, t3, t4 # t3 = result of ma
    # t1(eb) and t4(mb) are free to use
    mv t4, a0 # t4 = ia
    mv a0, t3
    jal count_leading_zeros # a0 = clz
    li t2, 8
    blt t2, a0, 4f
    sub t5, t2, a0
    srl t3, t3, t5
    add t0, t0, t5
    j 5f
4: 
    sub t5, a0, t2 # t5 = shift
    sll t3, t3, t5
    sub t0, t0, t5
5:
    li t2, 0x80000000
    and a0, t2, t4
    slli t0, t0, 23
    or a0, a0, t0
    li t2, 0x7fffff
    and t3, t3, t2
    or a0, a0, t3
    
    # epilogue
    lw ra, 0(sp)
    addi sp, sp, 4
    ret

Performance

Upon analysis, it's clear that the CPI and the number of executed instructions (Instrs. Retired) align somewhere between the values achieved with the -O2 and -O0 options in the GCC compiler. This indicates there's ample opportunity for improvement.

Assembly optimization

Loop unrolling

In the optioned Ripes CPU architecture, there is no branch prediction mechanism, and the computation of branch addresses takes place in the 3rd stage. Consequently, a branch miss incurs a significant penalty of 2 cycles, which is notably high in this case.

Based on the given scenario, loop unrolling appears to be an ideal optimization technique in this case. I have applied the unrolling method to the imul24 and idiv24 functions, both of which are extensively used in this homework assignment.

imul24:
    li t3, 0 # t3 = m, r(in imul24)
1:  
    andi t4, t1, 0x1
    neg t4, t4
    and t4, t0, t4
    srli t3, t3, 1
    add t3, t3, t4
    srli t1, t1, 1
    
+    andi t4, t1, 0x1 # unroll 2
+    neg t4, t4
+    and t4, t0, t4
+    srli t3, t3, 1
+    add t3, t3, t4
+    srli t1, t1, 1
+    andi t4, t1, 0x1 # unroll 3
+    neg t4, t4
+    and t4, t0, t4
+    srli t3, t3, 1
+    add t3, t3, t4
+    srli t1, t1, 1
+    andi t4, t1, 0x1 # unroll 4
+    neg t4, t4
+    and t4, t0, t4
+    srli t3, t3, 1
+    add t3, t3, t4
+    srli t1, t1, 1
+    bnez t1, 1b

idiv24:
    li t3, 0 # t3 = m, r(in idiv24)
    li t4, 32 # t4 = end condition
2:
    sub t0, t0, t1
    sltz t2, t0
    seqz t2, t2
    slli t3, t3, 1
    or t3, t3, t2
    seqz t2, t2
    neg t2, t2
    and t5, t2, t1 # t5 = b & -(a < 0)
    add t0, t0, t5
    slli t0, t0, 1

+    sub t0, t0, t1
+    sltz t2, t0
+    seqz t2, t2
+    slli t3, t3, 1
+    or t3, t3, t2
+    seqz t2, t2
+    neg t2, t2
+    and t5, t2, t1 # t5 = b & -(a < 0)
+    add t0, t0, t5
+    slli t0, t0, 1
+    sub t0, t0, t1
+    sltz t2, t0
+    seqz t2, t2
+    slli t3, t3, 1
+    or t3, t3, t2
+    seqz t2, t2
+    neg t2, t2
+    and t5, t2, t1 # t5 = b & -(a < 0)
+    add t0, t0, t5
+    slli t0, t0, 1
+    sub t0, t0, t1
+    sltz t2, t0
+    seqz t2, t2
+    slli t3, t3, 1
+    or t3, t3, t2
+    seqz t2, t2
+    neg t2, t2
+    and t5, t2, t1 # t5 = b & -(a < 0)
+    add t0, t0, t5
+    slli t0, t0, 1
    
-    addi t4, t4, -1
+    addi t4, t4, -4
+    bnez t4, 2b

Performance of Loop unrolling

Upon analysis, it is apparent that the CPI is lower than the -O3 optimization level, yet the number of executed instructions (Instructions Retired) is higher. We still execute about 2000 more operation in this implementation. This implies that there is still some room for optimizing our assembly code. There are likely some redundanct code that can be eliminated for better efficiency.

Still thinking…

Test Case

The values exhibit slight different when compared to sinf in the C math library. The primary deviation arises from the rounding strategy employed. Our implementation consistently utilizes rounding down, whereas IEEE 754 adheres to the "Round to nearest even" method.

However, these discrepancies are acceptable, since floating point only guarantee accuracy up to 6 to 7 significant digits.

degree	mySin	sinf (math.h)
45.0	0.7071068287	0.7071067691
20.0	0.3420201540	0.3420201242
90.0	1.0000035763	1.0000000000

Assignment1: RISC-V Assembly and Instruction Pipeline

Get sine value without floating point multiplication support

C code

fmul32

imul32

fdiv32

idiv24

fadd32

f2i32 & i2f32

Sin

Bare C code performance

ABI code

fmul32 & imul24

fdiv32 & idiv24

fadd32

Performance

Assembly optimization

Loop unrolling

Performance of Loop unrolling

Still thinking…

Test Case

`fmul32`

`imul32`

`fdiv32`

`idiv24`

`fadd32`

`f2i32` & `i2f32`

`Sin`

`fmul32` & `imul24`

`fdiv32` & `idiv24`

`fadd32`