Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by < kc71486 >

tags: RISC-V, jserv

Matrix multiplication with floating point

Forked from problem C. All code in this block comes from quizcv2.c.

In order to correctly perform matrix multiplication, properly handle floating point addition and multiplication is important.

Below is the implementaion of floating point multiplication.

float fmul32(float a, float b) { int32_t ia = *(int32_t *) &a; int32_t ib = *(int32_t *) &b; /* define sign */ int32_t sa = ia >> 31; int32_t sb = ib >> 31; int32_t sr; /* define mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; int32_t mr; /* define exponent */ int32_t ea = ((ia >> 23) & 0xFF); int32_t eb = ((ib >> 23) & 0xFF); int32_t er; /* define result */ int32_t result; /*special values*/ if(ea == 0xFF && ma != 0x800000) { int32_t f_nan = 0x7FF80001; return *(float *) &f_nan; } if(eb == 0xFF && mb != 0x800000) { int32_t f_nan = 0x7FF80001; return *(float *) &f_nan; } if(ea == 0xFF && ma == 0x800000) { if(eb == 0) { int32_t f_nan = 0x7F800001; return *(float *) &f_nan; } else { int32_t f_inf = 0x7F800000 | (sa ^ sb) << 31; return *(float *) &f_inf; } } if(eb == 0xFF && mb == 0x800000) { if(ea == 0) { int32_t f_nan = 0x7F800001; return *(float *) &f_nan; } else { int32_t f_inf = 0x7F800000 | (sa ^ sb) << 31; return *(float *) &f_inf; } } if(ea == 0 || eb == 0) { int32_t f_zero = 0 | (sa ^ sb) << 31; return *(float *) &f_zero; } /* multiplication */ sr = sa ^ sb; int64_t mrtmp = imul32(ma, mb) >> 23; int32_t ertmp = ea + eb - 127; /* realign mantissa */ int32_t mshift = getbit(mrtmp, 24); mr = mrtmp >> mshift; er = mshift ? inc(ertmp) : ertmp; /* overflow and underflow */ if(er < 0) { int32_t f_zero = 0 | (sa ^ sb) << 31; return *(float *) &f_zero; } if(er >= 0xFF) { int32_t f_inf = 0x7F800000 | (sa ^ sb) << 31; return *(float *) &f_inf; } /* result */ result = (sr << 31) | ((er & 0xFF) << 23) | (mr & 0x7FFFFF); return *(float *) &result; }

And below is the implementaion of floating point addition.

float fadd32(float a, float b) { int32_t ia = *(int32_t *) &a; int32_t ib = *(int32_t *) &b; /* define sign */ int32_t sa = ia >> 31; int32_t sb = ib >> 31; int32_t sr; /* define mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; int32_t mr; /* define exponent */ int32_t ea = ((ia >> 23) & 0xFF); int32_t eb = ((ib >> 23) & 0xFF); int32_t er; /* define result */ int32_t result; /*special values*/ if(ea == 0xFF && ma != 0x800000) { int32_t f_nan = 0x7FF80001; return *(float *) &f_nan; } if(eb == 0xFF && mb != 0x800000) { int32_t f_nan = 0x7FF80001; return *(float *) &f_nan; } if(ea == 0xFF && ma == 0x800000) { if(eb == 0xFF && mb == 0x800000 && sb != sa) { int32_t f_nan = 0x7F800001; return *(float *) &f_nan; } else { int32_t f_inf = 0x7F800000 | sa << 31; return *(float *) &f_inf; } } if(eb == 0xFF && mb == 0x800000) { if(ea == 0xFF && ma == 0x800000 && sa != sb) { int32_t f_nan = 0x7F800001; return *(float *) &f_nan; } else { int32_t f_inf = 0x7F800000 | sb << 31; return *(float *) &f_inf; } } /* exponent align */ if(ea >= eb) { if(ea - eb <= 24) { mb = mb >> (ea - eb); } else { mb = 0; } er = ea; } else { if(eb - ea <= 24) { ma = ma >> (eb - ea); } else { ma = 0; } er = eb; } /* addition or substraction */ sr = sa; int32_t madd; if((sa ^ sb) == 0) { madd = ma + mb; } else { madd = ma - mb; if((madd >> 31) != 0) { sr ^= 1; madd = ~madd + 1; } } /* realign mantissa */ int32_t digits = get_highest_digit(madd); if(digits == 25) { mr = (madd + 1) >> 1; } else { mr = madd << (24 - digits); } er = er - (24 - digits); /* overflow and underflow */ if(er < 0) { int f_zero = 0 | (sa ^ sb) << 31; return *(float *) &f_zero; } if(er >= 0xFF) { int f_inf = 0x7F800000 | (sa ^ sb) << 31; return *(float *) &f_inf; } /* result */ result = (sr << 31) | ((er & 0xFF) << 23) | (mr & 0x7FFFFF); return *(float *) &result; }

C Implementation

You can find the source code here. Feel free to fork and modify it.

Implement int64 support on several functions

All codes with line number in this block comes from quizcv4.c.

Lets first look at exponent incremnt:

er = mshift ? inc(ertmp) : ertmp;

The sole purpose of inc in this program is add 1, which can be easily substituted by:

er = ertmp + mshift;

Function inc and mask_lowest_zero are now unused, so we can safely delete them.

Now Lets look at mantissa multiplication:

int64_t mrtmp = imul32(ma, mb) >> 23;

My initial thought is splitting an int64_t into two int32_t, which looks like:

jint64_t imul32(int32_t a, int32_t b) {
    uint32_t ua = a;
    int32_t ah; // added higher word
    int32_t al; // added lower word
    int32_t carry;
    int32_t tmp;
    int32_t rh = 0; //result higher word
    int32_t rl = 0; //result lower word
    for(int32_t i = 0; i < 32; i++) {
        if((b >> i & 1) == 1) {
            al = ua << i;
            ah = ua >> (31 - i) >> 1;// prevent no shift
            tmp = rl;
            rl = rl + al;
            carry = ((tmp ^ al) & ~rl | (tmp & al)) >> 31 & 1;
            rh = rh + ah + carry;
        }
    }
    jint64_t r64 = {.h = rh, .l = rl};
    return r64;
}

The worst part of this code is the overflow detection, which requires a lot of cycle. To improve that, we can use some special properties in mantissa multiplication to our advantage.

Since we only need highest 24 bits, we can rewrite imul32, change output type into int32_t and merge left shift into the same function.

int32_t mmul(int32_t a, int32_t b) { int32_t r = 0; a = a << 1; /* to counter last right shift */ while(b != 0) { if((b & 1) != 0) { r = r + a; } b = b >> 1; r = r >> 1; } return r; }

Implement dedicated struct for matrix

All codes with line number in this block comes from quizcv4.c.

In c standard, if we want to use 2d array, we can either specify the column number (like float a[][3]), or use double pointer (like float **a). The second approach is much more versatile, but it requires additional redirection and some array length assumption, which is not ideal.

typedef struct { int32_t row; int32_t col; float *data; } matf32_t;

So instead we use an 1d array instead of 2d, and add implicit dimension information.

matf32_t *matmul(matf32_t *first, matf32_t *second) { /* (m * n) * (n * o) -> (m * o) */ int32_t m = first->row; int32_t n = first->col; int32_t o = second->col; if(n != second->row) { return NULL; } matf32_t *ret = &retmat; /* replace malloc struct */ ret->row = m; ret->col = o; ret->data = retdata; /* replace malloc array */ float *a = first->data; float *b = second->data; float *c = ret->data; float subtotal; int32_t arow = 0; int32_t aidx; int32_t bidx; int32_t cidx = 0; for(int i = 0; i < m; i ++) { for(int32_t j = 0; j < o; j ++) { subtotal = 0; aidx = arow; bidx = j; for(int32_t k = 0; k < n; k ++) { subtotal = fadd32(subtotal, fmul32(a[aidx], b[bidx])); aidx += 1; bidx += o; } c[cidx] = subtotal; cidx += 1; } arow += n; } return ret; }

Without double lookup, indexing becomes quite challenging. A naive way could be using multiplication to get each index. However, multiplication is even slower, so instead we use some increase and decrease each index to achieve it.

Double lookup also lead to some cache locality issue, which would make the program slower if considering cache miss latency.

Assembly Implementation

The first version is really unoptimized, it has 2044 bytes of instructions, 13890 cycles, 0.988 data cache hit rate and 0.8421 instruction cache hit rate (direct mapped).

cpu analysis

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


In this program, we are multiplying a 3*3 matrix by a 3*3 matrix, which requires 27 fadd32 and 27 fmul32, along with 27 mmul. Consider how much cycle a normal integer multiplication takes, it would be a challenge to drastically decrease instruction count.

cache analysis

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


42 Cache misses in direct cache is suprisingly good compare to total instruction retired.
If we were using the conventional approach and allocating in heap, data fragmentation will probably cause a lot of cache misses.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


In my perspective, it is wierd how 2 way set performs worse than direct map.

possible improvements

The perfromance probably mainly caused by excessive memory access, such as:

# addition or substraction lw t0, 52(sp) lw t1, 56(sp) sw t0, 60(sp) # sr = sa xor t0, t0, t1 bne t0, x0, addaxn # skip if (sa ^ sb) != 0 lw t0, 68(sp) lw t1, 72(sp) add t0, t0, t1 sw t0, 92(sp) # madd = ma + mb j addaxe addaxn: lw t0, 68(sp) lw t1, 72(sp) sub t0, t0, t1

Most of them can be replaced by saved registers, reducing a lot of memory access.

Because there will always be 24 loops in matrix multiplication, we can loop unroll it at the expense of file size and cache locality.

And also, there are some bug in special value segment in both fadd32 and fmul32, although it is not a big issue for matrix multiplication since infinity and nan are equally invalid in matrix.

Our previous design didn't deal with flexible size matrix multiplication, so we also implement custom low complexity dynamic allocation.

Finally, there are two competing matrix traverse method, the first being: (The numbers in the matrix represent the time when the location was first used)

[123456789][147258369]=[147101316192225]
, and:
[110194132271625][123101112192021]=[123456789]


Using some basic analysis, we can found out that second method requires lower amount of cycles.

Assembly improvement

To address above issues, we update our assembly to the next version (with its c code equivilent)

fmul32 and fadd32 improvement

We change the location of each variable from memory to saved registers, and merge variables with non-overlapping lifetimes.

/*define const*/
register int s0 = 0x7fffff;
register int s1 = 0x800000;
register int s2 = 0xff;
/* define sign */
register int32_t sr = (ia ^ ib) >> 31 << 31; /*s3*/
/* define mantissa */
register int32_t mar = (ia & s0) | s1; /*t4, ma and mr*/
register int32_t mb = (ib & s0) | s1; /*t5*/
/* define exponent */
register int32_t ear = ((ia >> 23) & 0xff); /*s4, sa and sr*/
register int32_t eb = ((ib >> 23) & 0xff); /*s5*/
/*define const*/
register int s0 = 0x7fffff;
register int s1 = 0x800000;
register int s2 = 0xff;
/* define sign */
register int32_t sa = ia >> 31 << 31; /*s3*/
register int32_t sb = ib >> 31 << 31; /*s4*/
register int32_t sr = sa; /*s5*/
/* define mantissa */
register int32_t mar = (ia & 0x7fffff) | 0x800000; /*s6, ma and mr*/
register int32_t mb = (ib & 0x7fffff) | 0x800000; /*s7*/
/* define exponent */
register int32_t ear = ((ia >> 23) & 0xff); /*s8, ea and er*/
register int32_t eb = ((ib >> 23) & 0xff); /*s9*/

mmul improvement

Because there are always 24 digits in mantissa, we can use loop unroll to greatly decrease branch miss at the cost of code size and instruction cache hit rate.

register int32_t r = 0;
a = a << 1; /* to counter last right shift */
do {
    if((b & 1) != 0) {
        r = r + a;
    }
    r = r >> 1;
    if((b & 2) != 0) {
        r = r + a;
    }
    r = r >> 1;
    if((b & 4) != 0) {
        r = r + a;
    }
    r = r >> 1;
    if((b & 8) != 0) {
        r = r + a;
    }
    b = b >> 4;
    r = r >> 1;
} while(b != 0);
return r;

dynamic allocation implementation

The complexity of this implementation is

O(n2), with zero initializing the matrix being the slowest part. We treat heap as an stack and move up when needed (all heap variables have unlimited lifetime in this implementation).

static inline matf32_t *new_mat() {
    register char *ptr = (char *) heap_top; /* for exact pointer calculation */
    heap_top = ptr + 12;
    return (matf32_t *) ptr;
}
static inline int32_t *new_arr(int32_t size) {
    register char *ptr = (char *) heap_top; /* for exact pointer calculation */
    size = size << 2;
    heap_top = ptr + size;
    register int32_t *whptr = ptr;
    do {
        *whptr = 0;
        whptr ++; /* whptr += 4*/
    } while(whptr < (int32_t *) heap_top);
    return (int32_t *) ptr;
}
/* integer multiply without 0 */
static inline int32_t imul32(register int32_t a, register int32_t b) {
    register int32_t r = 0;
    do {
        if((b & 1) != 0) {
            r = r + a;
        }
        a = a << 1;
        b = b >> 1;
    } while(b != 0);
    return r;
}
static matf32_t *matmul(matf32_t *first, matf32_t *second) {
    /*some code*/
    matf32_t *ret = new_mat();/*temp #t3 = (sp+56)*/
    /*some code*/
    ret->data = new_arr(imul32(m, o)); /*temp #t2*/
    /*some code*/
}

other possible improvement (considered, but not changed)

If we never go overflow or underflow, we can simply remove overflow and underflow sections, and even free some registers.

If there are no infinity or nan in the matrix, those special case handling can be removed.

If input matrix will always be valid, those if statements can also be removed.

highestbit and mmul can be inlined in their calling functions, but it loses some readability and the overhead is already really small.

cpu analysis

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Total cycles are 39% faster, which is quite considerable. Note that since my gcc compiler with -O1 or above cannot translate assembly correctly (the function is inlined, and the arguement is completely ignored), the result printed will be wrong (result in heap is still correct though).

Unsuprisingly, gcc -O1 outperforms my handwritten assembly.

cache analysis

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Data cache misses and writebacks are much smaller. Instruction cache misses are also smaller, probably due to smaller code size.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


2 way set makes data cache misses and writeback drastically smaller, due to the overlapping cache area between matrix a and c.

Reference