Matrix multiplication with floating point addition and multiplication

# Matrix multiplication with floating point addition and multiplication contributed by [Kuanch](https://github.com/Kuanch/ComputerArchitecture2023Fall_NCKU/tree/main/hw2). This is inspired by the brilliant work of [kc71486](https://github.com/kc71486/Computer-Architecture/tree/main/hw1), with the [Hackmd document](https://github.com/kc71486/Computer-Architecture/tree/main/hw1). ## Motivation Matrix multiplication is essential in the modern applications, and the work from kc71486 is clean and great implementation to achieve it with rv32i only. Optimization the code is not the main purpose of this study, just try my best. ## C Code The version 3 of kc71486's work already can seamless work on compiling with `-march=rv32i -mabi=ilp32` on rv32emu, I only add in testing code, refer to [here](https://github.com/Kuanch/ComputerArchitecture2023Fall_NCKU/blob/main/hw2/fmul32.c). ## Disassembly instructions After disassembling the ELF file dumped by `riscv-none-elf-objdump -d`, we could have the following output. Since the original code is lenthy and hard to read, let's **focus on the main part of this task, the muliplication of floating**, I reviewed all the asm code to add C code comment for understanding. Consider the following C code: ```c int32_t fmul32(int32_t ia, int32_t ib) { /* define sign */ int32_t sr = (ia ^ ib) >> 31; /* define mantissa */ int32_t ma = (ia & 0x7fffff) | 0x800000; int32_t mb = (ib & 0x7fffff) | 0x800000; int32_t mr; /* define exponent */ int32_t ea = ((ia >> 23) & 0xff); int32_t eb = ((ib >> 23) & 0xff); int32_t er; /* special values */ if(ea == 0xff) { if(ma == 0x800000 && eb != 0) { return 0x7f800000 | sr << 31; } return 0x7f800001; } if(eb == 0xff) { if(mb == 0x800000 && ea != 0) { return 0x7f800000 | sr << 31; } return 0x7f800001; } if(ea == 0) { return sr << 31; } if(eb == 0) { return sr << 31; } /* multiplication */ register int32_t mrtmp = mmul(ma, mb); register int32_t ertmp = ea + eb - 127; /* realign mantissa */ register int32_t mshift = (mrtmp >> 24) & 1; mr = mrtmp >> mshift; er = ertmp + mshift; /* overflow and underflow */ if(er <= 0) { return sr << 31; } if(er >= 0xff) { return 0x7f800000 | sr << 31; } /* result */ return (sr << 31) | ((er & 0xff) << 23) | (mr & 0x7fffff); } ``` The following block is dumped by `riscv-none-elf-objdump -d` with different level of optimized flags, I add some comments for easier understanding. Since the consideration of space, the complete code is shown here, only selected difference between flags would be presented in the following sections. ### `-O0` * **Statistics** * LOC (Line Of Code): `433` * Allocate `144` bytes on stack * Branching and jump used: `101` * 90 Branching * 11 Jump * Number of registers used: `7` * `ax` registers: `a0` `a1` `a2` `a3` `a4` `a5` * `sx` registers: `s0` * `tx` registers: None * Number of `lw` and `sw`: `91` * 61 `lw` * 30 `sw` * **Observations** 1. Lengthy code. 2. A lots of branching, since it's directly translated from C. 3. Access fewer registers. 4. Takes a lot time to save word and load word. :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: :::success Guan: I correct this and put the total cycles and size information at [the end of this section](#Cycle-Count-from-perfcounter). ::: ```c 000103d0 <fmul32>: 103d0: f7010113 add sp,sp,-144 103d4: 08112623 sw ra,140(sp) 103d8: 08812423 sw s0,136(sp) 103dc: 09010413 add s0,sp,144 103e0: f6a42e23 sw a0,-132(s0) # a0 = a 103e4: f6b42c23 sw a1,-136(s0) # a1 = b 103e8: f7c40793 add a5,s0,-132 103ec: 0007a783 lw a5,0(a5) 103f0: fef42623 sw a5,-20(s0) 103f4: f7840793 add a5,s0,-136 103f8: 0007a783 lw a5,0(a5) 103fc: fef42423 sw a5,-24(s0) 10400: fec42783 lw a5,-20(s0) /* define sign */ 10404: 41f7d793 sra a5,a5,0x1f # ia >> 31; 10408: fef42223 sw a5,-28(s0) 1040c: fe842783 lw a5,-24(s0) 10410: 41f7d793 sra a5,a5,0x1f # ib >> 31; 10414: fef42023 sw a5,-32(s0) 10418: fec42703 lw a4,-20(s0) /* define mantissa */ 1041c: 008007b7 lui a5,0x800 10420: fff78793 add a5,a5,-1 # 7fffff <__BSS_END__+0x7db0ef> 10424: 00f77733 and a4,a4,a5 # (ia & 0x7FFFFF) 10428: 008007b7 lui a5,0x800 1042c: 00f767b3 or a5,a4,a5 # | 0x800000; 10430: fcf42e23 sw a5,-36(s0) 10434: fe842703 lw a4,-24(s0) 10438: 008007b7 lui a5,0x800 1043c: fff78793 add a5,a5,-1 # 7fffff <__BSS_END__+0x7db0ef> 10440: 00f77733 and a4,a4,a5 # (ib & 0x7FFFFF) 10444: 008007b7 lui a5,0x800 10448: 00f767b3 or a5,a4,a5 # | 0x800000; 1044c: fcf42c23 sw a5,-40(s0) 10450: fec42783 lw a5,-20(s0) /* define exponent */ 10454: 4177d793 sra a5,a5,0x17 # (ia >> 23) 10458: 0ff7f793 zext.b a5,a5 # & 0xFF 1045c: fcf42a23 sw a5,-44(s0) 10460: fe842783 lw a5,-24(s0) 10464: 4177d793 sra a5,a5,0x17 # (ib >> 23) 10468: 0ff7f793 zext.b a5,a5 # & 0xFF 1046c: fcf42823 sw a5,-48(s0) 10470: fd442703 lw a4,-44(s0) # if(ea == 0xFF && ma != 0x800000) ... 10474: 0ff00793 li a5,255 10478: 02f71463 bne a4,a5,104a0 <fmul32+0xd0> 1047c: fdc42703 lw a4,-36(s0) 10480: 008007b7 lui a5,0x800 10484: 00f70e63 beq a4,a5,104a0 <fmul32+0xd0> 10488: 7ff807b7 lui a5,0x7ff80 1048c: 00178793 add a5,a5,1 # 7ff80001 <__BSS_END__+0x7ff5b0f1> 10490: faf42423 sw a5,-88(s0) 10494: fa840793 add a5,s0,-88 10498: 0007a783 lw a5,0(a5) 1049c: 2540006f j 106f0 <fmul32+0x320> # if(eb == 0xFF && mb != 0x800000) ... 104a0: fd042703 lw a4,-48(s0) 104a4: 0ff00793 li a5,255 104a8: 02f71463 bne a4,a5,104d0 <fmul32+0x100> 104ac: fd842703 lw a4,-40(s0) 104b0: 008007b7 lui a5,0x800 104b4: 00f70e63 beq a4,a5,104d0 <fmul32+0x100> 104b8: 7ff807b7 lui a5,0x7ff80 104bc: 00178793 add a5,a5,1 # 7ff80001 <__BSS_END__+0x7ff5b0f1> 104c0: faf42223 sw a5,-92(s0) 104c4: fa440793 add a5,s0,-92 104c8: 0007a783 lw a5,0(a5) 104cc: 2240006f j 106f0 <fmul32+0x320> # if(ea == 0xFF && ma == 0x800000) ... 104d0: fd442703 lw a4,-44(s0) 104d4: 0ff00793 li a5,255 104d8: 04f71c63 bne a4,a5,10530 <fmul32+0x160> 104dc: fdc42703 lw a4,-36(s0) 104e0: 008007b7 lui a5,0x800 104e4: 04f71663 bne a4,a5,10530 <fmul32+0x160> 104e8: fd042783 lw a5,-48(s0) 104ec: 00079e63 bnez a5,10508 <fmul32+0x138> 104f0: 7f8007b7 lui a5,0x7f800 104f4: 00178793 add a5,a5,1 # 7f800001 <__BSS_END__+0x7f7db0f1> 104f8: faf42023 sw a5,-96(s0) 104fc: fa040793 add a5,s0,-96 10500: 0007a783 lw a5,0(a5) 10504: 1ec0006f j 106f0 <fmul32+0x320> 10508: fe442703 lw a4,-28(s0) 1050c: fe042783 lw a5,-32(s0) 10510: 00f747b3 xor a5,a4,a5 10514: 01f79713 sll a4,a5,0x1f 10518: 7f8007b7 lui a5,0x7f800 1051c: 00f767b3 or a5,a4,a5 10520: f8f42e23 sw a5,-100(s0) 10524: f9c40793 add a5,s0,-100 10528: 0007a783 lw a5,0(a5) # 7f800000 <__BSS_END__+0x7f7db0f0> 1052c: 1c40006f j 106f0 <fmul32+0x320> # if(eb == 0xFF && mb == 0x800000) ... 10530: fd042703 lw a4,-48(s0) 10534: 0ff00793 li a5,255 10538: 04f71c63 bne a4,a5,10590 <fmul32+0x1c0> 1053c: fd842703 lw a4,-40(s0) 10540: 008007b7 lui a5,0x800 10544: 04f71663 bne a4,a5,10590 <fmul32+0x1c0> 10548: fd442783 lw a5,-44(s0) 1054c: 00079e63 bnez a5,10568 <fmul32+0x198> 10550: 7f8007b7 lui a5,0x7f800 10554: 00178793 add a5,a5,1 # 7f800001 <__BSS_END__+0x7f7db0f1> 10558: f8f42c23 sw a5,-104(s0) 1055c: f9840793 add a5,s0,-104 10560: 0007a783 lw a5,0(a5) 10564: 18c0006f j 106f0 <fmul32+0x320> 10568: fe442703 lw a4,-28(s0) 1056c: fe042783 lw a5,-32(s0) 10570: 00f747b3 xor a5,a4,a5 10574: 01f79713 sll a4,a5,0x1f 10578: 7f8007b7 lui a5,0x7f800 1057c: 00f767b3 or a5,a4,a5 10580: f8f42a23 sw a5,-108(s0) 10584: f9440793 add a5,s0,-108 10588: 0007a783 lw a5,0(a5) # 7f800000 <__BSS_END__+0x7f7db0f0> 1058c: 1640006f j 106f0 <fmul32+0x320> # if(ea == 0 || eb == 0) ... 10590: fd442783 lw a5,-44(s0) 10594: 00078663 beqz a5,105a0 <fmul32+0x1d0> 10598: fd042783 lw a5,-48(s0) 1059c: 02079263 bnez a5,105c0 <fmul32+0x1f0> 105a0: fe442703 lw a4,-28(s0) 105a4: fe042783 lw a5,-32(s0) 105a8: 00f747b3 xor a5,a4,a5 105ac: 01f79793 sll a5,a5,0x1f 105b0: f8f42823 sw a5,-112(s0) 105b4: f9040793 add a5,s0,-112 105b8: 0007a783 lw a5,0(a5) 105bc: 1340006f j 106f0 <fmul32+0x320> 105c0: fe442703 lw a4,-28(s0) 105c4: fe042783 lw a5,-32(s0) /* multiplication */ 105c8: 00f747b3 xor a5,a4,a5 # sr = sa ^ sb; 105cc: fcf42623 sw a5,-52(s0) 105d0: fd842583 lw a1,-40(s0) 105d4: fdc42503 lw a0,-36(s0) 105d8: cc9ff0ef jal 102a0 <imul32> # imul32(ma, mb) 105dc: 00050713 mv a4,a0 105e0: 00058793 mv a5,a1 105e4: fae42623 sw a4,-84(s0) 105e8: faf42823 sw a5,-80(s0) 105ec: fb042783 lw a5,-80(s0) 105f0: 0177d793 srl a5,a5,0x17 # >> 23 105f4: 00078713 mv a4,a5 105f8: fac42783 lw a5,-84(s0) 105fc: 00979793 sll a5,a5,0x9 10600: 00f767b3 or a5,a4,a5 10604: fcf42423 sw a5,-56(s0) 10608: fd442703 lw a4,-44(s0) 1060c: fd042783 lw a5,-48(s0) 10610: 00f707b3 add a5,a4,a5 # ea + eb 10614: f8178793 add a5,a5,-127 # - 127 10618: fcf42223 sw a5,-60(s0) 1061c: fc842783 lw a5,-56(s0) 10620: 4187d793 sra a5,a5,0x18 # getbit(mrtmp, 24) 10624: 0017f793 and a5,a5,1 # & 1; 10628: fcf42023 sw a5,-64(s0) 1062c: fc042783 lw a5,-64(s0) 10630: fc842703 lw a4,-56(s0) 10634: 40f757b3 sra a5,a4,a5 # mr = mrtmp >> mshift; 10638: faf42e23 sw a5,-68(s0) 1063c: fc042783 lw a5,-64(s0) # er = mshift ? inc(ertmp) : ertmp; 10640: 00078863 beqz a5,10650 <fmul32+0x280> 10644: fc442783 lw a5,-60(s0) 10648: 00178793 add a5,a5,1 1064c: 0080006f j 10654 <fmul32+0x284> 10650: fc442783 lw a5,-60(s0) # if(er < 0) ... 10654: faf42c23 sw a5,-72(s0) 10658: fb842783 lw a5,-72(s0) 1065c: 0207d263 bgez a5,10680 <fmul32+0x2b0> 10660: fe442703 lw a4,-28(s0) 10664: fe042783 lw a5,-32(s0) 10668: 00f747b3 xor a5,a4,a5 # (sa ^ sb) 1066c: 01f79793 sll a5,a5,0x1f # << 31 10670: f8f42623 sw a5,-116(s0) 10674: f8c40793 add a5,s0,-116 10678: 0007a783 lw a5,0(a5) 1067c: 0740006f j 106f0 <fmul32+0x320> # if(er >= 0xFF) ... 10680: fb842703 lw a4,-72(s0) 10684: 0fe00793 li a5,254 10688: 02e7d663 bge a5,a4,106b4 <fmul32+0x2e4> 1068c: fe442703 lw a4,-28(s0) 10690: fe042783 lw a5,-32(s0) 10694: 00f747b3 xor a5,a4,a5 # (sa ^ sb) << 31 10698: 01f79713 sll a4,a5,0x1f 1069c: 7f8007b7 lui a5,0x7f800 106a0: 00f767b3 or a5,a4,a5 # 0x7F800000 | 106a4: f8f42423 sw a5,-120(s0) 106a8: f8840793 add a5,s0,-120 106ac: 0007a783 lw a5,0(a5) # 7f800000 <__BSS_END__+0x7f7db0f0> 106b0: 0400006f j 106f0 <fmul32+0x320> # result = (sr << 31) | ((er & 0xFF) << 23) | (mr & 0x7FFFFF); 106b4: fcc42783 lw a5,-52(s0) 106b8: 01f79713 sll a4,a5,0x1f # << 31 106bc: fb842783 lw a5,-72(s0) 106c0: 01779693 sll a3,a5,0x17 # << 23 106c4: 7f8007b7 lui a5,0x7f800 106c8: 00f6f7b3 and a5,a3,a5 # (er & 0xFF) 106cc: 00f76733 or a4,a4,a5 # (sr << 31) | ((er & 0xFF) << 23) 106d0: fbc42683 lw a3,-68(s0) 106d4: 008007b7 lui a5,0x800 106d8: fff78793 add a5,a5,-1 # 7fffff <__BSS_END__+0x7db0ef> 106dc: 00f6f7b3 and a5,a3,a5 # (mr & 0x7FFFFF) 106e0: 00f767b3 or a5,a4,a5 106e4: faf42a23 sw a5,-76(s0) 106e8: fb440793 add a5,s0,-76 106ec: 0007a783 lw a5,0(a5) return *(float *) &result; 106f0: 00078513 mv a0,a5 106f4: 08c12083 lw ra,140(sp) 106f8: 08812403 lw s0,136(sp) 106fc: 09010113 add sp,sp,144 10700: 00008067 ret ``` ### -O1 * **Statistics** * LOC (Line Of Code): `94` * Allocate `32` bytes on stack * Branching and jump used: `55` * 44 Branching * 11 Jump * Number of registers used: `10` * `ax` registers: `a0` `a1` `a2` `a3` `a4` `a5` `a6` * `sx` registers: `s0` `s1` `s2` * `tx` registers: None * Number of `lw` and `sw`: `14` * 10 `lw` * 4 `sw` * **Observations** 1. All statistics improved compared with -O0. 2. Less saving word operation, just see the `sw a0,-132(s0)` and `sw a1,-136(s0)` at the beginning of -O0, -O1 avoid to do that. 3. More registers are re-used, to achieve less saving operation. 4. More intelligent branching, in -O0 the code is really straightforward, but -O1 seperate the code, like special values and overflow-underflow checking into 2 part, if the simple condition is not satisfied, the code is discarded. 5. From 1. and 3., the code is more compact, smaller size, means less cycles and less cache footprint. ```c 0001027c <fmul32>: 1027c: fe010113 add sp,sp,-32 10280: 00112e23 sw ra,28(sp) 10284: 00812c23 sw s0,24(sp) 10288: 00912a23 sw s1,20(sp) /* define sign */ 1028c: 41f55713 sra a4,a0,0x1f # ia >> 31; 10290: 41f5d813 sra a6,a1,0x1f # ib >> 31; /* define mantissa */ 10294: 008007b7 lui a5,0x800 10298: fff78793 add a5,a5,-1 # 7fffff <__BSS_END__+0x7db0a7> 1029c: 00a7f633 and a2,a5,a0 102a0: 00b7f7b3 and a5,a5,a1 /* define exponent */ 102a4: 41755513 sra a0,a0,0x17 # ia >> 23; 102a8: 0ff57493 zext.b s1,a0 102ac: 4175d593 sra a1,a1,0x17 # ib >> 23; 102b0: 0ff5f413 zext.b s0,a1 # special values 1 102b4: 0ff00693 li a3,255 102b8: 06d48e63 beq s1,a3,10334 <fmul32+0xb8> # if(ea == 0xff) 102bc: 0ff00693 li a3,255 102c0: 0ad40063 beq s0,a3,10360 <fmul32+0xe4> # if(eb == 0xff) 102c4: 0c048063 beqz s1,10384 <fmul32+0x108> # if(ea == 0) 102c8: 0a040e63 beqz s0,10384 <fmul32+0x108> # if(eb == 0) 102cc: 01212823 sw s2,16(sp) # multiplication 102d0: 01074933 xor s2,a4,a6 102d4: 00800537 lui a0,0x800 102d8: 00a7e5b3 or a1,a5,a0 102dc: 00a66533 or a0,a2,a0 102e0: f21ff0ef jal 10200 <imul32> # imul32(ma, mb); 102e4: 0175d793 srl a5,a1,0x17 # mr.l >> 23 102e8: 00951513 sll a0,a0,0x9 # (jmr.h << 9) 102ec: 00a7e7b3 or a5,a5,a0 102f0: 00848433 add s0,s1,s0 102f4: 4187d713 sra a4,a5,0x18 # (mrtmp >> 24) 102f8: 00177693 and a3,a4,1 # & 1; 102fc: 00868733 add a4,a3,s0 # er = ertmp + mshift; 10300: f8170713 add a4,a4,-127 # ea + eb - 127; # overflow and underflow 1 10304: 08074e63 bltz a4,103a0 <fmul32+0x124> # if(er < 0) ... 10308: 0fe00613 li a2,254 1030c: 0ae64063 blt a2,a4,103ac <fmul32+0x130> # if(er >= 0xFF) ... 10310: 40d7d7b3 sra a5,a5,a3 # mr = mrtmp >> mshift; 10314: 00979793 sll a5,a5,0x9 # 10318: 0097d793 srl a5,a5,0x9 # (mr & 0x7FFFFF); 1031c: 01f91513 sll a0,s2,0x1f # (sr << 31) 10320: 00a7e7b3 or a5,a5,a0 10324: 01771713 sll a4,a4,0x17 # ((er & 0xFF) << 23) 10328: 00e7e533 or a0,a5,a4 # result 1032c: 01012903 lw s2,16(sp) 10330: 05c0006f j 1038c <fmul32+0x110> # Why we have the below code if `j 1038c`? **jump from 102b8** # special values 2 10334: 08061663 bnez a2,103c0 <fmul32+0x144> 10338: 02d40463 beq s0,a3,10360 <fmul32+0xe4> 1033c: 0ff00693 li a3,255 10340: f8d492e3 bne s1,a3,102c4 <fmul32+0x48> 10344: f80612e3 bnez a2,102c8 <fmul32+0x4c> 10348: 08040463 beqz s0,103d0 <fmul32+0x154> 1034c: 01074533 xor a0,a4,a6 10350: 01f51513 sll a0,a0,0x1f 10354: 7f8007b7 lui a5,0x7f800 10358: 00f56533 or a0,a0,a5 1035c: 0300006f j 1038c <fmul32+0x110> 10360: 06079463 bnez a5,103c8 <fmul32+0x14c> 10364: 0ff00693 li a3,255 10368: 06d48c63 beq s1,a3,103e0 <fmul32+0x164> 1036c: 06048663 beqz s1,103d8 <fmul32+0x15c> 10370: 01074533 xor a0,a4,a6 10374: 01f51513 sll a0,a0,0x1f 10378: 7f8007b7 lui a5,0x7f800 1037c: 00f56533 or a0,a0,a5 10380: 00c0006f j 1038c <fmul32+0x110> 10384: 01074533 xor a0,a4,a6 10388: 01f51513 sll a0,a0,0x1f # restore and return 1038c: 01c12083 lw ra,28(sp) 10390: 01812403 lw s0,24(sp) 10394: 01412483 lw s1,20(sp) 10398: 02010113 add sp,sp,32 1039c: 00008067 ret # overflow and underflow 2 103a0: 01f91513 sll a0,s2,0x1f 103a4: 01012903 lw s2,16(sp) 103a8: fe5ff06f j 1038c <fmul32+0x110> 103ac: 01f91513 sll a0,s2,0x1f 103b0: 7f8007b7 lui a5,0x7f800 103b4: 00f56533 or a0,a0,a5 103b8: 01012903 lw s2,16(sp) 103bc: fd1ff06f j 1038c <fmul32+0x110> 103c0: f281a503 lw a0,-216(gp) # 248f8 <__SDATA_BEGIN__+0x68> 103c4: fc9ff06f j 1038c <fmul32+0x110> 103c8: f281a503 lw a0,-216(gp) # 248f8 <__SDATA_BEGIN__+0x68> 103cc: fc1ff06f j 1038c <fmul32+0x110> 103d0: f2c1a503 lw a0,-212(gp) # 248fc <__SDATA_BEGIN__+0x6c> 103d4: fb9ff06f j 1038c <fmul32+0x110> 103d8: f2c1a503 lw a0,-212(gp) # 248fc <__SDATA_BEGIN__+0x6c> 103dc: fb1ff06f j 1038c <fmul32+0x110> 103e0: f60606e3 beqz a2,1034c <fmul32+0xd0> 103e4: 0ff00693 li a3,255 103e8: eed410e3 bne s0,a3,102c8 <fmul32+0x4c> 103ec: ee0790e3 bnez a5,102cc <fmul32+0x50> 103f0: f81ff06f j 10370 <fmul32+0xf4> ``` ### -O2 * **Statistics** * LOC (Line Of Code): `96` * Allocate `48` bytes on stack * Branching and jump used: `42` * 38 Branching * 4 Jump * Number of registers used: `10` * `ax` registers: `a0` `a1` `a2` `a3` `a4` `a5` * `sx` registers: `s0` `s1` `s2` `s3` * `tx` registers: None * Number of `lw` and `sw`: `31` * 26 `lw` * 5 `sw` * **Observations** 1. Pretty much the same statistics compared to -O1. 2. Increase number of branching and jumping compared to -O1. 3. Take apart some of the checking with branching, since logically if there is some condition is statisfied first, the rest part of code is meaningless, which is an early stop. ```c 000104a4 <fmul32>: 104a4: fd010113 add sp,sp,-48 104a8: 02812423 sw s0,40(sp) 104ac: 008006b7 lui a3,0x800 # define exponent 1 104b0: 41755413 sra s0,a0,0x17 # ia >> 23 104b4: 03212023 sw s2,32(sp) 104b8: 01312e23 sw s3,28(sp) 104bc: 4175d913 sra s2,a1,0x17 # ib >> 23 104c0: fff68993 add s3,a3,-1 # 7fffff <__BSS_END__+0x7db0a7> 104c4: 02112623 sw ra,44(sp) 104c8: 0ff47413 zext.b s0,s0 104cc: 0ff00713 li a4,255 # define sign 1 104d0: 41f55793 sra a5,a0,0x1f # ia >> 31; 104d4: 41f5d613 sra a2,a1,0x1f # ib >> 31; # define mantissa 1 104d8: 00a9f533 and a0,s3,a0 # (ia & 0x7FFFFF) 104dc: 00b9f5b3 and a1,s3,a1 # (ib & 0x7FFFFF) 104e0: 0ff97913 zext.b s2,s2 # special values 1 104e4: 0ee40463 beq s0,a4,105cc <fmul32+0x128> # if(ea == 0xFF) 104e8: 08e90263 beq s2,a4,1056c <fmul32+0xc8> # if(ea == 0xFF) # define sign 2 104ec: 02912223 sw s1,36(sp) 104f0: 00c7c4b3 xor s1,a5,a2 # sr = #s1 # if(ea == 0 || eb == 0) 104f4: 08040e63 beqz s0,10590 <fmul32+0xec> 104f8: 08090c63 beqz s2,10590 <fmul32+0xec> # define mantissa 2 104fc: 00d5e5b3 or a1,a1,a3 # ma = #a1 10500: 00d56533 or a0,a0,a3 # mb = #a0 # multiplication 10504: f2dff0ef jal 10430 <imul32> # imul32(ma, mb); 10508: 00951513 sll a0,a0,0x9 # (jmr.h << 9) 1050c: 0175d793 srl a5,a1,0x17 # (jmr.l >> 23) 10510: 00a7e7b3 or a5,a5,a0 # mrtmp = #a5 10514: 4187d693 sra a3,a5,0x18 # (mrtmp >> 24) 10518: 0016f713 and a4,a3,1 # mshift = #a4 1051c: 01240433 add s0,s0,s2 10520: 00870733 add a4,a4,s0 10524: f8170713 add a4,a4,-127 # er = #a4 10528: 0016f693 and a3,a3,1 1052c: 01f49513 sll a0,s1,0x1f # (sr << 31) # overflow and underflow 1 10530: 0c074c63 bltz a4,10608 <fmul32+0x164> 10534: 0fe00613 li a2,254 10538: 0ce64c63 blt a2,a4,10610 <fmul32+0x16c> # restore 1053c: 02c12083 lw ra,44(sp) 10540: 02812403 lw s0,40(sp) 10544: 40d7d7b3 sra a5,a5,a3 10548: 0137f7b3 and a5,a5,s3 1054c: 00a7e7b3 or a5,a5,a0 # (sr << 31) | (mr & 0x7FFFFF) 10550: 01771713 sll a4,a4,0x17 # ((er & 0xFF) << 23) 10554: 02412483 lw s1,36(sp) 10558: 02012903 lw s2,32(sp) 1055c: 01c12983 lw s3,28(sp) 10560: 00e7e533 or a0,a5,a4 # result 10564: 03010113 add sp,sp,48 # return 10568: 00008067 ret # special values 2 1056c: 04059263 bnez a1,105b0 <fmul32+0x10c> 10570: 06041863 bnez s0,105e0 <fmul32+0x13c> 10574: 02c12083 lw ra,44(sp) 10578: 02812403 lw s0,40(sp) 1057c: f281a503 lw a0,-216(gp) # 248f8 <__SDATA_BEGIN__+0x68> 10580: 02012903 lw s2,32(sp) 10584: 01c12983 lw s3,28(sp) 10588: 03010113 add sp,sp,48 1058c: 00008067 ret 10590: 01f49513 sll a0,s1,0x1f 10594: 02412483 lw s1,36(sp) 10598: 02c12083 lw ra,44(sp) 1059c: 02812403 lw s0,40(sp) 105a0: 02012903 lw s2,32(sp) 105a4: 01c12983 lw s3,28(sp) 105a8: 03010113 add sp,sp,48 105ac: 00008067 ret 105b0: 02c12083 lw ra,44(sp) 105b4: 02812403 lw s0,40(sp) 105b8: f2c1a503 lw a0,-212(gp) # 248fc <__SDATA_BEGIN__+0x6c> 105bc: 02012903 lw s2,32(sp) 105c0: 01c12983 lw s3,28(sp) 105c4: 03010113 add sp,sp,48 105c8: 00008067 ret 105cc: fe0512e3 bnez a0,105b0 <fmul32+0x10c> 105d0: 00890663 beq s2,s0,105dc <fmul32+0x138> 105d4: fa0900e3 beqz s2,10574 <fmul32+0xd0> 105d8: 0080006f j 105e0 <fmul32+0x13c> 105dc: fc059ae3 bnez a1,105b0 <fmul32+0x10c> 105e0: 02c12083 lw ra,44(sp) 105e4: 02812403 lw s0,40(sp) 105e8: 00c7c533 xor a0,a5,a2 105ec: 01f51513 sll a0,a0,0x1f 105f0: 7f8007b7 lui a5,0x7f800 105f4: 02012903 lw s2,32(sp) 105f8: 01c12983 lw s3,28(sp) 105fc: 00f56533 or a0,a0,a5 10600: 03010113 add sp,sp,48 10604: 00008067 ret # overflow and underflow 2 10608: 02412483 lw s1,36(sp) 1060c: f8dff06f j 10598 <fmul32+0xf4> 10610: 7f8007b7 lui a5,0x7f800 10614: 02412483 lw s1,36(sp) 10618: 00f56533 or a0,a0,a5 1061c: f7dff06f j 10598 <fmul32+0xf4> ``` ### -O3 * **Statistics** * LOC (Line Of Code): `85` * Allocate `0` bytes on stack * Branching and jump used: `46` * 45 Branching * 1 Jump * Number of registers used: `16` * `ax` registers: `a0` `a1` `a2` `a3` `a4` `a5` `a6` `a7` `a8` * `sx` registers: None * `tx` registers: `t0` `t1` `t2` `t3` `t4` `t5` `t6` * Number of `lw` and `sw`: `2` * 2 `lw` * 0 `sw` * **Observations** 1. Take more checking apart, and even the normal operation like defining s, e, m. 2. Extremely less load and save. 3. More strategies, not straightforward translation, like below * In -O2, it coded as ```c lui a3,0x800 add s3,a3,-1 and a5,a5,s3 ``` in -O3, it directly does ```c slli a5,a5,0x9 srli a5,a5,0x9 ``` to replace any `& 0x7FFFFF`, since the latter reduces the instructions and register usage. ```c 000104a4 <fmul32>: # define exponent 1 104a4: 41755f13 sra t5,a0,0x17 # ia >> 23 104a8: 00800737 lui a4,0x800 104ac: fff70e13 add t3,a4,-1 # 7fffff <__BSS_END__+0x7db0a7> 104b0: 4175d293 sra t0,a1,0x17 # ib >> 23 104b4: 0fff7f13 zext.b t5,t5 104b8: 0ff00793 li a5,255 # defina sign 1 104bc: 41f55f93 sra t6,a0,0x1f # ia >> 31 104c0: 41f5d693 sra a3,a1,0x1f # ib >> 31 # define mantissa 1 104c4: 00ae7533 and a0,t3,a0 # ia & 0x7FFFFF 104c8: 0ff2f293 zext.b t0,t0 104cc: 00be7e33 and t3,t3,a1 # ib & 0x7FFFFF # special values 1 104d0: 0eff0663 beq t5,a5,105bc <fmul32+0x118> # if(ea == 0xFF) 104d4: 0cf28463 beq t0,a5,1059c <fmul32+0xf8> # if(eb == 0xFF) # defina sign 2 104d8: 00dfcfb3 xor t6,t6,a3 # sa = #t6 # special values 2 104dc: 0c0f0863 beqz t5,105ac <fmul32+0x108> # if(ea == 0) 104e0: 0c028663 beqz t0,105ac <fmul32+0x108> # if(eb == 0) # define mantissa 2 104e4: 00e568b3 or a7,a0,a4 # ma = #a7 104e8: 00ee6e33 or t3,t3,a4 # mb = #t3 # imul32, not jumping 104ec: 00000e93 li t4,0 104f0: 00000813 li a6,0 104f4: 00000693 li a3,0 # counter 104f8: 01f00393 li t2,31 104fc: 02000513 li a0,32 10500: 00d89633 sll a2,a7,a3 10504: 01060333 add t1,a2,a6 10508: 010645b3 xor a1,a2,a6 1050c: 40d387b3 sub a5,t2,a3 10510: fff34713 not a4,t1 10514: 01067633 and a2,a2,a6 10518: 00b77733 and a4,a4,a1 1051c: 00f8d7b3 srl a5,a7,a5 10520: 40de55b3 sra a1,t3,a3 # mb >> 10524: 00c76733 or a4,a4,a2 10528: 0017d793 srl a5,a5,0x1 1052c: 0015f613 and a2,a1,1 10530: 00168693 add a3,a3,1 10534: 01d787b3 add a5,a5,t4 10538: 01f75713 srl a4,a4,0x1f 1053c: 00060663 beqz a2,10548 <fmul32+0xa4> # if (mb == 0) 10540: 00f70eb3 add t4,a4,a5 10544: 00030813 mv a6,t1 10548: faa69ce3 bne a3,a0,10500 <fmul32+0x5c> # imul32 end 1054c: 01785813 srl a6,a6,0x17 # (jmr.l >> 23) 10550: 009e9e93 sll t4,t4,0x9 # (jmr.h << 9) 10554: 01d86833 or a6,a6,t4 # mrtmp = #a6 10558: 41885793 sra a5,a6,0x18 # (mrtmp >> 24) 1055c: 0017f713 and a4,a5,1 # mshift = #a4 10560: 005f0f33 add t5,t5,t0 10564: 01e70733 add a4,a4,t5 10568: f8170713 add a4,a4,-127 # er = #a4 1056c: 0017f793 and a5,a5,1 10570: 01ff9f93 sll t6,t6,0x1f # (sr << 31) # overflow and underflow 1 10574: 06074863 bltz a4,105e4 <fmul32+0x140> 10578: 0fe00693 li a3,254 1057c: 06e6c863 blt a3,a4,105ec <fmul32+0x148> 10580: 40f857b3 sra a5,a6,a5 # mrtmp >> mshift; mr = #a5 # (mr & 0x7FFFFF) 10584: 00979793 sll a5,a5,0x9 10588: 0097d793 srl a5,a5,0x9 ################# 1058c: 01f7e7b3 or a5,a5,t6 10590: 01771713 sll a4,a4,0x17 # ((er & 0xFF) << 23) 10594: 00e7e533 or a0,a5,a4 # result # return 10598: 00008067 ret # special values 3 1059c: 000e1c63 bnez t3,105b4 <fmul32+0x110> 105a0: 020f1863 bnez t5,105d0 <fmul32+0x12c> 105a4: f281a503 lw a0,-216(gp) # 248f8 <__SDATA_BEGIN__+0x68> 105a8: 00008067 ret 105ac: 01ff9513 sll a0,t6,0x1f 105b0: 00008067 ret 105b4: f2c1a503 lw a0,-212(gp) # 248fc <__SDATA_BEGIN__+0x6c> 105b8: 00008067 ret 105bc: fe051ce3 bnez a0,105b4 <fmul32+0x110> 105c0: 01e28663 beq t0,t5,105cc <fmul32+0x128> 105c4: fe0280e3 beqz t0,105a4 <fmul32+0x100> 105c8: 0080006f j 105d0 <fmul32+0x12c> 105cc: fe0e14e3 bnez t3,105b4 <fmul32+0x110> 105d0: 00dfc533 xor a0,t6,a3 105d4: 7f8007b7 lui a5,0x7f800 105d8: 01f51513 sll a0,a0,0x1f 105dc: 00f56533 or a0,a0,a5 105e0: 00008067 ret # overflow and underflow 2 105e4: 000f8513 mv a0,t6 105e8: 00008067 ret 105ec: 7f8007b7 lui a5,0x7f800 105f0: 00ffe533 or a0,t6,a5 105f4: 00008067 ret ``` ### -Ofast * **Observations** * No difference between -Ofast and -O3 on fmul32. ### -Os * **Statistics** * LOC (Line Of Code): `67` * Allocate `32` bytes on stack * Branching and jump used: `40` * 35 Branching * 5 Jump * Number of registers used: `12` * `ax` registers: `a0` `a1` `a2` `a3` `a4` `a5` `a6` `a7` `a8` * `sx` registers: `s0` `s1` `s2` * `tx` registers: None * Number of `lw` and `sw`: `10` * 6 `lw` * 4 `sw` * **Observations** 1. Even fewer LOC. 2. Lesser branching but more jumping. ```c 00010460 <fmul32>: 10460: fe010113 add sp,sp,-32 10464: 00800837 lui a6,0x800 10468: fff80713 add a4,a6,-1 # 7fffff <__BSS_END__+0x7db0a7> 1046c: 00912a23 sw s1,20(sp) 10470: 41755493 sra s1,a0,0x17 10474: 01212823 sw s2,16(sp) 10478: 41f5d613 sra a2,a1,0x1f 1047c: 4175d913 sra s2,a1,0x17 10480: 00a776b3 and a3,a4,a0 10484: 00112e23 sw ra,28(sp) 10488: 00b77733 and a4,a4,a1 1048c: 00812c23 sw s0,24(sp) 10490: 0ff4f493 zext.b s1,s1 10494: 0ff00593 li a1,255 10498: 41f55793 sra a5,a0,0x1f 1049c: 0ff97913 zext.b s2,s2 104a0: 0ab49063 bne s1,a1,10540 <fmul32+0xe0> 104a4: 00069663 bnez a3,104b0 <fmul32+0x50> 104a8: 00991863 bne s2,s1,104b8 <fmul32+0x58> 104ac: 08070063 beqz a4,1052c <fmul32+0xcc> 104b0: f2c1a503 lw a0,-212(gp) # 248fc <__SDATA_BEGIN__+0x6c> 104b4: 0a00006f j 10554 <fmul32+0xf4> 104b8: 08090063 beqz s2,10538 <fmul32+0xd8> 104bc: 00c7c533 xor a0,a5,a2 104c0: 01f51513 sll a0,a0,0x1f 104c4: 7f8007b7 lui a5,0x7f800 104c8: 00f56533 or a0,a0,a5 104cc: 0880006f j 10554 <fmul32+0xf4> 104d0: 010765b3 or a1,a4,a6 104d4: 0106e533 or a0,a3,a6 104d8: f15ff0ef jal 103ec <imul32> 104dc: 0175d593 srl a1,a1,0x17 104e0: 00951513 sll a0,a0,0x9 104e4: 00a5e5b3 or a1,a1,a0 104e8: 4185d713 sra a4,a1,0x18 104ec: 012484b3 add s1,s1,s2 104f0: 00177713 and a4,a4,1 104f4: f8248793 add a5,s1,-126 104f8: 00071463 bnez a4,10500 <fmul32+0xa0> 104fc: f8148793 add a5,s1,-127 10500: 01f41513 sll a0,s0,0x1f 10504: 0407c863 bltz a5,10554 <fmul32+0xf4> 10508: 0fe00693 li a3,254 1050c: faf6cce3 blt a3,a5,104c4 <fmul32+0x64> 10510: 40e5d5b3 sra a1,a1,a4 10514: 00959593 sll a1,a1,0x9 10518: 0095d593 srl a1,a1,0x9 1051c: 00a5e5b3 or a1,a1,a0 10520: 01779793 sll a5,a5,0x17 10524: 00f5e533 or a0,a1,a5 10528: 02c0006f j 10554 <fmul32+0xf4> 1052c: 0ff00713 li a4,255 10530: f8e486e3 beq s1,a4,104bc <fmul32+0x5c> 10534: f80494e3 bnez s1,104bc <fmul32+0x5c> 10538: f281a503 lw a0,-216(gp) # 248f8 <__SDATA_BEGIN__+0x68> 1053c: 0180006f j 10554 <fmul32+0xf4> 10540: f6b906e3 beq s2,a1,104ac <fmul32+0x4c> 10544: 00c7c433 xor s0,a5,a2 10548: 00048463 beqz s1,10550 <fmul32+0xf0> 1054c: f80912e3 bnez s2,104d0 <fmul32+0x70> 10550: 01f41513 sll a0,s0,0x1f 10554: 01c12083 lw ra,28(sp) 10558: 01812403 lw s0,24(sp) 1055c: 01412483 lw s1,20(sp) 10560: 01012903 lw s2,16(sp) 10564: 02010113 add sp,sp,32 10568: 00008067 ret ``` ### Compare with ASM code from kc71486 The following code is from [here](https://github.com/kc71486/Computer-Architecture/blob/main/hw1/quizc_v5.s). * **Statistics** * LOC (Line Of Code): `129` * Allocate `120` bytes on stack * Branching and jump used: `18` * 10 Branching * 8 Jump * Number of registers used: `10` * `ax` registers: `a0` `a1` * `sx` registers: `s0` `s1` * `tx` registers: `t0` `t1` `t2` `t4` `t5` * Number of `lw` and `sw`: `66` * 22 `lw` * 34 `sw` * **Observations** 1. There is lots of `sw` and `lw` just for preserving the registers. 2. Need actually about 80 byte. ```c fmul32: addi sp, sp, -120 # allocate stack sw ra, 0(sp) # save registers sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) sw s6, 28(sp) sw s7, 32(sp) sw s8, 36(sp) sw s9, 40(sp) sw s10, 44(sp) sw s11, 48(sp) li s0, 0x7fffff # #s0 = 0x7fffff li s1, 0x800000 # #s1 = 0x800000 li s2, 0xff # #s2 = 0xff xor t0, a0, a1 srli t0, t0, 31 sw t0, 52(sp) # sr = (sp + 52) and t0, a0, s0 or t0, t0, s1 sw t0, 56(sp) # ma = (sp + 56) and t0, a1, s0 or t0, t0, s1 sw t0, 60(sp) # mb = (sp + 60) sw x0, 64(sp) # mr = (sp + 64) srai t0, a0, 23 and t0, t0, s2 sw t0, 68(sp) # ea = (sp + 68) srai t0, a1, 23 and t0, t0, s2 sw t0, 72(sp) # eb = (sp + 72) sw x0, 76(sp) # er = (sp + 76) # special values lw t0, 68(sp) bne t0, s2, mulsan # skip if ea != 0xff lw t0, 56(sp) lw t1, 72(sp) bne t0, s1, mulsaan beq t1, x0, mulsaan # skip if ma != 0x800000 || eb == 0 lw t0, 52(sp) li a0, 0x7f800000 slli t0, t0, 31 or a0, a0, t0 # return 0x7f800000 | sr << 31 j fmul32ret mulsaan: li a0, 0x7f800001 # return 0x7f800001 j fmul32ret mulsan: lw t0, 72(sp) bne t0, s2, mulsbn # skip if eb != 0xff lw t0, 60(sp) lw t1, 68(sp) bne t0, s1, mulsbbn beq t1, x0, mulsbbn # skip if mb != 0x800000 || ea == 0 lw t0, 52(sp) li a0, 0x7f800000 slli t0, t0, 31 or a0, a0, t0 # return 0x7f800000 | sr << 31 mulsbbn: li a0, 0x7f800001 # return 0x7f800001 j fmul32ret mulsbn: lw t0, 68(sp) lw t1, 72(sp) bne t0, x0, mulsz1n # skip if ea != 0 lw a0, 52(sp) slli a0, a0, 31 # return sr << 31; j fmul32ret mulsz1n: bne t1, x0, mulsz2n # skip if eb != 0 lw a0, 52(sp) slli a0, a0, 31 # return sr << 31; j fmul32ret mulsz2n: # multiplication lw a0, 56(sp) lw a1, 60(sp) call mmul # mrtmp = #a0 = mmul(ma, mb) lw t0, 68(sp) lw t1, 72(sp) add t0, t0, t1 addi t4, t0, -127 # ertmp = #t4 = ea + eb - 127 # realign mantissa srli t5, a0, 24 andi t5, t5, 1 # mshift = #t5 = (mrtmp >>> 24) & 1 srl t0, a0, t5 sw t0, 64(sp) # mr = mrtmp >> mshift add t0, t4, t5 sw t0, 76(sp) # er = ertmp + mshift # overflow and underflow bgt t0, x0, mulun # skip if er > 0 lw a0, 52(sp) slli a0, a0, 31 # return sr << 31 j fmul32ret mulun: blt t0, s2, mulon # skip if er < 0xff lw t0, 52(sp) li a0, 0x7f800000 slli t0, t0, 31 or a0, a0, t0 # return 0x7f800000 | sr << 31 j fmul32ret mulon: lw t0, 52(sp) lw t1, 76(sp) lw t2, 64(sp) slli t0, t0, 31 and t1, t1, s2 slli t1, t1, 23 and t2, t2, s0 or a0, t0, t1 or a0, a0, t2 # return (sr << 31) | ((er & 0xff) << 23) | (mr & 0x7fffff) fmul32ret: lw ra, 0(sp) # restore registers lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) lw s8, 36(sp) lw s9, 40(sp) lw s10, 44(sp) lw s11, 48(sp) addi sp, sp, 120 # free stack jr ra ``` ### Cycle Count from perfcounter The following number is obtained from ```bash riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 ${OPT} -Wall -c -o getcycles.o getcycles.S riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 ${OPT} -Wall -c -o getinstret.o getinstret.S riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 ${OPT} -Wall -c -o mmul32.o mmul32.c riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 ${OPT} -Wall -c -o main.o main.c riscv-none-elf-gcc -o perfcount.elf getcycles.o getinstret.o mmul32.o main.o ``` where `${OPT}` represents different optimized level. | | O0 | O1 | O2 | O3 | Os | | ------ | ---- | ---- | ---- | ---- |----| | cycle | 81151 | 64486 | 69720 | **62941** | 64270 | | size(bytes) | 85384 | 83852 | 83924 | 84268 | **83548** | * Observation **I am pretty sure the `-O2` cycle is right, which is more than `-O1` for almost 10%.** We could see the symptom in the previous statistics, `-O2` uses 33 `lw` and `sw`, but `-O1` takes only 14, this is just for `fmul32` function, we could have a guess on this phenomenon. 1. In `-size` statistics, `-O1` has 83852 and `O2` got 83924 bytes, which might result from **the trade-off between code size and execution efficiency didn't work out favorably**. 3. The `-O3` optimization level, on the other hand, seems to have found a better balance, **reducing the cycle count again while possibly employing other optimization strategies** that helped mitigate the code size increase or improve execution efficiency in other ways. ## Potential Improvement Inspired from my HW1, I would like to adopt the `fmul32` in matrix multiplication faster, more specifically, I will focus on improving `mmul32` to be faster in some situation. ### 3 Floating Multiplication Consider the following C code to do mantissa multiplication ```c int32_t mmul(register int32_t a, register int32_t b) { register int32_t r = 0; a = a << 1; /* to counter last right shift */ do { if((b & 1) != 0) { r = r + a; } b = b >> 1; r = r >> 1; } while(b != 0); return r; } ``` I modify the following asm for fitting in C code compiled with `riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32` ```c .text .globl mmul32_s .align 2 mmul32_s: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw t0, 12(sp) sw t1, 16(sp) li t0, 0 # r = #t0 slli a0, a0, 1 loop: andi t1, a1, 1 beq t1, x0, skip_add_a # skip add if (b & 1) == 0 add t0, t0, a0 skip_add_a: srli a1, a1, 1 srli t0, t0, 1 bne a1, x0, loop # loop back if b != 0 mv a0, t0 lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw t0, 12(sp) lw t1, 16(sp) addi sp, sp, 20 ret ``` For any IEEE 754 float, since the leading bit always exist, which results in 24 iteration even `while(b != 0)` is applied, this is inevitable in most cases. However, when we do a serial multiplication, which is undoubtedly watsed since we need to iterate all numbers for all 24 bits. We could "steal" the iterations by enumerate several numbers at the same time, refer to my [homework1](https://hackmd.io/DfukrcL1RFmXJ-zoTjj80A). An idiot way to do 3 number multiplication inherited from common multiplication is like the following: ```c int32_t imul32_3num(int32_t a, int32_t b, int32_t c) { int32_t r = 0; for (int i = 0; i < 32; i++) { if (getbit(b, i)) { for (int j = 0; j < 32; j++) { if (getbit(c, j)) { r += a << (i + j); } } } } return r; } ``` But we could actually recording the bit position of b and c, preventing to iterate them again and again. ```c int32_t imul32_3num(int32_t a, int32_t b, int32_t c) { int32_t r = 0; for (int i = 0; i < 32; i++) { if (getbit(b, i)) ... # recording bits if (getbit(c, i)) ... # recording bits ``` Here is how I modify the `mmul` ```c .text .globl mmul32_3num_s .align 2 mmul32_3num_s: addi sp, sp, -96 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw t0, 12(sp) sw t1, 16(sp) addi s0, sp, 20 addi s1, sp, 52 li t2, 23 scan_bc: beqz t2, end_scan_bc andi t0, a1, 1 andi t1, a2, 1 srli a1, a1, 1 srli a2, a2, 1 bnez t0, save_pos_counterB bnez t1, save_pos_counterC addi t2, t2, -1 # t2 = counter j scan_bc save_pos_counterB: sw t2, 0(s0) addi s0, s0, 4 # s0 = &counterB addi t3, t3, 1 # t3 = # of 1s in b bnez t1, save_pos_counterC addi t2, t2, -1 j scan_bc save_pos_counterC: sw t2, 0(s1) addi s1, s1, 4 # s1 = &counterC addi t4, t4, 1 # t4 = # of 1s in c addi t2, t2, -1 j scan_bc end_scan_bc: addi s0, s0, -4 addi s1, s1, -4 addi t2, t2, -1 sw s1, 88(sp) sw t4, 92(sp) loop_b_bit: beqz t3, end_multiply addi t3, t3, -1 lw t0, 0(s0) loop_c_bit: addi t4, t4, -1 lw t1, 0(s1) add t2, t0, t1 srl t2, a0, t2 add t5, t5, t2 srl t2, a0, t0 add t5, t5, t2 srl t2, a0, t1 add t5, t5, t2 addi s1, s1, -4 bnez t4, loop_c_bit lw s1, 88(sp) lw t4, 92(sp) addi s0, s0, -4 j loop_b_bit end_multiply: add a0, a0, t5 lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw t0, 12(sp) lw t1, 16(sp) addi sp, sp, 96 ret ``` Calculate their cycles by ```c #include <stdint.h> #include <stdio.h> #include <string.h> #include <assert.h> extern uint64_t get_cycles(); extern uint64_t get_instret(); /* * Taken from the Sparkle-suite which is a collection of lightweight symmetric * cryptographic algorithms currently in the final round of the NIST * standardization effort. * See https://sparkle-lwc.github.io/ */ extern int32_t mmul32_s(int32_t, int32_t); extern int32_t mmul32_3num_s(int32_t, int32_t, int32_t); int main(void) { int32_t a = 0x3fc00000; int32_t b = 0x3fa00000; int32_t c = 0x3f900000; int32_t ma = (a & 0x7fffff) | 0x800000; int32_t mb = (b & 0x7fffff) | 0x800000; int32_t mc = (c & 0x7fffff) | 0x800000; /* measure cycles */ uint64_t instret = get_instret(); uint64_t oldcount = get_cycles(); int32_t r_3num = mmul32_3num_s(ma, mb, mc); uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); printf("instret: %x\n", (unsigned) (instret & 0xffffffff)); instret = get_instret(); oldcount = get_cycles(); int32_t temp = mmul32_s(ma, mb); int32_t r = mmul32_s(temp, mc); cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); printf("instret: %x\n", (unsigned) (instret & 0xffffffff)); assert(r == r_3num); return 0; } ``` We could have the result : | | O0 | O1 | O2 | O3 | Os | | ------ | ---- | ---- | ---- | ---- |----| | Original | 292 | 289 | 289 | 289 | 289 | | 3num | 268 | 267 | 268 | 268 | 268 | :::danger Unfortunately, in the most other cases, this could not beat the original multiplication, for example, `a=0x3fc10000, b=0x3fa12c00, c=0x3f953000`, the cycle would be `298` vs `585`, since the number of 1's increases. This version of optimization actullay failed for most case, still look for possible improving. ::: ### 3 Matrices Multiplication #### The Real Application In some scenario, we need to multiply exactly 3 floating number, like the [Coordinate Transformation](). Coordinate Transformation is essential in real world applications, when it comes to posture changing, especially when the object (vehicle, missile etc.) moving frequently along with time, calculating it efficiently is critical in these applications. :::warning :warning: Rewrite the above with LaTeX. :notes: jserv ::: $R_xR_yR_z \\ = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\theta_x) & -\sin(\theta_x) \\ 0 & \sin(\theta_x) & \cos(\theta_x) \end{bmatrix} \begin{bmatrix} \cos(\theta_y) & 0 & \sin(\theta_y) \\ 0 & 1 & 0 \\ -\sin(\theta_y) & 0 & \cos(\theta_y) \end{bmatrix} \begin{bmatrix} \cos(\theta_z) & -\sin(\theta_z) & 0 \\ \sin(\theta_z) & \cos(\theta_z) & 0 \\ 0 & 0 & 1 \end{bmatrix}$ $= \begin{bmatrix} \cos(\theta_y)\cos(\theta_z) & -\sin(\theta_z)\cos(\theta_y) & \sin(\theta_y) \\ \sin(\theta_x)\sin(\theta_y)\cos(\theta_z) + \sin(\theta_z)\cos(\theta_x) & -\sin(\theta_x)\sin(\theta_y)\sin(\theta_z) + \cos(\theta_x)\cos(\theta_z) & -\sin(\theta_x)\cos(\theta_y) \\ \sin(\theta_x)\sin(\theta_z) - \sin(\theta_y)\cos(\theta_x)\cos(\theta_z) & \sin(\theta_x)\cos(\theta_z) + \sin(\theta_y)\sin(\theta_z)\cos(\theta_x) & \cos(\theta_x)\cos(\theta_y) \end{bmatrix}$ In the above matrix, even though there are items are duplicated, consider multiply x, y, z, that is > $\begin{bmatrix} x'\\y'\\z' \end{bmatrix} = R_xR_yR_z\begin{bmatrix} x\\y\\z \end{bmatrix}$ 3 number multiplication still frequently happens. But another possibility is, for those applications requiring high throughput on computing rotation matrices, dedicated hardware might be designed, this algorithm might be just garbage.