[HW][計算機結構] HW2

# Assignment 2: RISC-V Toolchain contributed by AgainTW > [Code of this project](https://github.com/AgainTW/b16-SIMD/tree/main/SIMD/V4/linux) ## Why I choose this topic？ In HW1, my initial goal was to create a reduced yet efficient backpropagation Multi-Layer Perceptron (MLP) that could run on RISC architectures like Ripes or rv32emu. However, due to time constraints, I had to focus on implementing SIMD instructions for floating-point multiplication and made some progress in [HW1](https://hackmd.io/@nfUUgsYRTGy81y5d9AYOyg/r1Oy_iLga). Therefore, in HW2, I chose to work on **廖泓博** 'Matrix multiplication with floating point addition and multiplication.' Not only did I aim to port it to rv32emu, but I also aimed to implement matrix multiplication using SIMD architecture. > 廖泓博 uses bitwise technology to implement the multiplication of 32-bit floating point numbers. > In HW2, in order to make the comparison baseline consistent, I did not change the core algorithm of 廖泓博. > Modify it into matrix multiplication for bf16 float format, and use high-half word calculation. [Source code](https://github.com/kc71486/Computer-Architecture/blob/main/hw1/quizc_v4.c) The original implementation in 32-bits float format of the questions is as follows. :::spoiler ```c float fmul32(float a, float b) { int32_t ia = *(int32_t *) &a; int32_t ib = *(int32_t *) &b; /* define sign */ int32_t sr = (ia ^ ib) >> 31; /* define mantissa */ int32_t ma = (ia & 0x7fffff) | 0x800000; int32_t mb = (ib & 0x7fffff) | 0x800000; int32_t mr; /* define exponent */ int32_t ea = ((ia >> 23) & 0xff); int32_t eb = ((ib >> 23) & 0xff); int32_t er; /* define result */ int32_t result; /* special values */ if(ea == 0xff) { if(ma != 0x800000 || eb == 0) { int32_t f_nan = 0x7f800001; return *(float *) &f_nan; } else { int32_t f_inf = 0x7F800000 | sr << 31; return *(float *) &f_inf; } } if(eb == 0xff) { if(mb != 0x800000 || ea == 0) { int32_t f_nan = 0x7f800001; return *(float *) &f_nan; } else { int32_t f_inf = 0x7F800000 | sr << 31; return *(float *) &f_inf; } } if(ea == 0 || eb == 0) { int32_t f_zero = 0 | sr << 31; return *(float *) &f_zero; } /* multiplication */ int32_t mrtmp = mmul(ma, mb); int32_t ertmp = ea + eb - 127; /* realign mantissa */ int32_t mshift = (mrtmp >> 24) & 1; mr = mrtmp >> mshift; er = ertmp + mshift; /* overflow and underflow */ if(er <= 0) { int32_t f_zero = 0 | sr << 31; return *(float *) &f_zero; } if(er >= 0xff) { int32_t f_inf = 0x7f800001 | sr << 31; return *(float *) &f_inf; } /* result */ result = (sr << 31) | ((er & 0xff) << 23) | (mr & 0x7fffff); return *(float *) &result; } ``` ::: I modified it into an implementation in bf16 float format. :::spoiler ```c unsigned int fmul_b16(unsigned int a, unsigned int b) { unsigned int MD1 = a; unsigned int MD2 = b; unsigned int h_falg = 0; /* define sign */ int32_t s_MD1_h = MD1 >> 31; int32_t s_MD2_h = MD2 >> 31; int32_t sr_h; /* define mantissa */ int32_t m_MD1_h = ((MD1 & 0x7F0000) | 0x800000)>>16; int32_t m_MD2_h = ((MD2 & 0x7F0000) | 0x800000)>>16; int32_t mr_h; /* define exponent */ int32_t e_MD1_h = ((MD1 >> 23) & 0xFF); int32_t e_MD2_h = ((MD2 >> 23) & 0xFF); int32_t er_h; /* define result */ unsigned int h_result; /*special values (high)*/ if(e_MD1_h == 0xFF && m_MD1_h != 0x80 && h_falg == 0) { h_result = 0x7F81; // f_nan_h h_falg = 1; } if(e_MD2_h == 0xFF && m_MD2_h != 0x80 && h_falg == 0) { h_result = 0x7F81; // f_nan_h h_falg = 1; } if(e_MD1_h == 0xFF && m_MD1_h == 0x80 && h_falg==0) { if(e_MD2_h == 0) { h_result = 0x7F81; // f_nan_h h_falg = 1; } else { h_result = 0x7F80 | (s_MD1_h ^ s_MD2_h) << 15; // f_inf_h h_falg = 1; } } if(e_MD2_h == 0xFF && m_MD2_h == 0x80 && h_falg==0) { if(e_MD1_h == 0) { h_result = 0x7F81; // f_nan_h h_falg = 1; } else { h_result = 0x7F80 | (s_MD1_h ^ s_MD2_h) << 15; // f_inf_h h_falg = 1; } } if((e_MD1_h == 0 || e_MD2_h == 0) && h_falg==0) { h_result = 0 | (s_MD1_h ^ s_MD2_h) << 15; // f_zero_h h_falg = 1; } /*calculate*/ if(h_falg==0) { /* multiplication */ /**(high)**/ sr_h = s_MD1_h ^ s_MD2_h; int32_t mrtmp_h = m_mul(m_MD1_h, m_MD2_h); int32_t ertmp_h = e_MD1_h + e_MD2_h - 127; /* realign mantissa */ /**(high)**/ int32_t mshift_h = (mrtmp_h >> 8) & 1; mr_h = mrtmp_h >> mshift_h; er_h = ertmp_h + mshift_h; /* overflow and underflow */ /**(high)**/ if(er_h <= 0 && h_falg==0) { h_result = 0 | (s_MD1_h ^ s_MD2_h) << 15; // f_zero_h h_falg = 1; } if(er_h >= 0xff && h_falg==0) { h_result = 0x7F80 | (s_MD1_h ^ s_MD2_h) << 15; // f_inf_h h_falg = 1; } } /* result */ if(h_falg==0) h_result = (sr_h << 15) | ((er_h & 0xFF) << 7) | (mr_h & 0x7F); h_result = (h_result << 16); return h_result; } ``` ::: # Performance ## Cycles * Here are the cycles at different compilation levels * **mat_int32**：I modified 廖泓博's code into an implementation in bf16 float format. * **b16_SIMD**：Implement matrix multiplication using high-half word and low-half word in the function. * **b16_SIMD_opt**：Optimizing b16_SIMD. For instance, Jump out early when high-half word and low-half word are special results at the same time. | compile level | mat_int32 | b16_SIMD | b16_SIMD_opt | | :--------------: | :-------: | :------: | :----------: | | cycles on -O0 | 11565 | 11866 | 13751 | | cycles on -O1 | 5604 | 6215 | 6955 | | cycles on -O2 | 4980 | 5408 | 6813 | | cycles on -O3 | 4468 | 5068 | 6173 | | cycles on -Ofast | 4468 | 5068 | 6173 | From the above results, it can be observed that using SIMD architecture is not very effective. One possible reason for this is that the high-half word and low-half word of SIMD instructions must be constrained by each other. Assuming that the high-half word only requires $3$ m_mul(SIMD) operations, while the low-half word requires a full $7$ mul(SIMD) operations. Under the SIMD architecture, a total of $7$ m_mul(SIMD) operations are needed. However, the cycles of m_mul(SIMD) in this program are approximately 1.6 times that of m_mul(SISD) (as per HW1). Therefore, when calculated, SIMD requires $7*1.6=11.2$ m_mul(SISD) operations. Intuitively, SISD requires a total of $10$ m_mul(SISD) operations. A possible reason b16_SIMD gets worse the more optimized it is: I'm using too many branch instructions. In order to maintain instruction correctness, I need a large number of branch instructions to handle these exceptions, such as m_mul(SIMD). ```c // b16_SIMD int m_mul(int32_t a, int32_t b) { int32_t r = 0; a = a << 1; /* to counter last right shift */ while(b != 0) { if((b & 1) != 0) r = r + a; b = b >> 1; r = r >> 1; } return r; } ``` ```c // b16_SIMD_opt unsigned int m_mul(unsigned int a, unsigned int b) { unsigned int r = 0; unsigned int a_h, a_l, b_h, b_l; a_h = (a & 0xFFFF0000) << 1; /* to counter last right shift */ a_l = (a & 0xFFFF) << 1; /* to counter last right shift */ b_h = (b & 0xFFFF0000) >>16; b_l = b & 0xFFFF; while(b_h != 0 || b_l != 0) { if((b_h & 1) != 0) r = r + a_h; if((b_l & 1) != 0) r = r + a_l; if(b_h != 0 && b_l != 0) r = (r>>1) & 0xFFFF7FFF; else if(b_h == 0) r = (r & 0xFFFF0000) | (r & 0xFFFE)>>1; else if(b_l == 0) r = (r & 0xFFFF) | (r & 0xFFFE0000)>>1; b_h = b_h >> 1; b_l = b_l >> 1; } return r; } ``` ## Consider real SIMD In the operation of matrix multiplication, except for the addition of inner products between vectors, other operations do not have data dependence issues. Therefore, SIMD of real chips can be realized by adding hardware (increasing cost). Here we do not consider that this will increase the clock period of the hardware, but only consider how many clocks the operation requires when the extended instructions are not supported, and assume that the clocks required for these operations under SIMD is $1$. ### data level SIMD Considering real data level SIMD,data level SIMD，compile level=O0 cycles for matrix element arithmetic will be reduced from 11383 to 18. Using the data simulated previously, $13751-11383+18=2386$, the overall cycles will be reduced by **82%**.compile level=O3 cycles for matrix element arithmetic will be reduced from 1756 to 18. Using the data simulated previously, $6173-1756+18=4435$, the overall cycles will be reduced by **28%**. > One of the reasons why multiplication takes 0 cycles when "cycles on -O3". > > In chaintool of rv32, the compiler optimization will interpret the content of the code, determine that the code performs matrix operations, and then use matrix operation optimization techniques such as "Blocked" or "Column" to make the compiled code The code does not always perform calculations according to the loop originally planned by the designer. * i, j, k represent the position in the matrix where the operation occurs. * The overall cost of cycles increases because adding the getcycles function will cause a large amount of fuction I/O to occur during matrix operations. | i | j | k | cycles on -O0 | cycles on -O3 | | :--: | :--: | :--: | :-----------: | :-----------: | | 0 | 0 | 0 | 639 | 284 | | 0 | 0 | 1 | 629 | 287 | | 0 | 1 | 0 | 646 | 263 | | 0 | 1 | 1 | 649 | 0 | | 0 | 2 | 0 | 547 | 0 | | 0 | 2 | 1 | 649 | 0 | | 1 | 0 | 0 | 656 | 0 | | 1 | 0 | 1 | 419 | 0 | | 1 | 1 | 0 | 773 | 0 | | 1 | 1 | 1 | 649 | 0 | | 1 | 2 | 0 | 793 | 0 | | 1 | 2 | 1 | 411 | 0 | | 2 | 0 | 0 | 658 | 256 | | 2 | 0 | 1 | 656 | 0 | | 2 | 1 | 0 | 765 | 333 | | 2 | 1 | 1 | 411 | 0 | | 2 | 2 | 0 | 785 | 333 | | 2 | 2 | 1 | 648 | 0 | | | | avg | 632 | 98 | | | | sum | 11383 | 1756 | | | | tol | 60575 | 51037 | ### vector level SIMD * i, j represent the position in the matrix where the operation occurs. | i | j | cycles on -O0 | cycles on -O3 | | :--: | :--: | :------------: | :-----------: | | 0 | 0 | 1504 | 666 | | 0 | 1 | 1459 | 254 | | 0 | 2 | 1604 | 0 | | 1 | 0 | 1239 | 0 | | 1 | 1 | 1585 | 0 | | 1 | 2 | 1367 | 0 | | 2 | 0 | 1478 | 240 | | 2 | 1 | 1339 | 320 | | 2 | 2 | 1596 | 320 | | | avg | 1463 | 200 | | | sum | 13171 | 1800 | | | tol | 34511 | 25757 | ### matrix level SIMD | cycles | cycles on -O0 | cycles on -O3 | | :----: | :------------: | :-----------: | | matrix | 13675 | 6118 | | total | 15992 | 8347 | ## Instruction Frequency Histogram ### mat_int32_O0 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.14% [1787 ] ██████████████████████ 2. clwsp 8.30% [1463 ] ██████████████████ 3. addi 7.54% [1329 ] ████████████████▎ 4. cli 6.88% [1213 ] ██████████████▉ 5. cswsp 6.50% [1145 ] ██████████████ 6. caddi 4.24% [747 ] █████████▏ 7. cj 3.86% [681 ] ████████▍ 8. jal 3.78% [666 ] ████████▏ 9. lw 3.38% [595 ] ███████▎ 10. sw 3.17% [559 ] ██████▉ 11. beq 2.80% [494 ] ██████ 12. clw 2.28% [402 ] ████▉ 13. cadd 2.18% [385 ] ████▋ 14. bne 2.04% [360 ] ████▍ 15. andi 2.03% [357 ] ████▍ 16. cbeqz 1.84% [325 ] ████ 17. cjr 1.80% [317 ] ███▉ 18. csw 1.67% [295 ] ███▋ 19. sub 1.49% [263 ] ███▏ 20. bge 1.28% [225 ] ██▊ 21. lbu 1.24% [218 ] ██▋ 22. blt 1.21% [213 ] ██▌ 23. auipc 1.17% [206 ] ██▌ 24. slli 1.11% [195 ] ██▍ 25. cbnez 1.11% [195 ] ██▍ 26. or 1.10% [193 ] ██▍ 27. cslli 1.04% [184 ] ██▎ 28. srli 1.01% [178 ] ██▏ ``` ::: ### mat_int32_O1 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.27% [1787 ] ██████████████████████ 2. clwsp 8.41% [1463 ] ██████████████████ 3. addi 7.67% [1335 ] ████████████████▍ 4. cli 6.97% [1213 ] ██████████████▉ 5. cswsp 6.58% [1145 ] ██████████████ 6. caddi 4.29% [747 ] █████████▏ 7. cj 3.91% [681 ] ████████▍ 8. jal 3.87% [673 ] ████████▎ 9. beq 2.95% [514 ] ██████▎ 10. sw 2.67% [465 ] █████▋ 11. lw 2.49% [433 ] █████▎ 12. clw 2.31% [402 ] ████▉ 13. cadd 2.21% [385 ] ████▋ 14. andi 2.06% [358 ] ████▍ 15. bne 2.01% [349 ] ████▎ 16. cbeqz 1.87% [325 ] ████ 17. cjr 1.82% [317 ] ███▉ 18. csw 1.70% [295 ] ███▋ 19. sub 1.50% [261 ] ███▏ 20. bge 1.29% [224 ] ██▊ 21. lbu 1.25% [218 ] ██▋ 22. blt 1.22% [213 ] ██▌ 23. auipc 1.18% [206 ] ██▌ 24. cbnez 1.12% [195 ] ██▍ 25. or 1.11% [193 ] ██▍ 26. slli 1.10% [192 ] ██▎ 27. cslli 1.06% [184 ] ██▎ 28. srli 1.04% [181 ] ██▏ ``` ::: ### mat_int32_O2 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.30% [1787 ] ██████████████████████ 2. clwsp 8.43% [1463 ] ██████████████████ 3. addi 7.56% [1312 ] ████████████████▏ 4. cli 6.99% [1213 ] ██████████████▉ 5. cswsp 6.60% [1145 ] ██████████████ 6. caddi 4.31% [747 ] █████████▏ 7. cj 3.93% [682 ] ████████▍ 8. jal 3.76% [653 ] ████████ 9. beq 2.92% [507 ] ██████▏ 10. sw 2.65% [460 ] █████▋ 11. lw 2.51% [436 ] █████▎ 12. clw 2.32% [402 ] ████▉ 13. cadd 2.22% [385 ] ████▋ 14. andi 2.09% [362 ] ████▍ 15. bne 2.02% [350 ] ████▎ 16. cbeqz 1.87% [325 ] ████ 17. cjr 1.83% [317 ] ███▉ 18. csw 1.70% [295 ] ███▋ 19. sub 1.51% [262 ] ███▏ 20. bge 1.30% [226 ] ██▊ 21. lbu 1.26% [218 ] ██▋ 22. blt 1.22% [212 ] ██▌ 23. auipc 1.19% [206 ] ██▌ 24. cbnez 1.12% [195 ] ██▍ 25. or 1.09% [189 ] ██▎ 26. slli 1.08% [188 ] ██▎ 27. cslli 1.06% [184 ] ██▎ 28. srli 1.04% [181 ] ██▏ ``` ::: ### mat_int32_O3 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.21% [1787 ] ██████████████████████ 2. clwsp 8.36% [1463 ] ██████████████████ 3. addi 7.66% [1340 ] ████████████████▍ 4. cli 6.93% [1213 ] ██████████████▉ 5. cswsp 6.55% [1145 ] ██████████████ 6. caddi 4.27% [747 ] █████████▏ 7. cj 3.89% [681 ] ████████▍ 8. jal 3.78% [662 ] ████████▏ 9. beq 2.93% [513 ] ██████▎ 10. sw 2.62% [459 ] █████▋ 11. lw 2.42% [424 ] █████▏ 12. clw 2.30% [402 ] ████▉ 13. cadd 2.20% [385 ] ████▋ 14. andi 2.12% [370 ] ████▌ 15. bne 2.04% [357 ] ████▍ 16. cbeqz 1.86% [325 ] ████ 17. cjr 1.81% [317 ] ███▉ 18. csw 1.69% [295 ] ███▋ 19. sub 1.55% [271 ] ███▎ 20. bge 1.30% [227 ] ██▊ 21. lbu 1.25% [218 ] ██▋ 22. blt 1.24% [217 ] ██▋ 23. auipc 1.18% [206 ] ██▌ 24. srli 1.17% [204 ] ██▌ 25. or 1.15% [202 ] ██▍ 26. cbnez 1.11% [195 ] ██▍ 27. slli 1.10% [192 ] ██▎ 28. cslli 1.05% [184 ] ██▎ ``` ::: ### mat_int32_Ofast :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.21% [1787 ] ██████████████████████ 2. clwsp 8.36% [1463 ] ██████████████████ 3. addi 7.66% [1340 ] ████████████████▍ 4. cli 6.93% [1213 ] ██████████████▉ 5. cswsp 6.55% [1145 ] ██████████████ 6. caddi 4.27% [747 ] █████████▏ 7. cj 3.89% [681 ] ████████▍ 8. jal 3.78% [662 ] ████████▏ 9. beq 2.93% [513 ] ██████▎ 10. sw 2.62% [459 ] █████▋ 11. lw 2.42% [424 ] █████▏ 12. clw 2.30% [402 ] ████▉ 13. cadd 2.20% [385 ] ████▋ 14. andi 2.12% [370 ] ████▌ 15. bne 2.04% [357 ] ████▍ 16. cbeqz 1.86% [325 ] ████ 17. cjr 1.81% [317 ] ███▉ 18. csw 1.69% [295 ] ███▋ 19. sub 1.55% [271 ] ███▎ 20. bge 1.30% [227 ] ██▊ 21. lbu 1.25% [218 ] ██▋ 22. blt 1.24% [217 ] ██▋ 23. auipc 1.18% [206 ] ██▌ 24. srli 1.17% [204 ] ██▌ 25. or 1.15% [202 ] ██▍ 26. cbnez 1.11% [195 ] ██▍ 27. slli 1.10% [192 ] ██▎ 28. cslli 1.05% [184 ] ██▎ ``` ::: ### b16_SIMD_O0 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 9.97% [1787 ] ██████████████████████ 2. clwsp 8.16% [1463 ] ██████████████████ 3. addi 7.59% [1361 ] ████████████████▊ 4. cli 6.77% [1213 ] ██████████████▉ 5. cswsp 6.39% [1145 ] ██████████████ 6. caddi 4.17% [747 ] █████████▏ 7. lw 3.89% [697 ] ████████▌ 8. cj 3.80% [681 ] ████████▍ 9. jal 3.78% [677 ] ████████▎ 10. sw 3.41% [612 ] ███████▌ 11. beq 2.77% [497 ] ██████ 12. clw 2.24% [402 ] ████▉ 13. cadd 2.15% [385 ] ████▋ 14. bne 2.11% [379 ] ████▋ 15. andi 2.04% [366 ] ████▌ 16. cbeqz 1.81% [325 ] ████ 17. cjr 1.77% [317 ] ███▉ 18. csw 1.65% [295 ] ███▋ 19. sub 1.47% [264 ] ███▎ 20. bge 1.26% [226 ] ██▊ 21. lbu 1.22% [218 ] ██▋ 22. blt 1.20% [215 ] ██▋ 23. slli 1.19% [214 ] ██▋ 24. auipc 1.15% [206 ] ██▌ 25. or 1.10% [198 ] ██▍ 26. cbnez 1.09% [195 ] ██▍ 27. srli 1.04% [187 ] ██▎ 28. cslli 1.03% [184 ] ██▎ ``` ::: ### b16_SIMD_O1 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.13% [1788 ] ██████████████████████ 2. clwsp 8.29% [1463 ] ██████████████████ 3. addi 8.07% [1425 ] █████████████████▌ 4. cli 6.88% [1214 ] ██████████████▉ 5. cswsp 6.49% [1145 ] ██████████████ 6. caddi 4.23% [747 ] █████████▏ 7. jal 3.97% [701 ] ████████▋ 8. cj 3.86% [681 ] ████████▍ 9. beq 3.00% [530 ] ██████▌ 10. sw 2.72% [480 ] █████▉ 11. lw 2.59% [458 ] █████▋ 12. clw 2.28% [402 ] ████▉ 13. cadd 2.19% [386 ] ████▋ 14. andi 2.07% [366 ] ████▌ 15. bne 2.04% [361 ] ████▍ 16. cbeqz 1.85% [326 ] ████ 17. cjr 1.80% [318 ] ███▉ 18. csw 1.67% [295 ] ███▋ 19. sub 1.48% [261 ] ███▏ 20. bge 1.30% [229 ] ██▊ 21. lbu 1.23% [218 ] ██▋ 22. blt 1.21% [213 ] ██▌ 23. auipc 1.17% [206 ] ██▌ 24. slli 1.13% [199 ] ██▍ 25. cbnez 1.11% [196 ] ██▍ 26. or 1.10% [195 ] ██▍ 27. srli 1.05% [185 ] ██▎ 28. cslli 1.05% [185 ] ██▎ ``` ::: ### b16_SIMD_O2 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.16% [1788 ] ██████████████████████ 2. clwsp 8.31% [1463 ] ██████████████████ 3. addi 7.88% [1387 ] █████████████████ 4. cli 6.90% [1214 ] ██████████████▉ 5. cswsp 6.51% [1145 ] ██████████████ 6. caddi 4.24% [747 ] █████████▏ 7. cj 3.87% [681 ] ████████▍ 8. jal 3.85% [677 ] ████████▎ 9. beq 2.97% [523 ] ██████▍ 10. sw 2.67% [470 ] █████▊ 11. lw 2.59% [456 ] █████▌ 12. clw 2.28% [402 ] ████▉ 13. cadd 2.19% [386 ] ████▋ 14. andi 2.13% [375 ] ████▌ 15. bne 2.04% [359 ] ████▍ 16. cbeqz 1.85% [326 ] ████ 17. cjr 1.81% [318 ] ███▉ 18. csw 1.68% [295 ] ███▋ 19. sub 1.49% [262 ] ███▏ 20. bge 1.29% [227 ] ██▊ 21. lbu 1.24% [218 ] ██▋ 22. blt 1.22% [215 ] ██▋ 23. auipc 1.17% [206 ] ██▌ 24. slli 1.17% [206 ] ██▌ 25. cbnez 1.11% [196 ] ██▍ 26. or 1.11% [195 ] ██▍ 27. srli 1.07% [189 ] ██▎ 28. cslli 1.05% [185 ] ██▎ ``` ::: ### b16_SIMD_O3 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.15% [1788 ] ██████████████████████ 2. clwsp 8.31% [1463 ] ██████████████████ 3. addi 7.89% [1389 ] █████████████████ 4. cli 6.89% [1214 ] ██████████████▉ 5. cswsp 6.50% [1145 ] ██████████████ 6. caddi 4.24% [747 ] █████████▏ 7. cj 3.87% [681 ] ████████▍ 8. jal 3.85% [678 ] ████████▎ 9. beq 2.98% [524 ] ██████▍ 10. sw 2.64% [465 ] █████▋ 11. lw 2.48% [436 ] █████▎ 12. clw 2.28% [402 ] ████▉ 13. cadd 2.19% [386 ] ████▋ 14. andi 2.14% [376 ] ████▋ 15. bne 2.04% [360 ] ████▍ 16. cbeqz 1.85% [326 ] ████ 17. cjr 1.81% [318 ] ███▉ 18. csw 1.68% [295 ] ███▋ 19. sub 1.49% [263 ] ███▏ 20. bge 1.28% [226 ] ██▊ 21. lbu 1.24% [218 ] ██▋ 22. blt 1.22% [215 ] ██▋ 23. auipc 1.17% [206 ] ██▌ 24. slli 1.17% [206 ] ██▌ 25. or 1.14% [200 ] ██▍ 26. srli 1.12% [197 ] ██▍ 27. cbnez 1.11% [196 ] ██▍ 28. cslli 1.05% [185 ] ██▎ ``` ::: ### b16_SIMD_Ofast :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.15% [1788 ] ██████████████████████ 2. clwsp 8.31% [1463 ] ██████████████████ 3. addi 7.89% [1389 ] █████████████████ 4. cli 6.89% [1214 ] ██████████████▉ 5. cswsp 6.50% [1145 ] ██████████████ 6. caddi 4.24% [747 ] █████████▏ 7. cj 3.87% [681 ] ████████▍ 8. jal 3.85% [678 ] ████████▎ 9. beq 2.98% [524 ] ██████▍ 10. sw 2.64% [465 ] █████▋ 11. lw 2.48% [436 ] █████▎ 12. clw 2.28% [402 ] ████▉ 13. cadd 2.19% [386 ] ████▋ 14. andi 2.14% [376 ] ████▋ 15. bne 2.04% [360 ] ████▍ 16. cbeqz 1.85% [326 ] ████ 17. cjr 1.81% [318 ] ███▉ 18. csw 1.68% [295 ] ███▋ 19. sub 1.49% [263 ] ███▏ 20. bge 1.28% [226 ] ██▊ 21. lbu 1.24% [218 ] ██▋ 22. blt 1.22% [215 ] ██▋ 23. auipc 1.17% [206 ] ██▌ 24. slli 1.17% [206 ] ██▌ 25. or 1.14% [200 ] ██▍ 26. srli 1.12% [197 ] ██▍ 27. cbnez 1.11% [196 ] ██▍ 28. cslli 1.05% [185 ] ██▎ ``` ::: ### b16_SIMD_opt_O0 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 9.85% [1787 ] ██████████████████████ 2. clwsp 8.06% [1463 ] ██████████████████ 3. addi 7.55% [1370 ] ████████████████▊ 4. cli 6.69% [1213 ] ██████████████▉ 5. cswsp 6.31% [1145 ] ██████████████ 6. lw 4.17% [757 ] █████████▎ 7. caddi 4.12% [747 ] █████████▏ 8. jal 3.93% [714 ] ████████▊ 9. cj 3.75% [681 ] ████████▍ 10. sw 3.26% [591 ] ███████▎ 11. beq 2.87% [520 ] ██████▍ 12. clw 2.22% [402 ] ████▉ 13. cadd 2.12% [385 ] ████▋ 14. bne 2.09% [379 ] ████▋ 15. andi 2.07% [376 ] ████▋ 16. cbeqz 1.79% [325 ] ████ 17. cjr 1.75% [317 ] ███▉ 18. csw 1.63% [295 ] ███▋ 19. sub 1.45% [264 ] ███▎ 20. or 1.25% [227 ] ██▊ 21. bge 1.25% [226 ] ██▊ 22. lbu 1.20% [218 ] ██▋ 23. blt 1.18% [215 ] ██▋ 24. auipc 1.14% [206 ] ██▌ 25. slli 1.12% [204 ] ██▌ 26. cbnez 1.07% [195 ] ██▍ 27. srli 1.04% [189 ] ██▎ 28. cslli 1.01% [184 ] ██▎ ``` ::: ### b16_SIMD_opt_O1 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.12% [1788 ] ██████████████████████ 2. clwsp 8.28% [1463 ] ██████████████████ 3. addi 7.77% [1373 ] ████████████████▉ 4. cli 6.87% [1214 ] ██████████████▉ 5. cswsp 6.48% [1145 ] ██████████████ 6. caddi 4.23% [747 ] █████████▏ 7. jal 3.96% [700 ] ████████▌ 8. cj 3.86% [681 ] ████████▍ 9. beq 2.93% [518 ] ██████▎ 10. sw 2.67% [472 ] █████▊ 11. lw 2.55% [451 ] █████▌ 12. clw 2.28% [402 ] ████▉ 13. cadd 2.19% [386 ] ████▋ 14. andi 2.13% [377 ] ████▋ 15. bne 2.01% [355 ] ████▎ 16. cbeqz 1.85% [326 ] ████ 17. cjr 1.80% [318 ] ███▉ 18. csw 1.67% [295 ] ███▋ 19. sub 1.49% [263 ] ███▏ 20. bge 1.30% [230 ] ██▊ 21. or 1.30% [229 ] ██▊ 22. lbu 1.23% [218 ] ██▋ 23. blt 1.20% [212 ] ██▌ 24. auipc 1.17% [206 ] ██▌ 25. cbnez 1.11% [196 ] ██▍ 26. slli 1.10% [195 ] ██▍ 27. srli 1.07% [189 ] ██▎ 28. cslli 1.05% [185 ] ██▎ ``` ::: ### b16_SIMD_opt_O2 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.08% [1788 ] ██████████████████████ 2. clwsp 8.25% [1463 ] ██████████████████ 3. addi 7.93% [1407 ] █████████████████▎ 4. cli 6.84% [1214 ] ██████████████▉ 5. cswsp 6.45% [1145 ] ██████████████ 6. caddi 4.21% [747 ] █████████▏ 7. jal 3.92% [696 ] ████████▌ 8. cj 3.84% [681 ] ████████▍ 9. beq 3.00% [532 ] ██████▌ 10. sw 2.68% [475 ] █████▊ 11. lw 2.62% [465 ] █████▋ 12. clw 2.27% [402 ] ████▉ 13. cadd 2.18% [386 ] ████▋ 14. andi 2.13% [378 ] ████▋ 15. bne 2.09% [371 ] ████▌ 16. cbeqz 1.84% [326 ] ████ 17. cjr 1.79% [318 ] ███▉ 18. csw 1.66% [295 ] ███▋ 19. sub 1.48% [262 ] ███▏ 20. bge 1.31% [233 ] ██▊ 21. or 1.25% [221 ] ██▋ 22. lbu 1.23% [218 ] ██▋ 23. blt 1.20% [213 ] ██▌ 24. auipc 1.16% [206 ] ██▌ 25. cbnez 1.10% [196 ] ██▍ 26. slli 1.10% [195 ] ██▍ 27. srli 1.09% [193 ] ██▎ 28. cslli 1.04% [185 ] ██▎ ``` ::: ### b16_SIMD_opt_O3 :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.05% [1788 ] ██████████████████████ 2. clwsp 8.22% [1463 ] ██████████████████ 3. addi 8.03% [1429 ] █████████████████▌ 4. cli 6.82% [1214 ] ██████████████▉ 5. cswsp 6.44% [1145 ] ██████████████ 6. caddi 4.20% [747 ] █████████▏ 7. jal 3.91% [696 ] ████████▌ 8. cj 3.83% [681 ] ████████▍ 9. beq 2.95% [524 ] ██████▍ 10. sw 2.62% [466 ] █████▋ 11. lw 2.48% [441 ] █████▍ 12. clw 2.26% [402 ] ████▉ 13. cadd 2.17% [386 ] ████▋ 14. andi 2.15% [382 ] ████▋ 15. bne 2.07% [369 ] ████▌ 16. cbeqz 1.83% [326 ] ████ 17. cjr 1.79% [318 ] ███▉ 18. csw 1.66% [295 ] ███▋ 19. sub 1.48% [263 ] ███▏ 20. or 1.32% [235 ] ██▉ 21. bge 1.30% [232 ] ██▊ 22. lbu 1.23% [218 ] ██▋ 23. blt 1.20% [213 ] ██▌ 24. srli 1.17% [208 ] ██▌ 25. auipc 1.16% [206 ] ██▌ 26. slli 1.10% [196 ] ██▍ 27. cbnez 1.10% [196 ] ██▍ 28. cslli 1.04% [185 ] ██▎ ``` ::: ### b16_SIMD_opt_Ofast :::spoiler ``` +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. cmv 10.05% [1788 ] ██████████████████████ 2. clwsp 8.22% [1463 ] ██████████████████ 3. addi 8.03% [1429 ] █████████████████▌ 4. cli 6.82% [1214 ] ██████████████▉ 5. cswsp 6.44% [1145 ] ██████████████ 6. caddi 4.20% [747 ] █████████▏ 7. jal 3.91% [696 ] ████████▌ 8. cj 3.83% [681 ] ████████▍ 9. beq 2.95% [524 ] ██████▍ 10. sw 2.62% [466 ] █████▋ 11. lw 2.48% [441 ] █████▍ 12. clw 2.26% [402 ] ████▉ 13. cadd 2.17% [386 ] ████▋ 14. andi 2.15% [382 ] ████▋ 15. bne 2.07% [369 ] ████▌ 16. cbeqz 1.83% [326 ] ████ 17. cjr 1.79% [318 ] ███▉ 18. csw 1.66% [295 ] ███▋ 19. sub 1.48% [263 ] ███▏ 20. or 1.32% [235 ] ██▉ 21. bge 1.30% [232 ] ██▊ 22. lbu 1.23% [218 ] ██▋ 23. blt 1.20% [213 ] ██▌ 24. srli 1.17% [208 ] ██▌ 25. auipc 1.16% [206 ] ██▌ 26. slli 1.10% [196 ] ██▍ 27. cbnez 1.10% [196 ] ██▍ 28. cslli 1.04% [185 ] ██▎ ``` ::: ## Excecute the code :::spoiler ![](https://hackmd.io/_uploads/Sy2zYWVfT.png) ![](https://hackmd.io/_uploads/ByCXKb4GT.png) ![](https://hackmd.io/_uploads/BJs4KbVzT.png) ::: ## To overcome possible BUG ### VirtualBox * Intoduction * Oracle VM VirtualBox is a type-2 hypervisor for x86 virtualization developed by Oracle Corporation. * VirtualBox may be installed on Microsoft Windows, macOS, Linux, Solaris and OpenSolaris. There are also ports to FreeBSD and Genode. It supports the creation and management of guest virtual machines running Windows, Linux, BSD, OS/2, Solaris, Haiku, and OSx86, as well as limited virtualization of macOS guests on Apple hardware. * Why I use VM box？ * Save computer resources. * We do not require precise performance * [QEMU cannot run correctly on Win11](https://answers.microsoft.com/en-us/windows/forum/all/qemu-not-working-on-windows-11-why/690883d3-c121-48c5-b37b-56cf30b74bb9) * [Teaching about install VirtualBox and install ubuntu on VirtualBox](https://hackmd.io/@SCIST/VirtualBox) ## Some problom when you install RISC-V toolchain on **VirtualBox ubuntu** * VirtualBox ubuntu cannot open Terminal * [Solve](https://blog.csdn.net/qq_33583069/article/details/129518845) * Can't use ```sudo``` command * cause vboxuser is not in the sudoers file * [Solve](https://prathapreddy-mudium.medium.com/vboxuser-is-not-in-the-sudoers-file-this-incident-will-be-reported-enable-sudo-in-ubuntu-resolved-305e7988c6bc) * Remember that every time you restart VirtualBox, you need to reset the environment variable of the riscv toolchain. # Test on Windows * I use Dev-C++ IDE * inrtroduce Dev-C++：Dev-C++ is a free full-featured integrated development environment (IDE) distributed under the GNU General Public License for programming in C and C++. * If you need to link multiple files, you must create a new **''project file''** instead of a new ''source file''. * [how to create new project file](https://www.cs.pu.edu.tw/~tsay/course/objprog/slides/newproj.html) * Remember to put all files to be linked and the bridged .h file into the same project file. * In this way, Dec-C++ will generate the corresponding makefile to link the necessary files. ## .h file * Just like python can easily import a function library written by yourself, C can also do this. But... * Need to bridge through .h file * Common prologue：**#ifndef**, **#define**, **#endif** * [reference](https://huenlil.pixnet.net/blog/post/24339151) * [Function signature](https://medium.com/@alan81920/c-c-%E4%B8%AD%E7%9A%84-static-extern-%E7%9A%84%E8%AE%8A%E6%95%B8-9b42d000688f) ![image.png](https://hackmd.io/_uploads/SJvcn_I76.png) * How to use a same **struct** between two different source file? * [Solve](https://stackoverflow.com/questions/3041797/how-to-use-a-defined-struct-from-another-source-file) * Common **```struct```** are usually defined in .h file. * For instnacw, please refer to [my code](https://github.com/AgainTW/b16-SIMD/tree/main/SIMD/V4/linux/test%20in%20windows) ## Reference ### Purpose code * [Assignment 1](https://hackmd.io/@sysprog/2023-arch-homework1) * [Assignment 2](https://hackmd.io/@sysprog/2023-arch-homework2) * [廖泓博 HW1](https://hackmd.io/@kc71486/computer_architecture_hw1) * [廖泓博 HW2](https://hackmd.io/@kc71486/ca2023-hw2-redacted) ### Linux VM bug * [vboxuser is not in the sudoers file](https://prathapreddy-mudium.medium.com/vboxuser-is-not-in-the-sudoers-file-this-incident-will-be-reported-enable-sudo-in-ubuntu-resolved-305e7988c6bc) * [VirtualBox ubuntu22.10 can't open Terminal](https://blog.csdn.net/qq_33583069/article/details/129518845) * [QEMU install (But I ended up using VirtualBox)](https://www.youtube.com/watch?v=TVF3SRIJSDA) ### Prepare GNU Toolchain for RISC-V * [jserv](https://hackmd.io/@sysprog/rJAufgHYS) * Every time you restart the virtual machine, you must reset the riscv-none-embed-gcc environment variable. * [Install git on Ubuntu and clone the code](https://hackmd.io/@sam-liaw/BkzQ9zC0B) * [Sdl2-config not found error](https://stackoverflow.com/questions/54968758/sdl2-config-not-found-error-installing-pygame-sdl2) * [riscv-none-embed-gcc compilation problem](https://blog.csdn.net/humphreyandkate/article/details/109641664) ### How to code C "project" * [static, extern variables in C/C++](https://medium.com/@alan81920/c-c-%E4%B8%AD%E7%9A%84-static-extern-%E7%9A%84%E8%AE%8A%E6%95%B8-9b42d000688f) * [how to use #ifndef, #define, #endif, etc.](https://huenlil.pixnet.net/blog/post/24339151) * [How to use a defined struct from another source file?](https://stackoverflow.com/questions/3041797/how-to-use-a-defined-struct-from-another-source-file) ### Cal * [uint32 to IEEE-754](https://www.h-schmidt.net/FloatConverter/IEEE754.html) * [IEEE-754 to uint32](https://baseconvert.com/ieee-754-floating-point) * [Optimize Options (Using the GNU Compiler Collection (GCC))](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) * [Optimizing Matrix Multiplication](https://coffeebeforearch.github.io/2020/06/23/mmul.html)