Assignment2: RISC-V Toolchain

--- title: Computer_Architecture_homework2 tags: Computer Architecture --- # Assignment2: RISC-V Toolchain Contributed by < [Sun970053](https://github.com/Sun970053) > ## Install riscv rv32ima toolchain Follow the instruction from [this note](https://hackmd.io/@sysprog/rJAufgHYS) Replace this line ```bash cd $HOME/riscv-none-elf-gcc echo "export PATH=`pwd`/bin:$PATH" > setenv ``` to this line ```bash echo "export PATH=`pwd`/bin:$PATH" >> .bashrc ``` then ```source``` the .bashrc file ```bash source .bashrc ``` validate installation succeed by executing ```bash riscv-none-elf-gcc -v ``` and the following result is printed out on the terminal. ```bash Using built-in specs. COLLECT_GCC=riscv-none-elf-gcc ... gcc version 13.2.0 (xPack GNU RISC-V Embedded GCC x86_64) ``` :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: ## Commonly used commands in riscv rv32ima toolchain Compile c code ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 hw2_origin.c -o hw2_origin.elf ``` Run the elf file in the path ```~/rv32emu``` ```bash address/build/rv32emu hw2.elf ``` Output ```bash After 5 iterations, the square root of 500 is 22.360680 After 4 iterations, the square root of 400 is 20.000000 After 5 iterations, the square root of 31321 is 176.977386 ... inferior exit code 0 ``` objdump: -d: Display the assembler mnemonics for the machine instructions ```bash! riscv-none-elf-objdump -d hw2_origin.elf ``` Expected output: ```bash! ... 221c8: 40b005b3 neg a1,a1 221cc: 00008293 mv t0,ra 221d0: f91ff0ef jal 22160 <__hidden___udivsi3> 221d4: 40a00533 neg a0,a0 221d8: 00028067 jr t0 000221dc <__modsi3>: 221dc: 00008293 mv t0,ra 221e0: 0005ca63 bltz a1,221f4 <__modsi3+0x18> 221e4: 00054c63 bltz a0,221fc <__modsi3+0x20> 221e8: f79ff0ef jal 22160 <__hidden___udivsi3> 221ec: 00058513 mv a0,a1 221f0: 00028067 jr t0 221f4: 40b005b3 neg a1,a1 221f8: fe0558e3 bgez a0,221e8 <__modsi3+0xc> 221fc: 40a00533 neg a0,a0 22200: f61ff0ef jal 22160 <__hidden___udivsi3> 22204: 40b00533 neg a0,a1 22208: 00028067 jr t0 ``` readelf: -h : Display the ELF file header ```bash! riscv-none-elf-readelf -h hw2_origin.elf ``` Excepted output: ```bash ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 94936 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` ## Pick a Question The following question is picked from the Assignment 1: > 高紹捷 To [find the square root of a integer number](https://hackmd.io/@F3cNkb4bSWKg00J7O-_y8w/rJvzACwgT) > Newton-Raphson is a numerical, iterative method. It has a quadratic convergence. A limitation of this method is that, the speed of convergence is dependent on the chosen initial value, so the choice of initial value is important. In this work, computation of square root is implemented using Newton-Raphson iterative method, with the initial value obtained using an algorithm based on CLZ. [Source code](https://github.com/kkkkk1109/Implement-square-root-using-CLZ) ### Algorithm According to the reference paper, [Square Root Computation Using CLZ](https://ieeexplore.ieee.org/document/8290400), we can show the following computational steps: \begin{gather*}x=\sqrt{2}\end{gather*} \begin{align*} &x^{2}=\mathrm{a}\\ &f(x)=x^{2}-a\\ &x_{n+1}=x_{n}-\frac{({x_{n}}^{2}-a)}{2x_{n}}\\ \end{align*} And then we simplify it: \begin{gather*}x_{n+1}=0.5(x_n+\dfrac{a}{x_n})\end{gather*} The original implementation of the questions is as follows. ```c #include <stdint.h> #include <stdio.h> #define iteration 5 int lz, lzc, i; float ans; int count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } float uint_to_float(uint32_t u){ int c = count_leading_zeros(u); unsigned int exp=127+31-c; // This line is improved from while loop. // ex. c-(1+8-1)=c-8 u <<= (c-8); uint32_t flt= (exp << 23) | (u & 0x7FFFFF); return * (float *) &flt; } float division(float a, float b){ //change float into int int32_t ia = *(int32_t *)& a; int32_t ib = *(int32_t *)& b; //get the exponential int32_t expa = (ia >> 23) & 0xff; int32_t expb = (ib >> 23) & 0xff; // parameter used in rounded output // get the significand uint32_t siga= ((ia & 0x7fffff) | 0x800000); uint32_t sigb= ((ib & 0x7fffff) | 0x800000); // exponential output int32_t expout= expa - expb +127; // allign two numbers' significand // borrow number of digits bit from its exponent if(siga < sigb){ siga = siga << 1; expout--; } // r means result uint32_t r = 0; // start division for(int i = 0 ; i < 25 ; i++){ r= r << 1; if(siga >= sigb){ siga = siga - sigb; r = r | 1; } siga = siga << 1 ; } r=(r >> 1); int32_t sigout=r & 0x7fffff; int32_t out= ((expout & 0xff) << 23) | (sigout); return *(float *) &out; } float addition(float a,float b){ //change float into int int32_t ia = *(int32_t *)& a; int32_t ib = *(int32_t *)& b; //get the exponential int32_t expa = (ia >> 23) & 0xff ; int32_t expb = (ib >> 23) & 0xff; //get the significand uint32_t siga= ((ia & 0x7fffff) | 0x800000); uint32_t sigb= ((ib & 0x7fffff) | 0x800000); //exponential output int32_t expout=0; int dif= expa - expb; if(dif >= 0){ sigb = sigb >> dif; expout = expa; } else{ dif = -dif; siga = siga >> dif; expout = expb; } uint32_t sigout = siga + sigb; if(sigout >> 24 == 1){ sigout = sigout >>1; expout = expout + 1; } int32_t out=(0 & 1 << 31) | ((expout & 0xff) << 23) | ((sigout) & 0x7fffff); return *(float *) &out; } int main(){ uint32_t r=500; lz =count_leading_zeros(r); lzc = (32 - lz) / 2; ans=uint_to_float(r >> lzc); float t,c; for (i = 0 ; i < iteration ; i++){ t = division(r,ans); t = addition(t,ans); t = division(t,2); ans=t; if(ans == c){ break; } c = ans; } printf("After %d iterations, the square root of %d is %f",i+1,r,ans); return 0; } ``` * Firstly, I modified the main function to ensure that every input will result in a reasonable output. In the main function, I created an input array: $[500, 400, 31321, 102, 0]$. Among these numbers, there are some special inputs, such as ```0```, ```1``` and ```400```. If the input are ```0```, ```1``` and ```400```, I predict the ouput will be ```0```, ```1``` and ```20```. * The original code has a flaw. If I input the same number continuously, such as $[64, 64]$, the output will be: :::info After 2 iterations, the square root of 64 is 8.000000 After 1 iterations, the square root of 64 is 8.000000 ::: * The second line of output for the iteration will be 1. It didn't initialize the ```c``` variable in the iteration loop, so it used the incorrect variable. I replaced the location of ```c = ans[j];``` in line 10. * I noticed that the square root of ```64``` or ```1``` is computed in the initial estimation, but it exits the loop only after completing the second iteration in the original code. * Although the original code's author declared that the default input is a positive integer, excluding ```0```, I still want to address this issue. In this case, I need to deal with divide by 0, while the input value is ```0```. The code I improved in main: ```c int main(){ uint32_t r[6]={500, 400, 31321, 64, 1, 0}; float ans[6] = {}; float t,c; for(int j = 0; j < 6;j++){ lz =count_leading_zeros(r[j]); lzc = (32 - lz) / 2; ans[j]=uint_to_float(r[j] >> lzc); for (i = 0 ; i < iteration ; i++){ c = ans[j]; t = division(r[j],ans[j]); t = addition(t,ans[j]); t = division(t,2); ans[j]=t; if(ans[j] == c){ break; } } printf("After %d iterations, the square root of %d is %f\n",i+1,r[j],ans[j]); } return 0; } ``` Previous output: > After 5 iterations, the square root of 500 is 22.360680 > After 4 iterations, the square root of 400 is 20.000000 > After 5 iterations, the square root of 31321 is 176.977386 > After 2 iterations, the square root of 64 is 8.000000 > After 2 iterations, the square root of 1 is 1.000000 > After 6 iterations, the square root of 0 is 0.015625 > Program exited with code: 0 Current output: > After 5 iterations, the square root of 500 is 22.360680 > After 4 iterations, the square root of 400 is 20.000000 > After 5 iterations, the square root of 31321 is 176.977386 > After 1 iterations, the square root of 64 is 8.000000 > After 1 iterations, the square root of 1 is 1.000000 > After 6 iterations, the square root of 0 is 0.015625 > Program exited with code: 0 * I discovered something interesting when the input is ```0```, so I added ```if(u == 0) return 0;``` on line 2 * Also, I added ```if(ans[j] == 0) break;``` within the iteration for loop in the subsequent modification. ```c float uint_to_float(uint32_t u){ if(u == 0) return 0; int c = count_leading_zeros(u); unsigned int exp=127+31-c; // This line is improved from while loop. // ex. c-(1+8-1)=c-8 u <<= (c-8); uint32_t flt= (exp << 23) | (u & 0x7FFFFF); return * (float *) &flt; } ``` In the case where $r[1]=400$, I observed that there is an opportunity to reduce the calculation cycles in the shift_sub loop. If the ```siga``` variable is equal to zero after the if-statement, it's not necessary to execute the loop because it's only possible to set the zero bits in the subsequent processing. ```c float division(float a,float b){ ... for(int i = 0 ; i < 25 ; i++){ ... // If the remainder is zero, jump out the loop // And then set all the remaining bits to zero if(siga == 0){ r = r << (24-i); break; } ... } ... } ``` I think I can combine it with [my assignment 1 clz algorithm](https://hackmd.io/VxdSGcBHQHSLw3oIrp0_zA?view#Harley%E2%80%99s-algorithm) to achieve the greater performance in this case, so I try to replace the original ```uint16_t count_leading_zeros(uint32_t x);``` with the following code: ```c uint16_t count_leading_zeros(uint32_t x) { static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF }; /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return 31-Table[x >> 26]; } ``` After the modification, the cycles in this program have obviously reduced according to the execution information. Meanwhile, the output has changed to: Output: > After 5 iterations, the square root of 500 is 22.360680 > After 4 iterations, the square root of 400 is 20.000000 > After 5 iterations, the square root of 31321 is 176.977386 > After 1 iterations, the square root of 64 is 8.000000 > After 1 iterations, the square root of 1 is 1.000000 > After 1 iterations, the square root of 0 is 0.000000 > Program exited with code: 0 #### The original hand written Assembly The hand written assembly looks like this ```c .data intput: .word 500, 400, 31321, 64, 1, 0 str1: .string "After " str2: .string " iterations, the square root of " str3: .string " is " str4: .string "\n" .text # s0 = integer(input) # s1 = float(input) # s3 = test data address # s4 = test dataset len # s5 = j iterator # a5 = square root of input main: la s3, intput # get test data address li s4, 6 # load test dataset len li s5, 0 # j iterator = 0 loop: lw s0, 0(s3) # r[j] = test data j; mv a1, s0 # utf(a1) jal ra, utf # goto utf mv s1, a1 # store float(input) to s1 jal ra, clz # jump to clz and get a0 =clz(r) addi t0, x0, 32 # t0 = 32 sub t0, t0, a0 # 32 -lz srli t0, t0, 1 # t0 = lzc =(32 - lz)/2 srl a1, s0, t0 # a1 = r >> lzc jal ra, utf # jump to utf and get ans (a1 = utf(a1)) mv a0, x0 # i =0 addi a2, x0, 5 # iteration times in 5 newton: bge a0, a2, print mv a3, s1 # dividend = s0 mv a4, a1 # division = a1 jal ra, division # go to division (r / ans) mv a6, a5 # a6 = t mv a3, a6 # a3 = a6 mv a4, a1 # a4 = ans jal ra, addition # go to addition (t + ans) addi t0, x0, 1 # t0 = 1 slli t0, t0, 23 # t0 = << 1 sub a5, a5, t0 # a5 - t0 (exp-1) do the divide 2 mv a1, a5 # ans = t beq s2, a1 print # if last est == now est mv s2, a1 # c = ans addi a0, a0, 1 # i++ beq x0, x0 newton print: addi t0, a0, 1 la a0, str1 # after li a7, 4 ecall mv a0, t0 # i li a7, 1 ecall la a0, str2 # iteration, the square root of : li a7, 4 ecall mv a0, s0 # input li a7, 1 ecall la a0, str3 # iteration, the square root of : li a7, 4 ecall mv a0, a5 li a7, 2 ecall la a0, str4 # \n li a7, 4 ecall addi s3, s3, 4 # r[j] address + 4 addi s5, s5, 1 # j++ blt s5, s4, loop end: li a7, 10 ecall clz: mv t0, s0 # t0 = x mv t1, t0 # t1 = t0 srli t0, t0, 1 # x >> 1 or t0, t0, t1 # x = x | x >> 1 srli t1, t0, 2 # x >> 2 or t0, t0, t1 # x = x | x >> 2 srli t1, t0, 4 # x >> 4 or t0, t0, t1 # x = x | x >> 4 srli t1, t0, 8 # x >> 8 or t0, t0, t1 # x = x | x >> 8 # count ones (population count) srli t1, t0, 1 # x >> 1 li t6, 0x55555555 # load mask 0x55555555 and t1, t1, t6 # (x >> 1) & 0x55555555 sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555) srli t1, t0, 2 # x >> 2 li t6, 0x33333333 # load mask 0x33333333 and t1, t1, t6 # (x >> 2) & 0x33333333 and t0, t0, t6 # x & 0x33333333 add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t1, t0, 4 # x >> 4 add t0, t0, t1 # x + (x >> 4) li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t1, t0, 8 # x >> 8 add t0, t1, t0 # x + (x >> 8) srli t1, t0, 16 # x >> 16 add t0, t1, t0 # x + (x >> 16) andi t0, t0, 0x7f # x & 0x7f addi t1, x0, 32 # t1 = 32 sub a0, t1, t0 # return (32 - (x & 0x7f)) jr ra utf: addi sp, sp, -12 # make space on stack sw ra, 8(sp) # save ra sw s0, 4(sp) # save s0 sw a0, 0(sp) # save a0 mv s0, a1 jal ra, clz mv t0, a1 # t0 = a1 = u ,a0 = clz(u) addi t1, x0, 158 # t1 = 127 + 31 sub t1, t1, a0 # exp = 158 - clz(u) addi a0, a0,-8 # c-8 sll t0, t0, a0 # u << (c - 8) outloop: slli t1, t1, 23 # exp << 23 li t6, 0x7fffff # load mask 0x7fffff and t0, t0, t6 # u & 0x7fffff or a1, t1, t0 # a1= (exp << 23) | (u & 0x7FFFFF); lw ra, 8(sp) # get return adress lw s0, 4(sp) # get s0 lw a0, 0(sp) # get a0 jr ra # back to main #/////////////////division #a3= dividend #a4= divisor #a5= quotient #/////////////// #t0 = expout; #t1 = siga #t2 = sigb #t3 = i #t4 = 25 division: srli t0, a3, 23 # ia >> 23 andi t0, t0, 0xff # (ia >> 23) & 0xff expa srli t1, a4, 23 # ib >> 23 andi t1, t1, 0xff # (ib >> 23) & 0xff expb sub t0, t0, t1 # (expa - expb) addi t0, t0, 127 # expout = expa -expb +127 li t6, 0x7fffff # load mask 0x7fffff and t1, a3, t6 # ia & 0x7fffff siga and t2, a4, t6 # ib & 0x7fffff sigb li t6, 0x800000 # load mask 0x7fffff or t1, t1, t6 # siga = (ia & 0x7fffff ) | 0x800000 or t2, t2, t6 # sigb = (ib & 0x7fffff ) | 0x800000 div1: bge t1, t2, div2 # siga > sigb go to div2 slli t1, t1, 1 # siga = siga << 1 addi t0, t0, -1 # expout-- mv a5, x0 # r = 0 beq x0, x0, div1 # back to div1 div2: mv t3, x0 # i = 0 addi t4, x0, 25 # t4 = 25 shift_sub: bge t3, t4, div3 # if i >= 25 go to div3 slli a5, a5, 1 # r = r << 1 bltu t1, t2, next # if (siga < sigb) go to next sub t1, t1, t2 # siga = siga - sigb ori a5, a5, 1 # r = r | 1 next: slli t1, t1, 1 # siga = siga << 1 addi t3, t3, 1 # i = i + 1 beq x0, x0, shift_sub# back to shift_sub div3: srli a5, a5, 1 # r = r >> 1 li t6, 0x7fffff # load mask 0x7fffff and a5, a5, t6 # r & 0x7fffff andi t0, t0, 0xff # expout & 0xff slli t0, t0, 23 # expout = expout << 23 or a5, a5, t0 # out= ((expout & 0xff) << 23) | (sigout) jr ra # back to main #///////////////////addition #a3= a #a4= b #a5= a+b #/////////////// #t0 = expa / expout; #t1 = expb / sigout #t2 = dif #t3 = siga #t4 = sigb addition: srli t0, a3, 23 # ia >> 23 andi t0, t0, 0xff # expa = (ia >> 23) & 0xff srli t1, a4, 23 # ib >> 23 andi t1, t1, 0xff # expb = (ib >> 23) & 0xff sub t2, t0, t1 # dif = expa - expb li t6, 0x7fffff # load mask 0x7fffff and t3, a3, t6 # ia & 0x7fffff and t4, a4, t6 # ib & 0x7fffff li t6, 0x800000 # load mask 0x800000 or t3, t3, t6 # siga = ((ia & 0x7fffff) | 0x800000) or t4, t4, t6 # sigb = ((ib & 0x7fffff) | 0x800000) blt t2, x0, out_b # if dif < 0 , expout = expb, else expout=expa srl t4, t4, t2 # sigb = sigb >> dif beq x0, x0, plus # go to plus out_b: sub t2, x0, t2 # dif= - dif change dif to positive srl t3, t3, t2 # siga = siga >> dif mv t0, t1 # expout = expb plus: add t1, t3, t4 # sigout= siga + sigb srli t2, t1, 24 # sigout >> 24 addi t3, x0, 1 # t3 = 1 bne t2, t3, add_out # if (sigout >> 24) != 1 end addition srli t1, t1, 1 # sigout = sigout >> 1 addi t0, t0, 1 # expout ++ add_out: andi t0, t0, 0xff # expout & 0xff slli t0, t0, 23 # (expout & 0xff) << 23 li t6, 0x7fffff # load mask 0x7fffff and t1, t1, t6 # sigout& 0x7fffff or a5, t0, t1 # out = ((expout & 0xff) <<23) | ((sigout) & 0x7fffff) jr ra ``` The assembly codes I replaced are with optimized versions that improve performance, including byte array of harley hash table, count leading zero function and the if-statement in sub_shift. ```diff! .data intput: .word 500, 400, 31321, 64, 1, 0 +# hash map +table: + .byte 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF + .byte 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF + .byte 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF + .byte 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF + .byte 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF + .byte 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13 + .byte 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25 + .byte 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF str1: .string "After " str2: .string " iterations, the square root of " str3: .string " is " str4: .string "\n" ``` ```diff! clz: mv t0, s0 # t0 = x mv t1, t0 # t1 = t0 srli t0, t0, 1 # x >> 1 or t0, t0, t1 # x = x | x >> 1 srli t1, t0, 2 # x >> 2 or t0, t0, t1 # x = x | x >> 2 srli t1, t0, 4 # x >> 4 or t0, t0, t1 # x = x | x >> 4 srli t1, t0, 8 # x >> 8 or t0, t0, t1 # x = x | x >> 8 srli t1, t0, 16 # x >> 16 or t0, t0, t1 # x = x | x >> 16 # count ones (population count) - srli t1, t0, 1 # x >> 1 - li t6, 0x55555555 # load mask 0x55555555 - and t1, t1, t6 # (x >> 1) & 0x55555555 - sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555) - srli t1, t0, 2 # x >> 2 - li t6, 0x33333333 # load mask 0x33333333 - and t1, t1, t6 # (x >> 2) & 0x33333333 - and t0, t0, t6 # x & 0x33333333 - add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) - srli t1, t0, 4 # x >> 4 - add t0, t0, t1 # x + (x >> 4) - li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f - and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f - srli t1, t0, 8 # x >> 8 - add t0, t1, t0 # x + (x >> 8) - srli t1, t0, 16 # x >> 16 - add t0, t1, t0 # x + (x >> 16) - andi t0, t0, 0x7f # x & 0x7f - addi t1, x0, 32 # t1 = 32 - sub a0, t1, t0 # return (32 - (x & 0x7f)) + slli t1, t0, 3 + sub t0, t1, t0 # x = (x << 3) - x + slli t1, t0, 8 + sub t0, t1, t0 # x = (x << 8) - x + slli t1, t0, 8 + sub t0, t1, t0 # x = (x << 8) - x + slli t1, t0, 8 + sub t0, t1, t0 # x = (x << 8) - x + srli t0, t0, 26 # x >> 26 + la t1, table + add t1, t1, t0 + lbu t0, 0(t1) # Table[x >> 26] + li t1, 31 + sub a0, t1, t0 # 31 - Table[x >> 26] jr ra ``` ```diff! shift_sub: bge t3, t4, div3 # if i >= 25 go to div3 slli a5, a5, 1 # r = r << 1 bltu t1, t2, next # if (siga < sigb) go to next sub t1, t1, t2 # siga = siga - sigb ori a5, a5, 1 # r = r | 1 + bne t1, zero, next # check if the reminder is zero + li t6, 24 # load integer 24 + sub t6, t6, t3 # 24-i + sll a5, a5, t6 # r = r << (24-i) + jal zero, div3 # break, jump to div3 ``` ### Compiling using RISC-V gcc Commands about RISC-V gcc compiler: * Generate assembly file from .c file ```bash! riscv-none-elf-gcc -S hw2_origin.c -O0 -o hw2_origin_O0.s -march=rv32i -mabi=ilp32 ``` * Generate .o file from .S file ```bash! riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o hw2_origin_O0.o hw2_origin_O0.s ``` * Combine the assembly with the linking info, create program in executable and linkable format (.elf) ```bash! riscv-none-elf-ld -o hw2_origin_O0.elf -T hw2_origin_O0.ld --oformat=elf32-littleriscv hw2_origin_O0.o ``` Retrieve the ELF file information using the following command in the terminal: ```bash! riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O'x' hw2_origin.c -o hw2_origin.elf ``` , where ```'x'``` represents the optimization level, which can be set to 0, 1, 2, and so on. To obtain the section sizes and the total size of the object file, I can use this command to display its argument list: ```bash! riscv-none-elf-size hw2_origin.elf ``` Expected output: ```bash text data bss dec hex filename 77700 2320 1544 81564 13e9c hw2_origin.elf ``` I compile the original c file, ```hw2_origin.c``` with different optimization level, the result is the following: | Level | O0 | O1 | O2 | O3 | Ofast | | ----- | ----- | ----- | ----- | ----- | ----- | | text | 77700 | 76792 | 76816 | 77456 | 77436 | | data | 2320 | 2324 | 2324 | 2320 | 2320 | | bss | 1544 | 1544 | 1544 | 1544 | 1544 | | dec | 81564 | 80660 | 80684 | 81320 | 81300 | | hex | 13e9c | 13b14 | 13b2c | 13da8 | 13d94 | I compile the improved c file, ```hw2_imporved.c``` with different optimization level, the result is the following: | Level | O0 | O1 | O2 | O3 | Ofast | | ----- | ----- | ----- | ----- | ----- | ----- | | text | 77792 | 76848 | 76884 | 77492 | 77472 | | data | 2320 | 2324 | 2324 | 2320 | 2320 | | bss | 1544 | 1544 | 1544 | 1544 | 1544 | | dec | 81656 | 80716 | 80752 | 81356 | 81336 | | hex | 13ef8 | 13b4c | 13b70 | 13dcc | 13db8 | ### Evaluate computation cycles in C file Use the code in the [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) to obtain the execution clock cycle of the program. This code utilizes the CSR register. In other words, In other words, I am only measuring the clock cycles taken by the ```clz``` and ```sqrt``` functions. Write a makefile for c file, taking ```hw2_origin.c``` for example: ``` .PHONY: clean include ../../../rv32emu/mk/toolchain.mk CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall OBJS = \ getcycles.o \ getinstret.o \ sparkle.o \ hw2_origin.o BIN = hw2_origin.elf %.o: %.S $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< %.o: %.c $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< all: $(BIN) $(BIN): $(OBJS) $(CROSS_COMPILE)gcc -o $@ $^ clean: $(RM) $(BIN) $(OBJS) ``` #### hw2_origin.c Output on terminal: ```bash! After 5 iterations, the square root of 500 is 22.360680 Cycle count: 6411 Instructions retired: 6401 After 4 iterations, the square root of 400 is 20.000000 Cycle count: 4859 Instructions retired: 4859 After 5 iterations, the square root of 31321 is 176.977386 Cycle count: 6563 Instructions retired: 6563 After 2 iterations, the square root of 64 is 8.000000 Cycle count: 2520 Instructions retired: 2520 After 2 iterations, the square root of 1 is 1.000000 Cycle count: 2520 Instructions retired: 2520 After 6 iterations, the square root of 0 is 0.015625 Cycle count: 5832 Instructions retired: 5832 inferior exit code 0 ``` #### hw2_improved.c Output on terminal: ```bash After 5 iterations, the square root of 500 is 22.360680 Cycle count: 6885 Instructions retired: 6875 After 4 iterations, the square root of 400 is 20.000000 Cycle count: 3456 Instructions retired: 3456 After 5 iterations, the square root of 31321 is 176.977386 Cycle count: 6903 Instructions retired: 6903 After 1 iterations, the square root of 64 is 8.000000 Cycle count: 744 Instructions retired: 744 After 1 iterations, the square root of 1 is 1.000000 Cycle count: 744 Instructions retired: 744 After 1 iterations, the square root of 0 is 0.000000 Cycle count: 289 Instructions retired: 289 inferior exit code 0 ``` #### hw2_improved_harley.c Output on terminal: ```bash! After 5 iterations, the square root of 500 is 22.360680 Cycle count: 6865 Instructions retired: 6855 After 4 iterations, the square root of 400 is 20.000000 Cycle count: 3436 Instructions retired: 3436 After 5 iterations, the square root of 31321 is 176.977386 Cycle count: 6883 Instructions retired: 6883 After 1 iterations, the square root of 64 is 8.000000 Cycle count: 724 Instructions retired: 724 After 1 iterations, the square root of 1 is 1.000000 Cycle count: 724 Instructions retired: 724 After 1 iterations, the square root of 0 is 0.000000 Cycle count: 227 Instructions retired: 227 inferior exit code 0 ``` #### CSR I arranged the following sheet based on the results printed in the terminal. We can observe that ```hw2_origin.c``` performs the worst, excluding the input value ```500``` and ```31321```. When comparing ```hw2_origin.c``` to ```hw2_improved.c```, the latter handles the other input values quite well. This improvement is due to the if-statement in ```div_shift```, which noticeably reduces CSR cycles. However, the disadvantage of this modification is an increase in the cycle count during each iteration. ```hw2_improved_harley.c``` is an advanced improvement over ```hw2_improved.c```, which implements the Harley hash table that covers the ```clz``` function. This sheet represents the O0, O1 and Ofast optimized C code. **O0 Optimization** | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | | --------------------- | ---- | ---- | ----- | ---- | ---- | ---- | | hw2_origin.c | ==6411== | 4859 | ==6563== | 2520 | 2520 | 5832 | | hw2_improved.c | 6885 | 3456 | 6903 | 744 | 744 | 289 | | hw2_improved_harley.c | 6865 | ==3436== | 6883 | ==724== | ==724== | ==227== | **O1 Optimization** | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | | --------------------- | ---- | ---- | ----- | ---- | ---- | ---- | | hw2_origin.c | ==2436== | 1837 | ==2565== | 1032 | 1032 | 2186 | | hw2_improved.c | 2740 | 1454 | 2824 | 417 | 416 | 209 | | hw2_improved_harley.c | 2731 | ==1444== | 2814 | ==407== | ==406== | ==178== | **Ofast Optimization** | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | | --------------------- | ---- | ---- | ----- | ---- | ---- | ---- | | hw2_origin.c | ==2080== | 1641 | ==2197== | 984 | 984 | 2092 | | hw2_improved.c | 2249 | 1105 | 2324 | 354 | 354 | 220 | | hw2_improved_harley.c | 2243 | ==1099== | 2318 | ==348== | ==348== | ==187== | ### Write a Faster / Smaller Assembly program **hw2_origin.S** In ```hw2_origin.S```, I did some modification to print out the cycle count numbers and output values : ```diff! print: ... - la a0, str1 # after - li a7, 4 + li a0, 1 + la a1, str1 # "after " + la a2, str1_size + li a7, SYSWRITE ecall ... - mv a0, s0 # input - li a7, 1 + li a0, 1 + mv a1, s0 # input + li a7, PRINT_INT ecall ... - mv a0, a1 - li a7, 2 + li a0, 1 + mv a1, s2 # sqrt + li a7, PRINT_HEX ecall ... end: - li a7, 10 + mv ra ,x0 + li a7, SYSEXIT ecall ... + get_cycles: + csrr a1, cycleh + csrr a0, cycle + csrr a2, cycleh + bne a1, a2, get_cycles + ret ``` To use the cssr instruction, we need to modify the makefile: ```diff! - ASFLAGS = -march=rv32i -mabi=ilp32 + ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 -mabi=ilp32 ``` Since ```fprintf``` has some problem in printing floating point, I replace it by the number in hex format. After checking, the output values are correct. Output in terminal: ```bash! After 1363 iterations, the square root of 500 is 41b2e2ac Cycle count: 1363 After 1095 iterations, the square root of 400 is 41a00000 Cycle count: 1095 After 1374 iterations, the square root of 31321 is 4330fa36 Cycle count: 1374 After 603 iterations, the square root of 64 is 41000000 Cycle count: 603 After 603 iterations, the square root of 1 is 3f800000 Cycle count: 603 After 1372 iterations, the square root of 0 is 3f3504f3 Cycle count: 1372 inferior exit code 2 ``` **hw2_improved.S** ```bash! After 1408 iterations, the square root of 500 is 41b2e2ac Cycle count: 1408 After 839 iterations, the square root of 400 is 41a00000 Cycle count: 839 After 1423 iterations, the square root of 31321 is 4330fa36 Cycle count: 1423 After 225 iterations, the square root of 64 is 41000000 Cycle count: 225 After 225 iterations, the square root of 1 is 3f800000 Cycle count: 225 After 70 iterations, the square root of 0 is 00000000 Cycle count: 70 inferior exit code 2 ``` | info | hw2_improved.S | hw2_origin.S | | ---- | -------------- |:------------ | | text | 1036 | 968 | | data | 0 | 0 | | bss | 0 | 0 | | dec | 1036 | 968 | | hex | 40c | 3c8 | | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | |:-------------- |:-------- | ------- | -------- | ------- |:------- |:------ | | hw2_origin.S | ==1363== | 1095 | ==1374== | 603 | 603 | 1372 | | hw2_improved.S | 1408 | ==839== | 1423 | ==225== | ==225== | ==70== | ## Compare computing cycles between Ripes and RV32emu According to these figures, there are two set of information in Ripe. Example in ```hw2_improved.S``` performs better than example in ```hw2_origin.S``` . It's worth noting that both sets of values are derived from the results of running the entire test input suite. **hw2_origin.S** ![](https://hackmd.io/_uploads/SkUOhmpza.png) **hw2_improved.S** ![](https://hackmd.io/_uploads/Skcus7aMp.png) ## Estimating a histogram in the rv32emu toolchain Look into the instructions histogram generated by `rv_histogram` ## Reference * [rv32emu](https://github.com/sysprog21/rv32emu/tree/master) * [Why to Learn C programming?](https://www.tutorialspoint.com/cprogramming/index.htm) * [Assignment1: RISC-V Assembly and Instruction Pipeline-高紹捷](https://hackmd.io/@F3cNkb4bSWKg00J7O-_y8w/rJvzACwgT?utm_source=preview-mode&utm_medium=rec) * [Power of CLZ instruction in numerical computations](https://ieeexplore.ieee.org/document/8290400) :::warning :warning: Refer to the official **C standard** documentation for accurate information, rather than relying on web pages that may not be considered authoritative sources. :notes: jserv :::

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.