黃定山
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: Computer_Architecture_homework2 tags: Computer Architecture --- # Assignment2: RISC-V Toolchain Contributed by < [Sun970053](https://github.com/Sun970053) > ## Install riscv rv32ima toolchain Follow the instruction from [this note](https://hackmd.io/@sysprog/rJAufgHYS) Replace this line ```bash cd $HOME/riscv-none-elf-gcc echo "export PATH=`pwd`/bin:$PATH" > setenv ``` to this line ```bash echo "export PATH=`pwd`/bin:$PATH" >> .bashrc ``` then ```source``` the .bashrc file ```bash source .bashrc ``` validate installation succeed by executing ```bash riscv-none-elf-gcc -v ``` and the following result is printed out on the terminal. ```bash Using built-in specs. COLLECT_GCC=riscv-none-elf-gcc ... gcc version 13.2.0 (xPack GNU RISC-V Embedded GCC x86_64) ``` :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: ## Commonly used commands in riscv rv32ima toolchain Compile c code ```bash riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 hw2_origin.c -o hw2_origin.elf ``` Run the elf file in the path ```~/rv32emu``` ```bash address/build/rv32emu hw2.elf ``` Output ```bash After 5 iterations, the square root of 500 is 22.360680 After 4 iterations, the square root of 400 is 20.000000 After 5 iterations, the square root of 31321 is 176.977386 ... inferior exit code 0 ``` objdump: -d: Display the assembler mnemonics for the machine instructions ```bash! riscv-none-elf-objdump -d hw2_origin.elf ``` Expected output: ```bash! ... 221c8: 40b005b3 neg a1,a1 221cc: 00008293 mv t0,ra 221d0: f91ff0ef jal 22160 <__hidden___udivsi3> 221d4: 40a00533 neg a0,a0 221d8: 00028067 jr t0 000221dc <__modsi3>: 221dc: 00008293 mv t0,ra 221e0: 0005ca63 bltz a1,221f4 <__modsi3+0x18> 221e4: 00054c63 bltz a0,221fc <__modsi3+0x20> 221e8: f79ff0ef jal 22160 <__hidden___udivsi3> 221ec: 00058513 mv a0,a1 221f0: 00028067 jr t0 221f4: 40b005b3 neg a1,a1 221f8: fe0558e3 bgez a0,221e8 <__modsi3+0xc> 221fc: 40a00533 neg a0,a0 22200: f61ff0ef jal 22160 <__hidden___udivsi3> 22204: 40b00533 neg a0,a1 22208: 00028067 jr t0 ``` readelf: -h : Display the ELF file header ```bash! riscv-none-elf-readelf -h hw2_origin.elf ``` Excepted output: ```bash ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 94936 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` ## Pick a Question The following question is picked from the Assignment 1: > 高紹捷 To [find the square root of a integer number](https://hackmd.io/@F3cNkb4bSWKg00J7O-_y8w/rJvzACwgT) > Newton-Raphson is a numerical, iterative method. It has a quadratic convergence. A limitation of this method is that, the speed of convergence is dependent on the chosen initial value, so the choice of initial value is important. In this work, computation of square root is implemented using Newton-Raphson iterative method, with the initial value obtained using an algorithm based on CLZ. [Source code](https://github.com/kkkkk1109/Implement-square-root-using-CLZ) ### Algorithm According to the reference paper, [Square Root Computation Using CLZ](https://ieeexplore.ieee.org/document/8290400), we can show the following computational steps: \begin{gather*}x=\sqrt{2}\end{gather*} \begin{align*} &x^{2}=\mathrm{a}\\ &f(x)=x^{2}-a\\ &x_{n+1}=x_{n}-\frac{({x_{n}}^{2}-a)}{2x_{n}}\\ \end{align*} And then we simplify it: \begin{gather*}x_{n+1}=0.5(x_n+\dfrac{a}{x_n})\end{gather*} The original implementation of the questions is as follows. ```c #include <stdint.h> #include <stdio.h> #define iteration 5 int lz, lzc, i; float ans; int count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } float uint_to_float(uint32_t u){ int c = count_leading_zeros(u); unsigned int exp=127+31-c; // This line is improved from while loop. // ex. c-(1+8-1)=c-8 u <<= (c-8); uint32_t flt= (exp << 23) | (u & 0x7FFFFF); return * (float *) &flt; } float division(float a, float b){ //change float into int int32_t ia = *(int32_t *)& a; int32_t ib = *(int32_t *)& b; //get the exponential int32_t expa = (ia >> 23) & 0xff; int32_t expb = (ib >> 23) & 0xff; // parameter used in rounded output // get the significand uint32_t siga= ((ia & 0x7fffff) | 0x800000); uint32_t sigb= ((ib & 0x7fffff) | 0x800000); // exponential output int32_t expout= expa - expb +127; // allign two numbers' significand // borrow number of digits bit from its exponent if(siga < sigb){ siga = siga << 1; expout--; } // r means result uint32_t r = 0; // start division for(int i = 0 ; i < 25 ; i++){ r= r << 1; if(siga >= sigb){ siga = siga - sigb; r = r | 1; } siga = siga << 1 ; } r=(r >> 1); int32_t sigout=r & 0x7fffff; int32_t out= ((expout & 0xff) << 23) | (sigout); return *(float *) &out; } float addition(float a,float b){ //change float into int int32_t ia = *(int32_t *)& a; int32_t ib = *(int32_t *)& b; //get the exponential int32_t expa = (ia >> 23) & 0xff ; int32_t expb = (ib >> 23) & 0xff; //get the significand uint32_t siga= ((ia & 0x7fffff) | 0x800000); uint32_t sigb= ((ib & 0x7fffff) | 0x800000); //exponential output int32_t expout=0; int dif= expa - expb; if(dif >= 0){ sigb = sigb >> dif; expout = expa; } else{ dif = -dif; siga = siga >> dif; expout = expb; } uint32_t sigout = siga + sigb; if(sigout >> 24 == 1){ sigout = sigout >>1; expout = expout + 1; } int32_t out=(0 & 1 << 31) | ((expout & 0xff) << 23) | ((sigout) & 0x7fffff); return *(float *) &out; } int main(){ uint32_t r=500; lz =count_leading_zeros(r); lzc = (32 - lz) / 2; ans=uint_to_float(r >> lzc); float t,c; for (i = 0 ; i < iteration ; i++){ t = division(r,ans); t = addition(t,ans); t = division(t,2); ans=t; if(ans == c){ break; } c = ans; } printf("After %d iterations, the square root of %d is %f",i+1,r,ans); return 0; } ``` * Firstly, I modified the main function to ensure that every input will result in a reasonable output. In the main function, I created an input array: $[500, 400, 31321, 102, 0]$. Among these numbers, there are some special inputs, such as ```0```, ```1``` and ```400```. If the input are ```0```, ```1``` and ```400```, I predict the ouput will be ```0```, ```1``` and ```20```. * The original code has a flaw. If I input the same number continuously, such as $[64, 64]$, the output will be: :::info After 2 iterations, the square root of 64 is 8.000000 After 1 iterations, the square root of 64 is 8.000000 ::: * The second line of output for the iteration will be 1. It didn't initialize the ```c``` variable in the iteration loop, so it used the incorrect variable. I replaced the location of ```c = ans[j];``` in line 10. * I noticed that the square root of ```64``` or ```1``` is computed in the initial estimation, but it exits the loop only after completing the second iteration in the original code. * Although the original code's author declared that the default input is a positive integer, excluding ```0```, I still want to address this issue. In this case, I need to deal with divide by 0, while the input value is ```0```. The code I improved in main: ```c int main(){ uint32_t r[6]={500, 400, 31321, 64, 1, 0}; float ans[6] = {}; float t,c; for(int j = 0; j < 6;j++){ lz =count_leading_zeros(r[j]); lzc = (32 - lz) / 2; ans[j]=uint_to_float(r[j] >> lzc); for (i = 0 ; i < iteration ; i++){ c = ans[j]; t = division(r[j],ans[j]); t = addition(t,ans[j]); t = division(t,2); ans[j]=t; if(ans[j] == c){ break; } } printf("After %d iterations, the square root of %d is %f\n",i+1,r[j],ans[j]); } return 0; } ``` Previous output: > After 5 iterations, the square root of 500 is 22.360680 > After 4 iterations, the square root of 400 is 20.000000 > After 5 iterations, the square root of 31321 is 176.977386 > After 2 iterations, the square root of 64 is 8.000000 > After 2 iterations, the square root of 1 is 1.000000 > After 6 iterations, the square root of 0 is 0.015625 > Program exited with code: 0 Current output: > After 5 iterations, the square root of 500 is 22.360680 > After 4 iterations, the square root of 400 is 20.000000 > After 5 iterations, the square root of 31321 is 176.977386 > After 1 iterations, the square root of 64 is 8.000000 > After 1 iterations, the square root of 1 is 1.000000 > After 6 iterations, the square root of 0 is 0.015625 > Program exited with code: 0 * I discovered something interesting when the input is ```0```, so I added ```if(u == 0) return 0;``` on line 2 * Also, I added ```if(ans[j] == 0) break;``` within the iteration for loop in the subsequent modification. ```c float uint_to_float(uint32_t u){ if(u == 0) return 0; int c = count_leading_zeros(u); unsigned int exp=127+31-c; // This line is improved from while loop. // ex. c-(1+8-1)=c-8 u <<= (c-8); uint32_t flt= (exp << 23) | (u & 0x7FFFFF); return * (float *) &flt; } ``` In the case where $r[1]=400$, I observed that there is an opportunity to reduce the calculation cycles in the shift_sub loop. If the ```siga``` variable is equal to zero after the if-statement, it's not necessary to execute the loop because it's only possible to set the zero bits in the subsequent processing. ```c float division(float a,float b){ ... for(int i = 0 ; i < 25 ; i++){ ... // If the remainder is zero, jump out the loop // And then set all the remaining bits to zero if(siga == 0){ r = r << (24-i); break; } ... } ... } ``` I think I can combine it with [my assignment 1 clz algorithm](https://hackmd.io/VxdSGcBHQHSLw3oIrp0_zA?view#Harley%E2%80%99s-algorithm) to achieve the greater performance in this case, so I try to replace the original ```uint16_t count_leading_zeros(uint32_t x);``` with the following code: ```c uint16_t count_leading_zeros(uint32_t x) { static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF }; /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return 31-Table[x >> 26]; } ``` After the modification, the cycles in this program have obviously reduced according to the execution information. Meanwhile, the output has changed to: Output: > After 5 iterations, the square root of 500 is 22.360680 > After 4 iterations, the square root of 400 is 20.000000 > After 5 iterations, the square root of 31321 is 176.977386 > After 1 iterations, the square root of 64 is 8.000000 > After 1 iterations, the square root of 1 is 1.000000 > After 1 iterations, the square root of 0 is 0.000000 > Program exited with code: 0 #### The original hand written Assembly The hand written assembly looks like this ```c .data intput: .word 500, 400, 31321, 64, 1, 0 str1: .string "After " str2: .string " iterations, the square root of " str3: .string " is " str4: .string "\n" .text # s0 = integer(input) # s1 = float(input) # s3 = test data address # s4 = test dataset len # s5 = j iterator # a5 = square root of input main: la s3, intput # get test data address li s4, 6 # load test dataset len li s5, 0 # j iterator = 0 loop: lw s0, 0(s3) # r[j] = test data j; mv a1, s0 # utf(a1) jal ra, utf # goto utf mv s1, a1 # store float(input) to s1 jal ra, clz # jump to clz and get a0 =clz(r) addi t0, x0, 32 # t0 = 32 sub t0, t0, a0 # 32 -lz srli t0, t0, 1 # t0 = lzc =(32 - lz)/2 srl a1, s0, t0 # a1 = r >> lzc jal ra, utf # jump to utf and get ans (a1 = utf(a1)) mv a0, x0 # i =0 addi a2, x0, 5 # iteration times in 5 newton: bge a0, a2, print mv a3, s1 # dividend = s0 mv a4, a1 # division = a1 jal ra, division # go to division (r / ans) mv a6, a5 # a6 = t mv a3, a6 # a3 = a6 mv a4, a1 # a4 = ans jal ra, addition # go to addition (t + ans) addi t0, x0, 1 # t0 = 1 slli t0, t0, 23 # t0 = << 1 sub a5, a5, t0 # a5 - t0 (exp-1) do the divide 2 mv a1, a5 # ans = t beq s2, a1 print # if last est == now est mv s2, a1 # c = ans addi a0, a0, 1 # i++ beq x0, x0 newton print: addi t0, a0, 1 la a0, str1 # after li a7, 4 ecall mv a0, t0 # i li a7, 1 ecall la a0, str2 # iteration, the square root of : li a7, 4 ecall mv a0, s0 # input li a7, 1 ecall la a0, str3 # iteration, the square root of : li a7, 4 ecall mv a0, a5 li a7, 2 ecall la a0, str4 # \n li a7, 4 ecall addi s3, s3, 4 # r[j] address + 4 addi s5, s5, 1 # j++ blt s5, s4, loop end: li a7, 10 ecall clz: mv t0, s0 # t0 = x mv t1, t0 # t1 = t0 srli t0, t0, 1 # x >> 1 or t0, t0, t1 # x = x | x >> 1 srli t1, t0, 2 # x >> 2 or t0, t0, t1 # x = x | x >> 2 srli t1, t0, 4 # x >> 4 or t0, t0, t1 # x = x | x >> 4 srli t1, t0, 8 # x >> 8 or t0, t0, t1 # x = x | x >> 8 # count ones (population count) srli t1, t0, 1 # x >> 1 li t6, 0x55555555 # load mask 0x55555555 and t1, t1, t6 # (x >> 1) & 0x55555555 sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555) srli t1, t0, 2 # x >> 2 li t6, 0x33333333 # load mask 0x33333333 and t1, t1, t6 # (x >> 2) & 0x33333333 and t0, t0, t6 # x & 0x33333333 add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t1, t0, 4 # x >> 4 add t0, t0, t1 # x + (x >> 4) li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t1, t0, 8 # x >> 8 add t0, t1, t0 # x + (x >> 8) srli t1, t0, 16 # x >> 16 add t0, t1, t0 # x + (x >> 16) andi t0, t0, 0x7f # x & 0x7f addi t1, x0, 32 # t1 = 32 sub a0, t1, t0 # return (32 - (x & 0x7f)) jr ra utf: addi sp, sp, -12 # make space on stack sw ra, 8(sp) # save ra sw s0, 4(sp) # save s0 sw a0, 0(sp) # save a0 mv s0, a1 jal ra, clz mv t0, a1 # t0 = a1 = u ,a0 = clz(u) addi t1, x0, 158 # t1 = 127 + 31 sub t1, t1, a0 # exp = 158 - clz(u) addi a0, a0,-8 # c-8 sll t0, t0, a0 # u << (c - 8) outloop: slli t1, t1, 23 # exp << 23 li t6, 0x7fffff # load mask 0x7fffff and t0, t0, t6 # u & 0x7fffff or a1, t1, t0 # a1= (exp << 23) | (u & 0x7FFFFF); lw ra, 8(sp) # get return adress lw s0, 4(sp) # get s0 lw a0, 0(sp) # get a0 jr ra # back to main #/////////////////division #a3= dividend #a4= divisor #a5= quotient #/////////////// #t0 = expout; #t1 = siga #t2 = sigb #t3 = i #t4 = 25 division: srli t0, a3, 23 # ia >> 23 andi t0, t0, 0xff # (ia >> 23) & 0xff expa srli t1, a4, 23 # ib >> 23 andi t1, t1, 0xff # (ib >> 23) & 0xff expb sub t0, t0, t1 # (expa - expb) addi t0, t0, 127 # expout = expa -expb +127 li t6, 0x7fffff # load mask 0x7fffff and t1, a3, t6 # ia & 0x7fffff siga and t2, a4, t6 # ib & 0x7fffff sigb li t6, 0x800000 # load mask 0x7fffff or t1, t1, t6 # siga = (ia & 0x7fffff ) | 0x800000 or t2, t2, t6 # sigb = (ib & 0x7fffff ) | 0x800000 div1: bge t1, t2, div2 # siga > sigb go to div2 slli t1, t1, 1 # siga = siga << 1 addi t0, t0, -1 # expout-- mv a5, x0 # r = 0 beq x0, x0, div1 # back to div1 div2: mv t3, x0 # i = 0 addi t4, x0, 25 # t4 = 25 shift_sub: bge t3, t4, div3 # if i >= 25 go to div3 slli a5, a5, 1 # r = r << 1 bltu t1, t2, next # if (siga < sigb) go to next sub t1, t1, t2 # siga = siga - sigb ori a5, a5, 1 # r = r | 1 next: slli t1, t1, 1 # siga = siga << 1 addi t3, t3, 1 # i = i + 1 beq x0, x0, shift_sub# back to shift_sub div3: srli a5, a5, 1 # r = r >> 1 li t6, 0x7fffff # load mask 0x7fffff and a5, a5, t6 # r & 0x7fffff andi t0, t0, 0xff # expout & 0xff slli t0, t0, 23 # expout = expout << 23 or a5, a5, t0 # out= ((expout & 0xff) << 23) | (sigout) jr ra # back to main #///////////////////addition #a3= a #a4= b #a5= a+b #/////////////// #t0 = expa / expout; #t1 = expb / sigout #t2 = dif #t3 = siga #t4 = sigb addition: srli t0, a3, 23 # ia >> 23 andi t0, t0, 0xff # expa = (ia >> 23) & 0xff srli t1, a4, 23 # ib >> 23 andi t1, t1, 0xff # expb = (ib >> 23) & 0xff sub t2, t0, t1 # dif = expa - expb li t6, 0x7fffff # load mask 0x7fffff and t3, a3, t6 # ia & 0x7fffff and t4, a4, t6 # ib & 0x7fffff li t6, 0x800000 # load mask 0x800000 or t3, t3, t6 # siga = ((ia & 0x7fffff) | 0x800000) or t4, t4, t6 # sigb = ((ib & 0x7fffff) | 0x800000) blt t2, x0, out_b # if dif < 0 , expout = expb, else expout=expa srl t4, t4, t2 # sigb = sigb >> dif beq x0, x0, plus # go to plus out_b: sub t2, x0, t2 # dif= - dif change dif to positive srl t3, t3, t2 # siga = siga >> dif mv t0, t1 # expout = expb plus: add t1, t3, t4 # sigout= siga + sigb srli t2, t1, 24 # sigout >> 24 addi t3, x0, 1 # t3 = 1 bne t2, t3, add_out # if (sigout >> 24) != 1 end addition srli t1, t1, 1 # sigout = sigout >> 1 addi t0, t0, 1 # expout ++ add_out: andi t0, t0, 0xff # expout & 0xff slli t0, t0, 23 # (expout & 0xff) << 23 li t6, 0x7fffff # load mask 0x7fffff and t1, t1, t6 # sigout& 0x7fffff or a5, t0, t1 # out = ((expout & 0xff) <<23) | ((sigout) & 0x7fffff) jr ra ``` The assembly codes I replaced are with optimized versions that improve performance, including byte array of harley hash table, count leading zero function and the if-statement in sub_shift. ```diff! .data intput: .word 500, 400, 31321, 64, 1, 0 +# hash map +table: + .byte 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF + .byte 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF + .byte 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF + .byte 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF + .byte 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF + .byte 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13 + .byte 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25 + .byte 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF str1: .string "After " str2: .string " iterations, the square root of " str3: .string " is " str4: .string "\n" ``` ```diff! clz: mv t0, s0 # t0 = x mv t1, t0 # t1 = t0 srli t0, t0, 1 # x >> 1 or t0, t0, t1 # x = x | x >> 1 srli t1, t0, 2 # x >> 2 or t0, t0, t1 # x = x | x >> 2 srli t1, t0, 4 # x >> 4 or t0, t0, t1 # x = x | x >> 4 srli t1, t0, 8 # x >> 8 or t0, t0, t1 # x = x | x >> 8 srli t1, t0, 16 # x >> 16 or t0, t0, t1 # x = x | x >> 16 # count ones (population count) - srli t1, t0, 1 # x >> 1 - li t6, 0x55555555 # load mask 0x55555555 - and t1, t1, t6 # (x >> 1) & 0x55555555 - sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555) - srli t1, t0, 2 # x >> 2 - li t6, 0x33333333 # load mask 0x33333333 - and t1, t1, t6 # (x >> 2) & 0x33333333 - and t0, t0, t6 # x & 0x33333333 - add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) - srli t1, t0, 4 # x >> 4 - add t0, t0, t1 # x + (x >> 4) - li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f - and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f - srli t1, t0, 8 # x >> 8 - add t0, t1, t0 # x + (x >> 8) - srli t1, t0, 16 # x >> 16 - add t0, t1, t0 # x + (x >> 16) - andi t0, t0, 0x7f # x & 0x7f - addi t1, x0, 32 # t1 = 32 - sub a0, t1, t0 # return (32 - (x & 0x7f)) + slli t1, t0, 3 + sub t0, t1, t0 # x = (x << 3) - x + slli t1, t0, 8 + sub t0, t1, t0 # x = (x << 8) - x + slli t1, t0, 8 + sub t0, t1, t0 # x = (x << 8) - x + slli t1, t0, 8 + sub t0, t1, t0 # x = (x << 8) - x + srli t0, t0, 26 # x >> 26 + la t1, table + add t1, t1, t0 + lbu t0, 0(t1) # Table[x >> 26] + li t1, 31 + sub a0, t1, t0 # 31 - Table[x >> 26] jr ra ``` ```diff! shift_sub: bge t3, t4, div3 # if i >= 25 go to div3 slli a5, a5, 1 # r = r << 1 bltu t1, t2, next # if (siga < sigb) go to next sub t1, t1, t2 # siga = siga - sigb ori a5, a5, 1 # r = r | 1 + bne t1, zero, next # check if the reminder is zero + li t6, 24 # load integer 24 + sub t6, t6, t3 # 24-i + sll a5, a5, t6 # r = r << (24-i) + jal zero, div3 # break, jump to div3 ``` ### Compiling using RISC-V gcc Commands about RISC-V gcc compiler: * Generate assembly file from .c file ```bash! riscv-none-elf-gcc -S hw2_origin.c -O0 -o hw2_origin_O0.s -march=rv32i -mabi=ilp32 ``` * Generate .o file from .S file ```bash! riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o hw2_origin_O0.o hw2_origin_O0.s ``` * Combine the assembly with the linking info, create program in executable and linkable format (.elf) ```bash! riscv-none-elf-ld -o hw2_origin_O0.elf -T hw2_origin_O0.ld --oformat=elf32-littleriscv hw2_origin_O0.o ``` Retrieve the ELF file information using the following command in the terminal: ```bash! riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O'x' hw2_origin.c -o hw2_origin.elf ``` , where ```'x'``` represents the optimization level, which can be set to 0, 1, 2, and so on. To obtain the section sizes and the total size of the object file, I can use this command to display its argument list: ```bash! riscv-none-elf-size hw2_origin.elf ``` Expected output: ```bash text data bss dec hex filename 77700 2320 1544 81564 13e9c hw2_origin.elf ``` I compile the original c file, ```hw2_origin.c``` with different optimization level, the result is the following: | Level | O0 | O1 | O2 | O3 | Ofast | | ----- | ----- | ----- | ----- | ----- | ----- | | text | 77700 | 76792 | 76816 | 77456 | 77436 | | data | 2320 | 2324 | 2324 | 2320 | 2320 | | bss | 1544 | 1544 | 1544 | 1544 | 1544 | | dec | 81564 | 80660 | 80684 | 81320 | 81300 | | hex | 13e9c | 13b14 | 13b2c | 13da8 | 13d94 | I compile the improved c file, ```hw2_imporved.c``` with different optimization level, the result is the following: | Level | O0 | O1 | O2 | O3 | Ofast | | ----- | ----- | ----- | ----- | ----- | ----- | | text | 77792 | 76848 | 76884 | 77492 | 77472 | | data | 2320 | 2324 | 2324 | 2320 | 2320 | | bss | 1544 | 1544 | 1544 | 1544 | 1544 | | dec | 81656 | 80716 | 80752 | 81356 | 81336 | | hex | 13ef8 | 13b4c | 13b70 | 13dcc | 13db8 | ### Evaluate computation cycles in C file Use the code in the [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) to obtain the execution clock cycle of the program. This code utilizes the CSR register. In other words, In other words, I am only measuring the clock cycles taken by the ```clz``` and ```sqrt``` functions. Write a makefile for c file, taking ```hw2_origin.c``` for example: ``` .PHONY: clean include ../../../rv32emu/mk/toolchain.mk CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall OBJS = \ getcycles.o \ getinstret.o \ sparkle.o \ hw2_origin.o BIN = hw2_origin.elf %.o: %.S $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< %.o: %.c $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< all: $(BIN) $(BIN): $(OBJS) $(CROSS_COMPILE)gcc -o $@ $^ clean: $(RM) $(BIN) $(OBJS) ``` #### hw2_origin.c Output on terminal: ```bash! After 5 iterations, the square root of 500 is 22.360680 Cycle count: 6411 Instructions retired: 6401 After 4 iterations, the square root of 400 is 20.000000 Cycle count: 4859 Instructions retired: 4859 After 5 iterations, the square root of 31321 is 176.977386 Cycle count: 6563 Instructions retired: 6563 After 2 iterations, the square root of 64 is 8.000000 Cycle count: 2520 Instructions retired: 2520 After 2 iterations, the square root of 1 is 1.000000 Cycle count: 2520 Instructions retired: 2520 After 6 iterations, the square root of 0 is 0.015625 Cycle count: 5832 Instructions retired: 5832 inferior exit code 0 ``` #### hw2_improved.c Output on terminal: ```bash After 5 iterations, the square root of 500 is 22.360680 Cycle count: 6885 Instructions retired: 6875 After 4 iterations, the square root of 400 is 20.000000 Cycle count: 3456 Instructions retired: 3456 After 5 iterations, the square root of 31321 is 176.977386 Cycle count: 6903 Instructions retired: 6903 After 1 iterations, the square root of 64 is 8.000000 Cycle count: 744 Instructions retired: 744 After 1 iterations, the square root of 1 is 1.000000 Cycle count: 744 Instructions retired: 744 After 1 iterations, the square root of 0 is 0.000000 Cycle count: 289 Instructions retired: 289 inferior exit code 0 ``` #### hw2_improved_harley.c Output on terminal: ```bash! After 5 iterations, the square root of 500 is 22.360680 Cycle count: 6865 Instructions retired: 6855 After 4 iterations, the square root of 400 is 20.000000 Cycle count: 3436 Instructions retired: 3436 After 5 iterations, the square root of 31321 is 176.977386 Cycle count: 6883 Instructions retired: 6883 After 1 iterations, the square root of 64 is 8.000000 Cycle count: 724 Instructions retired: 724 After 1 iterations, the square root of 1 is 1.000000 Cycle count: 724 Instructions retired: 724 After 1 iterations, the square root of 0 is 0.000000 Cycle count: 227 Instructions retired: 227 inferior exit code 0 ``` #### CSR I arranged the following sheet based on the results printed in the terminal. We can observe that ```hw2_origin.c``` performs the worst, excluding the input value ```500``` and ```31321```. When comparing ```hw2_origin.c``` to ```hw2_improved.c```, the latter handles the other input values quite well. This improvement is due to the if-statement in ```div_shift```, which noticeably reduces CSR cycles. However, the disadvantage of this modification is an increase in the cycle count during each iteration. ```hw2_improved_harley.c``` is an advanced improvement over ```hw2_improved.c```, which implements the Harley hash table that covers the ```clz``` function. This sheet represents the O0, O1 and Ofast optimized C code. **O0 Optimization** | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | | --------------------- | ---- | ---- | ----- | ---- | ---- | ---- | | hw2_origin.c | ==6411== | 4859 | ==6563== | 2520 | 2520 | 5832 | | hw2_improved.c | 6885 | 3456 | 6903 | 744 | 744 | 289 | | hw2_improved_harley.c | 6865 | ==3436== | 6883 | ==724== | ==724== | ==227== | **O1 Optimization** | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | | --------------------- | ---- | ---- | ----- | ---- | ---- | ---- | | hw2_origin.c | ==2436== | 1837 | ==2565== | 1032 | 1032 | 2186 | | hw2_improved.c | 2740 | 1454 | 2824 | 417 | 416 | 209 | | hw2_improved_harley.c | 2731 | ==1444== | 2814 | ==407== | ==406== | ==178== | **Ofast Optimization** | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | | --------------------- | ---- | ---- | ----- | ---- | ---- | ---- | | hw2_origin.c | ==2080== | 1641 | ==2197== | 984 | 984 | 2092 | | hw2_improved.c | 2249 | 1105 | 2324 | 354 | 354 | 220 | | hw2_improved_harley.c | 2243 | ==1099== | 2318 | ==348== | ==348== | ==187== | ### Write a Faster / Smaller Assembly program **hw2_origin.S** In ```hw2_origin.S```, I did some modification to print out the cycle count numbers and output values : ```diff! print: ... - la a0, str1 # after - li a7, 4 + li a0, 1 + la a1, str1 # "after " + la a2, str1_size + li a7, SYSWRITE ecall ... - mv a0, s0 # input - li a7, 1 + li a0, 1 + mv a1, s0 # input + li a7, PRINT_INT ecall ... - mv a0, a1 - li a7, 2 + li a0, 1 + mv a1, s2 # sqrt + li a7, PRINT_HEX ecall ... end: - li a7, 10 + mv ra ,x0 + li a7, SYSEXIT ecall ... + get_cycles: + csrr a1, cycleh + csrr a0, cycle + csrr a2, cycleh + bne a1, a2, get_cycles + ret ``` To use the cssr instruction, we need to modify the makefile: ```diff! - ASFLAGS = -march=rv32i -mabi=ilp32 + ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 -mabi=ilp32 ``` Since ```fprintf``` has some problem in printing floating point, I replace it by the number in hex format. After checking, the output values are correct. Output in terminal: ```bash! After 1363 iterations, the square root of 500 is 41b2e2ac Cycle count: 1363 After 1095 iterations, the square root of 400 is 41a00000 Cycle count: 1095 After 1374 iterations, the square root of 31321 is 4330fa36 Cycle count: 1374 After 603 iterations, the square root of 64 is 41000000 Cycle count: 603 After 603 iterations, the square root of 1 is 3f800000 Cycle count: 603 After 1372 iterations, the square root of 0 is 3f3504f3 Cycle count: 1372 inferior exit code 2 ``` **hw2_improved.S** ```bash! After 1408 iterations, the square root of 500 is 41b2e2ac Cycle count: 1408 After 839 iterations, the square root of 400 is 41a00000 Cycle count: 839 After 1423 iterations, the square root of 31321 is 4330fa36 Cycle count: 1423 After 225 iterations, the square root of 64 is 41000000 Cycle count: 225 After 225 iterations, the square root of 1 is 3f800000 Cycle count: 225 After 70 iterations, the square root of 0 is 00000000 Cycle count: 70 inferior exit code 2 ``` | info | hw2_improved.S | hw2_origin.S | | ---- | -------------- |:------------ | | text | 1036 | 968 | | data | 0 | 0 | | bss | 0 | 0 | | dec | 1036 | 968 | | hex | 40c | 3c8 | | file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` | |:-------------- |:-------- | ------- | -------- | ------- |:------- |:------ | | hw2_origin.S | ==1363== | 1095 | ==1374== | 603 | 603 | 1372 | | hw2_improved.S | 1408 | ==839== | 1423 | ==225== | ==225== | ==70== | ## Compare computing cycles between Ripes and RV32emu According to these figures, there are two set of information in Ripe. Example in ```hw2_improved.S``` performs better than example in ```hw2_origin.S``` . It's worth noting that both sets of values are derived from the results of running the entire test input suite. **hw2_origin.S** ![](https://hackmd.io/_uploads/SkUOhmpza.png) **hw2_improved.S** ![](https://hackmd.io/_uploads/Skcus7aMp.png) ## Estimating a histogram in the rv32emu toolchain Look into the instructions histogram generated by `rv_histogram` ## Reference * [rv32emu](https://github.com/sysprog21/rv32emu/tree/master) * [Why to Learn C programming?](https://www.tutorialspoint.com/cprogramming/index.htm) * [Assignment1: RISC-V Assembly and Instruction Pipeline-高紹捷](https://hackmd.io/@F3cNkb4bSWKg00J7O-_y8w/rJvzACwgT?utm_source=preview-mode&utm_medium=rec) * [Power of CLZ instruction in numerical computations](https://ieeexplore.ieee.org/document/8290400) :::warning :warning: Refer to the official **C standard** documentation for accurate information, rather than relying on web pages that may not be considered authoritative sources. :notes: jserv :::

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully