---
title: Computer_Architecture_homework2
tags: Computer Architecture
---
# Assignment2: RISC-V Toolchain
Contributed by < [Sun970053](https://github.com/Sun970053) >
## Install riscv rv32ima toolchain
Follow the instruction from [this note](https://hackmd.io/@sysprog/rJAufgHYS)
Replace this line
```bash
cd $HOME/riscv-none-elf-gcc
echo "export PATH=`pwd`/bin:$PATH" > setenv
```
to this line
```bash
echo "export PATH=`pwd`/bin:$PATH" >> .bashrc
```
then ```source``` the .bashrc file
```bash
source .bashrc
```
validate installation succeed by executing
```bash
riscv-none-elf-gcc -v
```
and the following result is printed out on the terminal.
```bash
Using built-in specs.
COLLECT_GCC=riscv-none-elf-gcc
...
gcc version 13.2.0 (xPack GNU RISC-V Embedded GCC x86_64)
```
:::warning
:warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text.
:notes: jserv
:::
## Commonly used commands in riscv rv32ima toolchain
Compile c code
```bash
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 hw2_origin.c -o hw2_origin.elf
```
Run the elf file in the path ```~/rv32emu```
```bash
address/build/rv32emu hw2.elf
```
Output
```bash
After 5 iterations, the square root of 500 is 22.360680
After 4 iterations, the square root of 400 is 20.000000
After 5 iterations, the square root of 31321 is 176.977386
...
inferior exit code 0
```
objdump:
-d: Display the assembler mnemonics for the machine instructions
```bash!
riscv-none-elf-objdump -d hw2_origin.elf
```
Expected output:
```bash!
...
221c8: 40b005b3 neg a1,a1
221cc: 00008293 mv t0,ra
221d0: f91ff0ef jal 22160 <__hidden___udivsi3>
221d4: 40a00533 neg a0,a0
221d8: 00028067 jr t0
000221dc <__modsi3>:
221dc: 00008293 mv t0,ra
221e0: 0005ca63 bltz a1,221f4 <__modsi3+0x18>
221e4: 00054c63 bltz a0,221fc <__modsi3+0x20>
221e8: f79ff0ef jal 22160 <__hidden___udivsi3>
221ec: 00058513 mv a0,a1
221f0: 00028067 jr t0
221f4: 40b005b3 neg a1,a1
221f8: fe0558e3 bgez a0,221e8 <__modsi3+0xc>
221fc: 40a00533 neg a0,a0
22200: f61ff0ef jal 22160 <__hidden___udivsi3>
22204: 40b00533 neg a0,a1
22208: 00028067 jr t0
```
readelf:
-h : Display the ELF file header
```bash!
riscv-none-elf-readelf -h hw2_origin.elf
```
Excepted output:
```bash
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 94936 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
## Pick a Question
The following question is picked from the Assignment 1:
> 高紹捷 To [find the square root of a integer number](https://hackmd.io/@F3cNkb4bSWKg00J7O-_y8w/rJvzACwgT)
> Newton-Raphson is a numerical, iterative method. It has a quadratic convergence. A limitation of this method is that, the speed of convergence is dependent on the chosen initial value, so the choice of initial value is important. In this work, computation of square root is implemented using Newton-Raphson iterative method, with the initial value obtained using an algorithm based on CLZ.
[Source code](https://github.com/kkkkk1109/Implement-square-root-using-CLZ)
### Algorithm
According to the reference paper, [Square Root Computation Using CLZ](https://ieeexplore.ieee.org/document/8290400), we can show the following computational steps:
\begin{gather*}x=\sqrt{2}\end{gather*}
\begin{align*} &x^{2}=\mathrm{a}\\ &f(x)=x^{2}-a\\ &x_{n+1}=x_{n}-\frac{({x_{n}}^{2}-a)}{2x_{n}}\\ \end{align*}
And then we simplify it:
\begin{gather*}x_{n+1}=0.5(x_n+\dfrac{a}{x_n})\end{gather*}
The original implementation of the questions is as follows.
```c
#include <stdint.h>
#include <stdio.h>
#define iteration 5
int lz, lzc, i;
float ans;
int count_leading_zeros(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
float uint_to_float(uint32_t u){
int c = count_leading_zeros(u);
unsigned int exp=127+31-c;
// This line is improved from while loop.
// ex. c-(1+8-1)=c-8
u <<= (c-8);
uint32_t flt= (exp << 23) | (u & 0x7FFFFF);
return * (float *) &flt;
}
float division(float a, float b){
//change float into int
int32_t ia = *(int32_t *)& a;
int32_t ib = *(int32_t *)& b;
//get the exponential
int32_t expa = (ia >> 23) & 0xff;
int32_t expb = (ib >> 23) & 0xff;
// parameter used in rounded output
// get the significand
uint32_t siga= ((ia & 0x7fffff) | 0x800000);
uint32_t sigb= ((ib & 0x7fffff) | 0x800000);
// exponential output
int32_t expout= expa - expb +127;
// allign two numbers' significand
// borrow number of digits bit from its exponent
if(siga < sigb){
siga = siga << 1;
expout--;
}
// r means result
uint32_t r = 0;
// start division
for(int i = 0 ; i < 25 ; i++){
r= r << 1;
if(siga >= sigb){
siga = siga - sigb;
r = r | 1;
}
siga = siga << 1 ;
}
r=(r >> 1);
int32_t sigout=r & 0x7fffff;
int32_t out= ((expout & 0xff) << 23) | (sigout);
return *(float *) &out;
}
float addition(float a,float b){
//change float into int
int32_t ia = *(int32_t *)& a;
int32_t ib = *(int32_t *)& b;
//get the exponential
int32_t expa = (ia >> 23) & 0xff ;
int32_t expb = (ib >> 23) & 0xff;
//get the significand
uint32_t siga= ((ia & 0x7fffff) | 0x800000);
uint32_t sigb= ((ib & 0x7fffff) | 0x800000);
//exponential output
int32_t expout=0;
int dif= expa - expb;
if(dif >= 0){
sigb = sigb >> dif;
expout = expa;
}
else{
dif = -dif;
siga = siga >> dif;
expout = expb;
}
uint32_t sigout = siga + sigb;
if(sigout >> 24 == 1){
sigout = sigout >>1;
expout = expout + 1;
}
int32_t out=(0 & 1 << 31) | ((expout & 0xff) << 23) | ((sigout) & 0x7fffff);
return *(float *) &out;
}
int main(){
uint32_t r=500;
lz =count_leading_zeros(r);
lzc = (32 - lz) / 2;
ans=uint_to_float(r >> lzc);
float t,c;
for (i = 0 ; i < iteration ; i++){
t = division(r,ans);
t = addition(t,ans);
t = division(t,2);
ans=t;
if(ans == c){
break;
}
c = ans;
}
printf("After %d iterations, the square root of %d is %f",i+1,r,ans);
return 0;
}
```
* Firstly, I modified the main function to ensure that every input will result in a reasonable output. In the main function, I created an input array: $[500, 400, 31321, 102, 0]$. Among these numbers, there are some special inputs, such as ```0```, ```1``` and ```400```. If the input are ```0```, ```1``` and ```400```, I predict the ouput will be ```0```, ```1``` and ```20```.
* The original code has a flaw. If I input the same number continuously, such as $[64, 64]$, the output will be:
:::info
After 2 iterations, the square root of 64 is 8.000000
After 1 iterations, the square root of 64 is 8.000000
:::
* The second line of output for the iteration will be 1. It didn't initialize the ```c``` variable in the iteration loop, so it used the incorrect variable. I replaced the location of ```c = ans[j];``` in line 10.
* I noticed that the square root of ```64``` or ```1``` is computed in the initial estimation, but it exits the loop only after completing the second iteration in the original code.
* Although the original code's author declared that the default input is a positive integer, excluding ```0```, I still want to address this issue. In this case, I need to deal with divide by 0, while the input value is ```0```.
The code I improved in main:
```c
int main(){
uint32_t r[6]={500, 400, 31321, 64, 1, 0};
float ans[6] = {};
float t,c;
for(int j = 0; j < 6;j++){
lz =count_leading_zeros(r[j]);
lzc = (32 - lz) / 2;
ans[j]=uint_to_float(r[j] >> lzc);
for (i = 0 ; i < iteration ; i++){
c = ans[j];
t = division(r[j],ans[j]);
t = addition(t,ans[j]);
t = division(t,2);
ans[j]=t;
if(ans[j] == c){
break;
}
}
printf("After %d iterations, the square root of %d is %f\n",i+1,r[j],ans[j]);
}
return 0;
}
```
Previous output:
> After 5 iterations, the square root of 500 is 22.360680
> After 4 iterations, the square root of 400 is 20.000000
> After 5 iterations, the square root of 31321 is 176.977386
> After 2 iterations, the square root of 64 is 8.000000
> After 2 iterations, the square root of 1 is 1.000000
> After 6 iterations, the square root of 0 is 0.015625
> Program exited with code: 0
Current output:
> After 5 iterations, the square root of 500 is 22.360680
> After 4 iterations, the square root of 400 is 20.000000
> After 5 iterations, the square root of 31321 is 176.977386
> After 1 iterations, the square root of 64 is 8.000000
> After 1 iterations, the square root of 1 is 1.000000
> After 6 iterations, the square root of 0 is 0.015625
> Program exited with code: 0
* I discovered something interesting when the input is ```0```, so I added ```if(u == 0) return 0;``` on line 2
* Also, I added ```if(ans[j] == 0) break;``` within the iteration for loop in the subsequent modification.
```c
float uint_to_float(uint32_t u){
if(u == 0) return 0;
int c = count_leading_zeros(u);
unsigned int exp=127+31-c;
// This line is improved from while loop.
// ex. c-(1+8-1)=c-8
u <<= (c-8);
uint32_t flt= (exp << 23) | (u & 0x7FFFFF);
return * (float *) &flt;
}
```
In the case where $r[1]=400$, I observed that there is an opportunity to reduce the calculation cycles in the shift_sub loop. If the ```siga``` variable is equal to zero after the if-statement, it's not necessary to execute the loop because it's only possible to set the zero bits in the subsequent processing.
```c
float division(float a,float b){
...
for(int i = 0 ; i < 25 ; i++){
...
// If the remainder is zero, jump out the loop
// And then set all the remaining bits to zero
if(siga == 0){
r = r << (24-i);
break;
}
...
}
...
}
```
I think I can combine it with [my assignment 1 clz algorithm](https://hackmd.io/VxdSGcBHQHSLw3oIrp0_zA?view#Harley%E2%80%99s-algorithm) to achieve the greater performance in this case, so I try to replace the original ```uint16_t count_leading_zeros(uint32_t x);``` with the following code:
```c
uint16_t count_leading_zeros(uint32_t x)
{
static uint8_t const Table[] = {
0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF,
16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF,
0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF,
0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF,
14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF,
18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13,
26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25,
0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF
};
/* Propagate leftmost 1-bit to the right */
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
/* x = x * 0x6EB14F9 */
x = (x << 3) - x; /* Multiply by 7. */
x = (x << 8) - x; /* Multiply by 255. */
x = (x << 8) - x; /* Again. */
x = (x << 8) - x; /* Again. */
return 31-Table[x >> 26];
}
```
After the modification, the cycles in this program have obviously reduced according to the execution information. Meanwhile, the output has changed to:
Output:
> After 5 iterations, the square root of 500 is 22.360680
> After 4 iterations, the square root of 400 is 20.000000
> After 5 iterations, the square root of 31321 is 176.977386
> After 1 iterations, the square root of 64 is 8.000000
> After 1 iterations, the square root of 1 is 1.000000
> After 1 iterations, the square root of 0 is 0.000000
> Program exited with code: 0
#### The original hand written Assembly
The hand written assembly looks like this
```c
.data
intput: .word 500, 400, 31321, 64, 1, 0
str1: .string "After "
str2: .string " iterations, the square root of "
str3: .string " is "
str4: .string "\n"
.text
# s0 = integer(input)
# s1 = float(input)
# s3 = test data address
# s4 = test dataset len
# s5 = j iterator
# a5 = square root of input
main:
la s3, intput # get test data address
li s4, 6 # load test dataset len
li s5, 0 # j iterator = 0
loop:
lw s0, 0(s3) # r[j] = test data j;
mv a1, s0 # utf(a1)
jal ra, utf # goto utf
mv s1, a1 # store float(input) to s1
jal ra, clz # jump to clz and get a0 =clz(r)
addi t0, x0, 32 # t0 = 32
sub t0, t0, a0 # 32 -lz
srli t0, t0, 1 # t0 = lzc =(32 - lz)/2
srl a1, s0, t0 # a1 = r >> lzc
jal ra, utf # jump to utf and get ans (a1 = utf(a1))
mv a0, x0 # i =0
addi a2, x0, 5 # iteration times in 5
newton:
bge a0, a2, print
mv a3, s1 # dividend = s0
mv a4, a1 # division = a1
jal ra, division # go to division (r / ans)
mv a6, a5 # a6 = t
mv a3, a6 # a3 = a6
mv a4, a1 # a4 = ans
jal ra, addition # go to addition (t + ans)
addi t0, x0, 1 # t0 = 1
slli t0, t0, 23 # t0 = << 1
sub a5, a5, t0 # a5 - t0 (exp-1) do the divide 2
mv a1, a5 # ans = t
beq s2, a1 print # if last est == now est
mv s2, a1 # c = ans
addi a0, a0, 1 # i++
beq x0, x0 newton
print:
addi t0, a0, 1
la a0, str1 # after
li a7, 4
ecall
mv a0, t0 # i
li a7, 1
ecall
la a0, str2 # iteration, the square root of :
li a7, 4
ecall
mv a0, s0 # input
li a7, 1
ecall
la a0, str3 # iteration, the square root of :
li a7, 4
ecall
mv a0, a5
li a7, 2
ecall
la a0, str4 # \n
li a7, 4
ecall
addi s3, s3, 4 # r[j] address + 4
addi s5, s5, 1 # j++
blt s5, s4, loop
end:
li a7, 10
ecall
clz:
mv t0, s0 # t0 = x
mv t1, t0 # t1 = t0
srli t0, t0, 1 # x >> 1
or t0, t0, t1 # x = x | x >> 1
srli t1, t0, 2 # x >> 2
or t0, t0, t1 # x = x | x >> 2
srli t1, t0, 4 # x >> 4
or t0, t0, t1 # x = x | x >> 4
srli t1, t0, 8 # x >> 8
or t0, t0, t1 # x = x | x >> 8
# count ones (population count)
srli t1, t0, 1 # x >> 1
li t6, 0x55555555 # load mask 0x55555555
and t1, t1, t6 # (x >> 1) & 0x55555555
sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555)
srli t1, t0, 2 # x >> 2
li t6, 0x33333333 # load mask 0x33333333
and t1, t1, t6 # (x >> 2) & 0x33333333
and t0, t0, t6 # x & 0x33333333
add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
srli t1, t0, 4 # x >> 4
add t0, t0, t1 # x + (x >> 4)
li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f
and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f
srli t1, t0, 8 # x >> 8
add t0, t1, t0 # x + (x >> 8)
srli t1, t0, 16 # x >> 16
add t0, t1, t0 # x + (x >> 16)
andi t0, t0, 0x7f # x & 0x7f
addi t1, x0, 32 # t1 = 32
sub a0, t1, t0 # return (32 - (x & 0x7f))
jr ra
utf:
addi sp, sp, -12 # make space on stack
sw ra, 8(sp) # save ra
sw s0, 4(sp) # save s0
sw a0, 0(sp) # save a0
mv s0, a1
jal ra, clz
mv t0, a1 # t0 = a1 = u ,a0 = clz(u)
addi t1, x0, 158 # t1 = 127 + 31
sub t1, t1, a0 # exp = 158 - clz(u)
addi a0, a0,-8 # c-8
sll t0, t0, a0 # u << (c - 8)
outloop:
slli t1, t1, 23 # exp << 23
li t6, 0x7fffff # load mask 0x7fffff
and t0, t0, t6 # u & 0x7fffff
or a1, t1, t0 # a1= (exp << 23) | (u & 0x7FFFFF);
lw ra, 8(sp) # get return adress
lw s0, 4(sp) # get s0
lw a0, 0(sp) # get a0
jr ra # back to main
#/////////////////division
#a3= dividend
#a4= divisor
#a5= quotient
#///////////////
#t0 = expout;
#t1 = siga
#t2 = sigb
#t3 = i
#t4 = 25
division:
srli t0, a3, 23 # ia >> 23
andi t0, t0, 0xff # (ia >> 23) & 0xff expa
srli t1, a4, 23 # ib >> 23
andi t1, t1, 0xff # (ib >> 23) & 0xff expb
sub t0, t0, t1 # (expa - expb)
addi t0, t0, 127 # expout = expa -expb +127
li t6, 0x7fffff # load mask 0x7fffff
and t1, a3, t6 # ia & 0x7fffff siga
and t2, a4, t6 # ib & 0x7fffff sigb
li t6, 0x800000 # load mask 0x7fffff
or t1, t1, t6 # siga = (ia & 0x7fffff ) | 0x800000
or t2, t2, t6 # sigb = (ib & 0x7fffff ) | 0x800000
div1:
bge t1, t2, div2 # siga > sigb go to div2
slli t1, t1, 1 # siga = siga << 1
addi t0, t0, -1 # expout--
mv a5, x0 # r = 0
beq x0, x0, div1 # back to div1
div2:
mv t3, x0 # i = 0
addi t4, x0, 25 # t4 = 25
shift_sub:
bge t3, t4, div3 # if i >= 25 go to div3
slli a5, a5, 1 # r = r << 1
bltu t1, t2, next # if (siga < sigb) go to next
sub t1, t1, t2 # siga = siga - sigb
ori a5, a5, 1 # r = r | 1
next:
slli t1, t1, 1 # siga = siga << 1
addi t3, t3, 1 # i = i + 1
beq x0, x0, shift_sub# back to shift_sub
div3:
srli a5, a5, 1 # r = r >> 1
li t6, 0x7fffff # load mask 0x7fffff
and a5, a5, t6 # r & 0x7fffff
andi t0, t0, 0xff # expout & 0xff
slli t0, t0, 23 # expout = expout << 23
or a5, a5, t0 # out= ((expout & 0xff) << 23) | (sigout)
jr ra # back to main
#///////////////////addition
#a3= a
#a4= b
#a5= a+b
#///////////////
#t0 = expa / expout;
#t1 = expb / sigout
#t2 = dif
#t3 = siga
#t4 = sigb
addition:
srli t0, a3, 23 # ia >> 23
andi t0, t0, 0xff # expa = (ia >> 23) & 0xff
srli t1, a4, 23 # ib >> 23
andi t1, t1, 0xff # expb = (ib >> 23) & 0xff
sub t2, t0, t1 # dif = expa - expb
li t6, 0x7fffff # load mask 0x7fffff
and t3, a3, t6 # ia & 0x7fffff
and t4, a4, t6 # ib & 0x7fffff
li t6, 0x800000 # load mask 0x800000
or t3, t3, t6 # siga = ((ia & 0x7fffff) | 0x800000)
or t4, t4, t6 # sigb = ((ib & 0x7fffff) | 0x800000)
blt t2, x0, out_b # if dif < 0 , expout = expb, else expout=expa
srl t4, t4, t2 # sigb = sigb >> dif
beq x0, x0, plus # go to plus
out_b:
sub t2, x0, t2 # dif= - dif change dif to positive
srl t3, t3, t2 # siga = siga >> dif
mv t0, t1 # expout = expb
plus:
add t1, t3, t4 # sigout= siga + sigb
srli t2, t1, 24 # sigout >> 24
addi t3, x0, 1 # t3 = 1
bne t2, t3, add_out # if (sigout >> 24) != 1 end addition
srli t1, t1, 1 # sigout = sigout >> 1
addi t0, t0, 1 # expout ++
add_out:
andi t0, t0, 0xff # expout & 0xff
slli t0, t0, 23 # (expout & 0xff) << 23
li t6, 0x7fffff # load mask 0x7fffff
and t1, t1, t6 # sigout& 0x7fffff
or a5, t0, t1 # out = ((expout & 0xff) <<23) | ((sigout) & 0x7fffff)
jr ra
```
The assembly codes I replaced are with optimized versions that improve performance, including byte array of harley hash table, count leading zero function and the if-statement in sub_shift.
```diff!
.data
intput: .word 500, 400, 31321, 64, 1, 0
+# hash map
+table:
+ .byte 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF
+ .byte 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF
+ .byte 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF
+ .byte 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF
+ .byte 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF
+ .byte 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13
+ .byte 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25
+ .byte 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF
str1: .string "After "
str2: .string " iterations, the square root of "
str3: .string " is "
str4: .string "\n"
```
```diff!
clz:
mv t0, s0 # t0 = x
mv t1, t0 # t1 = t0
srli t0, t0, 1 # x >> 1
or t0, t0, t1 # x = x | x >> 1
srli t1, t0, 2 # x >> 2
or t0, t0, t1 # x = x | x >> 2
srli t1, t0, 4 # x >> 4
or t0, t0, t1 # x = x | x >> 4
srli t1, t0, 8 # x >> 8
or t0, t0, t1 # x = x | x >> 8
srli t1, t0, 16 # x >> 16
or t0, t0, t1 # x = x | x >> 16
# count ones (population count)
- srli t1, t0, 1 # x >> 1
- li t6, 0x55555555 # load mask 0x55555555
- and t1, t1, t6 # (x >> 1) & 0x55555555
- sub t0, t0, t1 # x -= ((x >> 1) & 0x55555555)
- srli t1, t0, 2 # x >> 2
- li t6, 0x33333333 # load mask 0x33333333
- and t1, t1, t6 # (x >> 2) & 0x33333333
- and t0, t0, t6 # x & 0x33333333
- add t0, t0, t1 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
- srli t1, t0, 4 # x >> 4
- add t0, t0, t1 # x + (x >> 4)
- li t6, 0x0f0f0f0f # load mask 0x0f0f0f0f
- and t0, t0, t6 # x = ((x >> 4) + x) & 0x0f0f0f0f
- srli t1, t0, 8 # x >> 8
- add t0, t1, t0 # x + (x >> 8)
- srli t1, t0, 16 # x >> 16
- add t0, t1, t0 # x + (x >> 16)
- andi t0, t0, 0x7f # x & 0x7f
- addi t1, x0, 32 # t1 = 32
- sub a0, t1, t0 # return (32 - (x & 0x7f))
+ slli t1, t0, 3
+ sub t0, t1, t0 # x = (x << 3) - x
+ slli t1, t0, 8
+ sub t0, t1, t0 # x = (x << 8) - x
+ slli t1, t0, 8
+ sub t0, t1, t0 # x = (x << 8) - x
+ slli t1, t0, 8
+ sub t0, t1, t0 # x = (x << 8) - x
+ srli t0, t0, 26 # x >> 26
+ la t1, table
+ add t1, t1, t0
+ lbu t0, 0(t1) # Table[x >> 26]
+ li t1, 31
+ sub a0, t1, t0 # 31 - Table[x >> 26]
jr ra
```
```diff!
shift_sub:
bge t3, t4, div3 # if i >= 25 go to div3
slli a5, a5, 1 # r = r << 1
bltu t1, t2, next # if (siga < sigb) go to next
sub t1, t1, t2 # siga = siga - sigb
ori a5, a5, 1 # r = r | 1
+ bne t1, zero, next # check if the reminder is zero
+ li t6, 24 # load integer 24
+ sub t6, t6, t3 # 24-i
+ sll a5, a5, t6 # r = r << (24-i)
+ jal zero, div3 # break, jump to div3
```
### Compiling using RISC-V gcc
Commands about RISC-V gcc compiler:
* Generate assembly file from .c file
```bash!
riscv-none-elf-gcc -S hw2_origin.c -O0 -o hw2_origin_O0.s -march=rv32i -mabi=ilp32
```
* Generate .o file from .S file
```bash!
riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o hw2_origin_O0.o hw2_origin_O0.s
```
* Combine the assembly with the linking info, create program in executable and linkable format (.elf)
```bash!
riscv-none-elf-ld -o hw2_origin_O0.elf -T hw2_origin_O0.ld --oformat=elf32-littleriscv hw2_origin_O0.o
```
Retrieve the ELF file information using the following command in the terminal:
```bash!
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O'x' hw2_origin.c -o hw2_origin.elf
```
, where ```'x'``` represents the optimization level, which can be set to 0, 1, 2, and so on. To obtain the section sizes and the total size of the object file, I can use this command to display its argument list:
```bash!
riscv-none-elf-size hw2_origin.elf
```
Expected output:
```bash
text data bss dec hex filename
77700 2320 1544 81564 13e9c hw2_origin.elf
```
I compile the original c file, ```hw2_origin.c``` with different optimization level, the result is the following:
| Level | O0 | O1 | O2 | O3 | Ofast |
| ----- | ----- | ----- | ----- | ----- | ----- |
| text | 77700 | 76792 | 76816 | 77456 | 77436 |
| data | 2320 | 2324 | 2324 | 2320 | 2320 |
| bss | 1544 | 1544 | 1544 | 1544 | 1544 |
| dec | 81564 | 80660 | 80684 | 81320 | 81300 |
| hex | 13e9c | 13b14 | 13b2c | 13da8 | 13d94 |
I compile the improved c file, ```hw2_imporved.c``` with different optimization level, the result is the following:
| Level | O0 | O1 | O2 | O3 | Ofast |
| ----- | ----- | ----- | ----- | ----- | ----- |
| text | 77792 | 76848 | 76884 | 77492 | 77472 |
| data | 2320 | 2324 | 2324 | 2320 | 2320 |
| bss | 1544 | 1544 | 1544 | 1544 | 1544 |
| dec | 81656 | 80716 | 80752 | 81356 | 81336 |
| hex | 13ef8 | 13b4c | 13b70 | 13dcc | 13db8 |
### Evaluate computation cycles in C file
Use the code in the [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) to obtain the execution clock cycle of the program. This code utilizes the CSR register. In other words, In other words, I am only measuring the clock cycles taken by the ```clz``` and ```sqrt``` functions.
Write a makefile for c file, taking ```hw2_origin.c``` for example:
```
.PHONY: clean
include ../../../rv32emu/mk/toolchain.mk
CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall
OBJS = \
getcycles.o \
getinstret.o \
sparkle.o \
hw2_origin.o
BIN = hw2_origin.elf
%.o: %.S
$(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $<
%.o: %.c
$(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $<
all: $(BIN)
$(BIN): $(OBJS)
$(CROSS_COMPILE)gcc -o $@ $^
clean:
$(RM) $(BIN) $(OBJS)
```
#### hw2_origin.c
Output on terminal:
```bash!
After 5 iterations, the square root of 500 is 22.360680
Cycle count: 6411
Instructions retired: 6401
After 4 iterations, the square root of 400 is 20.000000
Cycle count: 4859
Instructions retired: 4859
After 5 iterations, the square root of 31321 is 176.977386
Cycle count: 6563
Instructions retired: 6563
After 2 iterations, the square root of 64 is 8.000000
Cycle count: 2520
Instructions retired: 2520
After 2 iterations, the square root of 1 is 1.000000
Cycle count: 2520
Instructions retired: 2520
After 6 iterations, the square root of 0 is 0.015625
Cycle count: 5832
Instructions retired: 5832
inferior exit code 0
```
#### hw2_improved.c
Output on terminal:
```bash
After 5 iterations, the square root of 500 is 22.360680
Cycle count: 6885
Instructions retired: 6875
After 4 iterations, the square root of 400 is 20.000000
Cycle count: 3456
Instructions retired: 3456
After 5 iterations, the square root of 31321 is 176.977386
Cycle count: 6903
Instructions retired: 6903
After 1 iterations, the square root of 64 is 8.000000
Cycle count: 744
Instructions retired: 744
After 1 iterations, the square root of 1 is 1.000000
Cycle count: 744
Instructions retired: 744
After 1 iterations, the square root of 0 is 0.000000
Cycle count: 289
Instructions retired: 289
inferior exit code 0
```
#### hw2_improved_harley.c
Output on terminal:
```bash!
After 5 iterations, the square root of 500 is 22.360680
Cycle count: 6865
Instructions retired: 6855
After 4 iterations, the square root of 400 is 20.000000
Cycle count: 3436
Instructions retired: 3436
After 5 iterations, the square root of 31321 is 176.977386
Cycle count: 6883
Instructions retired: 6883
After 1 iterations, the square root of 64 is 8.000000
Cycle count: 724
Instructions retired: 724
After 1 iterations, the square root of 1 is 1.000000
Cycle count: 724
Instructions retired: 724
After 1 iterations, the square root of 0 is 0.000000
Cycle count: 227
Instructions retired: 227
inferior exit code 0
```
#### CSR
I arranged the following sheet based on the results printed in the terminal. We can observe that ```hw2_origin.c``` performs the worst, excluding the input value ```500``` and ```31321```. When comparing ```hw2_origin.c``` to ```hw2_improved.c```, the latter handles the other input values quite well. This improvement is due to the if-statement in ```div_shift```, which noticeably reduces CSR cycles. However, the disadvantage of this modification is an increase in the cycle count during each iteration. ```hw2_improved_harley.c``` is an advanced improvement over ```hw2_improved.c```, which implements the Harley hash table that covers the ```clz``` function. This sheet represents the O0, O1 and Ofast optimized C code.
**O0 Optimization**
| file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` |
| --------------------- | ---- | ---- | ----- | ---- | ---- | ---- |
| hw2_origin.c | ==6411== | 4859 | ==6563== | 2520 | 2520 | 5832 |
| hw2_improved.c | 6885 | 3456 | 6903 | 744 | 744 | 289 |
| hw2_improved_harley.c | 6865 | ==3436== | 6883 | ==724== | ==724== | ==227== |
**O1 Optimization**
| file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` |
| --------------------- | ---- | ---- | ----- | ---- | ---- | ---- |
| hw2_origin.c | ==2436== | 1837 | ==2565== | 1032 | 1032 | 2186 |
| hw2_improved.c | 2740 | 1454 | 2824 | 417 | 416 | 209 |
| hw2_improved_harley.c | 2731 | ==1444== | 2814 | ==407== | ==406== | ==178== |
**Ofast Optimization**
| file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` |
| --------------------- | ---- | ---- | ----- | ---- | ---- | ---- |
| hw2_origin.c | ==2080== | 1641 | ==2197== | 984 | 984 | 2092 |
| hw2_improved.c | 2249 | 1105 | 2324 | 354 | 354 | 220 |
| hw2_improved_harley.c | 2243 | ==1099== | 2318 | ==348== | ==348== | ==187== |
### Write a Faster / Smaller Assembly program
**hw2_origin.S**
In ```hw2_origin.S```, I did some modification to print out the cycle count numbers and output values :
```diff!
print:
...
- la a0, str1 # after
- li a7, 4
+ li a0, 1
+ la a1, str1 # "after "
+ la a2, str1_size
+ li a7, SYSWRITE
ecall
...
- mv a0, s0 # input
- li a7, 1
+ li a0, 1
+ mv a1, s0 # input
+ li a7, PRINT_INT
ecall
...
- mv a0, a1
- li a7, 2
+ li a0, 1
+ mv a1, s2 # sqrt
+ li a7, PRINT_HEX
ecall
...
end:
- li a7, 10
+ mv ra ,x0
+ li a7, SYSEXIT
ecall
...
+ get_cycles:
+ csrr a1, cycleh
+ csrr a0, cycle
+ csrr a2, cycleh
+ bne a1, a2, get_cycles
+ ret
```
To use the cssr instruction, we need to modify the makefile:
```diff!
- ASFLAGS = -march=rv32i -mabi=ilp32
+ ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 -mabi=ilp32
```
Since ```fprintf``` has some problem in printing floating point, I replace it by the number in hex format. After checking, the output values are correct.
Output in terminal:
```bash!
After 1363 iterations, the square root of 500 is 41b2e2ac
Cycle count: 1363
After 1095 iterations, the square root of 400 is 41a00000
Cycle count: 1095
After 1374 iterations, the square root of 31321 is 4330fa36
Cycle count: 1374
After 603 iterations, the square root of 64 is 41000000
Cycle count: 603
After 603 iterations, the square root of 1 is 3f800000
Cycle count: 603
After 1372 iterations, the square root of 0 is 3f3504f3
Cycle count: 1372
inferior exit code 2
```
**hw2_improved.S**
```bash!
After 1408 iterations, the square root of 500 is 41b2e2ac
Cycle count: 1408
After 839 iterations, the square root of 400 is 41a00000
Cycle count: 839
After 1423 iterations, the square root of 31321 is 4330fa36
Cycle count: 1423
After 225 iterations, the square root of 64 is 41000000
Cycle count: 225
After 225 iterations, the square root of 1 is 3f800000
Cycle count: 225
After 70 iterations, the square root of 0 is 00000000
Cycle count: 70
inferior exit code 2
```
| info | hw2_improved.S | hw2_origin.S |
| ---- | -------------- |:------------ |
| text | 1036 | 968 |
| data | 0 | 0 |
| bss | 0 | 0 |
| dec | 1036 | 968 |
| hex | 40c | 3c8 |
| file name | ```500``` | ```400``` | ```31321``` | ```64``` | ```1``` | ```0``` |
|:-------------- |:-------- | ------- | -------- | ------- |:------- |:------ |
| hw2_origin.S | ==1363== | 1095 | ==1374== | 603 | 603 | 1372 |
| hw2_improved.S | 1408 | ==839== | 1423 | ==225== | ==225== | ==70== |
## Compare computing cycles between Ripes and RV32emu
According to these figures, there are two set of information in Ripe. Example in ```hw2_improved.S``` performs better than example in ```hw2_origin.S``` . It's worth noting that both sets of values are derived from the results of running the entire test input suite.
**hw2_origin.S**

**hw2_improved.S**

## Estimating a histogram in the rv32emu toolchain
Look into the instructions histogram generated by `rv_histogram`
## Reference
* [rv32emu](https://github.com/sysprog21/rv32emu/tree/master)
* [Why to Learn C programming?](https://www.tutorialspoint.com/cprogramming/index.htm)
* [Assignment1: RISC-V Assembly and Instruction Pipeline-高紹捷](https://hackmd.io/@F3cNkb4bSWKg00J7O-_y8w/rJvzACwgT?utm_source=preview-mode&utm_medium=rec)
* [Power of CLZ instruction in numerical computations](https://ieeexplore.ieee.org/document/8290400)
:::warning
:warning: Refer to the official **C standard** documentation for accurate information, rather than relying on web pages that may not be considered authoritative sources.
:notes: jserv
:::