# Assignment2: RISC-V Toolchain
contributed by < `OscarShiang` >
## Rewrite [Sqrt(x)](https://hackmd.io/@30vhEV7FQECcWeCF1eAN5A/ry44_j8rK)
I implement my code based on [李政憲](https://hackmd.io/@30vhEV7FQECcWeCF1eAN5A/ry44_j8rK)'s one. Because I am curious about the way to compute the square root of a number without using brute force.
I have rewritten the code to find `floor(sqrt(x))` with binary search approach.
```c
#include <stdint.h>
#include "math.h"
#include "print.h"
uint32_t mysqrt(uint32_t x)
{
uint64_t lo = 0, hi = x + 1u;
while (lo < hi) {
uint64_t mi = (hi - lo) / 2 + lo;
if (mul(mi, mi) <= x)
lo = mi + 1;
else
hi = mi;
}
return lo - 1;
}
void _start()
{
int a = 4096;
print_str("sqrt(");
print_int(a);
print_str(") = ");
print_int(mysqrt(a));
print_str("\n");
}
```
### Execution Result
```shell
$ emu-rv32i ./sqrt
sqrt(4096) = 64
>>> Execution time: 4930683 ns
>>> Instruction count: 78360 (IPS=15892321)
>>> Jumps: 4700 (6.00%) - 110 forwards, 4590 backwards
>>> Branching T=4592 (98.18%) F=85 (1.82%)
```
## GNU tools
### Optimization
The code is generated by `riscv-none-embed-gcc` with `-O3` optimization and disassembled bt `riscv-none-embed-objdump`
```
./sqrt: file format elf32-littleriscv
Disassembly of section .text:
00010054 <mysqrt>:
10054: fe010113 addi sp,sp,-32
10058: 01512223 sw s5,4(sp)
1005c: 00112e23 sw ra,28(sp)
10060: 00812c23 sw s0,24(sp)
10064: 00912a23 sw s1,20(sp)
10068: 01212823 sw s2,16(sp)
1006c: 01312623 sw s3,12(sp)
10070: 01412423 sw s4,8(sp)
10074: 01612023 sw s6,0(sp)
10078: 00150a93 addi s5,a0,1
1007c: 0a0a8e63 beqz s5,10138 <mysqrt+0xe4>
10080: 00050993 mv s3,a0
10084: 00000b13 li s6,0
10088: 00000a13 li s4,0
1008c: 00000913 li s2,0
10090: 414a87b3 sub a5,s5,s4
10094: 00fab733 sltu a4,s5,a5
10098: 412b0433 sub s0,s6,s2
1009c: 40e40433 sub s0,s0,a4
100a0: 01f41713 slli a4,s0,0x1f
100a4: 0017d793 srli a5,a5,0x1
100a8: 00f767b3 or a5,a4,a5
100ac: 014784b3 add s1,a5,s4
100b0: 00145413 srli s0,s0,0x1
100b4: 01240433 add s0,s0,s2
100b8: 00f4b7b3 sltu a5,s1,a5
100bc: 00048593 mv a1,s1
100c0: 00048513 mv a0,s1
100c4: 00878433 add s0,a5,s0
100c8: 50c000ef jal ra,105d4 <mul>
100cc: 04059863 bnez a1,1011c <mysqrt+0xc8>
100d0: 00148793 addi a5,s1,1
100d4: 0097b733 sltu a4,a5,s1
100d8: 04a9e263 bltu s3,a0,1011c <mysqrt+0xc8>
100dc: 00870933 add s2,a4,s0
100e0: 00078a13 mv s4,a5
100e4: fb6966e3 bltu s2,s6,10090 <mysqrt+0x3c>
100e8: 012b1463 bne s6,s2,100f0 <mysqrt+0x9c>
100ec: fb57e2e3 bltu a5,s5,10090 <mysqrt+0x3c>
100f0: fffa0513 addi a0,s4,-1
100f4: 01c12083 lw ra,28(sp)
100f8: 01812403 lw s0,24(sp)
100fc: 01412483 lw s1,20(sp)
10100: 01012903 lw s2,16(sp)
10104: 00c12983 lw s3,12(sp)
10108: 00812a03 lw s4,8(sp)
1010c: 00412a83 lw s5,4(sp)
10110: 00012b03 lw s6,0(sp)
10114: 02010113 addi sp,sp,32
10118: 00008067 ret
1011c: 00897863 bgeu s2,s0,1012c <mysqrt+0xd8>
10120: 00048a93 mv s5,s1
10124: 00040b13 mv s6,s0
10128: f69ff06f j 10090 <mysqrt+0x3c>
1012c: fd2412e3 bne s0,s2,100f0 <mysqrt+0x9c>
10130: fe9a68e3 bltu s4,s1,10120 <mysqrt+0xcc>
10134: fbdff06f j 100f0 <mysqrt+0x9c>
10138: fff00513 li a0,-1
1013c: fb9ff06f j 100f4 <mysqrt+0xa0>
```
Compare the code generated with `-O3` and `-O0`, I find that
- the `-O3` code size is bigger than `-O0` one. Causing several loops to be unrolled.
- `-O3` code deals with corner cases. (case `0` at `1007c` in above code)
- the compiler with `-O3` try to detect the functions whose usage is similar to gcc builtin and replace them with gcc ones.
- Functions like (`put_char`) may be inlined to eliminate the allocation and release of stack memory.
- The constant data like the address of `tx` has been coded into `.text` section instead of storing in `.data` section.
Accoring to the above differences, the code with `O3` optimization has less branches and smaller execution time.
```shell
$ emu-rv32i ./sqrt
sqrt(4096) = 64
>>> Execution time: 331105 ns
>>> Instruction count: 2105 (IPS=6357499)
>>> Jumps: 341 (16.20%) - 164 forwards, 177 backwards
>>> Branching T=260 (71.04%) F=106 (28.96%)
```
### Code size
We can see that the size of text section
```shell
# with -O0 flag
$ riscv-none-embed-size ./sqrt
text data bss dec hex filename
1192 4 0 1196 4ac ./sqrt
# with -O3 flag
$ riscv-none-embed-size ./sqrt
text data bss dec hex filename
1666 0 0 1666 682 ./sqrt
```
## Implement print function
To print out the result of computing, we need to implement our own print function.
`rv32emu` provides the address of UART output pin, which is at `0x40002000`.
Base on this address, a simple print function , which can print integer in decimal form, can be implemented as below:
```cpp
void print_int(int a)
{
bool printed = false;
for (int i = 0; i < 10; i++) {
if (a >= pow10[i]) {
int rem;
printed = true;
put_char(div(a, pow10[i], &rem) + '0');
a = rem;
} else if (printed) {
put_char('0');
}
}
if (!printed)
put_char('0');
}
```
The function `put_char` is a wrapper of the following statement:
```cpp
static void put_char(char c)
{
*tx = c;
}
```
The reason we named `put_char` is to avoid confliction to the glibc `putchar` signature.
Since we need to print the number in decimal form, we must do division and take remainder.
However we only use `rv32i` as the default architecture. I append `div` and `mul` function which implemented by loops to archieve it.
The `div` function can return the remainder if we provide a address of a variable to the third parameter of that function.
```cpp
int div(int dividend, int divisor, int *rem)
{
int quotient = 0;
while (dividend >= divisor) {
quotient++;
dividend -= divisor;
}
if (rem) {
*rem = dividend;
}
return quotient;
}
uint64_t mul(uint32_t a, uint32_t b)
{
uint64_t res = 0;
while (b--) {
res += a;
}
return res;
}
```
As a result, we can have some output like
```shell
sqrt(4096) = 64
```
## Reference
###### tags: `arch2021`