# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < `guaneec` >
###### tags: `arch2020`
## The algorithm
Reference algorithm in Python:
```python3=
def printpi(n):
m = n * 10 // 3
a = [2] * m
pred = []
for _ in range(n):
a = [x * 10 for x in a]
for i in range(m - 1, 0, -1):
q, r = divmod(a[i], 2 * i + 1)
a[i] = r
a[i - 1] += q * i
d, a[0] = divmod(a[0], 10)
if d <= 8:
for p in pred:
print(p, end='')
pred = [d]
else:
if d == 9:
pred.append(9)
else:
for p in pred:
print((p + 1) % 10, end='')
pred = [0]
```
```python
>>> printpi(50)
3141592653589793238462643383279502884197169399375
```
For how it works, see [A Spigot Algorithm for the Digits of π (1995)](https://www.maa.org/sites/default/files/pdf/pubs/amm_supplements/Monthly_Reference_12.pdf).
TL;DR:
$$
\pi = 2 + \frac 1 3 \left(2 + \frac 2 5 \left(2 + \frac37 \left( 2 + \frac49 \left( 2+\dots \right) \right) \right) \right)
$$
$\pi$ in the mixed radix base $c=(\frac13, \frac25, \frac37, \frac49, \dots)$ is $(2;2, 2,2, 2, ...)$. Retrieving digits of $\pi$ in decimal is a matter of conversion of bases.

(taken from the paper)
:::info
**Spigot**: *noun*, a tap
The algorithm has the name because the digits flow out 1 by 1, like a spigot:
```
=()=
,/'\_||_
( (___ `.
`\./ `=='
:
:
9
5
1
4
1
3
```
[Tap art source](https://ascii.co.uk/art/faucet)
:::
A few properties of this algorithm:
* Only integer arithmetic is involved, so no floating point precision issues
* $O(n^2)$ time, $O(n)$ space
* The values in the buffer does not exceed $20 \lfloor 10n/3 \rfloor - 10$
:::info
**See also**
15000 digits of π in 133 characters of C:
> ```cpp
> a[52514],b,c=52514,d,e,f=1e4,g,h;main(){for(;b=c-=14;h=printf("%04d",
> e+d/f))for(e=d%=f;g=--b*2;d/=g)d=d*b+f*(h?a[b]:f/5),a[b]=d%--g;}
> ```
> by Arndt and Haenel in *π Unleashed, page 37*
[Explanation](https://stackoverflow.com/a/20288006)
Comment: This method doesn't work for arbitrarily large n because the calculation of predigits is omitted.
:::
:::info
**See also**
A "constant space" algorithm can be found in [Unbounded spigot algorithms for the digits of pi (Gibbons, J., 2006)](https://www.cs.ox.ac.uk/people/jeremy.gibbons/publications/spigot.pdf). A Haskell implementation `piG3` is provided in the paper, and [Python implementations](http://davidbau.com/downloads/pi.py) have been written by others. Unfortunately, the algorithm requires arbitrary precision integer arithmetic and therefore is not that simple to implement in C or RISC-V assembly.
:::
## Preprocessing
[Ripes' version of RISC-V](https://github.com/mortbopet/Ripes/wiki/RISC-V-Assembly-Programmer's-Manual-(Adapted-for-Ripes)#pseudo-ops) lacks the directive `.equ` available in the [original version](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md), which allows definition of compile time constants. I actually need this because I need to allocate memory beforehand (how do I malloc?). To overcome this, I wrote a [little template in Python](https://github.com/guaneec/arch2020/blob/master/a1/template.py).
There are a few benefits that come with this.
* The size of an array element can be selected easily (byte/half/word)
* Arrays can be pre-filled with values
* Compile time constant expressions (`m = 10 * n / 3`)
### Representing arrays in the memory
In the code above, there are two arrays, `a` and `pred`, whose size can exceed the number of registers and thus must be stored in the memory. The size of `a` is fixed once it's initalized while `pred` can grow to at most n. The arrays are placed in the data section.
```x
# RISC-V
.data
arr: .half 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
pred: .zero 200
.text
main:
addi sp sp -28
sw s0 0(sp)
sw s1 4(sp)
sw s2 8(sp)
sw s4 12(sp)
sw s5 16(sp)
sw s7 20(sp)
sw ra 24(sp)
la s0 arr # arr.begin()
la s1 pred # arr.end(), pred.begin()
la s2 pred # pred.end()
# s4: main loop counter
# s5: digit output
# s7: m
li s4 100
li s7 333
L1: # for _ in range(n)
ble s4 zero L1E
li t0 10
mv t1 s0
L2: # [x * 10 for x in a]
beq t1 s1 L2E
lh t2 0(t1)
mul t2 t2 t0
sh t2 0(t1)
addi t1 t1 2
j L2
L2E:
addi t0 s7 -1 # t0: i
slli t1 t0 1
add t1 s0 t1 # t1: a.begin() + i
L3: # for i in range(m - 1, 0, -1):
ble t0 zero L3E
lh t5 0(t1)
slli t2 t0 1
addi t2 t2 1
divu t3 t5 t2
remu t4 t5 t2
sh t4 0(t1)
mul t3 t3 t0
lh t4 -2(t1)
add t3 t3 t4
sh t3 -2(t1)
addi t0 t0 -1
addi t1 t1 -2
j L3
L3E:
# d, a[0] = divmod(a[0], 10)
lh t0 0(s0)
li t1 10
divu s5 t0 t1
remu t3 t0 t1
sh t3 0(s0)
li t1 8
# C0: if d <= 8
bgt s5 t1 C0
mv t0 s1 # t0: pred.begin() + i
L4: # for p in pred:
beq t0 s2 L4E
lb t3 0(t0)
# print
mv a0 t3
li a7 1
ecall
addi t0 t0 1
j L4
L4E:
addi s2 s1 1 # pred.end() = pred.begin() + 1
sb s5 0(s1) # pred[0] = d
j C0E
C0:
li t0 9
# if d == 9:
bne s5 t0 C1
sb s5 0(s2)
addi s2 s2 1
j C1E
C1:
mv t0 s1 # t0: pred.begin() + i
L5: # for p in pred
beq t0 s2 L5E
lb t2 0(t0)
addi t2 t2 1
li t3 10
remu a0 t2 t3
li a7 1
ecall
addi t0 t0 1
j L5
L5E:
# pred = [0]
sb zero 0(s1)
addi s2 s1 1
C1E:
C0E:
addi s4 s4 -1
j L1
L1E:
lw s0 0(sp)
lw s1 4(sp)
lw s2 8(sp)
lw s4 12(sp)
lw s5 16(sp)
lw s7 20(sp)
lw ra 24(sp)
addi sp sp 28
```
Instructions used:
```
add
addi
beq
bgt
ble
bne
divu
ecall
j
la
lb
lh
li
lw
mul
mv
remu
sb
sh
slli
sw
```
## Running
### `addi sp sp -32`
Let's look at how a single instruction `addi sp sp -32` go through the five stages
#### IF

In the IF stage, the instruction is fetched from the instruction memory and written into the IFID registers.
#### ID

In the ID stage, at the bottom, the index of the destination register is decoded (sp=x2). Also extracted is the immediate value. The indices of the source register (x2) is sent into the registers and its content is sent to the next stage.
#### EX

In EX, The values are then passed to the ALU, which performs an addition according to the opcode.
#### MEM

In MEM, since this is not a load/store instruction, the memory unit is disabled.
#### WB

In WB, the result is fed back to the registers. WrEn of the registers is enabled and only after this stage the modified x2 is set to the new value.
### `sh t3 -2(t1)`
#### IF
Same
#### ID

The offset -2 is extracted as an immediate.
#### EX

The ALU adds the address and offset. The value of reg2 is not actually used because a modified value (0x14) from a later stage is forwarded.
#### MEM

Write is enabled, the value 0x14 is written to the address with offset.
#### WB

The registers' wr_en is cleared since there's no register to write to.
---
Pipeline flush caused by a jump

Bubble caused by data hazard (RAW)

## Getting compiled code to work
Ripes comes with support of C source code, provided a compiler targeting RISC-V is installed.
For those who don't want to or cannot install the compiler (Windows user), the online [Compiler Explorer](https://godbolt.org) is useful.
There's a page titled [Adapting Compiler Explorer generated RISC V assembly code](https://github.com/mortbopet/Ripes/wiki/Adapting-Compiler-Explorer-generated-RISC-V-assembly-code) in Ripes' Wiki.
Let's run our li'l snippet of C for pi in Ripes.
```cpp
a[52514],b,c=52514,d,e,f=1e4,g,h;main(){for(;b=c-=14;h=printf("%04d",
e+d/f))for(e=d%=f;g=--b*2;d/=g)d=d*b+f*(h?a[b]:f/5),a[b]=d%--g;}
```
The compiled assembly can be grabbed from the [Compiler Explorer](https://godbolt.org/z/oaaddf). Just copy-n-paste and it should work, right?
Alas, things are not so simple.

Red squiggly lines. Lots of them.
To make this work, I had to make a few changes.
* change instructions with `%lo` and `%hi` to equivalent calls of `la`
* replace the printf function
* mark the `.text` and `.data` section
* set the size of the array to be under 65536 bytes
* Replace the `ret` at the end of `main` with a jump to the end of the file (Ripes won't halt otherwise)
I replaced the `printf` in the C source with a custom function
```cpp
int print4(int x);
a[364], b, c = 364, d, e, f = 1e4, g, h;
main() {
for (; b = c -= 14; h = print4(e + d / f))
for (e = d %= f; g = --b * 2; d /= g)
d = d * b + f * (h ? a[b] : f / 5), a[b] = d % --g;
}
// int print4(int x) { return printf("%04d", x); }
```
The size of the array is also reduced.
The definition is in handcrafted in RISC-V assembly (ripes)
```python
# print4.s
print4:
addi sp sp -8
sw s0 0(sp)
sw s1 4(sp)
li s1 10
mv s0 a0
li t0 1000
div a0 s0 t0
li a7 1
ecall
li t0 100
div a0 s0 t0
rem a0 a0 s1
li a7 1
ecall
li t0 10
div a0 s0 t0
rem a0 a0 s1
li a7 1
ecall
rem a0 s0 s1
li a7 1
ecall
li a0 4
lw s0 0(sp)
lw s1 4(sp)
addi sp sp 8
ret
```
I wrote a [conversion script](https://github.com/guaneec/arch2020/blob/master/a1/ripes.sh) to automate this.
Full workflow:
* Write C
* Remove implementation of functions that requires an `ecall`
* Rewrite these functions in Ripes RISC-V in `lib.s`
* Grab [compiled RISC-V from Compiler Explorer](https://godbolt.org/z/d6ohrz), save as IN_FILENAME
* Run `./ripes.sh <IN_FILENAME >OUT_FILENAME`
Generated assembly:
```
.data
c:
.word 364 # 0x16c
f:
.word 10000 # 0x2710
b:
.word 0 # 0x0
d:
.word 0 # 0x0
e:
.word 0 # 0x0
g:
.word 0 # 0x0
h:
.word 0 # 0x0
a:
.zero 1456
.text
main: # @main
addi sp, sp, -48
sw ra, 44(sp)
sw s0, 40(sp)
sw s1, 36(sp)
sw s2, 32(sp)
sw s3, 28(sp)
sw s4, 24(sp)
sw s5, 20(sp)
sw s6, 16(sp)
sw s7, 12(sp)
sw s8, 8(sp)
sw s9, 4(sp)
sw s10, 0(sp)
la s3, c
lw a0, 0(s3)
addi a3, a0, -14
sw a3, 0(s3)
la s9, b
sw a3, 0(s9)
beqz a3, .LBB0_9
la s2, f
la s10, d
la s4, e
la s5, g
la s6, h
lui a0, 419430
addi s7, a0, 1639
la a0, a
addi s8, a0, 0
j .LBB0_4
.LBB0_2: # in Loop: Header=BB0_4 Depth=1
sw a1, 0(s10)
sw zero, 0(s5)
sw zero, 0(s9)
.LBB0_3: # in Loop: Header=BB0_4 Depth=1
div a0, a1, a7
add a0, a0, a6
call print4
lw a1, 0(s3)
sw a0, 0(s6)
addi a3, a1, -14
sw a3, 0(s3)
sw a3, 0(s9)
beqz a3, .LBB0_9
.LBB0_4: # =>This Loop Header: Depth=1
lw a7, 0(s2)
lw a1, 0(s10)
rem a6, a1, a7
sw a6, 0(s10)
sw a6, 0(s4)
addi a2, a3, -1
sw a2, 0(s9)
slli a1, a2, 1
sw a1, 0(s5)
add a1, zero, a6
beqz a2, .LBB0_3
lw a1, 0(s6)
seqz a4, a1
mulh a1, a7, s7
srli a5, a1, 31
srai a1, a1, 1
add s1, a1, a5
slli a1, a3, 1
addi s0, a1, -3
slli a1, a3, 2
add a1, a1, s8
addi a3, a1, -4
add a1, zero, a6
j .LBB0_7
.LBB0_6: # in Loop: Header=BB0_7 Depth=2
mul a1, a1, a2
mul a5, a5, a7
add a5, a5, a1
div a1, a5, s0
mul a0, a1, s0
sub a0, a5, a0
sw a0, 0(a3)
addi a2, a2, -1
addi s0, s0, -2
addi a3, a3, -4
beqz a2, .LBB0_2
.LBB0_7: # Parent Loop BB0_4 Depth=1
add a5, zero, s1
bnez a4, .LBB0_6
lw a5, 0(a3)
j .LBB0_6
.LBB0_9:
mv a0, zero
lw s10, 0(sp)
lw s9, 4(sp)
lw s8, 8(sp)
lw s7, 12(sp)
lw s6, 16(sp)
lw s5, 20(sp)
lw s4, 24(sp)
lw s3, 28(sp)
lw s2, 32(sp)
lw s1, 36(sp)
lw s0, 40(sp)
lw ra, 44(sp)
addi sp, sp, 48
j end
print4:
addi sp sp -8
sw s0 0(sp)
sw s1 4(sp)
li s1 10
mv s0 a0
li t0 1000
div a0 s0 t0
li a7 1
ecall
li t0 100
div a0 s0 t0
rem a0 a0 s1
li a7 1
ecall
li t0 10
div a0 s0 t0
rem a0 a0 s1
li a7 1
ecall
rem a0 s0 s1
li a7 1
ecall
li a0 4
lw s0 0(sp)
lw s1 4(sp)
addi sp sp 8
ret
end:
```