Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `guaneec` > ###### tags: `arch2020` ## The algorithm Reference algorithm in Python: ```python3= def printpi(n): m = n * 10 // 3 a = [2] * m pred = [] for _ in range(n): a = [x * 10 for x in a] for i in range(m - 1, 0, -1): q, r = divmod(a[i], 2 * i + 1) a[i] = r a[i - 1] += q * i d, a[0] = divmod(a[0], 10) if d <= 8: for p in pred: print(p, end='') pred = [d] else: if d == 9: pred.append(9) else: for p in pred: print((p + 1) % 10, end='') pred = [0] ``` ```python >>> printpi(50) 3141592653589793238462643383279502884197169399375 ``` For how it works, see [A Spigot Algorithm for the Digits of π (1995)](https://www.maa.org/sites/default/files/pdf/pubs/amm_supplements/Monthly_Reference_12.pdf). TL;DR: $$ \pi = 2 + \frac 1 3 \left(2 + \frac 2 5 \left(2 + \frac37 \left( 2 + \frac49 \left( 2+\dots \right) \right) \right) \right) $$ $\pi$ in the mixed radix base $c=(\frac13, \frac25, \frac37, \frac49, \dots)$ is $(2;2, 2,2, 2, ...)$. Retrieving digits of $\pi$ in decimal is a matter of conversion of bases. ![](https://i.imgur.com/HFm7Q33.png) (taken from the paper) :::info **Spigot**: *noun*, a tap The algorithm has the name because the digits flow out 1 by 1, like a spigot: ``` =()= ,/'\_||_ ( (___ `. `\./ `==' : : 9 5 1 4 1 3 ``` [Tap art source](https://ascii.co.uk/art/faucet) ::: A few properties of this algorithm: * Only integer arithmetic is involved, so no floating point precision issues * $O(n^2)$ time, $O(n)$ space * The values in the buffer does not exceed $20 \lfloor 10n/3 \rfloor - 10$ :::info **See also** 15000 digits of π in 133 characters of C: > ```cpp > a[52514],b,c=52514,d,e,f=1e4,g,h;main(){for(;b=c-=14;h=printf("%04d", > e+d/f))for(e=d%=f;g=--b*2;d/=g)d=d*b+f*(h?a[b]:f/5),a[b]=d%--g;} > ``` > by Arndt and Haenel in *π Unleashed, page 37* [Explanation](https://stackoverflow.com/a/20288006) Comment: This method doesn't work for arbitrarily large n because the calculation of predigits is omitted. ::: :::info **See also** A "constant space" algorithm can be found in [Unbounded spigot algorithms for the digits of pi (Gibbons, J., 2006)](https://www.cs.ox.ac.uk/people/jeremy.gibbons/publications/spigot.pdf). A Haskell implementation `piG3` is provided in the paper, and [Python implementations](http://davidbau.com/downloads/pi.py) have been written by others. Unfortunately, the algorithm requires arbitrary precision integer arithmetic and therefore is not that simple to implement in C or RISC-V assembly. ::: ## Preprocessing [Ripes' version of RISC-V](https://github.com/mortbopet/Ripes/wiki/RISC-V-Assembly-Programmer's-Manual-(Adapted-for-Ripes)#pseudo-ops) lacks the directive `.equ` available in the [original version](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md), which allows definition of compile time constants. I actually need this because I need to allocate memory beforehand (how do I malloc?). To overcome this, I wrote a [little template in Python](https://github.com/guaneec/arch2020/blob/master/a1/template.py). There are a few benefits that come with this. * The size of an array element can be selected easily (byte/half/word) * Arrays can be pre-filled with values * Compile time constant expressions (`m = 10 * n / 3`) ### Representing arrays in the memory In the code above, there are two arrays, `a` and `pred`, whose size can exceed the number of registers and thus must be stored in the memory. The size of `a` is fixed once it's initalized while `pred` can grow to at most n. The arrays are placed in the data section. ```x # RISC-V .data arr: .half 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 pred: .zero 200 .text main: addi sp sp -28 sw s0 0(sp) sw s1 4(sp) sw s2 8(sp) sw s4 12(sp) sw s5 16(sp) sw s7 20(sp) sw ra 24(sp) la s0 arr # arr.begin() la s1 pred # arr.end(), pred.begin() la s2 pred # pred.end() # s4: main loop counter # s5: digit output # s7: m li s4 100 li s7 333 L1: # for _ in range(n) ble s4 zero L1E li t0 10 mv t1 s0 L2: # [x * 10 for x in a] beq t1 s1 L2E lh t2 0(t1) mul t2 t2 t0 sh t2 0(t1) addi t1 t1 2 j L2 L2E: addi t0 s7 -1 # t0: i slli t1 t0 1 add t1 s0 t1 # t1: a.begin() + i L3: # for i in range(m - 1, 0, -1): ble t0 zero L3E lh t5 0(t1) slli t2 t0 1 addi t2 t2 1 divu t3 t5 t2 remu t4 t5 t2 sh t4 0(t1) mul t3 t3 t0 lh t4 -2(t1) add t3 t3 t4 sh t3 -2(t1) addi t0 t0 -1 addi t1 t1 -2 j L3 L3E: # d, a[0] = divmod(a[0], 10) lh t0 0(s0) li t1 10 divu s5 t0 t1 remu t3 t0 t1 sh t3 0(s0) li t1 8 # C0: if d <= 8 bgt s5 t1 C0 mv t0 s1 # t0: pred.begin() + i L4: # for p in pred: beq t0 s2 L4E lb t3 0(t0) # print mv a0 t3 li a7 1 ecall addi t0 t0 1 j L4 L4E: addi s2 s1 1 # pred.end() = pred.begin() + 1 sb s5 0(s1) # pred[0] = d j C0E C0: li t0 9 # if d == 9: bne s5 t0 C1 sb s5 0(s2) addi s2 s2 1 j C1E C1: mv t0 s1 # t0: pred.begin() + i L5: # for p in pred beq t0 s2 L5E lb t2 0(t0) addi t2 t2 1 li t3 10 remu a0 t2 t3 li a7 1 ecall addi t0 t0 1 j L5 L5E: # pred = [0] sb zero 0(s1) addi s2 s1 1 C1E: C0E: addi s4 s4 -1 j L1 L1E: lw s0 0(sp) lw s1 4(sp) lw s2 8(sp) lw s4 12(sp) lw s5 16(sp) lw s7 20(sp) lw ra 24(sp) addi sp sp 28 ``` Instructions used: ``` add addi beq bgt ble bne divu ecall j la lb lh li lw mul mv remu sb sh slli sw ``` ## Running ### `addi sp sp -32` Let's look at how a single instruction `addi sp sp -32` go through the five stages #### IF ![](https://i.imgur.com/8OhQyBZ.png) In the IF stage, the instruction is fetched from the instruction memory and written into the IFID registers. #### ID ![](https://i.imgur.com/PGgeJaM.png) In the ID stage, at the bottom, the index of the destination register is decoded (sp=x2). Also extracted is the immediate value. The indices of the source register (x2) is sent into the registers and its content is sent to the next stage. #### EX ![](https://i.imgur.com/pvwM9Oc.png) In EX, The values are then passed to the ALU, which performs an addition according to the opcode. #### MEM ![](https://i.imgur.com/L5rURdt.png) In MEM, since this is not a load/store instruction, the memory unit is disabled. #### WB ![](https://i.imgur.com/BbC9cmn.png) In WB, the result is fed back to the registers. WrEn of the registers is enabled and only after this stage the modified x2 is set to the new value. ### `sh t3 -2(t1)` #### IF Same #### ID ![](https://i.imgur.com/XEzHdho.png) The offset -2 is extracted as an immediate. #### EX ![](https://i.imgur.com/DGmq1P7.png) The ALU adds the address and offset. The value of reg2 is not actually used because a modified value (0x14) from a later stage is forwarded. #### MEM ![](https://i.imgur.com/1jWiEof.png) Write is enabled, the value 0x14 is written to the address with offset. #### WB ![](https://i.imgur.com/n4kWg31.png) The registers' wr_en is cleared since there's no register to write to. --- Pipeline flush caused by a jump ![](https://i.imgur.com/noJcCA3.png) Bubble caused by data hazard (RAW) ![](https://i.imgur.com/MjSvj82.png) ## Getting compiled code to work Ripes comes with support of C source code, provided a compiler targeting RISC-V is installed. For those who don't want to or cannot install the compiler (Windows user), the online [Compiler Explorer](https://godbolt.org) is useful. There's a page titled [Adapting Compiler Explorer generated RISC V assembly code](https://github.com/mortbopet/Ripes/wiki/Adapting-Compiler-Explorer-generated-RISC-V-assembly-code) in Ripes' Wiki. Let's run our li'l snippet of C for pi in Ripes. ```cpp a[52514],b,c=52514,d,e,f=1e4,g,h;main(){for(;b=c-=14;h=printf("%04d", e+d/f))for(e=d%=f;g=--b*2;d/=g)d=d*b+f*(h?a[b]:f/5),a[b]=d%--g;} ``` The compiled assembly can be grabbed from the [Compiler Explorer](https://godbolt.org/z/oaaddf). Just copy-n-paste and it should work, right? Alas, things are not so simple. ![](https://i.imgur.com/Blbikz8.png) Red squiggly lines. Lots of them. To make this work, I had to make a few changes. * change instructions with `%lo` and `%hi` to equivalent calls of `la` * replace the printf function * mark the `.text` and `.data` section * set the size of the array to be under 65536 bytes * Replace the `ret` at the end of `main` with a jump to the end of the file (Ripes won't halt otherwise) I replaced the `printf` in the C source with a custom function ```cpp int print4(int x); a[364], b, c = 364, d, e, f = 1e4, g, h; main() { for (; b = c -= 14; h = print4(e + d / f)) for (e = d %= f; g = --b * 2; d /= g) d = d * b + f * (h ? a[b] : f / 5), a[b] = d % --g; } // int print4(int x) { return printf("%04d", x); } ``` The size of the array is also reduced. The definition is in handcrafted in RISC-V assembly (ripes) ```python # print4.s print4: addi sp sp -8 sw s0 0(sp) sw s1 4(sp) li s1 10 mv s0 a0 li t0 1000 div a0 s0 t0 li a7 1 ecall li t0 100 div a0 s0 t0 rem a0 a0 s1 li a7 1 ecall li t0 10 div a0 s0 t0 rem a0 a0 s1 li a7 1 ecall rem a0 s0 s1 li a7 1 ecall li a0 4 lw s0 0(sp) lw s1 4(sp) addi sp sp 8 ret ``` I wrote a [conversion script](https://github.com/guaneec/arch2020/blob/master/a1/ripes.sh) to automate this. Full workflow: * Write C * Remove implementation of functions that requires an `ecall` * Rewrite these functions in Ripes RISC-V in `lib.s` * Grab [compiled RISC-V from Compiler Explorer](https://godbolt.org/z/d6ohrz), save as IN_FILENAME * Run `./ripes.sh <IN_FILENAME >OUT_FILENAME` Generated assembly: ``` .data c: .word 364 # 0x16c f: .word 10000 # 0x2710 b: .word 0 # 0x0 d: .word 0 # 0x0 e: .word 0 # 0x0 g: .word 0 # 0x0 h: .word 0 # 0x0 a: .zero 1456 .text main: # @main addi sp, sp, -48 sw ra, 44(sp) sw s0, 40(sp) sw s1, 36(sp) sw s2, 32(sp) sw s3, 28(sp) sw s4, 24(sp) sw s5, 20(sp) sw s6, 16(sp) sw s7, 12(sp) sw s8, 8(sp) sw s9, 4(sp) sw s10, 0(sp) la s3, c lw a0, 0(s3) addi a3, a0, -14 sw a3, 0(s3) la s9, b sw a3, 0(s9) beqz a3, .LBB0_9 la s2, f la s10, d la s4, e la s5, g la s6, h lui a0, 419430 addi s7, a0, 1639 la a0, a addi s8, a0, 0 j .LBB0_4 .LBB0_2: # in Loop: Header=BB0_4 Depth=1 sw a1, 0(s10) sw zero, 0(s5) sw zero, 0(s9) .LBB0_3: # in Loop: Header=BB0_4 Depth=1 div a0, a1, a7 add a0, a0, a6 call print4 lw a1, 0(s3) sw a0, 0(s6) addi a3, a1, -14 sw a3, 0(s3) sw a3, 0(s9) beqz a3, .LBB0_9 .LBB0_4: # =>This Loop Header: Depth=1 lw a7, 0(s2) lw a1, 0(s10) rem a6, a1, a7 sw a6, 0(s10) sw a6, 0(s4) addi a2, a3, -1 sw a2, 0(s9) slli a1, a2, 1 sw a1, 0(s5) add a1, zero, a6 beqz a2, .LBB0_3 lw a1, 0(s6) seqz a4, a1 mulh a1, a7, s7 srli a5, a1, 31 srai a1, a1, 1 add s1, a1, a5 slli a1, a3, 1 addi s0, a1, -3 slli a1, a3, 2 add a1, a1, s8 addi a3, a1, -4 add a1, zero, a6 j .LBB0_7 .LBB0_6: # in Loop: Header=BB0_7 Depth=2 mul a1, a1, a2 mul a5, a5, a7 add a5, a5, a1 div a1, a5, s0 mul a0, a1, s0 sub a0, a5, a0 sw a0, 0(a3) addi a2, a2, -1 addi s0, s0, -2 addi a3, a3, -4 beqz a2, .LBB0_2 .LBB0_7: # Parent Loop BB0_4 Depth=1 add a5, zero, s1 bnez a4, .LBB0_6 lw a5, 0(a3) j .LBB0_6 .LBB0_9: mv a0, zero lw s10, 0(sp) lw s9, 4(sp) lw s8, 8(sp) lw s7, 12(sp) lw s6, 16(sp) lw s5, 20(sp) lw s4, 24(sp) lw s3, 28(sp) lw s2, 32(sp) lw s1, 36(sp) lw s0, 40(sp) lw ra, 44(sp) addi sp, sp, 48 j end print4: addi sp sp -8 sw s0 0(sp) sw s1 4(sp) li s1 10 mv s0 a0 li t0 1000 div a0 s0 t0 li a7 1 ecall li t0 100 div a0 s0 t0 rem a0 a0 s1 li a7 1 ecall li t0 10 div a0 s0 t0 rem a0 a0 s1 li a7 1 ecall rem a0 s0 s1 li a7 1 ecall li a0 4 lw s0 0(sp) lw s1 4(sp) addi sp sp 8 ret end: ```