Assignment1: RISC-V Assembly and Instruction Pipeline

--- tags: RISC-V, jserv --- # Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`MicahDoo`](https://github.com/MicahDoo) > [Notes on Using Ripes](https://hackmd.io/wr2NJwD5TJiENanuV_hiSQ) ## Invert Binary Tree I base this assignment upon [LeetCode Problem 226. Invert Binary Tree](https://leetcode.com/problems/invert-binary-tree/), an <font color=lime>Easy</font> Tree problem: > Given the `root` of a binary tree, invert the tree, and return its `root`. > >![](https://i.imgur.com/Hesq5th.jpg) :point_right: Straight to solution: Click [here](#Final-Assembly-Code) to skip my investigation and thought process. ## Problem Evaluation ### Data Organization - **Nodes**: contiguous class objects, each consisting of three elements: value (`val`), left branch (`left`) and right branch (`right`) in order. - **Links**: pointer values, each either a null pointer value or points to a child (a node). ```graphviz digraph g { fontname="Helvetica,Arial,sans-serif" node [fontname="Helvetica,Arial,sans-serif"] edge [fontname="Helvetica,Arial,sans-serif"] graph [ rankdir = "LR" ]; node [ fontsize = "16" shape = "ellipse" ]; edge [ ]; "node0" [ label = " <f0> val | <f1> left | <f2> right" shape = "record" ]; "node1" [ label = " <f0> val | left | right" shape = "record" ]; "node2" [ label = " <f0> val | left | right" shape = "record" ]; "node0":f1 -> "node1":f0 [ id = 0 ]; "node0":f2 -> "node2":f0 [ id = 1 ]; } ``` ### Strategy To concoct a strategy to attack this problem, it helps to first construct a solution using a higher level of abstraction. Below is a function written in C++ to tackle this problem: ```cpp void invert(TreeNode* node) { if (node == nullptr) return; TreeNode * temp = node -> left; node -> left = node -> right; node -> right = temp; invert(node -> left); invert(node -> right); } ``` A rough flowchart of the above function: ```flow st=>start: Invert(node) e=>end: Return cond=>condition: node == nullptr ? op=>operation: Swap node->left & node->right call=>subroutine: Invert(node->left) cond2=>condition: node->right == null ? call2=>subroutine: Invert(node->right) st->cond cond(no@y)->e cond(yes@n)->op op->call->call2->e ``` ### Evaluation of the solution in the context of assembly In assembly language, extra care has to be taken apropos a number of things: - There's no naming of a variable, only data in memory. - A class object is organized as a contiguous memory block where each word is an element in order. - Instead of refering to the elements by their names, we use **offsets**. In the case of Invert Binary Tree, assuming a 4-byte word length, `->val` is `+0`, `->left` is `+4` and `->right` is `+8`. - Function calls are handled on a stack. - Instead of the easy parenthesizing and the handy `return` keyword, we need to manually handle function calls by pushing arguments, local variables(or pointers to local variables) and return addresses on the stack and update the stack pointer. :::info - [ ] Is a frame pointer necessary or benefitial at all? ::: [RISCV Cheatsheet](https://itnext.io/risc-v-instruction-set-cheatsheet-70961b4bbe8) ## Thought Process ### Dataset A I will use the following data for desmonstration throughout the development of the code, unless a modification is needed: ```c .data n: # 0x10000000 .word 1, 0x1000000c, 0x10000018 nl: # 0x1000000c .word 2, 0x10000024, 0x10000030 nr: # 0x10000018 .word 3, 0x1000003c, 0x10000048 nll: # 0x10000024 .word 4, 0, 0 nlr: # 0x10000030 .word 5, 0, 0 nrl: # 0x1000003c .word 6, 0, 0 nrr: # 0x10000048 .word 7, 0, 0 ``` :::spoiler Original Tree ``` Tree DFS: 1 2 4 5 3 6 7 ``` ```graphviz digraph g { fontname="Helvetica,Arial,sans-serif" node [fontname="Helvetica,Arial,sans-serif"] edge [fontname="Helvetica,Arial,sans-serif"] graph [ rankdir = "LR" ]; node [ fontsize = "16" shape = "ellipse" ]; edge [ ]; "n" [ label = " <f0> 1 | <f1> 0x1000000c | <f2> 0x10000018" shape = "record" ]; "nl" [ label = " <f0> 2 | <f1> 0x10000024 | <f2> 0x10000030" shape = "record" ]; "nr" [ label = " <f0> 3 | <f1> 0x1000003c | <f2> 0x10000048" shape = "record" ]; "nll" [ label = " <f0> 4 | <f1> 0 | <f2> 0" shape = "record" ]; "nlr" [ label = " <f0> 5 | <f1> 0 | <f2> 0" shape = "record" ]; "nrl" [ label = " <f0> 6 | <f1> 0 | <f2> 0" shape = "record" ]; "nrr" [ label = " <f0> 7 | <f1> 0 | <f2> 0" shape = "record" ]; "n":f1 -> "nl":f0 "n":f2 -> "nr":f0 "nl":f1 -> "nll":f0 "nl":f2 -> "nlr":f0 "nr":f1 -> "nrl":f0 "nr":f2 -> "nrr":f0 } ``` ::: :::spoiler Desired Result ``` Tree: 1 3 7 6 2 5 4 ``` ```graphviz digraph g { fontname="Helvetica,Arial,sans-serif" node [fontname="Helvetica,Arial,sans-serif"] edge [fontname="Helvetica,Arial,sans-serif"] graph [ rankdir = "LR" ]; node [ fontsize = "16" shape = "ellipse" ]; edge [ ]; "n" [ label = " <f0> 1 | <f1> 0x10000018 | <f2> 0x1000000c" shape = "record" ]; "nr" [ label = " <f0> 3 | <f1> 0x10000048 | <f2> 0x1000003c" shape = "record" ]; "nl" [ label = " <f0> 2 | <f1> 0x10000030 | <f2> 0x10000024" shape = "record" ]; "nll" [ label = " <f0> 4 | <f1> 0 | <f2> 0" shape = "record" ]; "nlr" [ label = " <f0> 5 | <f1> 0 | <f2> 0" shape = "record" ]; "nrl" [ label = " <f0> 6 | <f1> 0 | <f2> 0" shape = "record" ]; "nrr" [ label = " <f0> 7 | <f1> 0 | <f2> 0" shape = "record" ]; "n":f1 -> "nr":f0 "n":f2 -> "nl":f0 "nl":f1 -> "nlr":f0 "nl":f2 -> "nll":f0 "nr":f1 -> "nrr":f0 "nr":f2 -> "nrl":f0 } ``` ::: ### Traversal Working from the bottom, I first try to implement a traversal of the tree: ```c .text main: la a0, n jal traverse # Exit program li a7, 10 ecall traverse: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp ### DO THINGS lw s0, 4, a0 # s0 <- M[n+4] = node->left lw s1, 8, a0 # s1 <- M[n+8] = node->right # BRANCH beq s0, x0, left_traversed # node->left == 0? # no mv a0, s0 jal traverse # yes left_traversed: beq s1, x0, right_traversed # node->right == 0 # no mv a0, s1 jal traverse # yes right_traversed: # POP from stack lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret ``` The above could serve as a template. To use it: 1. Re-label `traverse`, `left_traversed`, and `right_traversed` to avoid conflicts. 2. Replace `### DO THINGS` with operations. :::spoiler For example: Print DFS ```c .data # 0x10000000 n: # 0x10000000 .word 1, 0x1000000c, 0x10000018 nl: # 0x1000000c .word 2, 0x10000024, 0x10000030 nr: # 0x10000018 .word 3, 0x1000003c, 0x10000048 nll: # 0x10000024 .word 4, 0, 0 nlr: # 0x10000030 .word 5, 0, 0 nrl: # 0x1000003c .word 6, 0, 0 nrr: # 0x10000048 .word 7, 0, 0 str1: .string "\nTree DFS: " .text main: la a0, n jal printResult exit: # Exit program li a7, 10 ecall printResult: # PUSH addi sp, sp, -4 sw ra, 0, sp # print mv t0, a0 la a0, str1 # print string li a7, 4 ecall mv a0, t0 # mv a0, a0 # arg is carried over from the caller jal traverse # POP lw ra, 0, sp addi sp, sp, 4 ret traverse: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp mv t0, a0 lw a0, 0, a0 li a7, 1 # print integer ecall mv a0, t0 lw s0, 4, a0 # s0 <- M[n+4] = node->left lw s1, 8, a0 # s1 <- M[n+8] = node->right # BRANCH beq s0, x0, left_traversed # node->left == 0? # no mv a0, s0 jal traverse # yes left_traversed: beq s1, x0, right_traversed # node->right == 0 # no mv a0, s1 jal traverse # yes right_traversed: # POP from stack lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret ``` Output: ```shell Tree DFS: 1245367 Program exited with code: 0 ``` ::: ## Final Assembly Code After adding the print feature, I use the `traversal` template to swap `node->left` and `node->right` for all traversed `node` (I include only the function in this synopsis for simplicity): ```c invert: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp # ASSUME: address to node is already in a0 lw s0, 4, a0 # s0 <- M[a0+4] = node->left lw s1, 8, a0 # s1 <- M[a0+8] = node->right # SWAP: sw s0, 8, a0 # s0 -> M[a0+8] sw s1, 4, a0 # s1 -> M[a0+4] # BRANCH beq s0, x0, left_done # node->left == 0? # no mv a0, s0 jal invert # yes left_done: beq s1, x0, right_done # node->right == 0? # no mv a0, s1 jal invert # yes right_done: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret ``` :::spoiler Full Code ```c .data # 0x10000000 n: # 0x10000000 .word 1, 0x1000000c, 0x10000018 nl: # 0x1000000c .word 2, 0x10000024, 0x10000030 nr: # 0x10000018 .word 3, 0x1000003c, 0x10000048 nll: # 0x10000024 .word 4, 0, 0 nlr: # 0x10000030 .word 5, 0, 0 nrl: # 0x1000003c .word 6, 0, 0 nrr: # 0x10000048 .word 7, 0, 0 str1: .string "\nTree DFS: " space: .string " " .text main: # Load argument(s) from static data (pseudo instruction) # PRE-VALIDATION la a0, n # Preload node address # Equivalence: # auipc x10 0x10000 # lw x10 24 x10 # why 24? [12:0] 0x18 = 24 jal printResult # EXECUTION la a0, n jal invert # VALIDATION la a0, n jal printResult exit: # Exit program li a7, 10 ecall invert: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp # ASSUME: address to node is already in a0 lw s0, 4, a0 # s0 <- M[a0+4] = node->left lw s1, 8, a0 # s1 <- M[a0+8] = node->right # SWAP: sw s0, 8, a0 # s0 -> M[a0+8] sw s1, 4, a0 # s1 -> M[a0+4] # TRAVERSE: beq s0, x0, left_done # node->left == 0? # no mv a0, s0 jal invert # yes left_done: beq s1, x0, right_done # node->right == 0? # no mv a0, s1 jal invert # yes right_done: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # invert printResult: # PUSH addi sp, sp, -4 sw ra, 0, sp mv t0, a0 la a0, str1 li a7, 4 # print string ecall mv a0, t0 # mv a0, a0 jal traverse # POP lw ra, 0, sp addi sp, sp, 4 ret traverse: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp mv t0, a0 # supply lw a0, 0, a0 li a7, 1 # print integer ecall mv a0, t0 # restore lw s0, 4, a0 # s0 <- M[a0+4] = node->left lw s1, 8, a0 # s1 <- M[a0+8] = node->right # BRANCH # CHECK node->left == 0 beq s0, x0, left_traversed # no mv a0, s0 jal traverse # yes left_traversed: # CHECK node->right == 0 beq s1, x0, right_traversed # no mv a0, s1 jal traverse # yes right_traversed: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # traverse ``` ::: :::spoiler Older Version ```c .data # 0x10000000 n: # 0x10000000 .word 1, 0x1000000c, 0x10000018 nl: # 0x1000000c .word 2, 0x10000024, 0x10000030 nr: # 0x10000018 .word 3, 0x1000003c, 0x10000048 nll: # 0x10000024 .word 4, 0, 0 nlr: # 0x10000030 .word 5, 0, 0 nrl: # 0x1000003c .word 6, 0, 0 nrr: # 0x10000048 .word 7, 0, 0 str1: .string "\nTree DFS: " space: .string " " .text main: # Load argument from static data (pseudo instruction) # Load node address la a0, n jal printResult # Equivalence: # auipc x10 0x10000 # lw x10 24 x10 # why 24? [12:0] 0x18 = 24 # j invert # = # jal x0, invert # assign program return address to x0 but x0 will be automatically reset to 0 anyway -> no effect # save return address # current return addresss in ra, if callee calls another function, push it la a0, n jal invert la a0, n jal printResult exit: # Exit program li a7, 10 ecall invert: # ASSUME: address to node is already in a0 # Calculate address to node->left addi a0, a0, 4 # n + 4 # Get the address stored in that address # use saved register s0 for local variables lw s0, 0, a0 # s0 <- [a0] = [n+4] # Calculate address to node->right addi a0, a0, 4 # n + 4 + 4 # Get the address stored in that address # use saved register s1 for local variables lw s1, 0, a0 # s1 <- [a0] = [n+8] # SWAP: store node->left in node->right sw s0, 0, a0 # s0 -> [n+8] # SWAP: store node->right in node->left addi a0, a0, -4 sw s1, 0, a0 # s1 -> [n+4] # TRAVERSE: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp # PASS arguments lw a0, 4, sp # = # addi a0, s0, 0 # CHECK node->left == 0 beq a0, x0, left_done # no jal invert # yes left_done: # CHECK node->right == 0 lw a0, 8, sp beq a0, x0, right_done # no jal invert # yes right_done: lw ra, 12, sp addi sp, sp, 12 ret # invert printResult: mv t0, a0 la a0, str1 li a7, 4 # print string ecall mv a0, t0 # PUSH onto stack sw ra, 0, sp addi sp, sp, -4 jal traverse # POP from stack lw ra, 0, sp addi sp, sp, 4 ret traverse: ### DO THINGS mv t0, a0 lw a0, 0, a0 li a7, 1 # print integer ecall mv a0, t0 # Calculate address to node->left addi a0, a0, 4 # n + 4 lw s0, 0, a0 # s0 <- [a0] = [n+4] # Calculate address to node->right addi a0, a0, 4 # n + 4 + 4 lw s1, 0, a0 # s1 <- [a0] = [n+8] # PUSH onto stack sw ra, 0, sp sw s0, -4, sp sw s1, -8, sp addi sp, sp, -12 # BRANCH # PASS left argument lw a0, 8, sp # CHECK node->left == 0 beq a0, x0, left_traversed # no jal traverse # yes left_traversed: # PASS right argument lw a0, 4, sp # CHECK node->right == 0 beq a0, x0, right_traversed # no jal traverse # yes right_traversed: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # traverse ``` ::: :::warning :warning: Can you use fewer instructions? :notes: jserv ::: :::warning My attempt at shortening the code: * Access struct members (`node->left` & `node->right`) by using the offset. (Don't use addi + lw). * Loop through left and right? ::: :::info Takeaway from Code Review with [`teimeikichengmingchi`](https://hackmd.io/X4DFhZwETxSo5uwU0scsLA?view): - Printing and Checking is an inefficient validation method. - Rewrite my code so as to make my validation method more efficient. ::: Output: ```shell Tree DFS: 1245367 Tree DFS: 1376254 Program exited with code: 0 ``` As shown in the above output, the tree is thus successfully inverted. ## Ripes Simulator Pipeline Features Here I will use the following processor and instruction to demonstrate Ripes' pipelining stages: 1. 5-Stage Processor: This is the most comprehensive 5-stage processor (has both forwarding and hazard detection) 2. `lw`: since load instructions are the only type that is active in all five stages. ### Initial State Instruction: `lw a1 10 a0` ### IF (Instruction Fetch) ![](https://i.imgur.com/PAMi5EY.jpg) PC increases by 4 to advance to the next instruction in Instruction Memory. ### ID (Instruction Decode) ![](https://i.imgur.com/mM9s6mi.jpg) Instruction Memory passes the instruction on to the Decoder and the Immediate Generator. The Decoder then parse the instruction to get the following values: * `a0`: number of the register which contains the address which, after added the immediate value, gives us the memory of the data to be loaded. It is fed into the Registers Module to get `R[a0]`. * `M[R[a0]+imm]` is the data to be loaded. * `LW`: the instruction type. Will be used by control unit to control the flow. This also tells Immediate Generator how to parse the immediate value. The Immediate Generator gives us `10`. * `a1`: the write-back register, will be passed on to the next few stages until the write-back happens. ### EX (Execution) ![](https://i.imgur.com/MtxZhhK.jpg) * `R[a0]` from the Registers and the immediate value `10` from the Immediate Generator are fed into the ALU for addition. The result is `R[a0] + 10`. * Note that `a1` is being passed through this stage without engaging in any operations. ### MEM (Memory) ![](https://i.imgur.com/RdPzKm3.jpg) * The address `R[a0]+10` is passed into Data Memory and it spits out the word `M[R[a0]+10]` stored in that address. * Notice that WrEn (Write Enable) is off, which means nothing is being written into Data Memory. The Data In port is thus not important. ### WB (Write Back) ![](https://i.imgur.com/QLQVD5P.jpg) * The data `M[R[a0]+10]` is passed into the Registers. * The `a1`, having looped all the way also back into Registers, tells it where to store `M[R[a0]+10]`. * Another destination is through the Forwarding Unit (the two multiplexers to the right of IDEX) into either the Branch Comparator or the ALU. If an instruction happens to need `R[a1]`, then it doesn't need to stay in the ID stage to wait for the write back; it can progress to the EX stage and get the value from the WB stage directly. ## Issues with Other Processors While the full code above works without error and as intended with a 5-stage processor with forwarding and hazard detection, other processors present a different story: ### 5-Stage Processor wo/ Forwarding (Still has Hazard Detection) No error occurs, but since the time saved by forwarding is replaced by`nop`s injected by Control Unit's Hazard Detection, time efficiency is compromised. * In general, for any instrution that writes a value, the value is not going to be saved back in the registers until the WB stage of that instruction. If the next instruction needs the WB value, it needs to **stay in the ID stage until the previous instruction reaches WB**. That means the two stages in between, EX and MEM stages are empty. That is **two stalls/nops**. * If the circuit is designed so that the value can be forwarded from a stage before WB, then stalls can be avoided. * For R-Type instructions, the WB value first appears as an output of ALU in the EX stage. Forwarding can be done from the end of EX directly to the start of the EX, which means when the next instruction advances into the Ex stage, the WB value of the previous instruction will have already been made available. :::info ### Example: PREV: `addi x10 x0 0` followed by NEXT: `addi x10 x10 3` (a RAW data hazard) #### :-1: Without Forwarding PREV: EX NEXT: ID The addition result from PREV is fresh out of ALU. It hasn't been written back yet. ![](https://i.imgur.com/809RXsJ.jpg) **Two cycles later...** PREV: EX NEXT: WB ![](https://i.imgur.com/qzIoD0n.jpg) PREV's output successfully written back to `x10`. Now NEXT can safely proceed to the stage EX with the right value for addition. **Conclusion:** Two cycles were wasted. That's two stalls. #### :+1: With Forwarding PREV: EX NEXT: ID The addition result from PREV is fresh out of ALU. It hasn't been written back yet, but there is a path out of ALU through the EX/MEM registers that can feed the value directly back into the ALU next stage. ![](https://i.imgur.com/TraOWdC.jpg) **Next cycle...** PREV: MEM NEXT: EX ![](https://i.imgur.com/fTV21GU.jpg) As illustrated above, both instructions advance to the next stage and the WB value from PREV can be directly pushed into the ALU unit currently used by NEXT. **Conclusion** No delay. Better than without forwarding. ::: * For Load instructions, the WB value first appears as an output of DMEM in the MEM stage. The next instruction, if it needs the WB value, will have to stay in ID until the Load instruction reaches MEM. That translates to one empty stage in between, which means still one `nop` is needed for this to work. That is still a step up from 2 `nop`s without forwarding, though. Control's hazard detection will automatically add that for you. :::info ### Example: PREV: `lw x10 0 x12` followed by NEXT: `addi x10 x10 3` (a Load Before Use data hazard) #### :-1: Without Forwarding PREV: EX NEXT: ID PREV has calculated the address of the data it needs to read from, but has yet to read it. Hence NEXT can't get the value at this stage. ![](https://i.imgur.com/mEJw0qB.jpg) **Next cycle...** PREV: EX NEXT: MEM PREV just fetched the data from the address, but still has not written it back. ![](https://i.imgur.com/EcQxIg0.jpg) **Next cycle...** PREV: EX NEXT: WB ![](https://i.imgur.com/qzIoD0n.jpg) PREV's output successfully written back to `x10`. Now NEXT can safely proceed to the next stage with the right value for addition. **Conclusion:** Two cycles were wasted. That's two stalls. #### :+1: With Forwarding PREV: MEM NEXT: ID ![](https://i.imgur.com/zUHfLqg.jpg) PREV has already fetched the data at `R[x12]` out of Data Memory. It hasn’t been written back yet, but there is a path out of DMEM through the MEM/WB registers that can feed the value directly back into the ALU next stage. **Next cycle...** PREV: WB NEXT: EX ![](https://i.imgur.com/SdIYS1w.jpg) Because now at the WB stage, PREV can feed `M[R[x12]]` directly back into ALU, NEXT doesn't have to stay in ID to fetch from Registers. **Conclusion:** Only one cycle is wasted. Better than without forwarding. ::: ### 5-Stage Processor wo/ Hazard Detection (Still has Forwarding) * Errors can occur. * The appropriate number of with-forwarding nops need to be manually injected for the code to work correctly: * For R-Type instructions, no `nop`s need to be inserted. The forwarding already takes care of everything. * For Load instructions, one `nop` still need to be hand coded. * There seems to be a potential design flaw with branch instructions: refer to [here](#Issue:-Branching-before-Forwarding). #### Nops for Processor with Forwarding but without HD * As discussed above, in order for the program to work correctly, we need to ensure the following: 1. If a load instruction's `rd` is used in the next instruction, put one `nop` in between. 2. :small_red_triangle: **(Might not be typical of a regular RISC-V machine due to the design flaw discussed in the [next section](#Issue:-Branching-before-Forwarding))** If a load/R-Type instruction's `rd` is used in the next or the one after the next instruction, which happens to be a branch instruction, make sure there are two non-conflicting instructions in between. Pad with `nop`s if no. 3. :small_red_triangle: **(Yet to figure out why, refer to the [ecall section](#Issue-with-Ecalls))** Three `nop`s between an `ecall` and instructions that set `a7` and `a0`. :::spoiler Code ```c .text main: # PRE-VALIDATION la a0, n # Preload node address # Equivalence: # auipc x10 0x10000 # lw x10 24 x10 # why 24? [12:0] 0x18 = 24 jal printResult # EXECUTION la a0, n jal invert # VALIDATION la a0, n jal printResult exit: # Exit program li a7, 10 nop nop nop ecall invert: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp lw s0, 4, a0 # s0 <- M[a0+4] = node->left lw s1, 8, a0 # s1 <- M[a0+8] = node->right nop # SWAP: sw s0, 8, a0 # s0 -> M[a0+8] sw s1, 4, a0 # s1 -> M[a0+4] # TRAVERSE: beq s0, x0, left_done # node->left == 0? # no mv a0, s0 jal invert # yes left_done: beq s1, x0, right_done # node->right == 0? # no mv a0, s1 jal invert # yes right_done: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # invert printResult: # PUSH addi sp, sp, -4 sw ra, 0, sp mv t0, a0 la a0, str1 li a7, 4 # print string nop nop nop ecall mv a0, t0 # mv a0, a0 jal traverse # POP lw ra, 0, sp addi sp, sp, 4 ret traverse: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp mv t0, a0 # supply lw a0, 0, a0 li a7, 1 # print integer nop nop nop ecall mv a0, t0 # restore lw s0, 4, a0 # s0 <- M[a0+4] = node->left lw s1, 8, a0 # s1 <- M[a0+8] = node->right nop # BRANCH # CHECK node->left == 0 beq s0, x0, left_traversed # no mv a0, s0 jal traverse # yes left_traversed: # CHECK node->right == 0 beq s1, x0, right_traversed # no mv a0, s1 jal traverse # yes right_traversed: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # traverse ``` ::: ## Issue: Branching before Forwarding (Detail on [Github](https://github.com/mortbopet/Ripes/issues/233)) I discovered an oddity in the design layout for 5-Stage Processor without Hazard Detection: #### 5-Stage ![](https://i.imgur.com/ndmRjRB.jpg) #### 5-Stage without Hazard Detection Only Here is the layout for one with Hazard Detection for comparison: ![](https://i.imgur.com/wdDx9RR.jpg) ### Why are the with-forwarding layouts different for w/ HD and wo/ HD? Hazard Detection (HD) should be implemented in the Control component, which means it doesn't affect the arrangement of datapath elements, however, a difference is clearly seen in the two layouts: 1. For the one with HD, the Branch Comparator takes its inputs through the two forwarding multiplexers. This way the BC is allowed to take inputs from the Registers, or from WB values forwarded from ALU or DMEM. As it should be. ![](https://i.imgur.com/XcboT2w.jpg) 2. For the one without HD, the Branch Comparator is instead located before the forwarding multiplexers, taking it's input only from the Registers. This means WB values can never reach a branch instruction by forwarding. ![](https://i.imgur.com/JEmMVio.jpg) For comparison, in the other two cases where there are no forwarding, whether HD is in play doesn't make a difference: #### 5-Stage without Forwarding Only ![](https://i.imgur.com/5vDwqi0.jpg) #### 5-Stage without Either ![](https://i.imgur.com/JYUXbC2.jpg) As can be seen, although the wirings near the Branch Comparator are drawn different, they are, in fact, the same picture. ![](https://i.imgur.com/XFqHtXE.jpeg) :::warning ### :warning: Design Flaw The problem mentioned above seems to be a design flaw. I published an [issue](https://github.com/mortbopet/Ripes/issues/233) on Ripes github repository to bring this problem to attention. ::: :::spoiler Load before Use Test Code ``` .data branch_str: .string "Branched.\n" not_branch_str: .string "Not branched.\n" .text main: # a0 set to 10 addi a0, zero, 10 nop nop # a0 set to 0 addi a1, zero, 0 addi a2, zero, 64 sw a1, 0, a2 lw a0, 0, a2 # nop # zero stalls # no hazard: a0 = 0 + 3 = 3 / hazard: a0 = 10 + 3 = 13 addi a0, a0, 3 li a7, 1 nop nop nop ecall exit: # Exit li a7, 10 nop nop nop ecall ``` ::: ## Issue with Ecalls ![](https://i.imgur.com/iRn59It.jpg) Without enough `nop`s, the value in `a7` might not be appropriate and the above message might pop up. :::danger - There must be three stalls before ecall even if there is a forwarding unit. - One of the reasons might be because ecal uses the Branch Comparator. However, that still doesn't explain the following: - With a normal forwarding component as in **5-Stage Processor**, a `beq` instruction experiences no delay. There is no reason `ecalls` should incur a delay for using the BC. - Delays are typically two `nop`s. Where is the 3rd one coming from? ::: :::danger Can't print anything with `ecall`s when neither the forwarding unit or hazard detection is present. - Why? - How does `ecall` work? ::: ## Optimization: Code Scheduling to Avoid Stalls and Hazards * Assume the worst (no forwarding and no hazard detection). * Two benefits: 1. With Hazard Detection, performance is better. 2. Without Hazard Detection, performance AND correctness are both better. ### Example: Push to Stack #### Original ```c= addi sp, sp, 0 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp ``` Line 1 and 2 is a typical RAW hazard. To avoid that, I move the `addi` part to the end and add a -12 bias to all the offsets. #### Modified ```c= sw ra, -12, sp sw s0, -8, sp sw s1, -4, sp addi sp, sp, -12 ``` ### Example: SWAP Two Values #### Original ```c= sw s0, -8, sp sw s1, -4, sp sw ra, -12, sp addi sp, sp, -12 lw s0, 4, a0 lw s1, 8, a0 sw s0, 8, a0 sw s1, 4, a0 ``` Line 6 and 7 are only one line away from the next line using the register that they load values into. I would be a good idea to move one instruction from the top to below these two instructions, given the change or order doesn't change the results. Since line 4 doesn't use the same registers, it is safe to move it down. #### Modified ```c= sw s0, -8, sp sw s1, -4, sp sw ra, -12, sp lw s0, 4, a0 lw s1, 8, a0 addi sp, sp, -12 sw s0, 8, a0 sw s1, 4, a0 ``` ### Example Recursion ```c= traverse: # PUSH sw ra, -12, sp sw s0, -8, sp sw s1, -4, sp addi sp, sp, -12 mv t0, a0 # supply lw a0, 0, a0 li a7, 1 # print integer nop nop nop ecall mv a0, t0 # restore nop nop lw s0, 4, a0 # s0 <- M[a0+4] = node->left lw s1, 8, a0 # s1 <- M[a0+8] = node->right nop # BRANCH beq s0, x0, left_traversed mv a0, s0 jal traverse ``` ```c= traverse: li a7, 1 # don't use a0 on the first line, because a0 was just stored 2 instructions ago before jal traverse was called mv t0, a0 # supply lw a0, 0, a0 sw ra, -12, sp sw s0, -8, sp sw s1, -4, sp lw s0, 4, t0 # s0 <- M[a0+4] = node->left lw s1, 8, t0 # s1 <- M[a0+8] = node->right addi sp, sp, -12 ecall mv a0, t0 # restore # BRANCH # CHECK node->left == 0 beq s0, x0, left_traversed # no mv a0, s0 jal traverse ``` :::spoiler Full Revamped Code ```c .data n: # 0x10000000 .word 1, 0x1000000c, 0x10000018 nl: # 0x1000000c .word 2, 0x10000024, 0x10000030 nr: # 0x10000018 .word 3, 0x1000003c, 0x10000048 nll: # 0x10000024 .word 4, 0, 0 nlr: # 0x10000030 .word 5, 0, 0 nrl: # 0x1000003c .word 6, 0, 0 nrr: # 0x10000048 .word 7, 0, 0 str1: .string "\nTree DFS: " .text main: # Load argument(s) from static data (pseudo instruction) # PRE-VALIDATION la a0, n # Preload node address # Equivalence: # auipc x10 0x10000 # lw x10 24 x10 # why 24? [12:0] 0x18 = 24 jal printResult # EXECUTION la a0, n jal invert # VALIDATION la a0, n nop jal printResult exit: # Exit program li a7, 10 nop nop nop ecall invert: # PUSH sw s0, -8, sp sw s1, -4, sp sw ra, -12, sp lw s0, 4, a0 # s0 <- M[a0+4] = node->left lw s1, 8, a0 # s1 <- M[a0+8] = node->right addi sp, sp, -12 # SWAP: sw s0, 8, a0 # s0 -> M[a0+8] sw s1, 4, a0 # s1 -> M[a0+4] # TRAVERSE: beq s0, x0, left_done # node->left == 0? # no mv a0, s0 jal invert # yes left_done: beq s1, x0, right_done # node->right == 0? # no mv a0, s1 jal invert # yes right_done: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # invert printResult: # PUSH li a7, 4 # print string mv t0, a0 la a0, str1 sw ra, -4, sp addi sp, sp, -4 nop ecall mv a0, t0 # mv a0, a0 jal traverse # POP lw ra, 0, sp addi sp, sp, 4 ret traverse: li a7, 1 # don't use a0 on the first line, because a0 was just stored 2 instructions ago before jal traverse was called mv t0, a0 # supply lw a0, 0, a0 sw ra, -12, sp sw s0, -8, sp sw s1, -4, sp lw s0, 4, t0 # s0 <- M[a0+4] = node->left lw s1, 8, t0 # s1 <- M[a0+8] = node->right addi sp, sp, -12 ecall mv a0, t0 # restore # BRANCH # CHECK node->left == 0 beq s0, x0, left_traversed # no mv a0, s0 jal traverse # yes left_traversed: # CHECK node->right == 0 beq s1, x0, right_traversed # no mv a0, s1 nop jal traverse # yes right_traversed: # POP lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # traverse ``` ::: ### Tests | Cycles | Full | wo/ F w/ HD | w/F wo/ HD | wo/ F wo/ HD| | ---------------- | ---- | ----------- | ------------ | ------------ | | Original | 648 | 741 | Ecall Error | Ecall Error | | Code Rescheduled | 660 | 672 | 584 | 191 (Crash) | ## Optimization: Branch Reordering to Minimize Flushes/Kills - As shown in this very recent (only 10 days before this homework is due) [discussion](https://github.com/mortbopet/Ripes/discussions/230) on Ripes' github repository, no branch prediction is implemented in this simulator. In such case, it will do good for the programmer to write the code so that the less taken route is in the branch and the more frequently occuring follows immediately after the branch instruction. This way, we can ensure that the number of times we have to flush immediately-following instructions is minimized. - In my case, the branch instruction involves whether the branch is a real branch (attached to another node) or a null branch: - Suppose there are $n$ nodes. 1. Since this is a binary treem each node except the root has a from-branch: - There are $n-1$ from branches -> there are $n-1$ real branches. 2. Since each node has two branches (left and right), there are $2n$ branches in total: - The rest of the $2n - (n-1) = n+1$ branches are null branches. 3. Thus, we have proven that there are more null branches than real branches. 4. Conclusion, we should put the case where `left` or `right` equals `0` immediately after the branch instruction. - I try the following code: ```c .text main: la a0, n jal traverse # Exit program li a7, 10 ecall traverse: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp ### DO THINGS lw s0, 4, a0 # s0 <- M[n+4] = node->left lw s1, 8, a0 # s1 <- M[n+8] = node->right # BRANCH bne s0, x0, left_branch # node->left != 0? next_branch: bne s1, x0, right_branch # node->right != 0 traverse_done: # POP from stack lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret # no left_branch: mv a0, s0 jal traverse j next_branch right_branch: mv a0, s1 jal traverse j traverse_done ``` As seen from the above revised code, the addition of two `j` commands doesn't seem to help with time saving that much: * In the original version, if there are $n$ nodes, $(n+1)\times{2} = 2n + 2$ kills result. * in the modified version, if there are $n$ nodes, $(n-1)\times{2}$ results from `bne` taking the branch and $(n-1)\times{2}$ more result from the `j` that is executed after the branch is taken. That is $4n-4$ in total. That's definitely more than the original if for all $n>=2$. The following version might solve the extra jump problem, but will require chunks of duplicate code and may make it less readable. More memory will be needed to save it. Plus, only $4$ cycles can be saved from this optimization, which becomes increasingly less significant as the number of nodes go up: ```c .text main: la a0, n jal traverse # Exit program li a7, 10 ecall traverse: # PUSH addi sp, sp, -12 sw ra, 0, sp sw s0, 4, sp sw s1, 8, sp ### DO THINGS lw s0, 4, a0 # s0 <- M[n+4] = node->left lw s1, 8, a0 # s1 <- M[n+8] = node->right # BRANCH bne s0, x0, left_branch # node->left != 0? bne s1, x0, right_branch # node->right != 0 # POP from stack lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret left_branch: mv a0, s0 jal traverse bne s1, x0, right_branch # node->right != 0 # POP from stack lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret right_branch: mv a0, s1 jal traverse # POP from stack lw ra, 0, sp lw s0, 4, sp lw s1, 8, sp addi sp, sp, 12 ret ``` ### Testing Using [Dataset A](#Dataset-A), the number of cycles required for each versions are a follows: 1. Original `beq` version: 221 cycles. 2. `bne` version with two `j`s: 235 cycles. 3. `bne` version without `j`, with duplicate code: 217 cycles.