Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline ###### tags: `CA` contributed by < [`reputation0809`](https://github.com/reputation0809/ca_hw1) > ## Find the smallest letter greater than target Given a characters array letters that is sorted in non-decreasing order and a character target, return the smallest character in the array that is larger than target. ## Implementation You can find the source code [here](https://github.com/reputation0809/ca_hw1). Feel free to fork and modify it. ### C code I use length=3 letter[3]={'c','f','j'} to be the sorted and non-decreasing array. And target='a' to be the target. You can change the length or element inside them to see if the result is still correct. ```cpp #include<stdio.h> #include <stdlib.h> #include <string.h> int main(){ char letter[3]={'c','f','j'}; char target='a'; char output; for(int i=0;i<sizeof(letter);i++){ if((int)letter[i]>(int)target){ output=letter[i]; break; } } printf("%c",output); } ``` :::info Think of the following implementation: ```cpp char nextGreatestLetter(char *letters, int lettersSize, char target) { int lo = 0, hi = lettersSize - 1; if (letters[hi] <= target) return letters[0]; while (lo < hi) { int mi = (hi - lo) / 2 + lo; if (letters[mi] <= target) lo = mi + 1; else hi = mi; } return letters[lo]; } ``` Can we do even better? :notes: jserv ::: ### Assembly code In this section, the above C code is transferred to RISCV assembly code. ```c .data arr: .word 99, 102, 106 #letter[3]={'c','f','j'} in ASCII target: .word 97 #target in ASCII len: .word 3 #array length .text #s1=arr base address #s2=target value #s3=output #s4=length of arr #t0=i #t1=letter[i] main: la s1, arr #s1=letter lw s2, target #s2='a' add s3, x0, x0 #s3=output lw s4, len #s4=3 add t0, x0, x0 #i=0 jal ra, loop #jump to for loop jal ra, print #jump to print ecall loop: lw t1, 0(s1) #t1=letter[i] addi s1, s1, 4 #address move forward blt t1, s2, loop #if(t1<s2) jump to loop ret print: add a0, t1, x0 #load char li a7, 1 #print char ecall ``` ## Analysis I test my code with [Ripes](https://github.com/mortbopet/Ripes) simulator. ### Pseudo instruction When I put above code into editor and I see that instead of executing literally, Ripes replace pseudo instruction into equivalent one, and change register name from ABI name to sequencial one. The translated code looks like: ``` 00000000 <main>: 0: 10000497 auipc x9 0x10000 4: 00048493 addi x9 x9 0 8: 10000917 auipc x18 0x10000 c: 00492903 lw x18 4 x18 10: 000009b3 add x19 x0 x0 14: 10000a17 auipc x20 0x10000 18: ffca2a03 lw x20 -4 x20 1c: 000002b3 add x5 x0 x0 20: 00c000ef jal x1 12 <loop> 24: 018000ef jal x1 24 <print> 28: 00000073 ecall 0000002c <loop>: 2c: 0004a303 lw x6 0 x9 30: 00448493 addi x9 x9 4 34: ff234ce3 blt x6 x18 -8 <loop> 38: 00008067 jalr x0 x1 0 0000003c <print>: 3c: 00030533 add x10 x6 x0 40: 00100893 addi x17 x0 1 44: 00000073 ecall ``` In each row it denotes address in instruction memory, instruction’s machine code (in hex) and instruction itself respectively. ### 5-stage pipelined processor Now I choose **5 stage processor** provided by Ripes to run this code. Its block look like this: ![](https://i.imgur.com/OwUFsrt.png) The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are: 1. Instruction fetch (IF) 2. Instruction decode and register fetch (ID) 3. Execute (EX) 4. Memory access (MEM) 5. Register write back (WB) You can see that each stage is separated by **pipeline registers** (the rectangle block with stage names on its each side) in the block diagram. Instruction in different type of format will go through 5 stages with different signal turned on. Let's take `blt t1, s2, loop` for example to discuss each stage in detail. ### B-type format Let's see how it go through each stage. 1. Instruction Fetch(IF) ![](https://i.imgur.com/4EMVOZ0.png) * We start from instruction put at `0x00000034`, so `addr` is equal to `0x00000034` * The machine code of first instruction is `0xff234ce3` (you can look it up in the code snippet above), so `instr` is equal to `0xff234ce3`. * PC will increment by 4 automatically using the above adder. * Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder. 2. Instruction decode and register fetch(ID) ![](https://i.imgur.com/uaYdFUO.png) * Instruction `0xff234ce3` is decoded to several parts: * `opcode` = `blt` * `R1 idr` = `0x06` * `R2 idr` = `0x12` * `imm` = `0xfffffff8` (relative address) * `Reg 1` and `Reg 2` read value based on `R1 idr` and `R2 idr` , so their value are `0x00000000` and `0x00000061`. :::warning why Reg 1 value is 0x00000000 not 0x00000063? Because this `blt t1, s2, loop` instruction has **data dependency** with `lw t1, 0(s1)`, which means the lw instruction hasn't written back the data yet while the branch instruction moves to ID-stage. That's why Reg 1 value is 0x00000000 not 0x00000063. ::: * Current PC value (`0x00000034`) and next PC value (`0x00000038`) are just send through this stage, we don't use them. 3. Execute(EX) ![](https://i.imgur.com/24xXwD9.png) * The first level multiplexers choose value come from `Reg1` and `Reg2`. * As I mentioned above, there has **data dependency** between lw and blt, so `0x00000063` should be the right value forwarded by the WB-stage of lw. * The second level multiplexers choose `0x00000034` and `0xfffffff8` to be the operands of ALU. * ALU adds two operands to calculate the next PC if branch is taken. * Because `0x00000063` is greater than `0x00000061`, branch is not taken. 4. Memory access(MEM) ![](https://i.imgur.com/6X6tsXU.png) * `Reg 2` is send to `Data in`, but memory doesn't enable writing. * Next PC value (`0x00000038`) and `Wr idx` (`0x19`) are just send through this stage, we don't use them. 5. Register wright back(WB) ![](https://i.imgur.com/U0qTDZa.png) * The multiplexer choose `Res` from ALU as final output. So the output value is `0x0000002c`. * B-type instruction doesn't need to write data back to register, so write_enable signal in register will be false. ### Memory updates ![](https://i.imgur.com/YE8EC24.png) After finishing running above code, register a0 is `0x00000063` which is the ASCII of `C` and register a7 is used to print. ### Pipeline Hazard * There are three hazards: * Structural Hazard : * Structural hazards occur when more than one instruction needs to use the same datapath resource at the same time. * Data Hazard : * Data hazards are caused by data dependencies between instructions as I early mentioned. * Control Hazard : * Control hazards are caused by **jump and branch** instructions, because for all jumps and some branches, the next PC is not PC + 4, but the result of the computation completed in the EX stage. * For example : ![](https://i.imgur.com/K6dVyp3.png) As we can see in above image, in EX stage, the computation completes and decides the next PC will jump to. Therefore, we have to flush the two former wrong stages and re-fetch、re-decode the right instruction. ![](https://i.imgur.com/m7aQn9Y.png) You may see clearer in above cycle diagram. `jal x1 24 <loop>` is to jump to section loop, but processor doesn't know until EX stage, in the meanwhile, `jal x1 24 <print>` and `ecall` have already joined in the pipeline but next instruction should be `lw x6 0 x9`. That's why we need two nop(flush) in the pipeline. ## References * [leetcode 744](https://leetcode.com/problems/find-smallest-letter-greater-than-target/) * [Ripes](https://github.com/mortbopet/Ripes) * [RISCV pipeline and hazards](https://robotics.shanghaitech.edu.cn/courses/ca/19s/notes/Discussion8_2.pdf)