Assignment1: RISC-V Assembly and Instruction Pipeline

--- tags: CA2020 --- # Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `kevin30292` > ==Requirement== - [x] Don’t implement the same subject as others do. Your program shall be different. - [x] Your program(s) MUST contain loops (or recursive calls) and conditional branches. - [x] You have to ensure the program fully functioned with Ripes simulator. - [x] Explain your program with the visualization for multiplexer input selection, register write/enable signals and more. - [x] Illustrate each stage such as IF, ID, IE, MEM, and WB. - [x] Discuss the steps of memory updates accordingly. ## Selection Sort Use selection sort to sort a integer array which includes 5 integers :::info Selection sort: selection sort is a simple sorting algorithm. This sorting algorithm divided the list into two parts, `sorted-part` at the left end and `unsorted-part` at the right end. Initially, the sorted part is empty and the unsorted part is the entire list. Everytime choose the minimum value in `unsorted-part` and swapped with the most left value of `unsorted-part`, and it will become a part of `sorted-part`. ::: ### C code ```c= int main() { // inputs int i, j, min_id, temp; int arr[5]={1,2,4096,16,256}; // sort for (i = 0; i<4; i++) { min_id = i; for (j=i+1; j<5; j++) { if (arr[j] < arr[min_id]) { min_id = j; } } temp = arr[i]; arr[i] = arr[min_id]; arr[min_id] = temp; } return 0; } ``` #### Result arr={1,2,16,256,4096} ### Assembly code #### `version 1` | info | | |-------|-----| |cycles |210 | |retired|168 | |CPI |1.25 | |IPC |0.800| ```cpp= .data arr: .word 1,2,4096,16,256 .text main: #get base address la x8, arr #init i add x15, x0, x0 #for loop i L1: #init min_id add x16, x0, x15 #init j addi x6, x15, 1 #for loop j L2: slli x7, x6, 2 slli x28, x16, 2 add x5, x8, x28 lw x31, 0(x5) add x5, x8, x7 lw x30, 0(x5) bge x30, x31, nochange add x16, x0, x6 nochange: addi x6, x6, 1 addi x9, x0, 5 blt x6, x9, L2 swap: #get a[i] slli x29, x15, 2 add x5, x29, x8 #load a[i] in x30 lw x30, 0(x5) #get a[min_id] slli x28, x16, 2 add x5, x28, x8 #load a{min_id] in x31 lw x31, 0(x5) #put a[min_id] in a[i] add x5, x29, x8 sw x31, 0(x5) #get a[min_id] add x5, x28, x8 sw x30, 0(x5) addi x1, x0, 4 addi x15, x15, 1 blt x15, x1, L1 ``` #### `version 2` :::info organized the variable assignment ::: | variable | register | note | |-----------------|----------|-------| | arr -> base | x3 = gp | | | i | x5 = t0 | | | i condition | x18 = s2 | | | arr[i] offset | x28 = t3 | | | arr[i] address | x12 = a2 |x28(x3)| | j | x6 = t1 | | | j condition | x19 = s3 | | | arr[j] offset | x29 = t4 | | | arr[j] address | x13 = a3 |x29(x3)| | min_id | x7 = t2 | | | arr[j] offset | x30 = t5 | | | arr[j] address | x14 = a4 |x30(x3)| | compare1 & temp | x10 = a0 | | | compare2 | x11 = a1 | | | info | | |-------|-----| |cycles |209 | |retired|157 | |CPI |1.33 | |IPC |0.751| ```cpp= .data arr: .word 1,2,4096,16,256 .text main: #get array base address la x3, arr #init i addi x5, x0, 0 #for loop i L1: #init min_id = i addi x7, x5, 0 #init j = i + 1 addi x6, x5, 1 #for loop j L2: #get arr[j] address offset and address slli x29, x6, 2 add x13, x3, x29 lw x10, 0(x13) #get arr[min_id] address offset and address slli x30, x7, 2 add x14, x3, x30 lw x11, 0(x14) #check if condition x10 = arr[j] ; x11 = arr[min_id] bge x10, x11, nochange addi x7, x6, 0 nochange: #j++ addi x6, x6, 1 addi x19, x0, 5 blt x6, x19, L2 #get arr[i] address offset and address slli x28, x5, 2 add x12, x3, x28 #get arr[min_id] address offset and address slli x30, x7, 2 add x14, x3, x30 # temp = arr[i] lw x10, 0(x12) # arr[i] = arr[min_id] lw x11, 0(x14) sw x11, 0(x12) # arr[min_id] = temp sw x10, 0(x14) #i++ addi x5, x5, 1 addi x18, x0, 4 blt x5, x18, L1 ``` #### Result initial: ![](https://i.imgur.com/aBm4MoF.png =250x) after: ![](https://i.imgur.com/rf2CDlD.png =250x) #### code explain This code have four different parts. 1. main: Initializing value of i and `base-address` of the array. 2. L1: The big for-loop to sort every element in the array. 3. L2: The for-loop in L1 to find the smallest element in `unsorted-part`. 4. nochange: Label for skipping `change-min_id` if it have the smallest element's ID. ## Ripes simulator :::info RISC-V instruction format ![](https://i.imgur.com/zMmei2u.png) ::: ### la x3, arr :::warning ==la== will be transform to ==auipc== and ==addi== ==auipc==: add the imm value with `PC` and store in the `rd` ==addi==: subtract the `PC` value from `rd` ::: #### IF(Instruction Fetch) The `PC` will initialize with 0x0000_0000 which is the address of first instruction. :::info The `PC` value will also pass to `adder` plus 4(PC+4) and update the `PC`. If no branch or jump, the multiplexer will choose PC+4 for next instruction address. ::: In the picture, the first instruction(0x10000197) will pass to decoder. ![](https://i.imgur.com/cJe55GG.png =300x) #### ID(Instruction Decode) :::info la: ![](https://i.imgur.com/yhlOxRW.png) ::: 0x1000_0197 0b00010000000000000000_00011_0010111 U type op -> 0b0010111 = `auipc` rd -> 0b00011 = `x3` imm -> 0b0001_0000_0000_0000_0000 = 0x10000 :::info U type: 20-bit immediate is shifted left by 12 bits to form U immediates. ::: imm -> 0x10000==000== The first instruction pass into decoder and come aparts to `AUPIC`, `0x03`, `0x1000_0000` ![](https://i.imgur.com/1miw253.png =350x) #### EX(Execute) `0x03` directly pass to next stage. ALU plus `imm` and `PC` and pass the sum to next stage. ![](https://i.imgur.com/IJoyLuS.png =300x) #### MEM(memory access) just pass value #### WB(Write Back) The multiplexer choose value pass from ALU, not use the Memory Unit value. Write the value(`imm` + `pc`) back to register(`x3`). The `Wr En` will turn green to allow write back. ![](https://i.imgur.com/tMVH0LG.png =800x) :::info The register write back will delay one stage to actually update the Registers. Example: The x5 doesn't updata when instruction in WB stage ![](https://i.imgur.com/AH6XuLs.png =300x) --- Next stage: ![](https://i.imgur.com/esQoSn2.png =300x) ::: :::warning Now the x3 value is `pc` + `imm`, but we want the value is `imm`, so next instruction is subtract `pc`. ::: addi x3, x3, 0(-pc) >if pc is 0x08 addi x3, x3, -8 :::spoiler ![](https://i.imgur.com/hULQhJC.png) imm value will change automatically ![](https://i.imgur.com/Tv7mQko.png) ::: #### Result: `x3` = 0x1000_0000 (address of arr[]) ![](https://i.imgur.com/g1QxKp6.png =250x) --- ### addi x5, x0, 0 Initialize `i` #### IF(Instruction Fetch) Fetch instruction at `0x08` -> 0x0000_0293 ![](https://i.imgur.com/kkH8HU3.png =350x) #### ID(Instruction Decode) :::info addi ![](https://i.imgur.com/illAPNp.png) ::: 0x0000_0293 0b000000000000_00000_000_00101_0010011 op -> 0b0010011 = `imm` rd -> 0b00101 = `x5` funct -> 0b000 = `addi` rs1 -> 0b00000 = `x0` imm -> 0b0 = `0` :::info I type instruction: sign-extended 12-bit immediate. ::: The instruction pass into decoder and come aparts to `ADDI`, `0x00` ,`0x05`, `0x0000_0000` The rs1 pass into R1 idx and `Registers` pass the value of `x0` to next stage. There no rs2, so the value input to `Registers` is invalid, of course `Reg 2` output are no meaning. ![](https://i.imgur.com/W5AUDmO.png =350x) #### EX(Execute) `0x05` directly pass to next stage. Multuplexer choose `Reg 1` and `imm` value to ALU. ALU plus `imm` and `reg1` and pass the sum to next stage. ![](https://i.imgur.com/91vPGBW.png =350x) #### MEM(memory access) Just pass value. :::info The memory unit still have the `Read out` value, but it is invalid. In this case, the memory unit read the value at 0x0000_0000, but 0x0000_0000 actually doesn't mean address. ![](https://i.imgur.com/4vMY6bb.png =300x) ::: #### WB(Write Back) So in this stage, the multiplexer will choose the value pass from `ALU`, by pass the `memory unit`. Write the value(`imm` + `0x00`) back to register(`x5`). (`imm` + `0x00`) is directly pass from ALU to WB stage when the instruction still in the EX stage. Also, the `Registers' Wr En` will turn green to allow register writing. ![](https://i.imgur.com/FFTCI0i.png) #### Result: `x5` = 0x0000_0000 ![](https://i.imgur.com/QSVjmrQ.png =250x) --- ### Get array element address ```cpp= slli x29, x6, 2 add x13, x3, x29 ``` 1. Get the array element's offset. `2`: shift imm `x6`: i `x29`: offset 2. Get the array element's address, plus base and offset. `x3`: array base address `x29`: offset `x13`: array element's address #### IF(Instruction Fetch) 1. Fetch instruction at `0x14` -> 0x0023_1e93 ![](https://i.imgur.com/gzoOdoi.png =300x) 2. Fetch instruction at `0x18` -> 0x01d1_86b3 ![](https://i.imgur.com/wCmi7Tm.png =300x) #### ID(Instruction Decode) :::info slli: ![](https://i.imgur.com/1E5TvkI.png) --- add: ![](https://i.imgur.com/3TaLfXx.png) ::: 1. 0x0023_1e93 0b0000000_00010_00110_001_11101_0010011 op -> 0b0010011 = `op-imm` rd -> 0b11101 = `29` funct3-> 0b001 = `slli` rs1 -> 0b00110 = `6` imm -> 0b00010 = `2` funct7-> 0b0000000 = `slli` The instruction pass into decoder and come aparts to `SLLI`, `0x06`, `0x02`, `0x1d` The rs1 pass into R1 idx and `Registers` pass the value of `x6` to next stage. The rs2 pass into R2 idx and `Registers` pass the value of `x2` to next stage, ==but it is invalid==. ![](https://i.imgur.com/WZCsIKA.png =300x) 2. 0x01d1_86b3 0b0000000_11101_00011_000_01101_0110011 op -> 0b0110011 = `op` rd -> 0b01101 = `13` funct3-> 0b000 = `add` rs1 -> 0b00011 = `3` rs2 -> 0b11101 = `29` funct7-> 0b0000000 = `add` The instruction pass into decoder and come aparts to `ADD`, `0x03`, `0x1d`, `0x0d`4 The rs1 pass into R1 idx and `Registers` pass the value of `x3` to next stage. The rs2 pass into R1 idx and `Registers` pass the value of `x29` to next stage. The `imm.` value is invalid. ![](https://i.imgur.com/eNV5wIl.png =300x) #### EX(Execute) 1. slli x29, x6, 2 `0x1d` directly pass to next stage. :::info Because the last instruction is addi x6, x5, 1, but the new value of x6 haven't update to `Registers`, so cause ==data hazard==. `x6` mutiplexer need choose x6 value ==forwarding== from EX/MEM(==red mark==). ![](https://i.imgur.com/dMDFhUA.png =500x) ::: ALU shift `reg1` 2(`0x02`) bits and pass the value to next stage. ![](https://i.imgur.com/V9zHlrQ.png =300x) 2. add x13, x3, x29 Because the last instruction have change the register(x29), ![](https://i.imgur.com/mWskVmO.png =200x) but it doesn't update to Rrgisters, so the multiplxer choose to use the value ==forwarding== from EX/MEM.(==red mark==) ![](https://i.imgur.com/tmar89g.png =500x) #### MEM(Memory access) Just pass value. #### WB(Write Back) 1. slli x29, x6, 2 Write back 0x1d(x29), value = 0x04 ![](https://i.imgur.com/1rPxuUZ.png =200x) 2. add x13, x3, x29 Write back 0x0d(x13), value = 0x1000_0004 ![](https://i.imgur.com/4xTWgsR.png =200x) #### Result: 1. slli x29, x6, 2 ![](https://i.imgur.com/xeiWsUW.png =250x) 2. add x13, x3, x29 ![](https://i.imgur.com/AFEKpS1.png =250x) --- ### bge x10, x11, nochange ```c= if (arr[j] < arr[min_id]) { mid_id = j; } ``` change logic to ```c= if (arr[j] >= arr[min_id]) { skip(mid_id = j); } ``` bge x10, x11, nochange rs1 = `x10` = arr[j] rs2 = `x11` = arr[min_id] `bge`: take the branch if rs1 is greater than or equal to rs2 #### IF(Instruction Fetch) Fetch instruction at `0x2c` -> 0x00b55463 ![](https://i.imgur.com/xguzKfI.png =300x) #### ID(Instruction Decode) :::info bge: ![](https://i.imgur.com/lc3P5PW.png) ::: 0x00b55463 0b0_000000_01011_01010_101_0100_0_1100011 op -> 0b1100011 = `branch` funct3-> 0b101 = `bge` src1 -> 0b01010 = `10` src2 -> 0b01011 = `11` imm -> 0b000000000100 = `4` turn into -> 0b0000000001000 = `8` bytes :::spoiler Cite from CS61 2018 [Lecture 7](https://inst.eecs.berkeley.edu/~cs61c/resources/su18_lec/Lecture7.pdf) ![](https://i.imgur.com/z07oqed.png) --- Cite from CS61 2020 [Lecture 6](https://cs61c.org/su20/lectures/?file=lec06.pdf) ![](https://i.imgur.com/uLYHzV3.png) ::: --- The instruction pass into decoder and come aparts to `BGE`, `0x0a`, `0x0b`, `0x08` The src1 pass into R1 idx and `Registers` pass the value of `x10` to next stage. The src2 pass into R2 idx and `Registers` pass the value of `x11` to next stage. ![](https://i.imgur.com/xemCaiH.png =350x) #### EX(Execute) :::info ==one stall insert== ![](https://i.imgur.com/Xt350hZ.png) Because the last instruction is lw x11, 0(x14), and the bge need value of x11, so cause ==data hazard==, and it can't be deal with forwarding because MEM stage doesn't have the forwarding unit. So insert one stall, and ==forwarding== from WB can solve the hazard. ::: The multiplexer choose the value to branch unit: 1. `Register`(bule mark) 2. forwarding value(red mark) and branch unit decide to turn on the `Branch taken` signal. ![](https://i.imgur.com/xpafSO6.png) Also multiplexer choose the `PC` and `imm(offset)` value to ALU unit. ALU plus PC and offset and pass to PC unit and fetch the code, ![](https://i.imgur.com/yjUHn0k.png) but the code had ==be fetched and decoded== need to be ==flush==. ![](https://i.imgur.com/oSowecF.png) #### MEM(Memory Access) no operation #### WB(Write Back) on operation #### Result: branch to flag: nochange --- ### blt x6, x19, L2 #### IF(Instruction Fetch) Fetch instruction at `0x3c` -> 0xfd33_4ce3 ![](https://i.imgur.com/9Rteuho.png =200x) #### ID(Instruction Decode) :::info blt: ![](https://i.imgur.com/tsK5j87.png) ::: 0xfd334ce3 0b1_111110_10011_00110_100_1100_1_1100011 op -> 0b1100011 = `branch` funct3-> 0b100 = `blt` src1 -> 0b00110 = `6` src2 -> 0b10011 = `19` imm -> 0b111111101100 = `-20` 0b000000010100 = `20` -> 20*2===40== The instruction pass into decoder and come aparts to `BLT`, `0x06`, `0x13`, `0xffffffd8` The src1 pass into R1 idx and `Registers` pass the value of `x06` to next stage. The src2 pass into R2 idx and `Registers` pass the value of `x19` to next stage. ![](https://i.imgur.com/BLCbPex.png =300x) #### EX(Execute) Because the registers, `x19` and `x6`, didn't update to `Registers`, so the two value need to use forwarding value. ![](https://i.imgur.com/1L3mfXY.png) So the multiplxer choose to use the `x6` value forwarding from WB(==red mark==), and the `x19` value forwarding from EX/MEM(==blue mark==). ![](https://i.imgur.com/W5TnEFd.png) The branch unit get the forwarding value, and decide to turn on the `Branch taken` signal. ![](https://i.imgur.com/8iVvoKD.png =200x) ALU plus PC and offset and pass to PC unit and fetch the code, ![](https://i.imgur.com/P97qWcH.png) but the code had ==be fetched and decoded== need to be ==flush==. ![](https://i.imgur.com/66uM5ne.png) #### MEM(Memory Access) on operation #### WB(Write Back) on operation #### Result: branch to flag: L2 --- ### lw x10, 0(x12) load data from memory to register #### IF(Instruction Fetch) Fetch instruction at `0x1c` -> 0x0006a503 ![](https://i.imgur.com/QvIH18t.png =250x) #### ID(Instruction Decode) :::info lw: ![](https://i.imgur.com/CjrJ2Qt.png ) ::: 0x0006_a503 0b000000000000_01101_010_01010_0000011 op -> 0b110001 = `LOAD` rd -> 0b11001 = `25` funct3-> 0b010 = `load word` rs1 -> 0b01101 = `13` (base) imm -> 0b0 = `0` (offset) The instruction pass into decoder and come aparts to `LW`, `0x0d`, `0x0a`, `0x00000000` The rs1(base) pass into R1 idx and `Registers` pass the value of `x13` to next stage. ![](https://i.imgur.com/jRigb9X.png =300x) #### EX(Execute) Because the registers, `x19`, didn't update to `Registers`, so the value need to use forwarding value. ![](https://i.imgur.com/Pa7rASk.png) So the multiplxer choose to use the `x13` value forwarding from EX/MEM(==red mark==). ![](https://i.imgur.com/d2OQjSu.png =300x) ALU plus `x13` and offset, then pass the value(0x10000004) to next stage. #### MEM(Memory Access) Pass rd value to WB. Data memory unit get the address(0x1000_0004) from `ALU Res`, and output the corresponding data(0x0000_0002). ![](https://i.imgur.com/O8ehZiV.png =250x) #### WB(Write Back) Write back to `Registers`: 0x0a(x10), value = 0x0000_0002 ![](https://i.imgur.com/nyVoaLT.png) #### Result: x10 load data from 0(x13) = 0x02 ![](https://i.imgur.com/xScp1Pl.png =250x) --- ### sw x11, 0(x12) store data from register to memory #### IF(Instruction Fetch) Fetch instruction at `0x58` -> 0x00b62023 ![](https://i.imgur.com/zMDXhbJ.png =300x) #### ID(Instruction Decode) :::info sw: ![](https://i.imgur.com/u0l0kER.png) ::: 0x00b6_2023 0b0000000_01011_01100_010_00000_0100011 op -> 0b110001 = `STORE` funct3-> 0b010 = `store word` rs1 -> 0b01100 = `12` (base) rs2 -> 0b01011 = `11` imm -> 0b0 = `0` (offset) The instruction pass into decoder and come aparts to `SW`, `0x0b`, `0x0c`, `0x00000000` The rs1 pass into R1 idx and `Registers` pass the value of `x12` to next stage. The rs2 pass into R2 idx and `Registers` pass the value of `x11` to next stage. ![](https://i.imgur.com/WxEvclF.png =300x) #### EX(Execute) :::info ==one stall insert== Because the last instruction is lw x11, 0(x14), and the SW need the data of x11, but MEM stage doesn't have the forwarding unit. ![](https://i.imgur.com/sWq3xhj.png) So insert one stall, let LW to WB stage when SW in EX stage, and use the forwarding unit in WB to pass the new x11 value to SW. ![](https://i.imgur.com/w1RMQiR.png) ::: Multiplexer pass `imm`(offset) value and Reg 1(base) to ALU, ALU plus them and pass the address to MEM stage. And multiplexer let the forwarding value pass to `Data in`. ![](https://i.imgur.com/YfxDIrZ.png) #### MEM(memory access) The address pass from ALU Res and the Data in pass from last stage forwarding value. The `Wr en` turn green to allow Memory unit store data to memory. ![](https://i.imgur.com/4dyYB3z.png =300x) #### WB(Write Back) no operation #### Result: The value at the address(0x1000_0000) is 0x0000_0001 ![](https://i.imgur.com/pm55hnZ.png =250x) --- ### Memory update The i index for-loop(big loop), will swap a couple of memory. example: > i=0 is initialize | ==i== | arr[0] | arr[1] | arr[2] | arr[3] | arr[4] | |-------|--------|--------|--------|--------|--------| | ==0== | 5 | 3 | 8 | 6 | 1 | | ==1== | 1 | 3 | 8 | 6 | 5 | | ==2== | 1 | 3 | 8 | 6 | 5 | | ==3== | 1 | 3 | 5 | 6 | 8 | | ==4== | 1 | 3 | 5 | 6 | 8 | --- ## Refine assembly code According to the above analysis, some instruction can be reorginized. 1. The register always same value can move out from loop. x19 x18 > cycles 209 -> 197 2. i++ move front to before ==sw x11, 0(x12)== > cycles 197 -> 193 | info | | |-------|-----| |cycles |193 | |retired|145 | |CPI |1.33 | |IPC |0.751| ```cpp .data arr: .word 1,2,16,256,8 .text main: #get array base address la x3, arr #init i addi x5, x0, 0 addi x19, x0, 5 addi x18, x0, 4 #for loop i L1: #init mid_id = i addi x7, x5, 0 #init j = i + 1 addi x6, x5, 1 #for loop j L2: #get arr[j] address offset and address slli x29, x6, 2 add x13, x3, x29 lw x10, 0(x13) #get arr[mid_id] address offset and address slli x30, x7, 2 add x14, x3, x30 lw x11, 0(x14) #check if condition x10 = arr[j] ; x11 = arr[mid_id] bge x10, x11, nochange addi x7, x6, 0 nochange: #j++ addi x6, x6, 1 blt x6, x19, L2 #get arr[i] address offset and address slli x28, x5, 2 add x12, x3, x28 #get arr[mid_id] address offset and address slli x30, x7, 2 add x14, x3, x30 # temp = arr[i] lw x10, 0(x12) # arr[i] = arr[mid_id] lw x11, 0(x14) addi x5, x5, 1 sw x11, 0(x12) # arr[mid_id] = temp sw x10, 0(x14) blt x5, x18, L1 ``` ## Reference 1. ==Instruction format pictures== from [The RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) 2. ==CPU architecture pictures== generate from [Ripes Simulator](https://github.com/mortbopet/Ripes) 3. [CS61C_Su18_Lecture 7](https://inst.eecs.berkeley.edu/~cs61c/resources/su18_lec/Lecture7.pdf) P.45 4. [CS61C_Su20_Letture 6](https://cs61c.org/su20/lectures/?file=lec06.pdf) P.62