# Assignment1: RISC-V Assembly and Instruction Pipeline ###### tags: `Computer Architecture` `RISC-V` `jserv` > contributed by <[Tony (tonych1997)](https://github.com/tonych1997/Computer-Architecture)> > [Requirements for this assignment](https://hackmd.io/llVsw3WOTtSq4tf7GlrwRw?view) ## Leetcode [1929. Concatenation of Array](https://leetcode.com/problems/concatenation-of-array/) Given an integer array `nums` of length `n`, you want to create an array `ans` of length `2n` where `ans[i] == nums[i]` and `ans[i + n] == nums[i]` for `0 <= i < n` (**0-indexed**). Specifically, `ans` is the **concatenation** of two `nums` arrays. Return the array `ans`. ## Implementation You can find the source code [here](https://github.com/tonych1997/Computer-Architecture). Feel free to fork and modify it. ### C code Since the code used by Leetcode is in the form of a subroutine, I have modified the code slightly. In code for assignment, the input array lengths and contents, and the output array lengths, are predefined. #### Code for Leetcode ```clike= int* getConcatenation(int* nums, int numsSize, int* returnSize) { *returnSize = numsSize*2; int * ans=(int*)malloc(sizeof(int)*(numsSize*2)); int i = 0; for(i=0; i<numsSize; i++){ ans[i]=nums[i]; ans[i+numsSize]=nums[i]; } return ans; } ``` #### Code for assignment I defined an integer array of length 3 to do the concatenation action. After changing the length and content of the array, the program should still be correct. ```clike= #include <stdio.h> #include <stdlib.h> int main() { int a[3] = {1, 2, 3}; int b[6]; int i; for(i=0; i<3; i++){ b[i]=a[i]; b[i+3]=a[i]; } for(i=0; i<3*2; i++) { printf("%d ", b[i]); } return 0; } ``` ### Assembly code #### Correct output code **This code is longer, but it will output the correct answer.** I found that i (t0) is not set to 0 in main when the array length is more than 5, which causes an error. So we can add two nop in lines 18-19 and 20-21 to solve the problem. ```clike= .data arr: .word 1, 2, 3 # a[3] = {1, 2, 3} len1: .word 3 # array length of a is 3 space: .string " " # space .text # s1 = arr a base address # s2 = arr b base address # s3 = array length of a # s4 = array length of b # t0 = i for loopCon1, loopCon2, print main: la s1, arr # s1 = address of a lw s3, len1 # s3 = length of a add s4, s3, s3 # s4 = length of b (a*2) add t0, x0, x0 # i = 0 jal ra, loopCon1 add t0, x0, x0 # i = 0 jal ra, loopCon2 add t0, x0, x0 # i = 0 jal ra, print jal ra, exit loopCon1: add t1, t0, x0 # t1 = i add t1, t1, t1 # t1 = t1*2 add t1, t1, t1 # t1 = t1*2 (t1*4) add t1, t1, s1 # address of a[i] (base addr. + 4i) lw t2, 0(t1) # t2 = s1 (content of a[i]) add t1, t0, x0 # t1 = i add t1, t1, t1 # t1 = t1*2 add t1, t1, t1 # t1 = t1*2 (t1*4) add t1, t1, s2 # address of b[i] (base addr. + 4i) sw t2, 0(t1) # b[i] = t2 (content of a[i]) addi t0, t0, 1 # i++ blt t0, s3, loopCon1 # if i < length, go to loopCon1 ret # else, return to main loopCon2: add t1, t0, x0 # t1 = i add t1 t1 t1 # t1 i*2 add t1 t1 t1 # t1 i*2 (t1*4) add t1 t1 s1 # t1=i*4+base_of_arr lw t2 0(t1) # t2 = s1 (content of a[i]) add t1 t0 s3 # t1 = i + length add t1 t1 t1 # t1 = t1*2 add t1 t1 t1 # t1 = t1*2 (t1*4) add t1 t1 s2 # t1 = address of b[i+length] (base addr. + 4*(i+length)) sw t2 0(t1) # t2 = content in arr[n+1] addi t0 t0 1 # i++ blt t0, s3, loopCon2 # if i < length, go to loopCon2 ret # else, return to main print: lw t2 0(s2) # t2 = content of b[i] add a0, t2, x0 # load result of array b li a7, 1 # print integer ecall # systemcall la a0, space # load string - space li a7, 4 # print string ecall addi s2, s2, 4 # address move forward addi t0, t0, 1 # i++ blt t0, s4, print ret exit: li a7 10 # end program ecall ``` #### Wrong output code This code is **shorter and more efficient**, but when exported, the first half of the value is correct and the second half is wrong. **I think the error occurred on line 29, but I don't know the reason why.** ```clike= .data arr1: .word 1, 2, 3 # a[3] = {1, 2, 3} len1: .word 3 # array length of a is 3 len2: .word 6 # array length of b is 6 space: .string " " # space .text # s1 = arr a base address # s2 = arr b base address # s3 = array length of a # s4 = array length of b # t0 = i # t1 = a[i] # t2 = b[i] main: la s1, arr1 # s1 = a lw s3, len1 # s3 = 3 lw s4, len2 # s4 = 6 add t0, x0, x0 # i = 0 jal ra, loopCon # jump to loop for concatenation add t0, x0, x0 # i = 0 jal ra, loopPrint # jump to loop for print jal ra, exit # jump to exit loopCon: lw t1, 0(s1) # t1 = a[i] sw t1, 0(s2) # b[i] = t1 sw t1, 12(s2) # b[i+3] = t1 addi s1, s1, 4 # address move forward addi s2, s2, 4 # address move forward addi t0, t0, 1 # i++ blt t0, s3, loopCon # if i < 3, go to loopCon ret # else, return to main loopPrint: lw t2, 0(s2) # t2 = b[i] addi s2, s2, 4 # address move forward add a0, t2, x0 # load result of array b li a7, 1 # print integer ecall la a0, space # load string - space li a7, 4 # print string ecall addi t0, t0, 1 # i++ blt t0, s4, loopPrint # if i < 6, go to loopPrint ret # else, return to main exit: li a7, 10 # end program ecall ``` :::warning :warning: If you have a comprehensive automated test suite, no `print` is really needed. Instead, everything should be self-contained. :notes: jserv ::: ## Analysis I test my code using [Ripes](https://github.com/mortbopet/Ripes) simulator. ### Pseudo instuction Put code above into editor and we will see that Ripe doesn’t execute it literally. Instead, it replace [pseudo instruction (p.110)](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) into equivalent one, and change register name from [ABI (application binary interface)](https://zh.wikipedia.org/zh-tw/%E5%BA%94%E7%94%A8%E4%BA%8C%E8%BF%9B%E5%88%B6%E6%8E%A5%E5%8F%A3) name to sequencial one. The translated code as follows: ``` 00000000 <main>: 0: 10000497 auipc x9 0x10000 4: 00048493 addi x9 x9 0 8: 10000997 auipc x19 0x10000 c: 0049a983 lw x19 4 x19 10: 01398a33 add x20 x19 x19 14: 000002b3 add x5 x0 x0 18: 018000ef jal x1 24 <loopCon1> 1c: 000002b3 add x5 x0 x0 20: 044000ef jal x1 68 <loopCon2> 24: 000002b3 add x5 x0 x0 28: 070000ef jal x1 112 <print> 2c: 09c000ef jal x1 156 <exit> 00000030 <loopCon1>: 30: 00028333 add x6 x5 x0 34: 00630333 add x6 x6 x6 38: 00630333 add x6 x6 x6 3c: 00930333 add x6 x6 x9 40: 00032383 lw x7 0 x6 44: 00028333 add x6 x5 x0 48: 00630333 add x6 x6 x6 4c: 00630333 add x6 x6 x6 50: 01230333 add x6 x6 x18 54: 00732023 sw x7 0 x6 58: 00128293 addi x5 x5 1 5c: fd32cae3 blt x5 x19 -44 <loopCon1> 60: 00008067 jalr x0 x1 0 00000064 <loopCon2>: 64: 00028333 add x6 x5 x0 68: 00630333 add x6 x6 x6 6c: 00630333 add x6 x6 x6 70: 00930333 add x6 x6 x9 74: 00032383 lw x7 0 x6 78: 01328333 add x6 x5 x19 7c: 00630333 add x6 x6 x6 80: 00630333 add x6 x6 x6 84: 01230333 add x6 x6 x18 88: 00732023 sw x7 0 x6 8c: 00128293 addi x5 x5 1 90: fd32cae3 blt x5 x19 -44 <loopCon2> 94: 00008067 jalr x0 x1 0 00000098 <print>: 98: 00092383 lw x7 0 x18 9c: 00038533 add x10 x7 x0 a0: 00100893 addi x17 x0 1 a4: 00000073 ecall a8: 10000517 auipc x10 0x10000 ac: f6850513 addi x10 x10 -152 b0: 00400893 addi x17 x0 4 b4: 00000073 ecall b8: 00490913 addi x18 x18 4 bc: 00128293 addi x5 x5 1 c0: fd42cce3 blt x5 x20 -40 <print> c4: 00008067 jalr x0 x1 0 000000c8 <exit>: c8: 00a00893 addi x17 x0 10 cc: 00000073 ecall ``` In each row it denotes address in instruction memory, instruction’s machine code (in hex) and instruction itself respectively. ## 5-stage pipeline processor There are many types RISC-V processor in Ripes, as shown as follows. (RISC-V processors are similar to MIPS processors. MIPS is use in Computer Organization textbook of 白算盤) ![](https://i.imgur.com/t02SWLC.png) I choose the 5-stage proceesor, and it's block diagram as shown as follows. ![](https://i.imgur.com/lpeikBY.png) 5-stage means the processor has a pipeline of five stages, running instructions in parallel. The five stages as follows. 1. **Instruction Fetch (IF)** * Get Instructions from memory (mostly cache) by using PC . * PC + 4 ($\because$ each instruction is 4 bytes) 2. **Instruction decode and register file read (ID)** * Command decode and read Register 3. **EXecution/address calculation (EX)** * ALU Execute or address operation * Memory reference (load/store) * Register-Register ALU instruction * Register-Immediate ALU instruction * Conditional branch 4. **Memory access (MEM)** * Load: the memory does a read using the effective address computed in the previous cycle. * Store: the memory writes the data from the second register read from the register file using the effective address. 5. **Write back (WB)** * Write the result into the register file * Load: from the memory system * Register-Register ALU instruction: from the ALU ### Pipeline process Different types of instructions will use different stages. |Instruction|<--| --- | step name |---|-->|CPI| |:---------:|:-:|:---:|:------------------:|:-:|:-:|:-:| |R-type |IF |ID/RF|Execution | |WB | 4 | |lw |IF |ID/RF|addr. calc. |MA |WB | 5 | |sw |IF |ID/RF|addr. calc. |MA | | 4 | |beq |IF |ID/RF|compare reg. | | | 3 | |j |IF | ID |compose target addr.| | | 3 | | | | | | | | | |func. unit |IM |Reg. |ALU |DM |Reg.| 3| > Function Unit: The function unit used by the stage. > IM: Instruction Memory > Reg: Register file > DM: Data Memory The following diagram is a simple assembly language program that uses Ripes to understand how the pipeline works. ![](https://i.imgur.com/yrzCJqs.png)![](https://i.imgur.com/nZn7E46.png) #### Assembly code ```clike= main: addi s1, x0, 1 # s1 = 0 + 1 addi s2, x0, 6 # s2 = 0 + 6 addi s3, x0, 13 # s3 = 0 + 13 sw s1, 0(t0) # t0[0] content = s1 content sub t1, s3, s1 # t1 = s3(13) - s1(1) ``` #### Pseudo code ``` 00000000 <main>: 0: 00100493 addi x9 x0 1 4: 00600913 addi x18 x0 6 8: 00d00993 addi x19 x0 13 c: 0092a023 sw x9 0 x5 10: 40998333 sub x6 x19 x9 ``` #### IF In the begging, the PC has been added 4, and the first instruction has been put into pipline. ![](https://i.imgur.com/YKoiLOF.png) #### ID In the ID stage, the PC is set to 8 and then the value of the register or source is read in. R1 corresponds to x0 and the value is 0x00. R2 corresponds to the value 1 and the value is 0x01. ![](https://i.imgur.com/byaCYwl.png) #### EX In the EX stage, 0x00 and 0x01 are put into the ALU for the operation. ![](https://i.imgur.com/1qfTJG4.png) #### MEM In the MEM stage, read and write (lw / sw) accesses to memory are performed in this stage, and non-lw / sw instructions are passed. ![](https://i.imgur.com/VOME7TN.png) #### WB In the WB stage, the result will be written to the destination registers (R-type / lw). If you do not have to write back to the register, it will pass directly. Result (0x00000001) is the yellow line, and the above of yello line is the destination register 0x09 (x9 / s1). ![](https://i.imgur.com/kYwYfcb.png) ### Register value After five stages of the first instruction, the value of the s1 register is updated to 1. ![](https://i.imgur.com/2No2eCy.png) ## Pipeline Hazards Pipeline is efficient, but can cause hazard problems. There are three main types of Hazard that Pipeline encounters |Hazard type |Reasons for occurrence| |:---------------:|:--------------------:| |Structual Hazards|Hardware resource is limited. e.g. only single memory| |Data Hazards|An instruction in the pipeline requires a result that has not yet been generated by the previous instruction (which is still in the pipeline) (data dependency)| |Control Hazards|The result specified by Branch has not yet been generated (not yet decided whether to jump or not), and it has been executed by other commands.| ### Structual Hazards #### Solution * add hardware (e.g. memory) * e.g. If only have one memory, then every instruction use the same memory, should wait the previous instruction use memory first. ### Data Hazards * Solution * Software * Insert nop * Delay branch * Hardware * Stall + Forwarding #### Data Hazards example If we choose a 5-stage processor that without forwarding or hazard detection, then process these code. ![](https://i.imgur.com/2lawSzh.png) The result of register t1, it's value is 0x00. This is the wrong value. Becasuse the code in line 4 occured the Data Hazards. ![](https://i.imgur.com/kQ7RcRX.png) So, we can insert 2 nop to solve the Data Hazards. When the first instruction addi in WB stage, the third instruction add in ID stage. So after the first addi write back, then the third add will from ID to IF, then the third add can read the right value on t1(x9) in IF stage. ![](https://i.imgur.com/WVD72V7.png) After inserting two nop, t1 successfully writes the value. ![](https://i.imgur.com/Hx1t5H5.png) ### Control Hazards * Solution * Software * Insert nop * Delay branch * Hardware * Predict not taken * Flush wrong instruction #### Control Hazards example ```clike= main: addi s1, x0, 1 # s1 = 0 + 1 addi s2, x0, 1 # s2 = 0 + 1 beq s1, s2, jump # if(s1==s2) then jump to 'jump' label add t0, s1, s2 # t0 = s1 + s2 (set to = 2) jump: add t1, s1, s2 # t1 = s1 + s2 (set t1 = 2) ``` If Control Hazards occured, the beq in Ex stage decided to jump, but the back instructions are in the IF and ID stage. ![](https://i.imgur.com/Gtzp28h.png) ![](https://i.imgur.com/hkg6ein.png) Because the branch instruction decided to jump, so the pipeline flush the instruction in IF and ID stages, then read the branch's target instruction. Thus, solved the Control Hazards by using nop wastes two clocks. ![](https://i.imgur.com/EY5xfsZ.png) ## Reference ### Leetcode * [Leetcode Discuss - with C](https://leetcode.com/problems/concatenation-of-array/discuss/1341644/with-C) * [Leetcode Discuss - C / C++ / Python Simple and Short Solutions](https://leetcode.com/problems/concatenation-of-array/discuss/1418404/C-C%2B%2B-Python-Simple-and-Short-Solutions) ### RISC-V ISA * [RISC-V 指令集架構介紹 - Integer Calling convention](https://tclin914.github.io/77838749/) * [RISC-V指令集讲解(5)条件和无条件跳转指令](https://zhuanlan.zhihu.com/p/394860078) ### Pipeline * [RISC-V流水线CPU 计算机组成与设计第四章第二部分](https://blog.csdn.net/Photice/article/details/125240998?ops_request_misc=&request_id=&biz_id=102&utm_term=riscv%20pipeline&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-3-125240998.nonecase) * [Pipelining: Basic and Intermediate Concepts](https://blog.csdn.net/weixin_42437114/article/details/115285005) * [从零开始设计RISC-V处理器——五级流水线之数据冒险](https://blog.csdn.net/qq_45677520/article/details/122806845?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522166489581516800182773726%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=166489581516800182773726&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~first_rank_ecpm_v1~pc_rank_v39-12-122806845-null-null.142^v51^pc_rank_34_2,201^v3^add_ask&utm_term=riscv%20%E6%B5%81%E6%B0%B4%E7%B7%9A%20%E4%BB%8B%E7%B4%B9) * [RISC-V Datapath Part4: Pipeline](https://zhuanlan.zhihu.com/p/267576239) * [Day-7 Pipeline](https://ithelp.ithome.com.tw/articles/10261505) * [Hazard](https://hackmd.io/@dZfCcN4hT8aUuDPv3B8CWQ/rk7NPf2kK) ### Computer Organization / Architecture * [計算機組織結構](https://hackmd.io/@8bFA57f7SRG-K7AAT8s62g/ryv1NT3S#%E8%A8%88%E7%AE%97%E6%A9%9F%E7%B5%84%E7%B9%94%E7%B5%90%E6%A7%8B) * [wjungle 筆記](https://www.ptt.cc/bbs/graduate/M.1476022765.A.174.html) ### Past assignment in this class * [Assignment1: RISC-V Assembly and Instruction Pipeline - contributed by kaeteyaruyo](https://hackmd.io/@kaeteyaruyo/risc-v-hw1)