Assignment1 : RISC-V Assembly and Instruction Pipeline

# Assignment1 : RISC-V Assembly and Instruction Pipeline ###### tags: `2021_CS Computer Architecture` `LeetCode13` `RomanToInteger` ## Introduction Roman numerals transfer to integer in RISC-V assembly and C. ## Roman numerals Roman numerals are represented by seven different symbols: ```I```, ```V```, ```X```, ```L```, ```C```, ```D``` and ```M```. |Symbol|Value| |:--:|:--:| |I|1| |V|5| |X|10| |L|50| |C|100| |D|500| |M|1000| :::warning Use the Markdown syntax for table to represent. :notes: jserv ::: For example, ```2``` is written as ```II``` in Roman numeral, just two one's added together. ```12``` is written as ```XII```, which is simply ```X + II```. The number 27 is written as ```XXVII```, which is ```XX + V + II```. Roman numerals are usually written largest to smallest from left to right. However, the numeral for four is not ```IIII```. Instead, the number four is written as ```IV```. Because the one is before the five we subtract it making four. **For example:** - ```I``` can be placed before ```V``` (5) and ```X``` (10) to make 4 and 9. - ```X``` can be placed before ```L``` (50) and ```C``` (100) to make 40 and 90. - ```C``` can be placed before ```D``` (500) and ```M``` (1000) to make 400 and 900. **Constraints:** - ```1 <= s.length <= 15``` - ```s``` contains only the characters ```('I', 'V', 'X', 'L', 'C', 'D', 'M')```. - It is **guaranteed** that ```s``` is a valid roman numeral in the range ```[1, 3999]```. ## C implementation ([RomanToInteger.c](https://github.com/amberlin0516/LeetCode13_RomanToInteger/blob/main/LeetCode13_RomanToInteger.c)) ```c= #include<stdio.h> int transINT(char c) { if(c == 'I') return 1; else if(c == 'V') return 5; else if(c == 'X') return 10; else if(c == 'L') return 50; else if(c == 'C') return 100; else if(c == 'D') return 500; else return 1000; } int romanToInt(char * s){ int i; int n_size = 0; int sum; int cur, pre; //strSize i = 0; while(s[i] != '\0') i++; n_size = i; cur = transINT(*(s + n_size - 1)); sum = cur; //the rightest element n[n_size-1] must be added in sum //if sum = 0, it would be out of bound in forloop for(i = (n_size-1); i > 0 ; i--) { pre = transINT(*(s+i-1)); if(cur <= pre) sum += pre; else sum -= pre; cur = pre; } return sum; } int main() { char s[] = "MCMXCIV"; int number; number = romanToInt(s); printf("%d", number); //system("PAUSE"); return 0; } ``` ## RISC-V Assembly implementation ([RomanToInteger.s](https://github.com/amberlin0516/LeetCode13_RomanToInteger/blob/main/LeetCode13_RomanToInteger.s)) ```assembly= .data str: .byte 77, 67, 77, 88, 67, 73, 86 #MCMXCIV #s0 = str addr #s1 = str[s0], str data bit from s0 #a3 = sum = return value from [romantoInt], it is result to whole program #a2 = return value from [transInt] #s3 = n_size, how many elements in str #t0 is alaways i .text main: la s0, str #"la" is psedo inst. to load 1st addr of str jal ra, romantoInt addi a0, a3, 0 addi a7, x0, 1 ecall addi a7, x0, 10 ecall romantoInt: add t6, ra, x0 jal ra, strSize addi s3, t0, 0 #store n_size t0 to s3 #in [Loop] of [Compare], I call another function [transInt], than ra would be covered by line 51 #so i use stack store return address of line 25. jal ra, Compare add ra, t6, x0 jalr ra strSize: addi t0, t0, 0 #initialize i, t0 = i = 0 for: lb s1, 0(s0) beq s1, x0, zero #if s1 is zero, go to [zero] and return addi s0, s0, 1 #(t1 = n_size), n_size++ addi t0, t0, 1 #n_size++ jal x0, for zero: jalr ra #return value is t0. return to line 21. Compare: #t2 = cur / t3 = pre / t0 = i / sum = a3 #do sum = transINT(s[n_size-1]); addi sp, sp, -4 #preserve a space for stack sw ra, 0(sp) #save ra in stack addi s0, s0, -1 #through process [strSize], s0 = (str addr + n_size), s0 has to go back to (str addr + n_size - 1) lb s1, 0(s0) #s1 = str[s0] = str[str addr + n_size - 1] jal ra, transInt addi a3, a2, 0 #a3 = sum, a2 = return value from [transINT] addi t2, a2, 0 #t2 = cur addi t0, s3, -1 #initialize i, i = n_size - 1 Loop: addi t0, t0, -1 #i-- addi s0, s0, -1 lb s1, 0(s0) #s1 = str[n_size - i] jal ra, transInt addi t3, a2, 0 #t3 = pre blt t3, t2, PreLessthanCur #if pre < cur add a3, a3, t3 #else, sum += pre; addi t2, t3, 0 #cur = pre bne t0, x0, Loop #if i != 0, go to [Loop] lw ra, 0(sp) #load ra from stack addi sp, sp, 4 #restore stack jalr ra #return value is a3, return to line 25. PreLessthanCur: sub a3, a3, t3 #if pre < cur,sum -= pre; addi t2, t3, 0 #cur = pre bne t0, x0, Loop #if i != 0, go to [Loop] lw ra, 0(sp) #load ra from stack addi sp, sp, 4 #restore stack jalr ra #return value is a3, return to line 25. transInt: #return = a2 / t1 = temp addi t1, x0, 73 #I = 73 beq s1, t1, I addi t1, x0, 86 #V = 86 beq s1, t1, V addi t1, x0, 88 #X = 88 beq s1, t1, X addi t1, x0, 76 #L = 76 beq s1, t1, L addi t1, x0, 67 #C = 67 beq s1, t1, C addi t1, x0, 68 #D = 68 beq s1, t1, D M: addi a2, x0, 1000 #else M jalr ra #return value is a2. return to line 45 or 53. I: addi a2, x0, 1 jalr ra #return value is a2. return to line 45 or 53. V: addi a2, x0, 5 jalr ra #return value is a2. return to line 45 or 53. X: addi a2, x0, 10 jalr ra #return value is a2. return to line 45 or 53. L: addi a2, x0, 50 jalr ra #return value is a2. return to line 45 or 53. C: addi a2, x0, 100 jalr ra #return value is a2. return to line 45 or 53. D: addi a2, x0, 500 jalr ra #return value is a2. return to line 45 or 53. ``` ## How to input? ```str``` contains only the characters ```'I', 'V', 'X', 'L', 'C', 'D', 'M'```, which ASCII is ```73, 86, 88, 76, 67, 68, 77```. Just modify ASCII here to change Roman numerals. ![](https://i.imgur.com/v4k7up6.png) ## Result **Result in C:** Let's check the result in LeetCode Output window: (Btw, I had some unknown problems to install C compiler in my lab's computer. My classmate told me leetcode already has several good compilers to compile my program! It is pitty that I hadn't known before...) ![](https://i.imgur.com/EnCg6eY.png) **Result in RISC-V:** Let's check the result in Ripes console window: ![](https://i.imgur.com/RjnanPz.png) ## Ripes Simulator **Concept of pipeline** *Throughput* refers to the performance of tasks by a computing service or device over a specific period. It measures the amount of completed work against time consumed and may be used to measure the performance of a processor, memory and/or network communications. Instruction pipelined design make massive improvement to throughput. ![](https://i.imgur.com/6dkdbRf.png) The picture below is Ripes which is 5-stages pipeline CPU. ![](https://i.imgur.com/DD7Ob1e.png) The pipeline registers separate each pipeline stage, which are transmit data in each clock. There is 5-stages pipeline CPU: ![](https://i.imgur.com/oOdOxV6.png) **IF:** Instruction fetch **ID:** Instruction decode and register file read **EX:** Execution or address calculation **MEM:** Data memory access **WB:** Write back In order to make myself understand every stage operation simply and clearly, I redraw all stages picture and add some netname. Drawio is a pretty good tool to draw this diagram. The two arrow nets represent the data is to or from another stage. ### IF - Instruction Fetch IF stage not only fetchs instruction but also determines PC of next clock. The rightest selector chooses the input data of PC block which is PC+4 from adder or Jump instruction address from EX stage. The net name PC is the output of PC block, which would pass to 3 places: 1. Pass to instruction memory, and let it get instruction . 2. Pass to IF/ID register, and let another stage use it 3. Pass to adder to make next step PC+4. ![](https://i.imgur.com/0cR6j6L.png) ### ID - Instruction decode ID stage decode instruction and read register file. First, IF/ID register output instruction to Decode block and Immediate block. After Decode block decode the instructions, it will output rs1 input and rs2 input to Registers block. At the same time, transmit rd to EX/MEM register and send opcode to immediate block to do signed extension. Finally, EX/MEM register store three data, rs1 output, rs2 output, and signed extandsion immediate. ![](https://i.imgur.com/7DfyV27.png) ### EX - Execution EX stage is responsible for mathematical & logic calculation, and address calculation. It looks a little bit complicated, but separate and analyze it would be easier to know how it works. ![](https://i.imgur.com/496buxa.png) First satge selectors determine one of these input signal, and decide who can be the output reg1_out / reg2_out. 1. write back data which is bypassed from WB stage. 2. ALU output which is bypassed from EX/MEM register. 3. reg1 / reg2 ![](https://i.imgur.com/Ys8J1uT.png) Second stage selectors determine one of these input signal below and decide who can be the output ALU input1 / ALU input2. Upper selector input: 1. PC 2. reg1_out Lower selector input: 1. immediate 2. reg2_out The branch block deals with any branch instructions. ![](https://i.imgur.com/40uFvhc.png) Third stage, this is ALU (Arithmetic logic unit) that is digital circuit used to perform arithmetic and logic operations. ALU output its result to EX/MEM register, meanwhile, bypass Jump instruction address to IF stage. ![](https://i.imgur.com/pBnxiG0.png) ### MEM - Data memory access MEM stage deals with data Load/Store between register and memory, than there are some instructions like *lw*, *sw*...etc. - If the instruction is Load (blue path), Data memory would access ALU output from EX/MEM register, than output data to MEM/WB register. - If the instruction is Store (green path), Data memory would access reg2 from EX/MEM register. - if the instruction is neither Load nor Store (red path), EX/MEM register would transmit ALU output directly to MEM/WB rsgister. ![](https://i.imgur.com/3ftFzeu.png) ### WB - Write back WB stage is responsible for write back data to Register file which is in EX stage. EX/MEM register input one of these three data to WB stage: 1. PC+4 (originally comes from IF stage) 2. ALU Output (originally comes from EX stage ALU) 3. Load data (originally comes from MEM stage Data Memory) Than transmit rd and write back data to Registers where is in EX stage, and made Registers store rd. ![](https://i.imgur.com/xpn0hhK.png) ## Pipeline Hazards pipeline hazard is a condition that the next instruction cannot execute in the following clock cycle. the following is 3 different types of pipeline hazard: 1. structural hazard. 2. data hazard. 3. control hazard. ### Structural Hazards It means that the hardware cannot support the combination of instructions that we want to execute in the same clock cycle. **how to solve it?** In IF & MEM stage: The RV12 implements a Harvard architecture for simultaneous instruction and data memory accesses. It can effectively avoid structural hazard. To put it simply, two memories can solve structural hazard. In ID & WB stage: If read and write data to Register file comtemporary, just write at clock rising edge, and read data at falling edge to solve structural harzard. ### Data Hazards Data hazards occur when the pipeline must be stalled because one step must wait for another to complete. For example, Inst 2. in EX stage needs the result of Inst 1. in EX stage. ![](https://i.imgur.com/ooXq7FW.png) **how to solve it?** Just ***forwarding*** or ***bypassing*** to another stage. The two arrow path ALU output (red) is forwording fom MEM to EX stage. **load-use data hazard** A specific form of data hazard in which the data being loaded by a load instruction have not yet become available when they are needed by another instruction. This form of data hazard can not be solved by forwarding. For example, when Inst 3. in EXE stage needs the result of inst 1. which is Load instruction in MEM stage. ![](https://i.imgur.com/HHcSqN7.png) **how to solve it?** The ***hazard detection unit*** works in ID stage, it make the instruction that next to load instruction stall. When Inst 2. go to ID stage, hazard detection unit detect whether Inst 2. is Load instruction, and compare Inst 1. rd whether is the same as Inst 2. register(rd & rs). If both of the codition is true, flush the decode data of inst 2., and send Interlock signal. At next clock, Inst 2. would be redecoded. ![](https://i.imgur.com/CSAPrHk.png) ### Control Hazards Also called **branch hazard**. Arising from the need to make a decision based on the results of one instruction while others are executing. (*pipeline stall*, also call it *bubble*.) For example, Inst 1. is Branch or Jump instruction. When Inst 1. execute after EX stage, the next 2 instruction would be flushed. ![](https://i.imgur.com/oNeaQPi.png)