# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`dhiptmc`](https://github.com/dhiptmc) > ###### tags: `RISC-V` `Computer Architecture 2022` ## Problem Decompress Run-Length Encoded List([LeetCode1313](https://leetcode.com/problems/decompress-run-length-encoded-list/)) >We are given a list nums of integers representing a list compressed with run-length encoding. > >Consider each adjacent pair of elements [freq, val] = [nums[2*i], nums[2*i+1]] (with i >= 0). For each such pair, there are freq elements with value val concatenated in a sublist. Concatenate all the sublists from left to right to generate the decompressed list. > >Return the decompressed list. #### Example 1: ``` Input: nums = [1,2,3,4] Output: [2,4,4,4] Explanation: The first pair [1,2] means we have freq = 1 and val = 2 so we generate the array [2]. The second pair [3,4] means we have freq = 3 and val = 4 so we generate [4,4,4]. At the end the concatenation [2] + [4,4,4] is [2,4,4,4]. ``` #### Example 2: ``` Input: nums = [1,1,2,3] Output: [1,3,3] ``` ## Solution Input array and numsSize are known when we start this program. We first traverse the input array *(int i = 0; i < numsSize; i += 2)* to find the size of the output array. Next we collect the element after "freq" so as to get "val". Note that each "val" needs to be insert into output array for "freq" times. ## Implementation You can find the source code [here](https://github.com/dhiptmc/ComputerArchitecture/tree/main/Hw1). Feel free to fork and modify it. ### C code ```clike= /** * Note: The returned array must be malloced, assume caller calls free(). */ int *decompressRLElist(int *nums, int numsSize, int *returnSize) { *returnSize = 0; for (int i = 0; i < numsSize; i += 2) { *returnSize += nums[i]; } int *result = malloc(sizeof(int) * (*returnSize)); int count = 0; for (int i = 0; i < numsSize; i += 2) { for (int j = 0; j < nums[i]; j++) { result[count++] = nums[i+1]; } } return result; } ``` ### Assembly code ``` .data nums1: .word 1,2,3,4 #2444 nums2: .word 6,5,1,3,2,1 #555555311 nums3: .word 1,3,2,4,2,7,1,8 #344778 numsSize1: .word 4 numsSize2: .word 6 numsSize3: .word 8 returnSize1: .word 0 returnSize2: .word 0 returnSize3: .word 0 enter: .string "\n" .text main_1: # execute first test data la a1, nums1 # get nums[i] address a1->nums1 la a2, numsSize1 # get numsSize address a2->numsSize1 lw a2, 0(a2) # a2 = numsSize la a3, returnSize1 # get returnSize address a3->returnSize1 lw a3, 0(a3) # a3 = returnSize addi a4, zero, 0 # a4 = i = 0 #sw 0, 0(a3) # *returnSize1 = 0 addi s2, zero, 2 # record number of data set need to be execute j outer_loop #####loop##### proc: lw t1, 4(t0) # nums[i+1] slli t2, a6, 2 # calculate result[count] address add t2, t2, s1 # calculate result[count] address //s1 -> result base address sw t1, 0(t2) # result[count] = nums[i+1]; addi a6, a6, 1 # count++ addi a5, a5, 1 # j++ inner_loop: slli t0, a4, 2 # calculate nums[i] address add t0, t0, a1 # calculate nums[i] address //a1 -> nums base address lw t1, 0(t0) # load nums[i] to t1 blt a5, t1, proc # if j < nums[i], jump to proc add a3, a3, t1 # *returnSize += nums[i] ##modified addi a5, zero, 0 # a5 = j = 0 addi a4, a4, 2 # i += 2 outer_loop: blt a4, a2, inner_loop # if i < numsSize, jump to inner_loop addi a4, zero, 0 # set i = 0 addi a5, zero, 0 # a5 = j = 0 addi a6, zero, 0 # a6 = count = 0 j print #####load different test data##### main_2: la a1, nums2 la a2, numsSize2 # get numsSize2 address a2->numsSize2 lw a2, 0(a2) # a2 = numsSize2 la a3, returnSize2 # get returnSize2 address a3->returnSize2 lw a3, 0(a3) # a3 = returnSize addi a4, zero, 0 # reset i j outer_loop main_3: la a1, nums3 la a2, numsSize3 # get numsSize3 address a2->numsSize3 lw a2, 0(a2) # a2 = numsSize3 la a3, returnSize3 # get returnSize3 address a3->returnSize3 lw a3, 0(a3) # a3 = returnSize addi a4, zero, 0 # reset i j outer_loop #####print result##### print: slli t2, a4, 2 # calculate result[i] address add t2, t2 ,s1 # calculate result[i] address //s1 -> result base address lw a0, 0(t2) # load result[i] to a0 li a7, 1 # system call print ecall # execute system call addi a4, a4, 1 # i++ blt a4, a3, print # if i < returnSize, jump to print la a0, enter # print \n li a7, 4 # a7 = 4 means print string ecall addi s2, s2, -1 bgtz s2,main_2 # if s2 > 0 jump to main_2 beqz s2,main_3 # if s2 == 0 jump to main_3 j exit exit: li a7, 10 # system call exit ecall # execute system call ``` :::warning :warning: Can you use fewer instructions? :notes: jserv ::: :::success Indeed, there are possible ways to use fewer instructions. Modified version of Assembly code is shown below. ``` .data nums: .word 1,2,3,4, 6,5,1,3,2,1, 1,3,2,4,2,7,1,8 numsSize: .word 4,6,8 returnSize: .word 0,0,0 enter: .string "\n" result: .word 0,0,0,0,0,0,0,0,0,0 ##used to avoid EXCEEDING boundaries #####function starts##### .text initial: addi t5, zero, 3 # record number of testing datas need to be executed addi s4, zero, 0 # identify which testing data is executing la a1, nums # get nums[i] address a1 -> nums la a2, numsSize # get numsSize address a2 -> numsSize la a3, returnSize # get returnSize address a3 -> returnSize main: lw s3, 0(a3) # s3 = returnSize lw s2, 0(a2) # s2 = numsSize lw a4, result # base address of result lw t4, result # the desired result address j loop #####loop##### proc: sw t2, 0(t4) # result[count] = nums[i+1] addi t4, t4, 4 # t4 -> result[count+1] addi a6, a6, 1 # j++ blt a6, t1, proc # if j < nums[i], jump back to proc add s3, s3, t1 # *returnSize += nums[i] addi a6, zero, 0 # a6 = j = 0 addi a5, a5, 2 # i += 2 addi a1, a1, 8 # a1 -> nums[i+2] loop: lw t1, 0(a1) # t1 = nums[i] lw t2, 4(a1) # t2 = nums[i+1] bge a5, s2, print # if i >= numsSize, jump to print blt a6, t1, proc # if j < nums[i] , jump to proc #####print result and initialize next test set##### print: lw a0, 0(a4) # load result[i] to a0 li a7, 1 # system call print ecall # execute system call addi a4, a4, 4 # move to next result word bne a4, t4, print # if a4 != final address of t4, jump to print la a0, enter # print \n li a7, 4 # a7 = 4 means print string ecall ##initialize addi a5, zero, 0 # a5 = i = 0 addi s4, s4, 1 # move to the next testing data addi a2, a2, 4 # move to the next numsSize addi a3, a3, 4 # move to the next returnSize blt s4, t5, main # if s4 < 3 jump to main j exit exit: li a7, 10 # system call exit ecall # execute system call ``` As a result, clock cycle as the figure shown below decreases compared to the original work. Origin: ![](https://i.imgur.com/aNPQlad.png "Original") After modified: ![](https://i.imgur.com/iU2bs2X.png "Modified") ::: ## Analysis We test our code using [Ripes](https://github.com/mortbopet/Ripes) simulator ## Results * Test data 1 ``` Input: nums = [1,2,3,4] Output: [244] ``` * Test data 2 ``` Input: nums = [6,5,1,3,2,1] Output: [555555311] ``` * Test data 3 ``` Input: nums = [1,3,2,4,2,7,1,8] Output: [344778] ``` #### Ripes print ![](https://i.imgur.com/5vBIbvP.png "Ripes 模擬結果") ## 5-stage pipelined processor ![](https://i.imgur.com/fBgoWHd.png "5-stage pipelined processor") ### IF: instruction fetch stage ![](https://i.imgur.com/NRD5wpL.png "IF") **Fetch instruction from memory.** At this example, ***lw x12 0 x12*** is in IF stage. Current PC(Program counter) is 0x00000010, which leads to an instruction decoded as 0x00062603. We can find out that there is a simple ALU that adds 4 to the current PC which eventually returns the next PC. Meanwhile, current PC and PC+4 will be delivered to *IFID* for later use. Noted that there is also a possibility that *PC*(PC Unit) receives a value from **EXE** stage. ### ID: instruction decode/register file read stage ![](https://i.imgur.com/yVK3go4.png "ID") **Read registers while decoding the instruction.** At this example, ***addi x12 x12 64*** is in ID stage. *Decode*(Decode Unit) outputs the x12 address(0x0c) for the *Registers*, and *Registers* outputs the corresponding value stored in x12 to *IDEX*. In this circumstance, *Reg2* is unnecessary for *addi* operation, so we can ignore it. *Imm.* unit(Immediate Generator) outputs the immediaite value, in this case, 0x00000040(32'd64) will be delivered to *IDEX* for later use. Noted that target register (x12) address for storing back to the *Registers*, PC, and PC+4 are also delivered to *IDEX*. ### EXE: execution stage ![](https://i.imgur.com/2WiXv8X.png "EXE") **Execute the operation or calculate an address.** At this example, ***auipc x12 0x10000*** is in EXE stage. **EXE** stage is somewhat complicated. 4 MUXs(multiplexer) can be found. The left two MUXs choose the source of the input registers, which can be directly come from *IDEX* (*Reg1*/*Reg2*) or from two different forwarding paths. The right two MUXs choose whether the *Op* source comes from PC, *IMM.* unit or register. In this case, *Op1* comes from PC, and *Op2* comes from *IMM.* unit. *ALU* unit eventually finsihes the arithmetic operation and outputs ths result to *EXMEM*, and outputs the target PC address to **IF** stage in some branch instructions. In addition, a *Branch* unit takes value of two different registers in order to perform branch control, which splits the traditional operation from the *ALU* unit so as to alleviate the workload for *ALU* unit, and may have better performance for branch operations. Noted that target register, one register input, and PC+4 is also delivered to *EX/MEM*. ### MEM: memory access stage ![](https://i.imgur.com/pnJ3ufK.png "MEM") **Access an operand in data memory.** At this example, ***addi x11 x11 0*** is in MEM stage. *Data memory* unit recieves the *Addr.* and *Data in* signals to perform memory read/write. In this circumstance, there is no need to use *Data memory* unit, so we can simply ignore the *Read out* result. In addition, result from the previous **EXE** stage can be forwarded to the current **EXE** stage. ### WB: write-back stage ![](https://i.imgur.com/cOKEhLD.png "WB") **Write the result into a register.** At this example, ***auipc x11 0x10000*** is in WB stage. MUX chooses the input signal from PC+4, *Data memory* unit, or result from the **EXE** stage. In this circumstance, we choose the result from the **EXE** stage(0x10000000), and stores the value to the register whose target address is 0x0b. In addition, output signal from the MUX can be forwarded to the **EXE** stage. ## Memory update steps **Memory condition when program starts** ![](https://i.imgur.com/GrxdPlG.png "initialization") ![](https://i.imgur.com/32zglXl.png "Instruction memory") From the Instruction memory list, we can discover that the next few instructions has been loaded into the pipeline. For instance, ***lw x12 0 x12*** is in IF stage. **Memory updates during main_1** ![](https://i.imgur.com/LarbRMw.png "main_1") ![](https://i.imgur.com/DjoZaWg.png "main_1 memory") a1 gets the base address of nums1. a2 gets the numsSize1. a3 = returnSize which initials as 0 for later use. a4 = i = 0 for later branch deciding. s2 = 2 records numbers of testing sets need to be execute. **Memory updates during proc and inner_loop** ![](https://i.imgur.com/wokOIvQ.png "proc and inner_loop") ![](https://i.imgur.com/3mPqhFa.png "proc and inner_loop memory") t0 gets the nums1[i] address. t1 first loads nums[i], and is used to decide whether to branch to proc or not. Then, t1 loads nums[i+1], t2 gets result[count] address, and nums[i+1] is stored in "result[count] address". Noted that we shift the index 2 bits so as to match the memory address. Eventually, we make a5 += 1(j++), a6 += 1(count++). t1 reloads nums[i], and we check the relationship between the value in a5 and t1, which leads us to decide whether go backs to proc or not. If we don't need to branch to proc again, we add t1(nums[i]) to a3(returnSize), set a5(j) to 0, and add 2 to a4(i). **Memory updates during outer_loop** ![](https://i.imgur.com/Uha5tQE.png "outer_loop") ![](https://i.imgur.com/S5zvoU2.png "outer_loop memory") We reset a4(i), a5(j), a6(count) to zero, prepare for later use. **Memory updates after print** ![](https://i.imgur.com/nXaSb8t.png "after print") ![](https://i.imgur.com/7J1Qmgt.png "after print memory") s2 value is decreased by 1. ## Hazard ### Data Hazard Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur: * read after write (RAW), a true dependency * write after read (WAR), an anti-dependency * write after write (WAW), an output dependency Read after read (RAR) is not a hazard case. * **Solution** 1. Forwarding ![](https://i.imgur.com/Z3hN6H2.png "Forwarding example") By forwarding, we can use the result from **MEM** stage or **WB** stage immediately. As the picture shown above, the current instruction in **EXE** stage is ***addi x11 x11 0***, while the previous instruction is ***auipc x11 0x10000***. Therefore, we pass the value from *EX/MEM* to **EXE** stage as the yellow line shows in order to solve the data hazard. 2. Stall when Load-Use ![](https://i.imgur.com/f22nrff.png "Load-Use example") One case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. The picture shown above illustrates the problem. In this situation, a stall is needed. ### Control Hazard Control hazard occurs when the pipeline makes wrong decisions on branch prediction and therefore brings instructions into the pipeline that must subsequently be discarded. The term branch hazard also refers to a control hazard. * Solution: ![](https://i.imgur.com/FUNLJ7J.png "Control Hazard") Control hazard occurs when the branch is taken. Hence, we need to flush the sequential instructions that follow the branch. As a matter of fact, we will waste 2 cycles since branch is dicided at **EXE** stage. One way to improve branch performance is to reduce the cost of the taken branch. Moving the branch execution to the **ID** stage will be a wise choice, since it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Despite there may be several difficulties to conquer (several new units and mechanisms needed), it is still an improvement to the performance. ![](https://i.imgur.com/bwt3Asg.png "Better solution to Control Hazard") > David A. PATTERSON, JOHN L. HENNESSY. Computer Organization and Design 4/e, page 379 FIGURE 4.62. ## Pseudo instruction Origin: ``` 00000000 <main_1>: 0: 10000597 auipc x11 0x10000 4: 00058593 addi x11 x11 0 8: 10000617 auipc x12 0x10000 c: 04060613 addi x12 x12 64 10: 00062603 lw x12 0 x12 14: 10000697 auipc x13 0x10000 18: 04068693 addi x13 x13 64 1c: 0006a683 lw x13 0 x13 20: 00000713 addi x14 x0 0 24: 00200913 addi x18 x0 2 28: 0380006f jal x0 56 <outer_loop> 0000002c <proc>: 2c: 0042a303 lw x6 4 x5 30: 00281393 slli x7 x16 2 34: 009383b3 add x7 x7 x9 38: 0063a023 sw x6 0 x7 3c: 00180813 addi x16 x16 1 40: 00178793 addi x15 x15 1 00000044 <inner_loop>: 44: 00271293 slli x5 x14 2 48: 00b282b3 add x5 x5 x11 4c: 0002a303 lw x6 0 x5 50: fc67cee3 blt x15 x6 -36 <proc> 54: 006686b3 add x13 x13 x6 58: 00000793 addi x15 x0 0 5c: 00270713 addi x14 x14 2 00000060 <outer_loop>: 60: fec742e3 blt x14 x12 -28 <inner_loop> 64: 00000713 addi x14 x0 0 68: 00000793 addi x15 x0 0 6c: 00000813 addi x16 x0 0 70: 0540006f jal x0 84 <print> 00000074 <main_2>: 74: 10000597 auipc x11 0x10000 78: f9c58593 addi x11 x11 -100 7c: 10000617 auipc x12 0x10000 80: fd060613 addi x12 x12 -48 84: 00062603 lw x12 0 x12 88: 10000697 auipc x13 0x10000 8c: fd068693 addi x13 x13 -48 90: 0006a683 lw x13 0 x13 94: 00000713 addi x14 x0 0 98: fc9ff06f jal x0 -56 <outer_loop> 0000009c <main_3>: 9c: 10000597 auipc x11 0x10000 a0: f8c58593 addi x11 x11 -116 a4: 10000617 auipc x12 0x10000 a8: fac60613 addi x12 x12 -84 ac: 00062603 lw x12 0 x12 b0: 10000697 auipc x13 0x10000 b4: fac68693 addi x13 x13 -84 b8: 0006a683 lw x13 0 x13 bc: 00000713 addi x14 x0 0 c0: fa1ff06f jal x0 -96 <outer_loop> 000000c4 <print>: c4: 00271393 slli x7 x14 2 c8: 009383b3 add x7 x7 x9 cc: 0003a503 lw x10 0 x7 d0: 00100893 addi x17 x0 1 d4: 00000073 ecall d8: 00170713 addi x14 x14 1 dc: fed744e3 blt x14 x13 -24 <print> e0: 10000517 auipc x10 0x10000 e4: f8050513 addi x10 x10 -128 e8: 00400893 addi x17 x0 4 ec: 00000073 ecall f0: fff90913 addi x18 x18 -1 f4: f92040e3 blt x0 x18 -128 <main_2> f8: fa0902e3 beq x18 x0 -92 <main_3> fc: 0040006f jal x0 4 <exit> 00000100 <exit>: 100: 00a00893 addi x17 x0 10 104: 00000073 ecall ``` :::success Modified: ``` 00000000 <initial>: 0: 00300f13 addi x30 x0 3 4: 00000a13 addi x20 x0 0 8: 10000597 auipc x11 0x10000 c: ff858593 addi x11 x11 -8 10: 10000617 auipc x12 0x10000 14: 03860613 addi x12 x12 56 18: 10000697 auipc x13 0x10000 1c: 03c68693 addi x13 x13 60 00000020 <main>: 20: 0006a983 lw x19 0 x13 24: 00062903 lw x18 0 x12 28: 10000717 auipc x14 0x10000 2c: 03a72703 lw x14 58 x14 30: 10000e97 auipc x29 0x10000 34: 032eae83 lw x29 50 x29 38: 0240006f jal x0 36 <loop> 0000003c <proc>: 3c: 007ea023 sw x7 0 x29 40: 004e8e93 addi x29 x29 4 44: 00180813 addi x16 x16 1 48: fe684ae3 blt x16 x6 -12 <proc> 4c: 006989b3 add x19 x19 x6 50: 00000813 addi x16 x0 0 54: 00278793 addi x15 x15 2 58: 00858593 addi x11 x11 8 0000005c <loop>: 5c: 0005a303 lw x6 0 x11 60: 0045a383 lw x7 4 x11 64: 0127d463 bge x15 x18 8 <print> 68: fc684ae3 blt x16 x6 -44 <proc> 0000006c <print>: 6c: 00072503 lw x10 0 x14 70: 00100893 addi x17 x0 1 74: 00000073 ecall 78: 00470713 addi x14 x14 4 7c: ffd718e3 bne x14 x29 -16 <print> 80: 10000517 auipc x10 0x10000 84: fe050513 addi x10 x10 -32 88: 00400893 addi x17 x0 4 8c: 00000073 ecall 90: 00000793 addi x15 x0 0 94: 001a0a13 addi x20 x20 1 98: 00460613 addi x12 x12 4 9c: 00468693 addi x13 x13 4 a0: f9ea40e3 blt x20 x30 -128 <main> a4: 0040006f jal x0 4 <exit> 000000a8 <exit>: a8: 00a00893 addi x17 x0 10 ac: 00000073 ecall ``` ::: ## Reference * [Computer Organization and Design 4/e](http://staff.ustc.edu.cn/~llxx/cod/reference_books/Computer%20Organization%20and%20Design%204th.pdf) * [Hazard (computer architecture)](https://en.wikipedia.org/wiki/Hazard_(computer_architecture))