Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`Jack`](https://github.com/Chi-you) > ###### tags: `RISC-V`, `Computer Architure 2021` ## Majority Element ### Description This problem is based on [LeetCode 169](https://leetcode.com/problems/majority-element/). Given an array nums of size n, return the majority element. The majority element is the element that appears more than $\lfloor n / 2 \rfloor$ times. **You may assume that the majority element always exists in the array.** ### Solution I solve the problem by `Boyer-Moore Voting Algorithm`, which is an algorithm for finding the majority of a sequence of elements using linear time and constant space. The algorithm works on the fact that if an element occurs more than $n/2$ times, it means that the remaining elements other than this would definitely be less than $n/2$. First, I choose a candidate from the array. Then, I check that the other elements in array are the same as the candidate element. If it is the same, I increase the count. Otherwise, I decrease the count. If the count become 0, which means that there are the same number of different elements, it might not be the majority element. Therefore, I select another new element as the new candidate. The time complexity is $O(n)$ and space complexity is $O(1)$. ## C code ``` #include <stdio.h> int majorityElement(int num[], int n){ int cnt = 1; int major = num[0]; for(int i = 1; i < n; i++){ if(num[i] == major) cnt++; else if(cnt == 0){ major = num[i]; cnt++; } else cnt--; } return major; } int main(){ int A[] = {1, 2, 2, 5, 2, 2, 4, 1, 2, 2}; int ans = majorityElement(A, 10); printf("The majority element is %d\n", ans); return 0; } ``` ### Output ![](https://i.imgur.com/Hws9PBB.png) Compiled by gcc and executed in WSL2 (Ubuntu 20.04). ## Assembly code ``` .data arr: .word 1, 2, 2, 5, 2, 2, 4, 1, 2, 2 len: .word 10 str: .string "The majority element is " .text main: jal ra, majorityElement print: la a0, str li a7,4 # print string ecall mv a0, t4 li a7, 1 # print integer ecall exit: li a7, 10 # exit ecall majorityElement: la t0, len lw t0, 0(t0) # t0: len la t1, arr # t1: base_addr of arr li t2, 1 # cnt = 1 li t3, 1 # i = 1 lw t4, 0(t1) # major = num[0] addi t1, t1, 4 # base_addr += 4 (Because array start from index 1) for: blt t3, t0, loop # if i < len, branch to label "loop" mv a0, t4 jr ra # return main loop: lw t5, 0(t1) # t5 = num[i] beq t5, t4, if # if(num[i] == major) beq t2, zero, elseif # else if(cnt == 0) addi t2, t2, -1 # cnt-- increment: addi t1, t1, 4 # base_addr += 4 addi t3, t3, 1 # i++ j for if: addi t2, t2, 1 # cnt++ j increment elseif: mv t4, t5 # major = num[i]; addi t2, t2, 1 # cnt++ j increment ``` ### Code Explanation The assembly code is written by RISC-V RV32IM instruction. It includes two functions, `main` and `majorityElement` function. The `majorityElement` function is to find the majority element in an array. Here is the detail of each section in my assembly code | Section | Description | |:---------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | .data | The data storage, which stores the element of array, the length of array and the string to print out. | | .text | The code storage. | | main | The main function, it will call `majorityElement` function to find the majority element of the array and call `print` function to print out the result. | | print | Print out some message and the result of `majorityElement` function. | | exit | Exit the program. | | majorityElement | The major part of the program, which finds the majority element in the array. In this section, I initialize the variables and add 4 to the base address of array in order to access the second element. | | for | To check the condition of the for loop, if it is true, it will jump to `loop` and execute the statements in for. Otherwise, it saves the return value and returns back to the caller function(main). | | loop | The statements in for loop, including if-else statement. | | increment | It increase the base address by 4 of arr and the index i by 1. | | if | The if statement in c code. | | elseif | The elseif statement in c code. | ### Output ![](https://i.imgur.com/UYGb8BZ.png) Verify by Ripes. The output is the same as the c code. ## Analysis ### Pseudo Instruction For the programmer's benefit, it's useful to have additional instructions that aren't really implemented by the hardware, which is called pseudo instruction. With pseudo instructions, we can easily program and it will add clarity to the program. The examples below are the pseudo instructions I use in my program and their corresponding real instructions. | Pseudo instruction | Assembly code translated by Ripes | |:------------------ |:----------------------------------- | | li t2, 1 | addi x7 x0 1 | | mv a0, t4 | addi x10 x29 0 | | j for | jal x0 -36 <for> | | jr ra | jalr x0 x1 0 | | la t0, len | auipc x5 0x10000; addi x5 x5 0 | >The other pesudo instructions to the real instructions mapping can be found in [RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md#a-listing-of-standard-risc-v-pseudoinstructions) #### UJ-format For `UJ-format` instruction, if it is just jumping without linking, then the rd will be set to `x0`, which means that the return address is not important. #### la (load address) For instruction `la rd, symbol`, it is to load the address of symbol into rd register. It will be translated into two instruction `auipc` and `addi`. The instruction `auipc rd, imm20` is the abbreviation of “add upper immediate to PC”. It will set the higher 20 bits of target address by shifting left imm 12 bits and adding it to PC. Then, addi instruction will set the lower 12 bits of target address by adding the result above to imm. For example, to get the address of `len` in my program, the instruction `auipc x5 0x10000` save $PC+0x10000000=0x10000028$ into `x5` register. Then, `addi x5 x5 0` save $0x10000028+0=10000000$ into `x5` register, which is the same as the address of len and it can be verified by the `memory viewer`. ![](https://i.imgur.com/PUN671k.png) >The memory map of `len` --- ### 5-stage pipeline processor In Ripes, there are several processor types, including single-cycle processor, 5-stage processor, 6-stage dual-issue processor. The 5-stage pipeline processor: | Stage | Description | |:-----:|:------------------------------------------------------------------------------------------------ | | `IF` | Fetch instruction from memory | | `ID` | Decode the instruction and get the value from the register | | `EX` | Execute arithmetic operation according to the decoded instruction, including address calculation | | `MEM` | Load/Store the value from/to the memory if needed | | `WB` | Write the result back to the destination register | The one I demostrate here is 5-stage processor with hazard detection and forwarding unit. This processor has the less instruction count, because the processor will bypass the value instead of stalling some cycles by NOP, when the hazard occurs. The architecture of the processor is shown below: ![](https://i.imgur.com/5SvBnYs.png) The box diagram between each stage is called pipeline register, which holds the information produced in previous cycle. ### Data dependency (Hazard) In the design of the standard 5-stage pipeline, it exists `RAW (Read after Write)` data hazard. That is, when an instruction needs to use the result of the previous instruction or the instruction before previous instruction, but the result is still in the pipeline and has not been written back into the Register File, the instruction will get the wrong data. In order to solve the problems, we need to bypass the data from EX/MEM register or MEM/WB register to EX stage, which can be done by forwarding unit. The forwarding datapaths in Ripes are shown below. The green line is the datapath of bypassing EX/MEM pipeine register to EX stage; the yellow one is the datapath of bypassing MEM/WB to EX stage. ![](https://i.imgur.com/90JtVpf.jpg) ### Load-Use Data Hazard Most of the data hazard problems can be solved by forwarding unit. However, we cannot avoid `load-use data hazard`. That is, when the instruction (e.g. lw) needs the data from memory, we cannot get the correct data just through forwarding; instead, we need to stall the pipeline for a cycle. The following code is an example for `load-use data hazard` and its corrosponding pipeline diagram are shown below. ![](https://i.imgur.com/dt5igtu.png) We can see that the first `beq` instruction try to compare with the value from `x30` register, which is the rd register of `lw` instruction, while the `lw` instruction is not completed. Therefore, it exists `load-use data hazard`. We can solve the problem by stalling a cycle; that is, the following two instruction `beq x20 x29 24` and `beq x7 x0 28` will stall in their current stage (ID and IF), but the instruction `lw x30 0 x6` will move on to MEM stage. Therefore, it can bypass the value from MEM/WB register to EX stage just like the picture below. ![](https://i.imgur.com/GfcUEku.jpg) **The detailed process of solving load-use data hazard** To stall the following two instruction, Ripes will * disable the PC * disable the IF/ID pipeline register * insert an NOP by clearing the ID/EX pipeline register ![](https://i.imgur.com/y4m0467.jpg) The stalling cycle. ![](https://i.imgur.com/fig5uUI.png) After stalling a cycle, it bypasses the value from MEM/WB register to EX stage. Then, it can compare the value by branch unit. ![](https://i.imgur.com/tWx2vVo.jpg) ### Control Hazard Control hazard occurs when the pipeline makes wrong decisions on branch prediction and therefore brings instructions into the pipeline that must subsequently be discarded. To solve the problem, we can predict branch always not taken and flush the instruction if it is wrong. In Ripes, the `branch unit` is in EX stage and the jump address for `UJ-format` is calculated in ALU, which is also in EX stage. Therefore, if the prediction is wrong, Ripes need to flush the first two stage, i.e. * clear the IF/ID pipeline register * clear the ID/EX pipeline register Here I demonstrate the process of solving control hazard with instruction `jal x1 40`. As you see, the output of ALU is the new target address that will be jump into and the clear signal in IF/ID and ID/EX pipeline register is high. ![](https://i.imgur.com/U5W3Mcg.jpg) Therefore, the `PC`(blue box), `PC+4`(purple box) which stores in pipeline register is set to `0x00000000`. Also, the instruction(red box) is clear to zero, so it is decoded into NOP. ![](https://i.imgur.com/X0FTz8Q.jpg) Let's look at the pipeline diagram of `jal` instruction. As `jal` instruction is in MEM stage, the following two instruction will be flushed and the new instruction which is at the label `majoityElement` will be fetched. ![](https://i.imgur.com/sIRWu4E.png) --- ### The detail of `lw x29 0 x6` instruction in each stage #### IF stage In IF stage, the PC will be sent into instruction memory to fetch the instruction from instruction memory. Here, the `PC=0x00000044`. ![](https://i.imgur.com/0TGfsFX.png =40%x) For the cache in ripes, it adopts the `Harvard Architecture`, which seperates the instruction cache and data cache. Also, by default, it is a `32-entry 4-word direct-mapped cache`, which is shown below. In order to get the instruction, it will first check whether the instruction is in the cache or not. If cache miss, then it will load the instruction from memory. In the picture below, the cache hits, which is because the instruction has been loaded into the cache advancedly since the previous instruction. ![](https://i.imgur.com/0FhUo1d.png) #### ID stage In ID stage, the `lw x29 0 x6` instruction will be decoded. As you see the picture below, the opcode is LW, the immediate is 0x00000000, R1 index is 0x06 and the write index is 0x1d, which will be sent into the pipeline register to ensure that when the instruction is in WB stage, it can get the correct index of write back register. ![](https://i.imgur.com/xEulS7b.png =50%x) #### EX stage In EX stage, it will add the data in R1 and the immediate, so the output of ALU is 0x10000000, which it the address for data memory. ![](https://i.imgur.com/FPfESMq.png =50%x) #### MEM stage In MEM stage, it use the address that we calculated (0x10000000) at the previous stage to load the data in `Data memory`. Also, we can note that the write enable signal is low because the instruction is to load the data from memory not to write the data into memory. ![](https://i.imgur.com/p6Ae64S.png =40%x) Here we can see the data of each address by looking at the `memory viewer`. The data that stores in address 0x10000000 is 1. ![](https://i.imgur.com/TEkSCSj.png) #### WB stage In WB stage, it will write the result (0x00000001) back to the destination register, which is `0x1d (x29)` in this example. ![](https://i.imgur.com/YJPwL06.png) Finally, we can check the value in `x29` register, which is the same as we want. ![](https://i.imgur.com/tyOgtE8.png) ## Reference * [Leetcode 169 Majority Element](https://leetcode.com/problems/majority-element/) * [Boyer–Moore majority vote algorithm (Wikipedia)](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_majority_vote_algorithm) * [Boyer–Moore majority vote algorithm (GeeksforGeeks)](https://www.geeksforgeeks.org/boyer-moore-majority-voting-algorithm/) * [Environment Calls for Ripes](https://github.com/mortbopet/Ripes/wiki/Environment-calls) ## Appendix ### Assembly code with real instruction ``` 00000000 <main>: 0: 028000ef jal x1 40 <majorityElement> 4: 10000517 auipc x10 0x10000 8: 02850513 addi x10 x10 40 c: 00400893 addi x17 x0 4 10: 00000073 ecall 14: 000e8513 addi x10 x29 0 18: 00100893 addi x17 x0 1 1c: 00000073 ecall 20: 00a00893 addi x17 x0 10 24: 00000073 ecall 00000028 <majorityElement>: 28: 10000297 auipc x5 0x10000 2c: 00028293 addi x5 x5 0 30: 0002a283 lw x5 0 x5 34: 10000317 auipc x6 0x10000 38: fcc30313 addi x6 x6 -52 3c: 00100393 addi x7 x0 1 40: 00100e13 addi x28 x0 1 44: 00032e83 lw x29 0 x6 48: 00430313 addi x6 x6 4 0000004c <for>: 4c: 005e4663 blt x28 x5 12 <loop> 50: 000e8513 addi x10 x29 0 54: 00008067 jalr x0 x1 0 00000058 <loop>: 58: 00032f03 lw x30 0 x6 5c: 01df0c63 beq x30 x29 24 <if> 60: 00038e63 beq x7 x0 28 <elseif> 64: fff38393 addi x7 x7 -1 00000068 <increment>: 68: 00430313 addi x6 x6 4 6c: 001e0e13 addi x28 x28 1 70: fddff06f jal x0 -36 <for> 00000074 <if>: 74: 00138393 addi x7 x7 1 78: ff1ff06f jal x0 -16 <increment> 0000007c <elseif>: 7c: 000f0e93 addi x29 x30 0 80: 00138393 addi x7 x7 1 84: fe5ff06f jal x0 -28 <increment> ```