Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`tobychui`](https://github.com/tobychui) > ## Shuffle the Array This is one of the "Easy" questions on LeetCode as requested in the assignment. The question is as follows. >Given the array nums consisting of 2n elements in the form [x1,x2,...,xn,y1,y2,...,yn]. >Return the array in the form [x1,y1,x2,y2,...,xn,yn]. ## Implementaion You can find the source code [here](https://gist.github.com/tobychui/032db06f31c7b72be1df5b30ab8d937c). ### C code The question given the following structure for implementing the array shuffling function: ```c /** * Note: The returned array must be malloced, assume caller calls free(). */ int *shuffle(int *nums, int numsSize, int n, int* returnSize) { ... } ``` The following extracted the core logic of the shuffle program. For a full program including sample data and pretty printing logics, see the source code attached in the link above. ```c #include <stdio.h> #include <stdlib.h> // Sample data defination here... int* shuffle(int* nums, int numsSize, int n, int* returnSize){ if (numsSize <= 2){ return nums; } int counter = 0; for (int i = 0; i < n; i++){ returnSize[counter] = nums[i]; counter++; returnSize[counter] = nums[i+n]; counter++; } return returnSize; } // Pretty printing logic here... ``` ### Assembly code #### Main Logic In assembly code, the procedure is basically the same as what we have done in the C version of the program. The following shows the assembly code for the shuffleLoop logic which is equivalent to the for loop in the C program above. :::warning Can you avoid the use of `mul` instruction? :notes: jserv ::: ```c ######################## Shuffle Loop ############################# shuffleLoop: # Shuffle the s0 into s1 main loop, offset for outputNums will be stored in t1 bge t0, s3, printResult #If t0 (i) >= s2(number of item in inputNums), print results # result[counter] = nums[i]; lw t3, 0(s0) sw t3, 0(s1) addi s1, s1, 4 # counter++ addi t1, t1, 4 # Array shifting accumlation counter # Calculate pointer offset for inputNums, put offset in t2 li t3, 4 mul t2, s3, t3 # result[counter] = nums[i+n]; add s0, s0, t2 lw t3, 0(s0) sw t3, 0(s1) # Reset pointer offsets sub s0, s0, t2 addi s0, s0, 4 addi s1, s1, 4 # counter++ addi t1, t1, 4 # Array shifting accumlation counter addi t0, t0, 1 # Add 1 to i j shuffleLoop ``` ### Updates Nov 1 2021 As requested by **:notes: jserv** for not using mul in that section, slli can be used to replace these two lines: ``` # Calculate pointer offset for inputNums, put offset in t2 li t3, 4 mul t2, s3, t3 ``` to this: ``` # Calculate pointer offset for inputNums, put offset in t2 slli t2, s3, 2 ``` Shift operator is far cheaper (performance / computational wise) than multiplexor. Hence, it is sensible to use shift operation than mulitplication whenever possible. **END OF UPDATES** The section above was jump from the main logic where it also contains an edge case detection if-else statement which detect if the array only contains 2 numbers. ```c # Check if array length <= 2. If yes, clone array to results and print it li t2, 3 blt s2, t2, cloneAndReturn j shuffleLoop # Optional, but exists for clean code purpose ``` For the "cloneAndReturn" section, it simply copy the first two numbers from s0 (The inputNums array) to the s1 (The outputNums array) and continue the logic by jumping to the printing section. I tried to copy the pointer of s0 to s1 by using the ```mv``` instruction, but not sure why it didn't work as expected. Thus, this copy method was used. ```c ####################### Edge Case Handling ############################## cloneAndReturn: # Handle cases where length of array <= 2, just copy the [0] and [1] to the new array and print it mv s1, s0 # Copy first item in array addi s0, s0, 4 # Move pointer up by 1 word addi s1, s1, 4 mv s1, s0 # Copy the 2nd item in array addi s1, s1, -4 # Restore the pointer of the result array j printResult ``` #### Result Outputs In the assembly code, the printf in C program was replaced by a series of secitons of assembly code that perform different type of print-outs to the terminal. The following two labels are mostly used when pretty-printing the array. ```c printInt: li a7, 1 ecall jr ra printString: li a7, 4 ecall jr ra ``` The value of register ```a7``` before ```ecall``` is not defined on any offical documentation of the standard of RISC-V but only in Ripes' [Github repo](https://github.com/mortbopet/Ripes/wiki/Environment-calls). This might implies that these specific ecall values are only specific to this emulator but not an industry standard for RISC-V assembly code. The list below shows different type of possible ecalls for the printing / outputs. ![](https://i.imgur.com/FYvysr7.png) In here, as this program will only involve integer and string datatype, ecall with ```a7=1``` and ```a7=4``` is mostly used. ## Analysis I tested the code using [Ripes](https://github.com/mortbopet/Ripes) simulator (because it is part of the requirement) and it did run. ### Pseudo instruction Ripes converted the sudo instruction into sequential instructions supported instruction by the RISC-V processor (or emulator). Here are the final executable assembly code for the shuffleLoop logic. ``` 00000058 <shuffleLoop>: 58: 0d32d063 bge x5 x19 192 <printResult> 5c: 00042e03 lw x28 0 x8 60: 01c4a023 sw x28 0 x9 64: 00448493 addi x9 x9 4 68: 00430313 addi x6 x6 4 6c: 00400e13 addi x28 x0 4 70: 03c983b3 mul x7 x19 x28 74: 00740433 add x8 x8 x7 78: 00042e03 lw x28 0 x8 7c: 01c4a023 sw x28 0 x9 80: 40740433 sub x8 x8 x7 84: 00440413 addi x8 x8 4 88: 00448493 addi x9 x9 4 8c: 00430313 addi x6 x6 4 90: 00128293 addi x5 x5 1 94: fc5ff06f jal x0 -60 <shuffleLoop> ``` From the output above, it can be observed that some instriuction like ```mv``` are replaced by ```addi``` instructions. The jal instruction's destination label is also replaced with a numeric offset value (e.g. on line 94: -60 is injected into the instruction code). For the full version of the sequential instructions, see the appendix at the end of this document. ### 5-stage pipelined processor This emulated RISC-V processor (pretty much the same as MIPS processor), uses a 5 stage pipeline to fetch and execute an instruction loaded from memory. 1. Instruction fetch (IF) 2. Instruction decode and register fetch (ID) 3. Execute (EX) 4. Memory access (MEM) 5. Register write back (WB) ### Pipeline Process ![](https://i.imgur.com/Whk6xDq.png) The diagram above show the first few instruction executed by my program. The first one is set x8 register to 0. #### IF - Instruction fetch The instruction is feched into the IFID and the Program Counter has added 4 in this step. Before: ![](https://i.imgur.com/cAh0jJ7.png) After: ![](https://i.imgur.com/jSYSLwt.png) The program counter become 0x8 as there are no jump operations. #### ID - Instruction decode and register fetch In this step, the opcode has been decoded from the IFID and optained the ADDI instruction which is labeled on the output od the decoder. ![](https://i.imgur.com/RJuqGmp.png) This will be the instruction that will be executed in the next step. You can also noted the 0x08 value has been decoded which means the ADDI destination wil be the x8 register. #### EX - Execute In this step, the instructin in the previous stage will be executed. You can see from the emulator, the ALU receiving all 0x0 inputs (as the target operations is to set x8 to zero). ![](https://i.imgur.com/hmTN1ZH.png) The 0x08 is being passed into the EX/MEME section of the CPU where it will be executed and the result will be temporarily saved. #### MEM - Memory access In this step, the result of the calculation will be buffered to the Data Memory. ![](https://i.imgur.com/HwAfsGa.png) The Data In is 0x0 here and the same time, the address of the target register (x8, bottom line in yellow) is being passed to the MEM/WB block. This means the CPU is ready to write the new data (0x0) back to the cache memory. ![](https://i.imgur.com/HV8tELa.png) #### WB - Register write back In this operation, the data (0x0) and the register address (0x08) is being pushed out of the MEM/WB block as shown in the diagram below. ![](https://i.imgur.com/SaGkphi.png) The circuit connects the output of the MEM/WB block to the Registers where the data and the register has been passed to the Register block and write it back to the Registers on the cache. The following two diagrams shows the register address and data bus highlighted in yellow in operations. ![](https://i.imgur.com/z6N0hCL.png) Caption 1: Register address passed back into the register block from the yellow bus above ![](https://i.imgur.com/dXXWrdc.png) Caption 2: The value to write back to register is passed in the yellow bus above ### Memory Updates #### Memory updates during register initialization (set to zero / imm value) After the operation performed in the steps above, the memory x8 has been reset to zero as shown in the diagram below. ![](https://i.imgur.com/xv9Y1ra.png) From the Instruction memory list, it can be observed that the next few instructions has been loaded into the pipeline as well. For example, the ```addi x5 x0 5``` (i.e. ```li x5 5```) is currently in the IF section of the pipeline. ![](https://i.imgur.com/rM0fPXP.png) #### Memory updates during constructing array During the array construction section of the code, the t0 (x5) which is used as a temp variable to buffer integer into the s0 array, is changed according to the assembly code. ![](https://i.imgur.com/wEkVLHQ.png) #### Memory updates during array shuffle loop During the array shuffle loop, the logic in assembly code check if the array has been finished the iteration and jump to the start of the loop section if there are still integers to be iterated. ![](https://i.imgur.com/3tmrvUF.png) The jump and link instruction stored the address of the callee to the ra register as shown in the screenshot above. #### Memory updates during printing results to terminal During the result printing section of the code, the value that will be printed will be laoded into a0 and the datatype of the printing data will be loaded into a7. ![](https://i.imgur.com/neamTqr.png) The above screenshot shows the moment just before a printing to terminal ecall occurs where a0 is set to the value to be printed (0x5) and the datatype is set to a7 as integer (0x1) ### Control Hazards - Pipeline Performance Drop on JAL instruction From the experiment above, there was another observation regarding the special behaviour of JAL instruction. For example, this is one of the pipeline snapshot just before a JAL instruction is executed. ![](https://i.imgur.com/KxI4D8y.png) This is a snapshot just after the JAL instruction is executed ![](https://i.imgur.com/wY9wH06.png) By comparing two snapshots, I noticed that the IF and ID pipeline actually bypassed the JAL command and continues to fetch and decode the next instruction after the first JAL instruction. When the jal instruction is executed, the fetched and decoded instruction from the pipeline is discarded and reloaded from the endpoint starting from the jal target. This explained the reason for the clockspeed of the emulated RISC-V processor slow down significantly during the "Printing" section of the code where a lot of JAL is required to print the string onto the terminal due to JAL instruction overhead. ![](https://i.imgur.com/sW7VBzn.png) Caption: The clockspeed drops to mHz range when performing consecutive JAL commands during the ```printOriginal``` section of the assembly code This might concluded that abstraction (using functions) in assembly code will not gain performance boost. ## Reference * [RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) * [Environmental Calls](https://github.com/mortbopet/Ripes/wiki/Environment-calls) # Appendix ## Full version of the RIPES seq. instruction ```c 00000000 <main>: 0: 00000413 addi x8 x0 0 4: 01840493 addi x9 x8 24 8: 00200293 addi x5 x0 2 c: 00542023 sw x5 0 x8 10: 00500293 addi x5 x0 5 14: 00542223 sw x5 4 x8 18: 00100293 addi x5 x0 1 1c: 00542423 sw x5 8 x8 20: 00300293 addi x5 x0 3 24: 00542623 sw x5 12 x8 28: 00400293 addi x5 x0 4 2c: 00542823 sw x5 16 x8 30: 00700293 addi x5 x0 7 34: 00542a23 sw x5 20 x8 38: 00600913 addi x18 x0 6 3c: 00300993 addi x19 x0 3 40: 00000293 addi x5 x0 0 44: 00000313 addi x6 x0 0 48: 00300393 addi x7 x0 3 4c: 04794663 blt x18 x7 76 <cloneAndReturn> 50: 06000fef jal x31 96 <printOriginal> 54: 0040006f jal x0 4 <shuffleLoop> 00000058 <shuffleLoop>: 58: 0d32d063 bge x5 x19 192 <printResult> 5c: 00042e03 lw x28 0 x8 60: 01c4a023 sw x28 0 x9 64: 00448493 addi x9 x9 4 68: 00430313 addi x6 x6 4 6c: 00400e13 addi x28 x0 4 70: 03c983b3 mul x7 x19 x28 74: 00740433 add x8 x8 x7 78: 00042e03 lw x28 0 x8 7c: 01c4a023 sw x28 0 x9 80: 40740433 sub x8 x8 x7 84: 00440413 addi x8 x8 4 88: 00448493 addi x9 x9 4 8c: 00430313 addi x6 x6 4 90: 00128293 addi x5 x5 1 94: fc5ff06f jal x0 -60 <shuffleLoop> 00000098 <cloneAndReturn>: 98: 00040493 addi x9 x8 0 9c: 00440413 addi x8 x8 4 a0: 00448493 addi x9 x9 4 a4: 00040493 addi x9 x8 0 a8: ffc48493 addi x9 x9 -4 ac: 06c0006f jal x0 108 <printResult> 000000b0 <printOriginal>: b0: 00050f13 addi x30 x10 0 b4: 10000517 auipc x10 0x10000 b8: f6150513 addi x10 x10 -159 bc: 0c8000ef jal x1 200 <printString> c0: 00042503 lw x10 0 x8 c4: 0b4000ef jal x1 180 <printInt> c8: 0c8000ef jal x1 200 <printSepeartor> cc: 00442503 lw x10 4 x8 d0: 0a8000ef jal x1 168 <printInt> d4: 0bc000ef jal x1 188 <printSepeartor> d8: 00842503 lw x10 8 x8 dc: 09c000ef jal x1 156 <printInt> e0: 0b0000ef jal x1 176 <printSepeartor> e4: 00c42503 lw x10 12 x8 e8: 090000ef jal x1 144 <printInt> ec: 0a4000ef jal x1 164 <printSepeartor> f0: 01042503 lw x10 16 x8 f4: 084000ef jal x1 132 <printInt> f8: 098000ef jal x1 152 <printSepeartor> fc: 01442503 lw x10 20 x8 100: 078000ef jal x1 120 <printInt> 104: 10000517 auipc x10 0x10000 108: f2450513 addi x10 x10 -220 10c: 078000ef jal x1 120 <printString> 110: 000f0513 addi x10 x30 0 114: 000f8067 jalr x0 x31 0 00000118 <printResult>: 118: 406484b3 sub x9 x9 x6 11c: 10000517 auipc x10 0x10000 120: ee450513 addi x10 x10 -284 124: 060000ef jal x1 96 <printString> 128: 0004a503 lw x10 0 x9 12c: 04c000ef jal x1 76 <printInt> 130: 060000ef jal x1 96 <printSepeartor> 134: 0044a503 lw x10 4 x9 138: 040000ef jal x1 64 <printInt> 13c: 00300393 addi x7 x0 3 140: 06794263 blt x18 x7 100 <exit> 144: 04c000ef jal x1 76 <printSepeartor> 148: 0084a503 lw x10 8 x9 14c: 02c000ef jal x1 44 <printInt> 150: 040000ef jal x1 64 <printSepeartor> 154: 00c4a503 lw x10 12 x9 158: 020000ef jal x1 32 <printInt> 15c: 034000ef jal x1 52 <printSepeartor> 160: 0104a503 lw x10 16 x9 164: 014000ef jal x1 20 <printInt> 168: 028000ef jal x1 40 <printSepeartor> 16c: 0144a503 lw x10 20 x9 170: 008000ef jal x1 8 <printInt> 174: 0300006f jal x0 48 <exit> 00000178 <printInt>: 178: 00100893 addi x17 x0 1 17c: 00000073 ecall 180: 00008067 jalr x0 x1 0 00000184 <printString>: 184: 00400893 addi x17 x0 4 188: 00000073 ecall 18c: 00008067 jalr x0 x1 0 00000190 <printSepeartor>: 190: 10000517 auipc x10 0x10000 194: e8350513 addi x10 x10 -381 198: 00400893 addi x17 x0 4 19c: 00000073 ecall 1a0: 00008067 jalr x0 x1 0 000001a4 <exit>: 1a4: 00000513 addi x10 x0 0 1a8: 05d00893 addi x17 x0 93 1ac: 00000073 ecall ```