Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [HenryChiang](https://github.com/HenryChaing/Computer_Arch_112) > ###### tags: `RISC-V` `Computer Architecture` `CLZ` ## 1. Introduction to Problem ### 1.1. Normalizing an integer The problem I have chosen is an extension of Problem A, "Counting leading zeros." It allows an integer to be normalized into a binary floating-point representation, where the format will be expressed in the standard "1.XX*2^n" form. ### 1.2. prepare solution First, use CLZ to determine the number of shifts N (which will be exactly the number of leading zeros plus one). Then, use the sll instruction to perform the shifting and finally output the result to the console using ecall (System call). ## 2. Assembly code ### 2.1. Code section (with comments) ```c= .data //test array arr: .word 6,9,25 //strings to output to console str1: .string "The normalization for number " str2: .string " is 1." str3: .string "\n" str4: .string " * 2^" str5: .string "please ignore 0b\n" .text //main code main: la a0, str5 //print str5 li a7, 4 ecall addi t5,zero,3 //load test array and its length to register t5,t6 la t6,arr loop: # load element to a0, run CLZ lw a0,0(t6) jal ra, count_leading_zeros # Print the result to console mv a1, a0 lw a0, 0(t6) jal ra, printResult # test next element addi t5,t5,-1 addi t6,t6,4 bne t5,zero,loop # Exit program li a7, 10 ecall //Problem A assembly code count_leading_zeros: srai t0, a0, 1 or a0, a0, t0 srai t0, a0, 2 or a0, a0, t0 srai t0, a0, 4 or a0, a0, t0 srai t0, a0, 8 or a0, a0, t0 srai t0, a0, 16 or a0, a0, t0 srai t0, a0, 16 srai t0, a0, 16 or a0, a0, t0 srai t0, a0, 1 andi t0, t0, 0x555 sub a0, a0, t0 srai t0, a0, 2 andi t0, t0,0x333 andi t1, a0,0x333 add a0,t0,t1 srai t0, a0,4 add t0,a0,t0 andi a0,t0,0xf0f srai t0,a0,8 add a0,a0,t0 srai t0,a0,16 add a0,a0,t0 srai t0,a0,16 srai t0,t0,16 add a0,a0,t0 andi t0,a0,0x7f addi t1,zero,64 sub a0,t0,zero jr ra # --- printResult --- # a0: Original number # a1: Leading Zero printResult: mv t0, a0 mv t1, a1 lw t3, 0(t6) la a0, str1 li a7, 4 ecall mv a0, t0 li a7, 1 ecall la a0, str2 li a7, 4 ecall addi t2,zero,33 sub t2,t2,t1 sll t3,t3,t2 mv a0, t3 li a7, 35 ecall la a0, str4 li a7, 4 ecall addi t1,t1,-1 mv a0, t1 li a7, 1 ecall la a0, str3 li a7, 4 ecall ret ``` ## 3. 5-stage Pipeline Analyzing ### 2.1. The 5-stage graph ![](https://hackmd.io/_uploads/r1RNI0jlT.jpg) ![](https://hackmd.io/_uploads/ByjB3Csxp.jpg) This is the program segment that the pipeline is currently executing. It represents the main loop in the program where each piece of data is analyzed and printed. We are now preparing to enter the next iteration of the loop. At this point, the EX stage has determined that there will be a branch, so we are analyzing the status of each stage at this moment. ### 2.2. IF stage ![](https://hackmd.io/_uploads/SyLGoRjg6.jpg) Here, it needs to be explained in two stages. First is the value fetched by the PC in the IF stage itself (0x44). This is the location in instruction memory where the ecall instruction is stored (as seen in section 2.1 of the code). The result obtained after reading from instruction memory is 0x73. This represents how the (ecall) instruction is translated into machine code, waiting to enter the ID stage for decoding. Next is about the update of the PC value. Since the EX stage has already decided to branch to the 'loop' label, at this point, the updated value of PC will be the address of the 'loop' label (0x1C). The original PC+4 value obtained through the adder (0x48) will not be updated to the PC. ### 2.2. ID stage ![](https://hackmd.io/_uploads/ryRyH1neT.jpg) In the ID stage at this moment, decoding and register fetching are performed. First, the instruction 0xa00893 is decoded to obtain the instruction "addi x17, x0, 10". Next, the values of the temporary registers and the immediate constant need to be extracted. Only after extracting these three can calculations be performed in the subsequent EX stage. The values obtained for x17 and x0 are 0x100000031 and 0x00, respectively, while the immediate value (imm) is 0x0a. ### 2.3. EX stage ![](https://hackmd.io/_uploads/Hkixu1nga.jpg) Next is the EX stage, which can be divided into two parts to explain. First is the ALU (Arithmetic Logic Unit). Since the instruction in the EX stage is "bne," at this point, the ALU operation involves the instruction address to which the branch should be taken. Among these, op1 holds the instruction address of the "bne" instruction, while op2 represents the relative distance between the "loop" label and the "bne" instruction. The final result, 0x1C, corresponds to the address of the "loop" label in the instruction memory. ![](https://hackmd.io/_uploads/ByXq1gnxp.jpg) Next, let's introduce the branch unit. The branch unit compares rs1 and rd, and when they are equal, the branch taken is set to 1; otherwise, it's set to 0. This is used to determine whether branching is taken or not. ### 2.4. MEM stage ![](https://hackmd.io/_uploads/HyVWelnlp.jpg) Next is the MEM stage. In this stage, because it's not currently an "lw" or "sw" instruction, there won't be any read or write operations performed on the data memory. Once the write enable is set to 0, data memory won't perform any writes, and the value read out won't be selected by the multiplexer, so there won't be any reading of incorrect data either. ### 2.5. WB stage ![](https://hackmd.io/_uploads/HJZnfx3ea.jpg) Finally, in the WB (Write Back) stage, the main purpose is to return the value obtained from data memory or the ALU calculation back to the Register Block. For example, in the current instruction "addi," we will return the result of adding x30 and 1 from the EX stage back to the x30 register. The value to be written back in this stage is 0x02. (Note: Change t5 from 0x03 to 0x02.) ![](https://hackmd.io/_uploads/rJwQNenx6.jpg) ### There is no memory write back ## 3. Optimize the code ### 3.1. Before optimize ![](https://hackmd.io/_uploads/Bk65cW3xT.jpg) ![](https://hackmd.io/_uploads/B1Cscb2x6.jpg) ### 3.2. After optimize ![](https://hackmd.io/_uploads/Bkvpq-2gT.jpg) ![](https://hackmd.io/_uploads/ryap5bhlp.jpg) ### 3.3. Conclusion Because originally the loop required two branch instructions to complete one iteration, it was modified to require only one branch to complete. The advantage of this modification is that it can reduce the control hazard caused by branches. Therefore, even with only three loops, we have still reduced the consumption of 6 cycles, effectively lowering the CPI (Cycles Per Instruction) and reducing execution time.