Assignment3: Soft CPU

--- ###### tags: `Computer Architecture 2022` --- # Assignment3: Soft CPU ## Environment First, update the environment, this `$PATH` need previously install the [GNU Toolchain for RISC-V](https://xpack.github.io/riscv-none-elf-gcc/). ```c=1 cd $HOME source riscv-none-elf-gcc/setenv ```  ## [srv32](https://github.com/sysprog21/srv32)^MIT^: Simple 3-stage pipeline RISC-V processor ![](https://i.imgur.com/jIo3h96.png) ## Derived from Assignment2 ### Modify Makefile #### C code ```c=1 #include <stdio.h> void twoSum(int* nums,int numSize, int target){ int i,j; for(i=0;i<numSize;i++){ for(j=i+1;j<numSize;j++){ if(nums[i]+nums[j]==target) printf("%d %d\n",i,j); } } } int main(){ int nums1[4] = {2,7,11,15}, target1 = 9, size1=4; int nums2[3] = {3,2,4}, target2 = 6, size2=3; int nums3[2] = {3,3}, target3 = 6, size3=2; twoSum(nums1,size1,target1); twoSum(nums2,size2,target2); twoSum(nums3,size3,target3); return 0; } ``` #### Edit the Makefile ``` include ../common/Makefile.common EXE = .elf SRC = twosum_.c CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = twosum_ OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ### Run code ``` $ cd verilator/srv32/ $make twosum_ ``` ### RTL sim result ``` make[2]: Leaving directory '/home/lab41206/verilator/srv32/sw/twosum_' make[1]: Leaving directory '/home/lab41206/verilator/srv32/sw' make[1]: Entering directory '/home/lab41206/verilator/srv32/sim' 0 1 1 2 0 1 Excuting 5410 instructions, 7258 cycles, 1.341 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.217 s Simulation cycles: 7269 Simulation speed : 0.0334977 MHz ``` ### ISS sim result ``` make[1]: Leaving directory '/home/lab41206/verilator/srv32/sim' make[1]: Entering directory '/home/lab41206/verilator/srv32/tools' ./rvsim --memsize 128 -l trace.log ../sw/twosum_/twosum_.elf 0 1 1 2 0 1 Excuting 5410 instructions, 7258 cycles, 1.342 CPI Program terminate Simulation statistics ===================== Simulation time : 0.009 s Simulation cycles: 7258 Simulation speed : 0.784 MHz make[1]: Leaving directory '/home/lab41206/verilator/srv32/tools' Compare the trace between RTL and ISS simulato ``` ### Assembly Code Generated By `srv32` ``` 0000003c <twoSum>: 3c: 0cb05063 blez a1,fc <twoSum+0xc0> 40: fd010113 addi sp,sp,-48 44: 02812423 sw s0,40(sp) 48: 03212023 sw s2,32(sp) 4c: 01312e23 sw s3,28(sp) 50: 01412c23 sw s4,24(sp) 54: 01712623 sw s7,12(sp) 58: 02112623 sw ra,44(sp) 5c: 02912223 sw s1,36(sp) 60: 01512a23 sw s5,20(sp) 64: 01612823 sw s6,16(sp) 68: 00100413 li s0,1 6c: 00058993 mv s3,a1 70: 00060a13 mv s4,a2 74: 00450913 addi s2,a0,4 78: 00020bb7 lui s7,0x20 7c: 04858a63 beq a1,s0,d0 <twoSum+0x94> 80: 00040b13 mv s6,s0 84: fff40a93 addi s5,s0,-1 88: 00090493 mv s1,s2 8c: 00c0006f j 98 <twoSum+0x5c> 90: 00140413 addi s0,s0,1 94: 03345863 bge s0,s3,c4 <twoSum+0x88> 98: 0004a703 lw a4,0(s1) 9c: ffc92783 lw a5,-4(s2) a0: 00448493 addi s1,s1,4 a4: 00e787b3 add a5,a5,a4 a8: ff4794e3 bne a5,s4,90 <twoSum+0x54> ac: 00040613 mv a2,s0 b0: 000a8593 mv a1,s5 b4: 090b8513 addi a0,s7,144 # 20090 <environ+0x4> b8: 00140413 addi s0,s0,1 bc: 114000ef jal ra,1d0 <printf> c0: fd344ce3 blt s0,s3,98 <twoSum+0x5c> c4: 001b0413 addi s0,s6,1 c8: 00490913 addi s2,s2,4 cc: fa899ae3 bne s3,s0,80 <twoSum+0x44> d0: 02c12083 lw ra,44(sp) d4: 02812403 lw s0,40(sp) d8: 02412483 lw s1,36(sp) dc: 02012903 lw s2,32(sp) e0: 01c12983 lw s3,28(sp) e4: 01812a03 lw s4,24(sp) e8: 01412a83 lw s5,20(sp) ec: 01012b03 lw s6,16(sp) f0: 00c12b83 lw s7,12(sp) f4: 03010113 addi sp,sp,48 f8: 00008067 ret fc: 00008067 ret 00000100 <main>: 100: 000207b7 lui a5,0x20 104: 09878793 addi a5,a5,152 # 20098 <environ+0xc> 108: 00c7a703 lw a4,12(a5) 10c: fc010113 addi sp,sp,-64 110: 0007a883 lw a7,0(a5) 114: 0047a803 lw a6,4(a5) 118: 0087a683 lw a3,8(a5) 11c: 02e12623 sw a4,44(sp) 120: 00200713 li a4,2 124: 00300793 li a5,3 128: 02010513 addi a0,sp,32 12c: 00e12c23 sw a4,24(sp) 130: 00900613 li a2,9 134: 00400713 li a4,4 138: 00400593 li a1,4 13c: 02112e23 sw ra,60(sp) 140: 03112023 sw a7,32(sp) 144: 03012223 sw a6,36(sp) 148: 02d12423 sw a3,40(sp) 14c: 00f12a23 sw a5,20(sp) 150: 00e12e23 sw a4,28(sp) 154: 00f12623 sw a5,12(sp) 158: 00f12823 sw a5,16(sp) 15c: ee1ff0ef jal ra,3c <twoSum> 160: 01410513 addi a0,sp,20 164: 00600613 li a2,6 168: 00300593 li a1,3 16c: ed1ff0ef jal ra,3c <twoSum> 170: 00c10513 addi a0,sp,12 174: 00600613 li a2,6 178: 00200593 li a1,2 17c: ec1ff0ef jal ra,3c <twoSum> 180: 03c12083 lw ra,60(sp) 184: 00000513 li a0,0 188: 04010113 addi sp,sp,64 18c: 00008067 ret ``` ### Optimized for assignment2 Trying to use handwritten assembly code to decrease the number of cycle. ```asm= .data # first test data nums1: .word 0, 0, 0, 0, 2, 7, 11, 15 target1: .word 9 size1: .word 8 # second test data nums2: .word 3, 2, 4 target2: .word 6 size2: .word 3 # third test data nums3: .word 3, 3 target3: .word 6 size3: .word 2 endl: .string "\n" .text main: la s0, nums1 #load nums1 address to s0 lw s1, target1 #load target1 to s1z lw s2, size1 #load size1 to s2 jal twoSum la s0, nums2 lw s1, target2 lw s2, size2 jal twoSum la s0, nums3 lw s1, target3 lw s2, size3 jal twoSum li a7, 10 #end ecall twoSum: mv t0, zero #int i=0 addi t1, t0, 1 #int j=i+1 mv a0, s0 #store nums1 address for i mv a1, s0 #store nums1 address for j loop2: lw a2, 0(a0) #load nums[i] lw a3, 4(a1) #load nums[j] add a3, a2, a3 #nums1[i]+nums2[j] beq a3, s1, print #if (nums1[i]+nums2[j]==target) go to print addi a1, a1, 4 #point to next nums1[j] address addi t1, t1, 1 #j++ #modified blt t1, s2, loop1 lw a3, 4(a1) add a3, a2, a3 #a3=j -> a3=i+j beq a3, s1, print addi a1, a1, 4 #point to next nums1[j] address addi t1, t1, 1 #j++ blt t1, s2, loop2 loop1: addi t0, t0, 1 #i++ addi a0, a0, 4 #point to next nums[i] address mv a1, a0 addi t1, t0, 1 #j=i+1 bltu t0, s2, loop2 #if (i<numsize) go to loop2 print: mv a0, t0 #move i li a7, 1 #print integer ecall mv a0, t1 #move j ecall la a0, endl #load \n li a7, 4 #print string ecall jr ra ``` ### RTL sim result ``` 0 1 1 2 0 1 Excuting 5268 instructions, 7084 cycles, 1.344 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.133 s Simulation cycles: 7095 Simulation speed : 0.0533459 MHz ``` ### ISS sim result ``` 0 1 1 2 0 1 Excuting 5268 instructions, 7084 cycles, 1.345 CPI Program terminate Simulation statistics ===================== Simulation time : 0.009 s Simulation cycles: 7084 Simulation speed : 0.790 MHz ``` ### Forwarding #### Control hazard Using the GTKWave tool to observed the hazard. ![](https://i.imgur.com/qQo4qfZ.png) ``` 30: fd410113 addi sp,sp,-44 # 40000 <_stack> 34: 0cc000ef jal ra,100 <main> 38: 3a41006f j 103dc <exit> 00000100 <main>: 100: 000207b7 lui a5,0x20 104: 09878793 addi a5,a5,152 # 20098 <environ+0xc> 108: 00c7a703 lw a4,12(a5) ``` Control hazard happened when meet the jump instruction, wrong pc address(pc+4) already enter into the first pipeline stage, `flush` signal need to help the other pipeline judge the current instruction right or not. In this problem, because of the `jal` instruction, instruction fectch stage will get the wrong pc address, and the `ex_jal` signal turn to 1. The `ex_flush` signal will turn to 1 in the next cycle, other pipeline wouldn't take the wrong instruction. ### Load-use hazard ![](https://i.imgur.com/LdNaBep.png) ``` 20: 0002a023 sw zero,0(t0) 24: 00428293 addi t0,t0,4 ``` In the 5-stage pipeline processor, load-use hazard happend because the instruction need to wait the MEM stage load/save the data to the memory. In the `srv32`, this processor have the 3-stage pipeline, the `MEM` and `ALU` in the same stage, so after `MEM` save the data to the memory and directly forwarding to the `ALU`, waiting for the next cycle, the forwading data and `addi`(next instruction) can enter to the `ALU` together. So design the `ALU` and `MEM` in the same stage can avoid the load-use hazard.`` ## LeetCode problem with medium difficulty — [11. Container With Most Water](https://leetcode.com/problems/container-with-most-water/description/) ### Origin code ```c= #include<stdio.h> int maxArea(int* height, int heightSize){ int temp, max=0; int left=0, right=heightSize-1; while(right>left){ if(height[left]<height[right]){ temp = (right-left)*height[left]; if(temp > max){max = temp;} left++; } else{ temp = (right-left)*height[right]; if(temp > max){max = temp;} right--; } } return max; } int main(){ int height[]={1,8,6,2,5,4,8,3,7}; int heightSize = 9; int max = maxArea(height,heightSize); printf("height(1)=%d\n",max); int height2[]={1,2,1}; int heightSize2 = 3; max = maxArea(height2,heightSize2); printf("height(2)=%d\n",max); int height3[]={1,3,6,9,5,4,8,3,7,2,1,2,3,4}; int heightSize3 = 14; max = maxArea(height3,heightSize3); printf("height(3)=%d",max); return 0 ; } ``` ### RTL sim result ``` height(1)=49 height(2)=2 height(3)=44 Excuting 5555 instructions, 7601 cycles, 1.368 CPI Program terminate - ../rtl/../testbench/testbench.v:434: Verilog $finish Simulation statistics ===================== Simulation time : 0.137 s Simulation cycles: 7612 Simulation speed : 0.055562 MHz ``` ### ISS sim result ``` ./rvsim --memsize 128 -l trace.log ../sw/container/container.elf height(1)=49 height(2)=2 height(3)=44 Excuting 5555 instructions, 7601 cycles, 1.368 CPI Program terminate Simulation statistics ===================== Simulation time : 0.006 s Simulation cycles: 7601 Simulation speed : 1.232 MHz make[1]: Leaving directory '/home/lab41206/verilator/srv32/tools' Compare the trace between RTL and ISS simulator ```