# [Assignment3: SoftCPU](https://hackmd.io/@sysprog/2021-arch-homework3) contributed by <[chinghongfang](https://github.com/chinghongfang)> ###### tags: `RISC-V` `jserv` ## Before Everything ### Installlations 1. Get [xPack GNU RISC-V Embedded GCC](https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/). ```bash= $ cd /tmp $ wget https://github.com/xpack-dev-tools/riscv-none-embed-gcc-xpack/releases/download/v10.2.0-1.1/xpack-riscv-none-embed-gcc-10.2.0-1.1-linux-x64.tar.gz $ tar zxvf xpack-riscv-none-embed-gcc-10.2.0-1.1-linux-x64.tar.gz $ cp -af xpack-riscv-none-embed-gcc-10.2.0-1.1 $HOME/riscv-none-embed-gcc # Set path (example on fish) set -a PATH /home/hong/Documents/course/Computer_Architecture/riscv-none-gcc/bin ``` 2. Get [srv32](https://github.com/sysprog21/srv32). ```bash= git clone https://github.com/sysprog21/srv32 ``` 3. Get [gtkwave](http://gtkwave.sourceforge.net). (Find your version on the website.) 4. Modify the makefile ```bash= # In srv32/sw/common directory vim Makefile.common # Search for CFLAGS, and change if needed ``` 5. Put your source in srv32/sw ```bash= # In srv32/ mkdir sw/hw3 # Copy existing make file to your directory cp sw/hello/Makefile sw/hw3/ # Write down your code vim sw/hw3/hw3.c # Run it make hw3 ``` ### Workflow ![](https://i.imgur.com/ZqC81m0.png) --- ## Start ### Analysis srv32 CPU srv32 CPU architecture is a **S**imple **r**isc-**v** **32**-bit CPU. By the author of this CPU, this CPU has only 3 pipeline stages. #### Datapath and Pipeline This picture is drawn according to the file `srv32/rtl/riscv.v`. There are two parts that is out of the file. They are **instruction memory** and **data memory**. (By observing the wave, I think that they are positive edge trigger.) ![](https://i.imgur.com/oPL7ZB0.png) 1. Start from the `fetch_pc` register. It is the register that store the program counter and fetch instruction from memory. 2. `Instruction Memory`: As the clk positive edge triggerred, it output the instruction the last `fetch_pc` want. 4. `ex_reg`: Get register data on positive edge triggerred. 5. `wb_reg`: Get executed result. Also the memory data. The executed result may forward to ALU. 6. `fetch_pc`, `Register Files`: Change the pc, flush the ex_reg, flush the wb_reg **if branch taken**. Write date to register files. The memory data may forward to ALU. #### Hazzard No data hazzard. Eliminate by full fowarding. By the pipeline, the **memory data** read by last instruction can forward to EX stage. #### Stall Branches cause stalls. (To know when the stall happened, observe the `ex_flush` wire and `wb_flush` wire in the wave.fst file. They are the **signals** to clear the pipeline register) By the pipeline, the branching result is determined at **EX stage**. So we may fetch **wrong inst**ructions. We can flush the wrong instructions and make it looks like a stall. By the pipeline, branch taken will cause **2 stalls**. --- ### [Min Patches](https://github.com/chinghongfang/ComputerArchitecture/blob/main/hw1/patchingArray.c) By reading disassembled file, the optimizer inline the function. Original function: ```c= int minPatches(const int *nums, int numsSize, int n){ unsigned int m = 1, res = 0, i = 0; while (m <= n) m += (i < numsSize && nums[i] <= m) ? nums[i++] : (res++, m); return res; } ``` ``` ./rvsim --memsize 128 -l trace.log ../sw/hw/hw.elf 29 Excuting 1649 instructions, 2135 cycles, 1.295 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 2135 Simulation speed : 4.886 MHz ``` To lower down the simulation cycles, we start from branch. Because only when the branch not taken takes stalls. #### Find out branches ![](https://i.imgur.com/zBhu9ef.png) (Around 20 flushes looping) We found that there are **many branch prediction failed** in assembly code 124. And 11C~124 is `m+=(res++, m);` ```asm= 11c: 00179793 slli a5,a5,0x1 120: 00158593 addi a1,a1,1 124: fef77ce3 bgeu a4,a5,11c <main+0x70> ``` Many branch prediction **failed** here. #### Optimize So I expand this section as [following](https://github.com/chinghongfang/ComputerArchitecture/blob/main/hw1/reduceBranch.c). (I know that this loop execute **at most 31 times**. So I can expand 31 times by hand.) ```c= int minPatches(const int *nums, int numsSize, int n){ unsigned int m = 1, res = 0, i = 0; //while (m <= n) //m += (i < numsSize && nums[i] > m) ? nums[i++] : (res++, m); if (m > n) return 0; select_add: if (i >= numsSize) goto only_plus_m; if (nums[i] > m) goto plus_m; m += nums[i++]; if (m > n) goto ret; goto select_add; plus_m: m <<= 1; ++res; if (m > n) goto ret; goto select_add; only_plus_m: m <<= 1; ++res; if (m > n) goto ret; m <<= 1; ++res; if (m > n) goto ret; m <<= 1; ++res; if (m > n) goto ret; // ... ret: return res; } ``` Execution wave: ![](https://i.imgur.com/mb70wZB.png) (10 flushes in total) Result ``` ./rvsim --memsize 128 -l trace.log ../sw/hw/hw.elf 29 Excuting 1625 instructions, 2065 cycles, 1.271 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 2065 Simulation speed : 4.791 MHz ``` We reduced 2135-2065 = 70 cycles.