# Assignment3: SoftCPU contributed by <`OscarShiang` > ## Port `count_bits` into srv32 Because we now move our code from Ripes to bare metal environment, we need to follow the calling convention carefully. Thus we should ensure that all we use saved registers and temporary registers properly. :::info :pencil: The revised code can be found at [`6cc195f`](https://github.com/OscarShiang/srv32/blob/6cc195f/sw/count_bits/count_bits.s) ::: To verify the result of computing, I use newlib `printf` to print out the result. Then provide `%d` as format string to output decimal numbers. We can first debug our code on ISS infrastrature. The result is: ```shell $ tools/rvsim count_bits.elf [0,1,1,2,1] Excuting 4726 instructions, 5988 cycles, 1.267 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 5988 Simulation speed : 15.008 MHz ``` Next, we enter `make count_bits.run` under `sim/` directory to copy the memory layout of `count_bits` to the directory and start to simulate the result. ```shell $ make count_bits.run [0,1,1,2,1] Excuting 4726 instructions, 5988 cycles, 1.267 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.068 s Simulation cycles: 5999 Simulation speed : 0.0882206 MHz ``` We can get the result similar to ISS one. Run the following command to generate `wave.fst` ```shell $ ./sim +trace ``` ## Analyze the waveform ### Latency of instrcution fetching As the below waveform shows, we can consider `imem` is a combinatial circuit on this CPU. We can retrieve the instruction with given address (`raddr`) in the same clock period. ![](https://i.imgur.com/fAfJYwe.png) ### Reason of stall Since srv32 does have fully bypassing, there is no stall resulting from data hazards. All nops are generated after control hazards happended. ![](https://i.imgur.com/WfbO8IY.png) So we should try to reduce the stalls because of control hazards. ### Control hazards srv32 is a 3 staged pipeline architecture. It has `F/D`, `E`, and `WB` stages ![](https://github.com/sysprog21/srv32/raw/devel/images/forwarding.svg) Its branch penalty will be 2 cycles since it has to flush the incorrect instruction in `F/D` and then load the new instruction. ![](https://github.com/sysprog21/srv32/raw/devel/images/branch.svg) In the following waveform, we can see that when the signal `branch_taken` is set, it generates a flush signal which takes 2 cycles to complete. Then reload the right instruction after flushing. ![](https://i.imgur.com/ceSrNSu.png) ## Improve the performance ### Branchless popcount According to the section we discussed before, we know that the stalls are all generated by control hazards. To reduce the executing cycles, we can try to apply branchless algorithm to compute `popcount` For comparison, I first compute the branch_taken count in the original code: there are `632` branches been taken. Then we apply the algorithm mentioned in [amacc](https://github.com/jserv/amacc/blob/master/amacc.c#L504) to revise our code. The code is now changed to: ``` popcount: srli t0, a0, 1 li t1, 0x55555555 and t0, t0, t1 sub a0, a0, t0 li t1, 0x33333333 srli t2, a0, 2 and t2, t2, t1 and a0, a0, t1 add a0, a0, t2 srli t0, a0, 4 add a0, a0, t0 li t0, 0x0f0f0f0f and a0, a0, t0 li t0, 0x01010101 mul a0, a0, t0 srli a0, a0, 24 ret ``` We can see from the output of the simulation: ``` [0,1,1,2,1] Excuting 4781 instructions, 6023 cycles, 1.259 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.043 s Simulation cycles: 6034 Simulation speed : 0.140326 MHz ``` Although the count of branch_taken is slightly reduced to `622`, we need more instructions to compute the popcount for each numbers. In result, it cancels out the benefit of branchless implementation and cause the total executing cycles to increase. ### Loop unrolling In order to eliminate the stall caused by for-loop, we simply expand the loop body and rewrite the `count_bits` and `print` part. ``` [0,1,1,2,1] Excuting 4754 instructions, 5982 cycles, 1.258 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 5982 Simulation speed : 14.918 MHz ``` As the result shown above, we get 41 cycles reduced after this modification. The count of `branch_taken` is eliminated to `615` ### Inline function Instead of adding `inline` prefix in function prototype, I use a macro to replace the use of `popcount` to reduce the jump instruction to function body. The behavior of the macro is similar to `count_bits` function, but it will not result in function call. It expand in compiling time and increase the code size instead. ``` .macro pop_cnt srli t0, a0, 1 li t1, 0x55555555 and t0, t0, t1 sub a0, a0, t0 li t1, 0x33333333 srli t2, a0, 2 and t2, t2, t1 and a0, a0, t1 add a0, a0, t2 srli t0, a0, 4 add a0, a0, t0 li t0, 0x0f0f0f0f and a0, a0, t0 li t0, 0x01010101 mul a0, a0, t0 srli a0, a0, 24 .endm ``` In the result of simulation, we receive another 30 cycles reduction. ``` [0,1,1,2,1] Excuting 4744 instructions, 5952 cycles, 1.255 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 5952 Simulation speed : 15.030 MHz ``` The count of `branch_taken` is down to `605`. ###### tags: `arch2021`