# Final Project - Analyze and Improve srv32 ###### tags: `computer architure 2021` ## Object 1. Looking how the static branch prediction works in srv32. 2. If there are something out of expection, how can we improve the static branch prediction. 3. From the implement of RTL, discussing about branch penalty. 4. Analyze and improve the branch predictor implementation. ## Intorduction ### Branch Prediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures. #### Static Prediction Static branch prediction is the method of picking the speculative instruction based on just looking at the branch itself. There are different type of static prediction method.For example, In the early time of MIPS,they used single-direction branch prediction:they always predict that a conditional jump will not be taken, so they always fetch the next sequential instruction. Only when the branch or jump is evaluated and found to be taken that does the instruction pointer get set to a non-squential address. But there is a shortage about static prediction, as the pic show below, we can find out the frequency of backward prediction might be greater than forward prediction, so we need a new system called `Dynamic Prediction` to make the compiler more efficiently. ![](https://i.imgur.com/8rM3aS2.jpg) #### Dynamic Prediction For dynamic prediction, the branch will make a educated guess about the branch will be taken or not. The hardward can look for clues based on instructions or the usetage history. Learning based on past behavior: * **1-bit predictior:** use 1-bit counter to record the last outcome of the branch. * **2-bits predictor:** It will change if the branch mispredict twice. ![](https://i.imgur.com/AneOokt.png) ### Branch Penalty When the branch is taken during the execute phase, it needs to flush the instructions that have been fetched into the pipeline, which causes a delay of two instructions, so the extra cost of the branch is two. ![](https://i.imgur.com/JMi8Kz2.png) ### srv32 Architecture ![](https://i.imgur.com/fk92IF2.png) ### Branch Prediction in srv32 A table called branch prediction table is newly created. It is indexed by `iaddr` values and stores a Taken/Not Taken flag (T/NT flag) and the target address wherever applicable. In the code, the T/NT flags are stored in a register called tnt_tab and target addresses are stored in a register called targ_tab. These 2 registers together form the branch prediction table. For every new iaddr value, this table will be checked to see if the Taken flag is 1. If it is 1, a flag called ‘pc_taken’ is set and the corresponding target address is stored in a register called ‘pc_pred’ and loaded as the next iaddr value. A flag called ‘branched' is set when the new iaddr value is loaded. The decision of whether branching should take place or not is known only in the EX stage. ## Observe srv32 branch prediction To see how the branch prediction work in srv32, we write a simple code with some loop and I-instruction. ```verilog= .global main main: li s1, 0 li s2, 3 loop1: #for loop, if i = 3, jump to the end beq s1, s2, end1 addi s1, s1, 1 j loop1 end1: #exit li a0, 0 ret ``` before run the code, we need to change the makefile to make it run ``` include ../common/Makefile.common EXE = .elf SRC = test1.s CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = hw3 OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ```shell= $ make hw3 make[1]: 進入目錄「/home/evelyn/srv32/sw」 make -C common make[2]: 進入目錄「/home/evelyn/srv32/sw/common」 make[2]: 對「all」無需做任何事。 make[2]: 離開目錄「/home/evelyn/srv32/sw/common」 make[1]: 離開目錄「/home/evelyn/srv32/sw」 make[1]: 進入目錄「/home/evelyn/srv32/sim」 Excuting 32 instructions, 46 cycles, 1.437 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.012 s Simulation cycles: 57 Simulation speed : 0.00475 MHz make[1]: 離開目錄「/home/evelyn/srv32/sim」 make[1]: 進入目錄「/home/evelyn/srv32/tools」 ./rvsim --memsize 128 -l trace.log ../sw/hw3/hw3.elf Excuting 32 instructions, 46 cycles, 1.438 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 46 Simulation speed : 1.586 MHz make[1]: 離開目錄「/home/evelyn/srv32/tools」 Compare the trace between RTL and ISS simulator === Simulation passed === ``` In order to see how it did on srv32, we open `wave.fst` file under sim floder with GTKWave. In order to know the addr in code, open `hw3.dis` file under hw3 floder ``` 0000003c <main>: 3c: 00000493 li s1,0 40: 00300913 li s2,3 00000044 <loop1>: 44: 01248663 beq s1,s2,50 <end1> 48: 00148493 addi s1,s1,1 4c: ff9ff06f j 44 <loop1> 00000050 <end1>: 50: 00000513 li a0,0 54: 00008067 ret ``` and open the trace.log file which generate by ISS simulator . ``` 19 00000044 01248663 20 00000048 00148493 x09 (s1) <= 0x00000001 21 0000004c ff9ff06f x00 (zero) <= 0x00000050 24 00000044 01248663 25 00000048 00148493 x09 (s1) <= 0x00000002 26 0000004c ff9ff06f x00 (zero) <= 0x00000050 29 00000044 01248663 30 00000048 00148493 x09 (s1) <= 0x00000003 31 0000004c ff9ff06f x00 (zero) <= 0x00000050 34 00000044 01248663 37 00000050 00000513 x10 (a0) <= 0x00000000 ``` After analyze the wave we can find out that it will asume the branch is not taken everytime (as the pic above). When the branch is taken, `ex_flush` will abandon the result. ![](https://i.imgur.com/pyfnh5E.png) So, we can find out wether the `beq` is equal or not, the branch will always not be taken,they are always single-direction, so we can conclude that `srv32` use static forward prediction. ## Implementation of Branch Penalty in srv32 To see how does the pipeline work in srv32, we have to write a simple code, test.s for showing off. Here is our sample assembly code. In order to make the branch penalty more obvious, we use `beq` to construct a small for loop. ```verilog= .global main main: li s1, 0 li s2, 3 li s3, 0 loop1: #for loop, if i = 3, jump to the end beq s1, s2, end1 addi s1, s1, 1 addi s3, s3, 1 j loop1 end1: #exit li a0, 0 ret ``` Before we complie our assembly code on RTL, we have to modify a little bit in makefile. ```verilog= include ../common/Makefile.common EXE = .elf SRC = test.s CFLAGS += -L../common LDFLAGS += -T ../common/default.ld TARGET = test OUTPUT = $(TARGET)$(EXE) .PHONY: all clean all: $(TARGET) $(TARGET): $(SRC) $(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g $(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin $(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin $(OBJCOPY) -O binary $(OUTPUT) memory.bin $(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis $(READELF) -a $(OUTPUT) > $(TARGET).symbol clean: $(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin ``` ```shell $ make test make[1]: Entering directory '/home/peishan/srv32/sw' make -C common make[2]: Entering directory '/home/peishan/srv32/sw/common' make[2]: Nothing to be done for 'all'. make[2]: Leaving directory '/home/peishan/srv32/sw/common' make[2]: Entering directory '/home/peishan/srv32/sw/test' riscv-none-embed-gcc -O1 -Wall -march=rv32im -mabi=ilp32 -nostartfiles -nostdlib -L../common -o test.elf test.s -lc -lm -lgcc -lsys -T ../common/default.ld -g riscv-none-embed-objcopy -j .text -O binary test.elf imem.bin riscv-none-embed-objcopy -j .data -O binary test.elf dmem.bin riscv-none-embed-objcopy -O binary test.elf memory.bin riscv-none-embed-objdump -d test.elf > test.dis riscv-none-embed-readelf -a test.elf > test.symbol make[2]: Leaving directory '/home/peishan/srv32/sw/test' make[1]: Leaving directory '/home/peishan/srv32/sw' make[1]: Entering directory '/home/peishan/srv32/sim' Excuting 32 instructions, 46 cycles, 1.437 CPI Program terminate - ../rtl/../testbench/testbench.v:418: Verilog $finish Simulation statistics ===================== Simulation time : 0.069 s Simulation cycles: 57 Simulation speed : 0.000826087 MHz make[1]: Leaving directory '/home/peishan/srv32/sim' make[1]: Entering directory '/home/peishan/srv32/tools' ./rvsim --memsize 128 -l trace.log ../sw/test/test.elf Excuting 32 instructions, 46 cycles, 1.438 CPI Program terminate Simulation statistics ===================== Simulation time : 0.000 s Simulation cycles: 46 Simulation speed : 0.630 MHz make[1]: Leaving directory '/home/peishan/srv32/tools' Compare the trace between RTL and ISS simulator === Simulation passed === ``` Open `wave.fst` in GTKWave, and take the screenshot below for example. We can find there are some bubbles in the pipeline. ![](https://i.imgur.com/wEAlJVE.png) ### Branch Penalty According to [CA2020-Team Prject-Analyze and improve srv32](https://hackmd.io/@WeiCheng14159/HkNvemOpD), we can get the implementation of branch penalty in srv32. Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken. ### Pipelining Conditional Branches ![](https://i.imgur.com/9oig4qF.png) If the branch is taken, there wiil come out following actions. First, kill the two following instructions. Second, the instruction at the decode stage is not valid ⇒ stall signal is not valid We can see how does it truly work in srv32. Take the waveform below for example, and compare with the assembly code from test.dis and trace.log. ``` 20 00000048 01248863 21 0000004c 00148493 x09 (s1) <= 0x00000001 22 00000050 00198993 x19 (s3) <= 0x00000001 23 00000054 ff5ff06f x00 (zero) <= 0x00000058 26 00000048 01248863 27 0000004c 00148493 x09 (s1) <= 0x00000002 28 00000050 00198993 x19 (s3) <= 0x00000002 29 00000054 ff5ff06f x00 (zero) <= 0x00000058 ``` ``` 00000048 <loop1>: 48: 01248863 beq s1,s2,58 <end1> 4c: 00148493 addi s1,s1,1 50: 00198993 addi s3,s3,1 54: ff5ff06f j 48 <loop1> 00000058 <end1>: 58: 00000513 li a0,0 5c: 00008067 ret ``` ![](https://i.imgur.com/SIHLXJn.png) In the picture above, we can see flush works in EX stage and WB stage during the red circles. And there comes out two instructions been killed after the branch taken. If we add one instruction into the loop, we can see there is still flushing for two cycle in waveform. ![](https://i.imgur.com/DihOTEZ.png) ### Pipelining Jumps After reading the passage from [lecture09_RISCV_Impl_pipeline](https://passlab.github.io/CSE564/notes/lecture09_RISCV_Impl_pipeline.pdf), we can find that a jump instruction will kill the follow instructions. Here we are going to discuss about how does jumps work in srv32. ![](https://i.imgur.com/run6M54.png) The meaning of the red circles is `NOP`. Here we can see the instruction `j 48<loop>` is `FF5FF06F`, `li a0, 0` is `00000513`, `ret` is `00008067`. In this section, `li a0, 0` and `ret` represent `NOP`. We can also see that the flush works during the red cricles. Furthermore, I sort out the pipeline to the form below and then it's easier to do investigation. ![](https://i.imgur.com/KARz8VT.png) | clock | clk1 | clk2 | clk3 | clk4 | clk5 | clk6 | clk7 | clk8 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | IF/ID |`j`| `li ` | `ret` |`beq `|`addi s1`|`addi s3`| | EX| | `j` |`li` | `ret` | `beq` | `addi s1` |`addi s3`| | WB| | | `j` | `li ` | `ret` | `beq` | `addi s1` |`addi s3`| | clock | clk1 | clk2 | clk3 | clk4 | clk5 | clk6 | clk7 | clk8 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | `j 48<loop1>` | IF/ID | EX | WB | | `beq s1, s2, end1`| | NOP | NOP | IF/ID | EX | WB | | `addi s1, s1, 1` | | | NOP | NOP | IF/ID | EX | WB | | `addi s3, s3, 1` | | | | NOP | NOP | IF/ID | EX | WB | ## Proposal of Improving Branch Prediction in srv32 In order to improve the performance of branch predictor, we have to understand how does the branch prediction works in srv32. So in the articles above, we have already discussed something about it and different kinds of branch prediction. In srv32, we can see there is static branch prediction inside the processor, and it always predict that the branch does not be taken. If the branch is taken, srv32 will flush the two instructions behind. We need more information to find out the method for improving branch prediction, we read a lot research for RISCV. And we found there is an related article [Design and Development of an Efficient Branch Predictor for an In-order RISC-V Processor](https://www.researchgate.net/publication/346893502_Design_and_Development_of_an_Efficient_Branch_Predictor_for_an_In-order_RISC-V_Processor). In this passage, we can see some implementation of improving the branch prediction in RISCV. They use an Branch Buffer Predictor(BTB) and Pattern History Tables(PHT) to make a dynamic branch predictor. Here is the introduction of their design. We think that is more likely to Two-level predictor The two-tier predictor separates the branch history into the branch history record and pattern history table. The pattern history table lists the frequency of each occurrence of the branch. The content of the branch history record is used to index the pattern history table. might be the good material of implementation of prench predictor. The branch predictor consists of PHT and BTB, which are commonly used data structures. The PHT and BTB have been indexed according to the branch address. The PHT forecasts whether the branch is taken or not. The following instruction address is taken from the BTB when the branch is taken. The next command address if the branch is not taken is the current branch address plus instruction size. The PHT uses the 2-bit saturation counter state transition, which increases when the prediction is correct and decreases if the prediction is wrong. The PHT uses a valid bit to ensure the finishing of the training period. The valid bit will be initially set to zero. When the branch is encountered for the first time, it changes the valid bit to zero. The target branch address of the current branch instruction is stored in the BTB. The BTB is updated by the execution unit of the pipelined processor. Picture below show block diagram of the branch predictor. ![](https://i.imgur.com/dZA86q8.png) Picture below show architecture of the branch predictor. ![](https://i.imgur.com/w2T6Boi.png) There is another research of the proformance of implemantation of dynamic branch prediction Dynamic Branch Prediction, [Modeller for RISC Architecture](https://ieeexplore.ieee.org/document/6918861). And we can see the conclusion inside. In the paper, they compare three different type of branch predictor: one-bit branch predictior, always taken branch predictor and saturating counter. The prediction of saturating counter is depend on how strong that the state is, as the picture shows below: ![](https://i.imgur.com/CJGBELI.png) The paper conducted four type patten to compare the three predictor. The hit indicates the accurate branch prediction and miss indicates the wrong prediction. ![](https://i.imgur.com/yW9y3yy.png) ![](https://i.imgur.com/IjfkypB.png) ![](https://i.imgur.com/vWfbyos.png) ![](https://i.imgur.com/MmNbe3F.png) Using the function to calculate the performance, saturating counter has the better precision to hit the branch. ![](https://i.imgur.com/A2OgEmr.png) ![](https://i.imgur.com/U7L070u.png) In conclusion, the result might be better than original processor because the dynamic branch prediction does better than static branch prediction. The branch prediction in srv32 always predict the same direction, so if might come out some problem when the probability of forward or backward taken is much larger than the other. If we can use dynamic branch predictor, than it can modify the way of decide how does the branch be taken. The method above might be the good material of implementation of improving branch predictor. ## Reference 1. [Branch predictor](https://en.wikipedia.org/wiki/Branch_predictor) 2. [Design and Development of an Efficient Branch Predictor for an In-order RISC-V Processor](https://www.researchgate.net/publication/346893502_Design_and_Development_of_an_Efficient_Branch_Predictor_for_an_In-order_RISC-V_Processor). 3. [Dynamic Branch Prediction](https://www.cs.umd.edu/~meesh/411/CA-online/chapter/dynamic-branch-prediction/index.html) 4. [lecture09_RISCV_Impl_pipeline](https://passlab.github.io/CSE564/notes/lecture09_RISCV_Impl_pipeline.pdf) 5. [CA2020-Team Prject-Analyze and improve srv32](https://hackmd.io/@WeiCheng14159/HkNvemOpD)