Final Project - Analyze and Improve srv32

tags: `computer architure 2021`

Object

Looking how the static branch prediction works in srv32.
If there are something out of expection, how can we improve the static branch prediction.
From the implement of RTL, discussing about branch penalty.
Analyze and improve the branch predictor implementation.

Intorduction

Branch Prediction

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures.

Static Prediction

Static branch prediction is the method of picking the speculative instruction based on just looking at the branch itself.
There are different type of static prediction method.For example, In the early time of MIPS,they used single-direction branch prediction:they always predict that a conditional jump will not be taken, so they always fetch the next sequential instruction. Only when the branch or jump is evaluated and found to be taken that does the instruction pointer get set to a non-squential address.
But there is a shortage about static prediction, as the pic show below, we can find out the frequency of backward prediction might be greater than forward prediction, so we need a new system called Dynamic Prediction to make the compiler more efficiently.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Dynamic Prediction

For dynamic prediction, the branch will make a educated guess about the branch will be taken or not. The hardward can look for clues based on instructions or the usetage history.
Learning based on past behavior:

1-bit predictior: use 1-bit counter to record the last outcome of the branch.
2-bits predictor: It will change if the branch mispredict twice.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Branch Penalty

When the branch is taken during the execute phase, it needs to flush the instructions that have been fetched into the pipeline, which causes a delay of two instructions, so the extra cost of the branch is two.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

srv32 Architecture

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Branch Prediction in srv32

A table called branch prediction table is newly created. It is indexed by iaddr values and stores a Taken/Not Taken flag (T/NT flag) and the target address wherever applicable.

In the code, the T/NT flags are stored in a register called tnt_tab and target addresses are stored in a register called targ_tab. These 2 registers together form the branch prediction table.

For every new iaddr value, this table will be checked to see if the Taken flag is 1. If it is 1, a flag called ‘pc_taken’ is set and the corresponding target address is stored in a register called ‘pc_pred’ and loaded as the next iaddr value. A flag called ‘branched' is set when the new iaddr value is loaded. The decision of whether branching should take place or not is known only in the EX stage.

Observe srv32 branch prediction

To see how the branch prediction work in srv32, we write a simple code with some loop and I-instruction.











.global main
main:
    li   s1, 0
    li   s2, 3
loop1: #for loop, if i = 3, jump to the end 
    beq  s1, s2, end1
    addi s1, s1, 1
    j    loop1
end1:  #exit
    li a0, 0
    ret

before run the code, we need to change the makefile to make it run

include ../common/Makefile.common

EXE      = .elf
SRC      = test1.s
CFLAGS  += -L../common
LDFLAGS += -T ../common/default.ld
TARGET   = hw3
OUTPUT   = $(TARGET)$(EXE)

.PHONY: all clean

all: $(TARGET)

$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol

clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin




































$ make hw3
make[1]: 進入目錄「/home/evelyn/srv32/sw」
make -C common
make[2]: 進入目錄「/home/evelyn/srv32/sw/common」
make[2]: 對「all」無需做任何事。
make[2]: 離開目錄「/home/evelyn/srv32/sw/common」
make[1]: 離開目錄「/home/evelyn/srv32/sw」
make[1]: 進入目錄「/home/evelyn/srv32/sim」

Excuting 32 instructions, 46 cycles, 1.437 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.012 s
Simulation cycles: 57
Simulation speed : 0.00475 MHz

make[1]: 離開目錄「/home/evelyn/srv32/sim」
make[1]: 進入目錄「/home/evelyn/srv32/tools」
./rvsim --memsize 128 -l trace.log ../sw/hw3/hw3.elf

Excuting 32 instructions, 46 cycles, 1.438 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 46
Simulation speed : 1.586 MHz

make[1]: 離開目錄「/home/evelyn/srv32/tools」
Compare the trace between RTL and ISS simulator
=== Simulation passed ===

In order to see how it did on srv32, we open wave.fst file under sim floder with GTKWave.
In order to know the addr in code, open hw3.dis file under hw3 floder

0000003c <main>:
  3c:	00000493          	li	s1,0
  40:	00300913          	li	s2,3

00000044 <loop1>:
  44:	01248663          	beq	s1,s2,50 <end1>
  48:	00148493          	addi	s1,s1,1
  4c:	ff9ff06f          	j	44 <loop1>

00000050 <end1>:
  50:	00000513          	li	a0,0
  54:	00008067          	ret

and open the trace.log file which generate by ISS simulator .

        19 00000044 01248663
        20 00000048 00148493 x09 (s1) <= 0x00000001
        21 0000004c ff9ff06f x00 (zero) <= 0x00000050
        24 00000044 01248663
        25 00000048 00148493 x09 (s1) <= 0x00000002
        26 0000004c ff9ff06f x00 (zero) <= 0x00000050
        29 00000044 01248663
        30 00000048 00148493 x09 (s1) <= 0x00000003
        31 0000004c ff9ff06f x00 (zero) <= 0x00000050
        34 00000044 01248663
        37 00000050 00000513 x10 (a0) <= 0x00000000

After analyze the wave we can find out that it will asume the branch is not taken everytime (as the pic above). When the branch is taken, ex_flush will abandon the result.

So, we can find out wether the beq is equal or not, the branch will always not be taken,they are always single-direction, so we can conclude that srv32 use static forward prediction.

Implementation of Branch Penalty in srv32

To see how does the pipeline work in srv32, we have to write a simple code, test.s for showing off.
Here is our sample assembly code. In order to make the branch penalty more obvious, we use beq to construct a small for loop.













.global main
main:
    li   s1, 0
    li   s2, 3
    li   s3, 0
loop1: #for loop, if i = 3, jump to the end 
    beq  s1, s2, end1
    addi s1, s1, 1
    addi s3, s3, 1
    j    loop1
end1:  #exit
    li a0, 0
    ret

Before we complie our assembly code on RTL, we have to modify a little bit in makefile.























include ../common/Makefile.common

EXE      = .elf
SRC      = test.s
CFLAGS  += -L../common
LDFLAGS += -T ../common/default.ld
TARGET   = test
OUTPUT   = $(TARGET)$(EXE)

.PHONY: all clean

all: $(TARGET)

$(TARGET): $(SRC)
$(CC) $(CFLAGS) -o $(OUTPUT) $(SRC) $(LDFLAGS) -g
$(OBJCOPY) -j .text -O binary $(OUTPUT) imem.bin
$(OBJCOPY) -j .data -O binary $(OUTPUT) dmem.bin
$(OBJCOPY) -O binary $(OUTPUT) memory.bin
$(OBJDUMP) -d $(OUTPUT) > $(TARGET).dis
$(READELF) -a $(OUTPUT) > $(TARGET).symbol

clean:
$(RM) *.o $(OUTPUT) $(TARGET).dis $(TARGET).symbol [id]mem.bin memory.bin

$ make test
make[1]: Entering directory '/home/peishan/srv32/sw'
make -C common
make[2]: Entering directory '/home/peishan/srv32/sw/common'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/peishan/srv32/sw/common'
make[2]: Entering directory '/home/peishan/srv32/sw/test'
riscv-none-embed-gcc -O1 -Wall -march=rv32im -mabi=ilp32 -nostartfiles -nostdlib -L../common -o test.elf test.s -lc -lm -lgcc -lsys -T ../common/default.ld -g
riscv-none-embed-objcopy -j .text -O binary test.elf imem.bin
riscv-none-embed-objcopy -j .data -O binary test.elf dmem.bin
riscv-none-embed-objcopy -O binary test.elf memory.bin
riscv-none-embed-objdump -d test.elf > test.dis
riscv-none-embed-readelf -a test.elf > test.symbol
make[2]: Leaving directory '/home/peishan/srv32/sw/test'
make[1]: Leaving directory '/home/peishan/srv32/sw'
make[1]: Entering directory '/home/peishan/srv32/sim'

Excuting 32 instructions, 46 cycles, 1.437 CPI
Program terminate
- ../rtl/../testbench/testbench.v:418: Verilog $finish

Simulation statistics
=====================
Simulation time  : 0.069 s
Simulation cycles: 57
Simulation speed : 0.000826087 MHz

make[1]: Leaving directory '/home/peishan/srv32/sim'
make[1]: Entering directory '/home/peishan/srv32/tools'
./rvsim --memsize 128 -l trace.log ../sw/test/test.elf

Excuting 32 instructions, 46 cycles, 1.438 CPI
Program terminate

Simulation statistics
=====================
Simulation time  : 0.000 s
Simulation cycles: 46
Simulation speed : 0.630 MHz

make[1]: Leaving directory '/home/peishan/srv32/tools'
Compare the trace between RTL and ISS simulator
=== Simulation passed ===

Open wave.fst in GTKWave, and take the screenshot below for example. We can find there are some bubbles in the pipeline.

Branch Penalty

According to CA2020-Team Prject-Analyze and improve srv32, we can get the implementation of branch penalty in srv32.
Branch penalty is the number of instructions killed after a branch instruction if a branch is TAKEN. Branch result is resolved at the end EX stage by ALU so the instruction fetch in IF/ID might need to be killed if a branch is taken. In this processor; however, the address of next instruction (next PC) should be fed into I-MEM a cycle ahead. Thus, the branch penalty for srv32 is 2. To clarify, by the time next PC is resolved, one instruction has been fetch into pipeline and another PC has been calculated because address should be computed one cycle ahead. The number of instructions that should be killed (a.k.a. set to NOP) is 2 instruction after a branch instruction if the branch is actually taken.

Pipelining Conditional Branches

If the branch is taken, there wiil come out following actions.
First, kill the two following instructions.
Second, the instruction at the decode stage is not valid
⇒ stall signal is not valid

We can see how does it truly work in srv32.
Take the waveform below for example, and compare with the assembly code from test.dis and trace.log.

  20 00000048 01248863
  21 0000004c 00148493 x09 (s1) <= 0x00000001
  22 00000050 00198993 x19 (s3) <= 0x00000001
  23 00000054 ff5ff06f x00 (zero) <= 0x00000058
  26 00000048 01248863
  27 0000004c 00148493 x09 (s1) <= 0x00000002
  28 00000050 00198993 x19 (s3) <= 0x00000002
  29 00000054 ff5ff06f x00 (zero) <= 0x00000058

00000048 <loop1>:
  48: 01248863           beq s1,s2,58 <end1>
  4c: 00148493           addi s1,s1,1
  50: 00198993           addi s3,s3,1
  54: ff5ff06f           j 48 <loop1>
  
00000058 <end1>:
  58: 00000513           li a0,0
  5c: 00008067           ret

In the picture above, we can see flush works in EX stage and WB stage during the red circles. And there comes out two instructions been killed after the branch taken.
If we add one instruction into the loop, we can see there is still flushing for two cycle in waveform.

Pipelining Jumps

After reading the passage from lecture09_RISCV_Impl_pipeline, we can find that a jump instruction will kill the follow instructions. Here we are going to discuss about how does jumps work in srv32.

The meaning of the red circles is NOP. Here we can see the instruction j 48<loop> is FF5FF06F, li a0, 0 is 00000513, ret is 00008067. In this section, li a0, 0 and ret represent NOP. We can also see that the flush works during the red cricles. Furthermore, I sort out the pipeline to the form below and then it's easier to do investigation.

clock	clk1	clk2	clk3	clk4	clk5	clk6	clk7	clk8
IF/ID	`j`	`li`	`ret`	`beq`	`addi s1`	`addi s3`
EX		`j`	`li`	`ret`	`beq`	`addi s1`	`addi s3`
WB			`j`	`li`	`ret`	`beq`	`addi s1`	`addi s3`

clock	clk1	clk2	clk3	clk4	clk5	clk6	clk7	clk8
`j 48<loop1>`	IF/ID	EX	WB
`beq s1, s2, end1`		NOP	NOP	IF/ID	EX	WB
`addi s1, s1, 1`			NOP	NOP	IF/ID	EX	WB
`addi s3, s3, 1`				NOP	NOP	IF/ID	EX	WB

Proposal of Improving Branch Prediction in srv32

In order to improve the performance of branch predictor, we have to understand how does the branch prediction works in srv32. So in the articles above, we have already discussed something about it and different kinds of branch prediction.
In srv32, we can see there is static branch prediction inside the processor, and it always predict that the branch does not be taken.
If the branch is taken, srv32 will flush the two instructions behind.

We need more information to find out the method for improving branch prediction, we read a lot research for RISCV. And we found there is an related article Design and Development of an Efficient Branch Predictor for an In-order RISC-V Processor.

In this passage, we can see some implementation of improving the branch prediction in RISCV. They use an Branch Buffer Predictor(BTB) and Pattern History Tables(PHT) to make a dynamic branch predictor.
Here is the introduction of their design. We think that is more likely to Two-level predictor
The two-tier predictor separates the branch history into the branch history record and pattern history table. The pattern history table lists the frequency of each occurrence of the branch. The content of the branch history record is used to index the pattern history table. might be the good material of implementation of prench predictor.
The branch predictor consists of PHT and BTB, which are commonly used data structures. The PHT and BTB have been indexed according to the branch address. The PHT forecasts whether the branch is taken or not. The following instruction address is taken from the BTB when the branch is taken. The next command address if the branch is not taken is the current branch address plus instruction size.
The PHT uses the 2-bit saturation counter state transition, which increases when the prediction is correct and decreases if the prediction is wrong. The PHT uses a valid bit to ensure the finishing of the training period. The valid bit will be initially set to zero. When the branch is encountered for the first time, it changes the valid bit to zero. The target branch address of the current branch instruction is stored in the BTB. The BTB is updated by the execution unit of the pipelined processor.
Picture below show block diagram of the branch predictor.

Picture below show architecture of the branch predictor.

There is another research of the proformance of implemantation of dynamic branch prediction Dynamic Branch Prediction, Modeller for RISC Architecture. And we can see the conclusion inside.

In the paper, they compare three different type of branch predictor: one-bit branch predictior, always taken branch predictor and saturating counter. The prediction of saturating counter is depend on how strong that the state is, as the picture shows below:

The paper conducted four type patten to compare the three predictor.
The hit indicates the accurate branch prediction and miss indicates the wrong prediction.

Using the function to calculate the performance, saturating counter has the better precision to hit the branch.

In conclusion, the result might be better than original processor because the dynamic branch prediction does better than static branch prediction. The branch prediction in srv32 always predict the same direction, so if might come out some problem when the probability of forward or backward taken is much larger than the other. If we can use dynamic branch predictor, than it can modify the way of decide how does the branch be taken. The method above might be the good material of implementation of improving branch predictor.

Final Project - Analyze and Improve srv32

tags: computer architure 2021

Object

Intorduction

Branch Prediction

Static Prediction

Dynamic Prediction

Branch Penalty

srv32 Architecture

Branch Prediction in srv32

Observe srv32 branch prediction

Implementation of Branch Penalty in srv32

Branch Penalty

Pipelining Conditional Branches

Pipelining Jumps

Proposal of Improving Branch Prediction in srv32

Reference

tags: `computer architure 2021`