Assignment 1: Reducing memory usage with bfloat16 and bfloat16 multiplication
contributed by <bclegend
>
Lab1: RV32I Simulator
Reducing memory usage with bfloat16 and bfloat16 multiplication
The Problem is from 2023 Computer Architecture quiz1,and it is Problem B
, single prcision floating point valuses to corresponding bfloat16 floating-point format.
Reducing memory usage with bfloat16 is when we call a bfloat16 in memory we usually use one register but there will cause 16-bits waste with 0 and if you want to call two bfloat16 you should use two register to , so if we can combine two bfloat16 in 32-bits memory then we can get these two bfloat16 from memory and use only one register.
The function above will include encoder and decoder, encoder is for combine two bfloat16 in 32-bits register and decoder is to seperate two bfloat16 number to two registers.
Then the bfloat16 multiplication is the function for bfloat16, it may mulitply two floating-point number and let the output number be a bfloat16.
Solution
My problem solving idea is from bfloat16 has a leading 16 bits with it value and the other 16 bits will be 0.
Then if we want to merge two bfloat16 to one 32-bits register, we need our encoder to hold first bfloat16 at original number and shift second bfloat16 16 bits to fit the space with first bfloat16 0,next step we will or these two bfloat16 together to get merged value.
And the decoder will be with two mask 0xFFFF0000
and 0x0000FFFF
to take out two bfloat16 number and the second bfloat16 should shift 16-bits to be right position.
The Multiplication with two floating point will first,
XOR
two floating point sign,the exponent1+exponent2-127
will be the the exponent number
Final we deal with the fraction part, at first we get first bfloat16 and second bfloat16 fraction and add leading 1, next if the first bfloat16 number are 1 then we add second bfloat16 at a zero register,then shift right second bfloat 1-bits and if the second number of first bfloat16 are 1 then add shifted second bfloat16 to the register above, repeat the steps above till the 8-bits(1 integer and 7 fraction)is over.And if the target bfloat16 is overflow then shift right one bits and add 1 number on exponent.
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Implementation
C Code
This is my code on Github
RISC-V Assembly Code
This is my code on Github
.data
test0: .word 0x4141f9a7,0x423645a2
test1: .word 0x3fa66666,0x42c63333
test2: .word 0x43e43a5e,0x42b1999a
mask0: .word 0x7F800000,0x007FFFFF,0x800000,0x8000,0x7f,0x3F800000,0x80000000
mask1: .word 0x8000
mask2: .word 0xFFFF0000,0x0000FFFF
str: .string "\n"
.text
main:
li a7,1
la a2,test0
lw a6,0(a2)
jal ra,f32_b16_p1
add a5,a6,x0
lw a6,4(a2)
jal ra,f32_b16_p1
add a4,a6,x0
jal ra,encoder
add s9,s3,x0
jal ra,decoder
jal ra,Multi_bfloat
li a7,2
add a0,x0,s5
ecall
jal ra,cl
li a7,2
add a0,x0,s6
ecall
jal ra,cl
li a7,2
add a0,x0,s3
ecall
j exit
f32_b16_p1:
sw a6,0(sp)
add t0,a6,x0
la a3,mask0
lw t6,0(a3)
and t1,t0,t6
lw t6,4(a3)
and t2,t0,t6
lw t6,0(a3)
beq t1,t6,inf_or_zero
or t3,t1,t2
beq t3,x0,inf_or_zero
lw t6,8(a3)
or t2,t2,t6
lw t6,12(a3)
add t2,t2,t6
srli t5,t2,24
beq t5,x0,no_overflow
lw t6,8(a3)
add t1,t1,t6
srli t2,t2,17
lw t6,16(a3)
and t2,t2,t6
slli t2,t2,16
j f32_b16_p2
no_overflow:
srli t2,t2,16
lw t6,16(a3)
and t2,t2,t6
slli t2,t2,16
f32_b16_p2:
srli t0,t0,31
slli t0,t0,31
or t0,t0,t1
or t0,t0,t2
add a6,t0,x0
ret
inf_or_zero:
srli a6,a6,16
slli a6,a6,16
ret
encoder:
add t0,a5,x0
add t1,a4,x0
srli t1,t1,16
or t0,t0,t1
add s3,t0,x0
ret
decoder:
add t0,s9,x0
la a1,mask2
lw s2,0(a1)
and t1,t0,s2
lw s2,4(a1)
and t2,t0,s2
slli t2,t2,16
add s6,t1,x0
add s5,t2,x0
ret
cl:
li a7,4
la a0,str
ecall
ret
Multi_bfloat:
add t0,s5,x0
add t1,s6,x0
lw t6,0(a3)
and t3,t0,t6
and t2,t1,t6
add t3,t3,t2
lw t6,20(a3)
sub t3,t3,t6
xor t2,t0,t1
srli t2,t2,31
slli t2,t2,31
or t3,t3,t2
slli t0,t0,9
srli t0,t0,9
or t0,t3,t0
lw t6,16(a3)
slli t6,t6,16
and t2,t0,t6
and t3,t1,t6
slli t2,t2,9
srli t2,t2,1
lw t6,24(a3)
or t2,t2,t6
srli t2,t2,1
slli t3,t3,8
or t3,t3,t6
srli t3,t3,1
add s11,x0,x0
addi s10,x0,8
add t1,x0,x0
lw t6,24(a3)
loop:
addi s11,s11,1
srli t6,t6,1
and t4,t2,t6
beq t4,x0,not_add
add t1,t1,t3
not_add:
srli t3,t3,1
bne s11,s10,loop
lw t6,24(a3)
and t4,t1,t6
beq t4,x0,not_overflow
slli t1,t1,1
lw t6,8(a3)
add t0,t0,t6
j Mult_end
not_overflow:
slli t1,t1,2
Mult_end:
srli t1,t1,24
addi t1,t1,1
srli t1,t1,1
slli t1,t1,16
srli t0,t0,23
slli t0,t0,23
or t0,t0,t1
add s3,t0,x0
ret
exit:
li a7,10
ecall
Results
test0
- floating 1 : 12.123 (Hexadecimal : 0x4141f9a7)
- floating 2 : 45.568 (Hexadecimal : 0x423645a2)
- Result:

test1
- floating 1 : 1.2999999 (Hexadecimal : 0x3fa66666)
- floating 2 : 99 (Hexadecimal :0x42c63333)
- Result:

test2
- floating 1 : 456.456 (Hexadecimal : 0x43e43a5e)
- floating 2 : 88.8 (Hexadecimal : 0x42b1999a)
- Result:

Analysis
Ripes are ours simulate RISC-V processor

Single-Cycle RV32I Datapath

five-stage execution pipeline simulator


Pipeline instructions with my code
Risc-V Assembly
Disassembled
Instruction Fetch (IF)

PC input adderss is 0x0000004
,and
The instruction translated to RISC-V CPU will be 0x00100893
And the I-Format instruction addi
is as following.
We translate 0x00100893
to the instruction memory output.
imm[11:0] |
rs1 |
funct3 |
rd |
opcode |
000000000001 |
00000 |
000 |
10001 |
0010011 |
Instruction Decode/Register Read (ID)

In this stage insturction will be decode.
R1 idx = 0x00
,R2 idx = 0x01
,Reg out 1 = 0x00000000
,Reg out 2 = 0x00000000
And immdeiate
will send to next stage.
ALU Execute (EX)

We chose add
instruction from these four MUX(multiplexer),and the op1 = 0x00000000
and op2 = 0x00000001
,the result of this will be Res = 0x00000001
Memory Access (MEM)

The red light in Memory stage represent this stage we will not store data to memory.
Write Back (WB)

In this stage we will store 0x00000001
to destination register
Pipeline Hazard
Hazard is a situation that prevent starting the next instruction in the next clock cycle
(1) Structural hazard
- Two or more instructions in pipeline compete for access to a single physical resource.
- A required resource is busy
(2) Data hazard
- Data dependency between instructions
- Need to wait for previous instruction to complete its data write
(3) Control hazard
- Flow of execution depends on previous instruction
Structural Hazard
- Problem
-
Two or more instructions in pipeline compete for access to a single physical resource.
-
Like the picture below, we can see the regfile are used in ID and WB in clock cycles


-
Since each instruction can only
read : two operands in decode stage
write : one value in write back stage
-
Avoid Structural hazard by having separate ports
- Solution
- Build RegFile with independent read and write ports
- Conclusion
- Read and Write to registers during same clock cycle is okay
Data Hazard
- Problem
- Conflict for use of a resource
- In RISC-V pipeline with a single memory unit
- Without memory units, instruction fetch would ahve to stall fot that cycle
-> all other operations in pipeline would have to wait
- The memory units are used in same Time

- Stalls and performance
- Stalls reduce performance
- Compiler can arrange code to avoid hazards and stalls
R-type instructions
- Solution
- Forwarding
- Forward result as soon as it is available,even though it's not stored in RegFile yet

Loads
- Load delay slot
- If that instruction uses the resullt of the load then teh hardware will stall for one cycle
- Solution
- Code Scheduling to avoid Stalls

Control Hazard
- Problem
- Branch determines flow control

- Moving branch comparator to ID stage would add redundant hard ware and introduce new problems
- Kill instructions after Branch if Taken
- the instructions beween branch control instructions and labels witll be kill and wasted

- Solution for RISC-V : Branch Prediction
- guess out come of the branch

Find Hazard in my code
As the picture we can see the branch or jump funtion may cause the Control hazard, so if we want to reduce hazard we should reduce the branch or jump we use

will cause the better performance in this code
- Optimized Execution info from my code

Appendix 1 : Pseudo Instruction
Reference