Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < Hotmercury
>
we can find instruction reference
code on github
Find the position 0 the most close to LSB, and then make the position left to this postion become 0.
we can use this to compute the carry flag.
example
mask = 1010 0111
1010 0111 -> 0000 0111
and we can use (mask << 1) xor (mask) will get cary bit
-> 0000 1000
use ori immidiate only has 12 bit
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
.data
test_1: .dword 0x10FFFFFFFFFF3333
.text
main:
la t0, test_1 # lui t0, test_1[31:12]
lw a0, 0(t0)
lw a1, 4(t0)
jal mask_lowest_zero
li a7, 10
ecall
mask_lowest_zero:
slli t1, a1, 1
srli t0, a0, 31
or t1, t1, t0
slli t0, a0, 1
ori t0, t0, 1
a0, a0, t0
a1, a1, t1
slli t1, a1, 2
srli t0, a0, 30
or t1, t1, t0
slli t0, a0, 2
ori t0, t0, 3
a0, a0, t0
a1, a1, t1
slli t1, a1, 4
srli t0, a0, 28
or t1, t1, t0
slli t0, a0, 4
ori t0, t0, 0xF
a0, a0, t0
a1, a1, t1
slli t1, a1, 8
srli t0, a0, 24
or t1, t1, t0
slli t0, a0, 8
ori t0, t0, 0xFF
a0, a0, t0
a1, a1, t1
li t3 , 0xFFFF # lui + addi
slli t1, a1, 16
srli t0, a0, 16
or t1, t1, t0
slli t0, a0, 16
or t0, t0, t3
a0, a0, t0
a1, a1, t1
a1, a1,a0
x = x + 1
flow
- test if overflow
- find the carry flag
- return flag & another bit
find the value of nth bit
multiplicand multiplier = result
flow
- get specify bit of b
- if(1) result + a << i
imul32:
addi sp, sp, -28
sw ra,0(sp)
sw s0,4(sp)
sw s1,8(sp)
sw s2,12(sp)
sw s3,16(sp)
sw s4,20(sp)
sw s5,24(sp)
mv s0, a0
mv s1, a1
li s2, 0
li s3, 0
li s4, 0
li s5, 32
imul32_loop:
beq s4, s5, imul32_end
mv a0, a1
mv a2, s4
jal getbit
beq a0, zero, imul_skip
sub t2, s5, s4
mv t0, s0
srl t1, t0, t2
sll t0, t0, s4
slli t3, s2, 31
slli t4, t0, 31
and t5, t3, t4
beq t5, zero, 8
addi s3, s3, 1
add s2, s2, t0
add s3, s3, t1
imul_skip:
addi s4, s4, 1
j imul32_loop
imul32_end:
mv a0, s2
mv a1, s3
lw ra,0(sp)
lw s0,4(sp)
lw s1,8(sp)
lw s2,12(sp)
lw s3,16(sp)
lw s4,20(sp)
lw s5,24(sp)
addi sp,sp,24
ret
merge result and b -> result(32),b(32)
so we dont care overflow by use risc32 to implement 64 bit
float32 multiply
flow
- transform tp IEEE
- decide sign bit
- find the mantissa and add another 1 23 + 1
- exponent
- use imul32 compute the multiply of mantissa of two number
- 24 * 24 most 48 bit - 23 = 25 we only perserve 24 bit,sowe need to check another shift
int mshift = getbit(mrtmp, 24);
will get the position of 25 from LSB
int32_t er = mshift ? inc(ertmp) : ertmp;
if had shift that we should add another 1 to exponent
- conbime new IEEE
fmul32:
addi sp, sp, -24
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
srli s0, a0, 31
srli s1, a1, 31
xor s0, s0, s1
li t0, 0x7FFFFF
li t1, 0x800000
and s1, a0, t0
or s1, s1, t1
and s2, a1, t0
or s2, s2, t1
srli s3, a0, 23
andi s3, s3, 0xFF
srli s4, a1, 23
andi s4, s4, 0xFF
mv a0, s1
mv a1, s2
jal imul32
mv s1, a0
mv s2, a1
srli s1, s1, 23
slli s2, s2, 9
or s1, s1, s2
mv a0, s1
li a2, 24
jal getbit
srl s1, s1, a0
add s3, s3, s4
addi s3, s3, -127
add s3, s3, a0
srli s0, s0, 31
andi s3, s3, 0xFF
slli s3, s3, 23
li t0, 0x7FFFFF
and s1, s1, t0
or s0, s0, s3
or s0, s0, s1
mv a0, s0
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
addi sp, sp, 24
ret
Analyze
-
what is CPI ?
Below is the mathematical calculation for Cycle Per Instruction (CPI), so a higher CPI is considered better.
-
What is IPC ?
Instructions per cycle
higher is better
When calculating floating-point multiplication, we need to compute the sign bit, exponent, and mantissa. This calculation involves multiplying the mantissas of two floating-point numbers, which means using two 23-bit segments from the IEEE representation, denoted as |1|8|23|. Upon observing the actual operation using a simulator, we identified that the performance bottleneck primarily lies in the mantissa calculation. The two diagrams below show the original and improved data, demonstrating a significant reduction of nearly 50% in cycles.


difference
original code
-
Calling the getbit
function using jal getbit
results in a significant cycle overhead. By observing this pipeline, we can also note that when we use jal
, it introduces two additional NOP instructions.


And clear will happen

-
When getbit
is true, we need to perform the operation r += a64 << i;
. We first shift a64
left by i
and then add it to r
. However, this operation introduces a potential issue - register overflow.
In our simulation, we are using the RV32 architecture, which operates with 32-bit principles. But considering that a 32-bit multiplication can result in a 64-bit value, we need to use two registers to accommodate this. This introduces the overflow concern. When performing addition on the lower 32 bits, it may result in a carry, effectively incrementing the value in the higher bits of the register. Hence, we see various overflow checks in the RISC-V code mentioned above.
These checks ensure that we handle potential overflow scenarios appropriately to maintain the integrity of our calculations.
example
let proccesor rv4
when 1001(a1) 1001(a0) + 0100(t1) 1000(t0)
we need to record that (a0 + t0) has carry bit and add sum(a1,t1,0)

- reodering the instructions
For example, in the following code, instructions have been deliberately reordered. This can be an effective way to avoid "Read after Write" hazards. However, through observation, we can see that this doesn't have an impact because data forwarding can resolve this issue. You can notice that the gray selectors differ, indicating that the value in t3 comes from the calculation in the previous pipeline stage.
order


reorder


improve
Hazard
various hazard example
- structural hazard
- two instructions try to write to the register at the same time
- Data Hazards
- Control hazards
problem
branch often happen,and will come with two nop, can we avoid it?
