# Reciprocal integer square root in RISC-V
In Quiz1 problem A, it mentioned about getting square root from Bfloat16.
Here, I want to discuss relevent topic. To compute reciprocal square root of an integer without FPU.
A famous technique in Quake III souce code to compute inverse square root of float.
```clike
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
```
But this magical technique only effictive to floating point 32. If the environment restricts from using FPU, we can not use it.
To compute reciprocal square root of integer, we need Newton's method.
## Newton's method
Newton's method is a algorithm to compute root of function.

$f'(x)$ can be slope of $f(x)$ near $x_n$.
According to slope formula:
$f'(x) = \frac{f(x_n)-0}{x_n-x_{n+1}}$
so
$x_{n+1} = x_n-\frac{f(x_n)}{f'(x_n)}$
Then to compute reciprocal square root. We let $y=\frac{1}{\sqrt{x}}$ and let $f(y) = \frac{1}{y^2}-x$
Make a initial guess for root of $f(y)$ as $y_0$. We gets
$y_1=y_0-\frac{y_0^{-2}-x}{-2y_0^{-3}}=y_0+\frac{y_0}{2}-\frac{xy_0^3}{2}=y_0(\frac{3}{2}-\frac{xy_0^2}{2})$
We know that reciprocal square root defintely smaller than $1$, it must be represent in floating point or fixed point.
Here, we assume not using floating point and FPU, so we represent it in fixed point Q0.32
```clike
/*
* new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
*
* Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
*/
static void newton_step(u32 *rec_inv_sqrt, u32 x)
{
u32 invsqrt, invsqrt2;
u64 val;
invsqrt = rec_inv_sqrt;
invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
val = (3LL << 32) - ((u64)x * invsqrt2);
val >>= 2; /* avoid overflow in following multiply */
val = (val * invsqrt) >> (32 - 2 + 1);
*rec_inv_sqrt = val;
}
```
But Newton's method is not an efficient way to approximate root.
For example, $\frac{1}{\sqrt{5}}$ with starter guess 0.5 still needs 4 times approximation to reach 32 bits precision.
It's better to make a lookup table
```clike
#define REC_INV_SQRT_CACHE (16)
static const u32 inv_sqrt_cache[REC_INV_SQRT_CACHE] = {
~0, ~0, 3037000500, 2479700525,
2147483647, 1920767767, 1753413056, 1623345051,
1518500250, 1431655765, 1358187914, 1294981364,
1239850263, 1191209601, 1147878294, 1108955788
};
```
### Follow up in Quiz3
[Quiz 3 problem C](https://hackmd.io/@sysprog/arch2025-quiz3-sol?stext=27539%3A9%3A0%3A1761019229%3AjEhMzK)
Why scaling factor needed? What the benefits? How to determine lookuptable
## Transform C code to RISC-V instructions
RISC-V have 31 registers, each one is general purpose. (why not more registers? because more register, more propagation delay.)

For **argument register** holds function argument from x10(a0) to x17(a7) totally 8 registers.
x9 and x18(s2) to x27(s11) are **saved registers**, which should be maintain while jump to different function. So for example, if a function using s0, it should first store original value of s0 to memory before changing it. After operation down, then move s0 original from memory to s0.
Further in RISC-V ABI, saying **stack pointer** have to align 16 bytes. Stack pointer is accumulate from high address to low address. So push a data to stack means substract stack pointer, pop means add.
**Program counter** is a separate special internal register, not a general purpose register.


pseudo instructions are group of instruction representing abstruct idea.
* `mv rd rs1` = `addi rd rs1 0`
* `li rd immd` = `addi rd rd immd` or `lui rd immd + addi rd immd` which `lui` for upper 20 bits(clear lower 12 bits) and `addi` for lower 12 bits
* `jr re` = `jalr x0 x1 0`
* `j label` = `jal x0 label`
* `nop` = `addi x0 x0 0`
Immediate instruction only performs sign-extend 12 bits immediate. If
Formats, group same encoding layout instruction into same format:
* R-format: three registers, `rd rs1 rs2`
* I-format: 12 bits immediate, `addi` or `lw` and `jalr`
* S-format: store, `sw`
* B-format: branch, `beq` (aware of b-format encoding)
* J-format: jump, `jal`
* U-format: load upper 20 bits and clear lower 12 bits, `lui`
**auipc** is special instruction for handling super large offsets. Because B-format only handles $\pm2^{10}$ offsets, while J-format is $\pm2^{18}$. So for larger ones, we need `auipc` which `auipc rd immd => rd = (immed << 12) + PC`
Also, `jalr` is I-format, so it won't multiply immed by $2$ like B-format and J-format. However, `jalr rd rs1 immed` means it can do arithmatic beforehand to make super large offsets.
In assembly, symbol are those human readable representation, such as label or varible.
```
.data
varible:
.text
label:
```
`if`, `while`, function calls all represent jump in assembly.
Translating `bf16_add`
:::spoiler bf16_add in RISC-V instructions
```c
.data
BF16_NAN: 0x7FC0
.text
bf16_add:
srli t1 a0 15
andi t1 t1 1 # sign_a
srli t2 a1 15
andi t2 t2 1 # sign_b
srli t3 a0 7
li a3 0xFF
and t3 t3 a3 # exp_a
srli t4 a1 7
and t4 t4 a3 # exp_b
andi t5 a0 0x7F # mant_a
andi t6 a1 0x7F # mant_b
beq t3 ta3 is_inf
beq t4 a3 ret_b
not t7 t3
not t8 t5
and t7 t7 t8
bnez t7 ret_b
not t7 t4
not t8 t6
and t7 t7 t8
bnez t7 ret_a
is_inf:
beqz t5 is_nan
jal x0 ret_a
is_nan:
beq t4 a3 ret_nan
jal x0 ret_a
ret_nan:
bnez t6 ret_b
beq t1 t2 ret_b
la a0 BF16_NAN
lw a0 0(a0)
jal x0 ret_a
ret_a:
jr ra
ret_b:
addi a0 a1 0
jr ra
```
:::
`jal rgs label` jal is J-type, calculates its relative to program count. while `jalr rgs immd(rgs)` jalr is I-type, calculates based on immediate + register's value.
### Reciprocal square root in RV32I
Initial guess:
$$y_0 = \frac{1}{\sqrt{2^{\lfloor log_2x \rfloor}}}$$
Original code:
```c
static uint64_t mul32(uint32_t a, uint32_t b)
{
uint64_t r = 0;
for (int i = 0; i < 32; i++) {
if (b & (1U << i)) {
r += (uint64_t)a << i;
}
}
return r;
}
static const uint16_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */
11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */
2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */
362, 256, 181, 128, 90, /* 2^15 to 2^19 */
64, 45, 32, 23, 16, /* 2^20 to 2^24 */
11, 8, 6, 4, 3, /* 2^25 to 2^29 */
2, 1 /* 2^30, 2^31 */
};
/* 65536/sqrt(x) */
uint32_t fast_rsqrt(uint32_t x)
{
if (x == 0) return 0xFFFFFFFF;
if (x == 1) return 65536;
int exp = 31 - clz(x);
uint32_t y = rsqrt_table[exp];
if (x > (1u << exp)) {
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint32_t frac = ((x - (1U << exp)) << 16) / (1U << exp);
y -= (uint32_t) (delta*frac) >> 16;
}
for (int iter = 0; iter < 2; iter++) {
uint32_t y2 = (uint32_t)(mul32(y, y) >> 16);
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16);
y = (uint32_t) (mul32(y, (3u << 16) - xy2) >> 17);
}
return y;
}
uint32_t testcase[4] = {0x0, 0x1, 0x4, 0xFFFFFFFF};
uint32_t answer[4] = {0x0, 0x1, 0x8000, 0x1};
int main()
{
for (int i = 0; i < 4; i++) {
uint32_t y = fast_rsqrt(testcase[i]);
}
}
```
In interpolation,
```c
if (x > (1u << exp)) {
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint32_t frac = ((x - (1U << exp)) << 16) / (1U << exp);
y -= (uint32_t) (delta*frac) >> 16;
}
```
if x is larger than $2^{17}$, then `(x - (1U << exp)) << 16)` will overflow.
so avoid overflow, extends to `uint64_t`:
```c
uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) / (1UL << exp));
```
In Newton-Raphson iteration,
```c
uint32_t y2 = (uint32_t)(mul32(y, y) >> 16);
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16);
y = (uint32_t) (mul32(y, (3u << 16) - xy2) >> 17);
```
when y is less than $2^8$(which also means x is larger than $2^{16}$), then `y2` will become zero, which makes `xy2` becomes zero, too.
Then we observes that because `y` is less than $2^{16}=65536$. `y2` will definitely less than $2^{32}$. So we can express `y2` in Q0.32
and for `xy2`, just maintain Q16.16
```c
uint32_t y2 = (uint32_t) mul32(y, y); // Q0.32
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // Q16.16
uint64_t tmp = mul32(y, (3u << 16) - xy2);
y = (uint32_t)((tmp >> 17) + ((tmp >> 16) & 1)); // round to nearest
```
Testcase below:
```c
#define testcase_num 10
uint32_t testcase[testcase_num] = {
0, 1, 4, 5, 1600, 1000000, 12582912, 4294967294, 0xFFFFFFFE, 0xFFFFFFFF};
uint32_t answer[testcase_num] = {0xFFFFFFFF, 65536, 32768, 29309, 1638,
65, 18, 1, 1, 1};
int main() {
for (int i = 0; i < testcase_num; i++) {
uint32_t y = fast_rsqrt(testcase[i]);
if (y != answer[i]) {
int32_t delta = (int32_t)answer[i] - y;
double deltaf = (double)delta;
double perc = (double)answer[i];
printf("testcase %d: y=%d , correct=%d ,error rate=%lf\n", i, y, answer[i],
deltaf / perc);
} else
printf("testcase %d pass\n", i);
}
return 0;
}
```
Numerical difference is less than 1.5%.
```shell
testcase 0 pass
testcase 1 pass
testcase 2 pass
testcase 3 differce: y=29308 , correct:29309 ,error rate: 0.000034
testcase 4 pass
testcase 5 differce: y=66 , correct:65 ,error rate: -0.015385
testcase 6 pass
testcase 7 pass
testcase 8 pass
testcase 9 pass
```
Full code after correction:
```c
#include <math.h>
#include <stdint.h>
#include <stdio.h>
static uint32_t clz(uint32_t a) {
uint32_t lz = 0;
for (uint32_t i = 16; i > 0; i >>= 1) {
if (!(a >> i)) {
lz += i;
} else {
a >>= i;
}
}
return lz;
}
static uint64_t mul32(uint32_t a, uint32_t b) {
uint64_t r = 0;
for (int i = 0; i < 32; i++) {
if (b & (1U << i)) {
r += (uint64_t)a << i;
}
}
return r;
}
static const uint16_t rsqrt_table[32] = {
65535, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */
11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */
2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */
362, 256, 181, 128, 90, /* 2^15 to 2^19 */
64, 45, 32, 23, 16, /* 2^20 to 2^24 */
11, 8, 6, 4, 3, /* 2^25 to 2^29 */
2, 1 /* 2^30, 2^31 */
};
/* 65536/sqrt(x) */
uint32_t fast_rsqrt(uint32_t x) {
if (x == 0)
return 0xFFFFFFFF;
if (x == 1)
return 65536;
int exp = 31 - clz(x);
uint32_t y = rsqrt_table[exp]; // Q16.16
if (x > (1u << exp)) {
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint32_t frac =
(uint32_t)((((uint64_t)x - (1UL << exp)) << 16) >> exp);
y -= (uint32_t)(delta * frac) >> 16;
for (int iter = 0; iter < 2; iter++) {
uint32_t y2 = (uint32_t)mul32(y, y); // Q0.32
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // Q16.16
uint64_t tmp = mul32(y, (3u << 16) - xy2);
y = (uint32_t)((tmp >> 17) + ((tmp >> 16) & 1)); // round to nearest
}
}
return y;
}
```
```c
.data
table:
.short 65535, 46341, 32768, 23170, 16384, # 2^0 to 2^4
.short 11585, 8192, 5793, 4096, 2896, # 2^5 to 2^9
.short 2048, 1448, 1024, 724, 512, # 2^10 to 2^14
.short 362, 256, 181, 128, 90, # 2^15 to 2^19
.short 64, 45, 32, 23, 16, # 2^20 to 2^24
.short 11, 8, 6, 4, 3, # 2^25 to 2^29
.short 2, 1
testcase:
.word 0, 1, 4, 5, 1600,
.word 1000000, 12582912, 4294967294, 0xFFFFFFFE,
.word 0xFFFFFFFF
testcase_num: .word 10
.text
.global main
main:
li s10 0
jal x0 test_loop
test_loop:
lw t0 testcase_num
beq s10 t0 end
la t0 testcase
slli t1 s10 2
add t0 t0 t1
lw a0 0(t0)
jal ra fast_rsqrt
li a7 36 # print unsigned
ecall
addi s10 s10 1
li a0 10 # print \n
li a7 11
ecall
jal x0 test_loop
end:
li a7, 10
ecall
# init r, i
init_clz:
li a5 0 # r
li a4 16 # i
jal x0 clz
clz_loop:
srl a3 a0 a4
beq a3 x0 add_lz
srl a0 a0 a4
srli a4 a4 1
jal x0 clz
add_lz:
add a5 a5 a4
srli a4 a4 1
jal x0 clz
clz:
bne a4 x0 clz_loop
mv a0 a5
jr ra
init_mul32:
li t1 0 # r_low
li t6 0 # r_hi
li t2 0 # i
li t3 32 # 32
jal x0 mul32
mul32_loop:
li t4, 1
sll t4, t4, t2
and t5, a1, t4
beq t5, x0, 1f # 若該位是 0,跳過加法
mv a5, a0
sll a4, a0, t2 # a64_low
sub a6, t3, t2 # 32 - i
srl a5, a5, a6 # a64_hi
add t1, t1, a4 # r_low += a64_low
add t6, t6, a5 # r_hi += a64_hi
sltu a7, t1, a4 # carry
add t6, t6, a7
1: addi t2, t2, 1 # i++
blt t2, t3, mul32_loop
mul32:
blt t2 t3 mul32_loop
mv a0 t1 # r_low
mv a1 t6 # r_hi
jr ra
fast_rsqrt:
beq a0 x0 ret_max # x==0
li t0 1 # 1
beq a0 t0 ret_scale1 # x==1
mv s1 a0 # transfer x to saved register
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra init_clz # jump clz()
lw ra, 0(sp) # recover return address
addi sp sp 4
li t0 31
sub s2 t0 a0 # exp=31-clz()
la t1 table
slli t3 s2 1
add t1 t1 t3
lhu s3, 0(t1) # y=rsqrt_table[exp]
li t0 1
sll t0 t0 s2
bge s1 t0 intepol
mv a0 s3
jr ra
intepol:
jal t4 y_next
sub s7 s3 s6 # delta
li t0 1
sll t0 t0 s2 # 1UL<<exp
sub t1 s1 t0 # x-(1UL<<exp), t1 is x_low
mv t2 t1 # t2 is x_hi
slli t1 t1 16
srli t2 t2 16
srl t1 t1 s2
li t3 32
sub t3 t3 s2
sll t2 t2 t3
add s8 t1 t2 # frac
mv a0 s7
mv a1 s8
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra call_mul32
lw ra, 0(sp) # recover return address
addi sp sp 4
slli t0, a1, 16
srli t1, a0, 16
or t0, t0, t1 # (delta*frac)>>16
sub s3 s3 t0
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra newton_iter
lw ra, 0(sp) # recover return address
addi sp sp 4
mv a0 s3
jr ra # return y
call_mul32:
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra init_mul32 # delta*frac
lw ra, 0(sp) # recover return address
addi sp sp 4
jr ra
y_next:
li t0 31
blt s2 t0 b1
li s6 0 # y_next=0
jalr x0 t4 0
b1:
la t1, table
addi t2, s2, 1
slli t2, t2, 1 # ×2,16-bit
add t1, t1, t2
lhu s6, 0(t1) # y_next=rsqrt_table[exp+1]
jalr x0 t4 0
newton_iter:
mv a0 s3
mv a1 s3
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra call_mul32
lw ra, 0(sp) # recover return address
addi sp sp 4
mv s7 a0 # y2=(uint32_t)mul32(y,y)
mv a0 s1
mv a1 s7
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra call_mul32
lw ra, 0(sp) # recover return address
addi sp sp 4
slli t0, a1, 16 # hi << 16
srli t1, a0, 16 # lo >> 16
or s8, t0, t1 # xy2=(uint32_t)(mul32(x,y2)>>16)
mv a0 s3
li t0 3
slli t0 t0 16
sub t0 t0 s8
mv a1 t0
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra call_mul32
lw ra, 0(sp) # recover return address
addi sp sp 4
slli t0, a1, 15 # high part for >>17
srli t1, a0, 17 # low part for >>17
or s3, t0, t1 # s3 = tmp >> 17
srli t2, a0, 16 # (tmp >> 16) 的最低位
andi t2, t2, 1
add s3, s3, t2 # y = (tmp >> 17) + round_bit
# loop unrooling
mv a0 s3
mv a1 s3
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra call_mul32
lw ra, 0(sp) # recover return address
addi sp sp 4
mv s7 a0 # y2=(uint32_t)mul32(y,y)
mv a0 s1
mv a1 s7
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra call_mul32
lw ra, 0(sp) # recover return address
addi sp sp 4
slli t0, a1, 16 # hi << 16
srli t1, a0, 16 # lo >> 16
or s8, t0, t1 # xy2=(uint32_t)(mul32(x,y2)>>16)
mv a0 s3
li t0 3
slli t0 t0 16
sub t0 t0 s8
mv a1 t0
addi sp sp -4
sw ra, 0(sp) # saved current return address
jal ra call_mul32
lw ra, 0(sp) # recover return address
addi sp sp 4
slli t0, a1, 15 # high part for >>17
srli t1, a0, 17 # low part for >>17
or s3, t0, t1 # s3 = tmp >> 17
srli t2, a0, 16 # (tmp >> 16) 的最低位
andi t2, t2, 1
add s3, s3, t2 # y = (tmp >> 17) + round_bit
jr ra
ret_max:
addi a0 x0 -1
jr ra
ret_scale1:
slli a0 a0 16
jr ra
```
running in Ripes:
```
4294967295
65536
32768
29308
1638
66
18
1
1
1
```
## trace in 5 staged pipeline
First program counter at 0x00000000.
Instruction memory output is 0x00000000
Instruction code is 0x00000d13




we can see that `jal test_loop` cause fetch and decode of fail
```
8: 10000297 auipc x5 0x10000
c: 0602a283 lw x5 96 x5
```
original cycle count

After we delete `jal x0 test_loop`

With C compiler -O0

-O2

We can see that after C compiler, the instruction become so much, including printf, memset, ... lots of library function.
After we eliminates unecessary library function:
```
.data
rsqrt_table:
.short 65535, 46341, 32768, 23170, 16384
.short 11585, 8192, 5793, 4096, 2896
.short 2048, 1448, 1024, 724, 512
.short 362, 256, 181, 128, 90
.short 64, 45, 32, 23, 16
.short 11, 8, 6, 4, 3
.short 2, 1
testcase:
.word 0, 1, 4, 5, 1600,
.word 1000000, 12582912, 4294967294, 0xFFFFFFFE,
.word 0xFFFFFFFF
testcase_num: .word 10
.text
.global main
main:
li s10 0
test_loop:
lw t0 testcase_num
beq s10 t0 end
la t0 testcase
slli t1 s10 2
add t0 t0 t1
lw a0 0(t0)
jal ra fast_rsqrt
li a7 36 # print unsigned
ecall
addi s10 s10 1
li a0 10 # print \n
li a7 11
ecall
jal x0 test_loop
end:
li a7, 10
ecall
fast_rsqrt.part:
addi x2 x2 -16
sw x1 12 x2
sw x8 8 x2
sw x9 4 x2
addi x12 x10 0
addi x14 x0 5
addi x11 x0 0
addi x15 x0 16
srl x13 x12 x15
bne x13 x0 312
add x11 x11 x15
addi x14 x14 -1
srli x15 x15 1
bne x14 x0 -20
addi x15 x0 31
sub x11 x15 x11
la x14 rsqrt_table
slli x13 x11 1
addi x15 x0 1
add x13 x14 x13
sll x15 x15 x11
lhu x8 0 x13
bgeu x15 x10 264
addi x12 x0 30
addi x13 x8 0
blt x12 x11 24
addi x13 x11 1
slli x13 x13 1
add x14 x14 x13
lhu x13 0 x14
sub x13 x8 x13
sub x15 x10 x15
sltu x14 x10 x15
sub x14 x0 x14
srli x16 x15 16
slli x14 x14 16
addi x12 x11 -32
or x14 x16 x14
slli x15 x15 16
blt x12 x0 220
srl x11 x14 x12
addi x9 x10 0
addi x10 x13 0
jal x1 408
srli x10 x10 16
sub x8 x8 x10
addi x6 x0 2
addi x11 x0 1
addi x17 x0 31
addi x16 x0 32
srli x10 x9 1
lui x28 0x30
addi x30 x0 0
addi x15 x0 0
addi x12 x15 -32
sll x13 x11 x15
sll x14 x8 x15
srai x12 x12 31
and x13 x13 x8
and x14 x14 x12
addi x15 x15 1
beq x13 x0 8
add x30 x30 x14
bne x15 x16 -36
addi x12 x0 0
addi x29 x0 0
addi x15 x0 0
jal x0 40
sll x13 x9 x13
addi x14 x0 0
add x14 x12 x14
sltu x31 x14 x12
add x13 x29 x13
addi x12 x14 0
add x29 x31 x13
addi x15 x15 1
beq x15 x16 100
sll x14 x11 x15
and x14 x14 x30
addi x13 x15 -32
beq x14 x0 -20
sub x31 x17 x15
bge x13 x0 -56
sll x14 x9 x15
srl x13 x10 x31
jal x0 -60
addi x12 x13 0
jal x0 -308
addi x10 x8 0
lw x1 12 x2
lw x8 8 x2
lw x9 4 x2
addi x2 x2 16
jalr x0 x1 0
addi x12 x0 31
sub x12 x12 x11
slli x14 x14 1
sll x14 x14 x12
srl x11 x15 x11
or x11 x14 x11
jal x0 -236
slli x29 x29 16
srli x12 x12 16
or x12 x29 x12
sub x12 x28 x12
addi x30 x0 0
addi x29 x0 0
addi x15 x0 0
srli x5 x8 1
jal x0 40
sll x13 x8 x13
addi x14 x0 0
add x14 x30 x14
sltu x31 x14 x30
add x13 x29 x13
addi x30 x14 0
add x29 x31 x13
addi x15 x15 1
beq x15 x16 40
sll x14 x11 x15
and x14 x14 x12
addi x13 x15 -32
beq x14 x0 -20
sub x31 x17 x15
bge x13 x0 -56
sll x14 x8 x15
srl x13 x5 x31
jal x0 -60
srli x8 x30 16
slli x29 x29 15
srli x30 x30 17
andi x8 x8 1
or x30 x29 x30
add x8 x8 x30
beq x6 x11 -184
addi x6 x0 1
jal x0 -336
fast_rsqrt:
beq x10 x0 16
addi x14 x0 1
beq x10 x14 16
jal x0 -564
addi x10 x0 -1
jalr x0 x1 0
lui x10 0x10
jalr x0 x1 0
__mulsi3:
addi x12 x10 0
addi x10 x0 0
andi x13 x11 1
beq x13 x0 8
add x10 x10 x12
srli x11 x11 1
slli x12 x12 1
bne x11 x0 -20
jalr x0 x1 0
```
Output is same as previous hand-made instruction program. But the cycles count is less.

We know that CPI larger than 1 have several reasons, such as branchmiss, dependency.
In our case, there is lots of dependency in instructions.
If using forwarding but no hazard detection/eliminating 5 staged pipeline CPU, hazards cause program error, for example:

in this situation, `lw t0 testcase_num` write back the value to t0 and `beq s10 t0` happen simutaneously, this means t0 won't get right value while executing `beq s10 t0`, cause wrong result.
If no hazard handling but is a forwaring CPU, it should arrage two `nop` instruction to avoid hazard. Because register write back stage is two stages after exection stage.
With hazard handling but no forwarding, program won't crush, but every dependency instructions won't execute until it dependency register write back complete.
With forwarding, after every stage of memeory access and register right back, execution stage will get new data.
But if it is hazard right after load, forward not help, so it still need wait 1 cycle(1 nop).
So in above exxample three instructions:
1. With hazard detection and forward
**1 nop** automatically insert avoid hazard error for **load-use**. Through forward and 1 nop, exec `beq s10 t0 end` will get result of `lw t0 testcase`, so it will successfully run. (aware, memory access stage forward will transfer memory read to exec but value hasn't update to t0, update is write back stage. Even t0 register was not right, but `beq s10 t0 end` stage has got right value)
```
4: 10000297 auipc x5 0x10000
8: 0642a283 lw x5 100 x5
nop
c: 025d0c63 beq x26 x5 56 <end>
```
2. With hazard detection but without forward
every dependent instructions will automatically insert **2 nop** instructions to avoid hazard error.
```
4: 10000297 auipc x5 0x10000
nop
nop
8: 0642a283 lw x5 100 x5
nop
nop
c: 025d0c63 beq x26 x5 56 <end>
```
3. Without hazard detecion but with forward
because no nop for load-use, causing hazard error.
4. Without hazard detection and forward
same, no nop cause not only load-use but every dependencies have hazard error.
Qestion:
1. unconditional jump 可以被 pipeline 嗎?
2. 執行指令時,使用 Stack 的方式與想像中不一樣,我以為會是宣告變數後立即將它放入 stack,但實際上寫的時候,放入 stack 的卻是 saved register 的內容,也就是 caller function 存放其中的資料。
3. slt 指令為何老師作業都沒用到
4. riscv 官方文件寫 stack pointer 要對齊 16 byte?
5. 在執行 stack 寫入 `sw x1 0 x2` 時,當此指令執行完 memory access 階段,值就已被寫入記憶體,但 cache 卻等了 4 個 cycles 才更新剛剛寫入的記憶體值。
6. ripes cache 的運作很奇怪,單步執行時,假如運行到第 570 cycles,寫入記憶體某地址,值從 0x180 變成 0x1a4,但到了 574 cycles cache 的值才發生改變,從 0x180 變成 0x1a4,但逆向執行,cache 的值卻在 568 cycles 才變回 0x180,但記憶體則會在 570 cycles 時變回 0x180。
7. ripes 貌似沒有 branch prediction
[Modern Microprocessors: A 90-Minute Guide!](https://www.lighterra.com/papers/modernmicroprocessors/)
https://shakti.org.in/docs/risc-v-asm-manual.pdf