Reciprocal integer square root in RISC-V

# Reciprocal integer square root in RISC-V In Quiz1 problem A, it mentioned about getting square root from Bfloat16. Here, I want to discuss relevent topic. To compute reciprocal square root of an integer without FPU. A famous technique in Quake III souce code to compute inverse square root of float. ```clike float Q_rsqrt( float number ) { long i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; // evil floating point bit level hacking i = 0x5f3759df - ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed return y; } ``` But this magical technique only effictive to floating point 32. If the environment restricts from using FPU, we can not use it. To compute reciprocal square root of integer, we need Newton's method. ## Newton's method Newton's method is a algorithm to compute root of function. ![image](https://hackmd.io/_uploads/BkWs0d5all.png) $f'(x)$ can be slope of $f(x)$ near $x_n$. According to slope formula: $f'(x) = \frac{f(x_n)-0}{x_n-x_{n+1}}$ so $x_{n+1} = x_n-\frac{f(x_n)}{f'(x_n)}$ Then to compute reciprocal square root. We let $y=\frac{1}{\sqrt{x}}$ and let $f(y) = \frac{1}{y^2}-x$ Make a initial guess for root of $f(y)$ as $y_0$. We gets $y_1=y_0-\frac{y_0^{-2}-x}{-2y_0^{-3}}=y_0+\frac{y_0}{2}-\frac{xy_0^3}{2}=y_0(\frac{3}{2}-\frac{xy_0^2}{2})$ We know that reciprocal square root defintely smaller than $1$, it must be represent in floating point or fixed point. Here, we assume not using floating point and FPU, so we represent it in fixed point Q0.32 ```clike /* * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2) * * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32 */ static void newton_step(u32 *rec_inv_sqrt, u32 x) { u32 invsqrt, invsqrt2; u64 val; invsqrt = rec_inv_sqrt; invsqrt2 = ((u64)invsqrt * invsqrt) >> 32; val = (3LL << 32) - ((u64)x * invsqrt2); val >>= 2; /* avoid overflow in following multiply */ val = (val * invsqrt) >> (32 - 2 + 1); *rec_inv_sqrt = val; } ``` But Newton's method is not an efficient way to approximate root. For example, $\frac{1}{\sqrt{5}}$ with starter guess 0.5 still needs 4 times approximation to reach 32 bits precision. It's better to make a lookup table ```clike #define REC_INV_SQRT_CACHE (16) static const u32 inv_sqrt_cache[REC_INV_SQRT_CACHE] = { ~0, ~0, 3037000500, 2479700525, 2147483647, 1920767767, 1753413056, 1623345051, 1518500250, 1431655765, 1358187914, 1294981364, 1239850263, 1191209601, 1147878294, 1108955788 }; ``` ### Follow up in Quiz3 [Quiz 3 problem C](https://hackmd.io/@sysprog/arch2025-quiz3-sol?stext=27539%3A9%3A0%3A1761019229%3AjEhMzK) Why scaling factor needed? What the benefits? How to determine lookuptable ## Transform C code to RISC-V instructions RISC-V have 31 registers, each one is general purpose. (why not more registers? because more register, more propagation delay.) ![image](https://hackmd.io/_uploads/By3yZSM1-l.png) For **argument register** holds function argument from x10(a0) to x17(a7) totally 8 registers. x9 and x18(s2) to x27(s11) are **saved registers**, which should be maintain while jump to different function. So for example, if a function using s0, it should first store original value of s0 to memory before changing it. After operation down, then move s0 original from memory to s0. Further in RISC-V ABI, saying **stack pointer** have to align 16 bytes. Stack pointer is accumulate from high address to low address. So push a data to stack means substract stack pointer, pop means add. **Program counter** is a separate special internal register, not a general purpose register. ![image](https://hackmd.io/_uploads/ByQTZSzy-e.png) ![圖片](https://hackmd.io/_uploads/H18gSqVy-e.png) pseudo instructions are group of instruction representing abstruct idea. * `mv rd rs1` = `addi rd rs1 0` * `li rd immd` = `addi rd rd immd` or `lui rd immd + addi rd immd` which `lui` for upper 20 bits(clear lower 12 bits) and `addi` for lower 12 bits * `jr re` = `jalr x0 x1 0` * `j label` = `jal x0 label` * `nop` = `addi x0 x0 0` Immediate instruction only performs sign-extend 12 bits immediate. If Formats, group same encoding layout instruction into same format: * R-format: three registers, `rd rs1 rs2` * I-format: 12 bits immediate, `addi` or `lw` and `jalr` * S-format: store, `sw` * B-format: branch, `beq` (aware of b-format encoding) * J-format: jump, `jal` * U-format: load upper 20 bits and clear lower 12 bits, `lui` **auipc** is special instruction for handling super large offsets. Because B-format only handles $\pm2^{10}$ offsets, while J-format is $\pm2^{18}$. So for larger ones, we need `auipc` which `auipc rd immd => rd = (immed << 12) + PC` Also, `jalr` is I-format, so it won't multiply immed by $2$ like B-format and J-format. However, `jalr rd rs1 immed` means it can do arithmatic beforehand to make super large offsets. In assembly, symbol are those human readable representation, such as label or varible. ``` .data varible: .text label: ``` `if`, `while`, function calls all represent jump in assembly. Translating `bf16_add` :::spoiler bf16_add in RISC-V instructions ```c .data BF16_NAN: 0x7FC0 .text bf16_add: srli t1 a0 15 andi t1 t1 1 # sign_a srli t2 a1 15 andi t2 t2 1 # sign_b srli t3 a0 7 li a3 0xFF and t3 t3 a3 # exp_a srli t4 a1 7 and t4 t4 a3 # exp_b andi t5 a0 0x7F # mant_a andi t6 a1 0x7F # mant_b beq t3 ta3 is_inf beq t4 a3 ret_b not t7 t3 not t8 t5 and t7 t7 t8 bnez t7 ret_b not t7 t4 not t8 t6 and t7 t7 t8 bnez t7 ret_a is_inf: beqz t5 is_nan jal x0 ret_a is_nan: beq t4 a3 ret_nan jal x0 ret_a ret_nan: bnez t6 ret_b beq t1 t2 ret_b la a0 BF16_NAN lw a0 0(a0) jal x0 ret_a ret_a: jr ra ret_b: addi a0 a1 0 jr ra ``` ::: `jal rgs label` jal is J-type, calculates its relative to program count. while `jalr rgs immd(rgs)` jalr is I-type, calculates based on immediate + register's value. ### Reciprocal square root in RV32I Initial guess: $$y_0 = \frac{1}{\sqrt{2^{\lfloor log_2x \rfloor}}}$$ Original code: ```c static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for (int i = 0; i < 32; i++) { if (b & (1U << i)) { r += (uint64_t)a << i; } } return r; } static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */ 11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */ 2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */ 362, 256, 181, 128, 90, /* 2^15 to 2^19 */ 64, 45, 32, 23, 16, /* 2^20 to 2^24 */ 11, 8, 6, 4, 3, /* 2^25 to 2^29 */ 2, 1 /* 2^30, 2^31 */ }; /* 65536/sqrt(x) */ uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return 0xFFFFFFFF; if (x == 1) return 65536; int exp = 31 - clz(x); uint32_t y = rsqrt_table[exp]; if (x > (1u << exp)) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = ((x - (1U << exp)) << 16) / (1U << exp); y -= (uint32_t) (delta*frac) >> 16; } for (int iter = 0; iter < 2; iter++) { uint32_t y2 = (uint32_t)(mul32(y, y) >> 16); uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); y = (uint32_t) (mul32(y, (3u << 16) - xy2) >> 17); } return y; } uint32_t testcase[4] = {0x0, 0x1, 0x4, 0xFFFFFFFF}; uint32_t answer[4] = {0x0, 0x1, 0x8000, 0x1}; int main() { for (int i = 0; i < 4; i++) { uint32_t y = fast_rsqrt(testcase[i]); } } ``` In interpolation, ```c if (x > (1u << exp)) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = ((x - (1U << exp)) << 16) / (1U << exp); y -= (uint32_t) (delta*frac) >> 16; } ``` if x is larger than $2^{17}$, then `(x - (1U << exp)) << 16)` will overflow. so avoid overflow, extends to `uint64_t`: ```c uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) / (1UL << exp)); ``` In Newton-Raphson iteration, ```c uint32_t y2 = (uint32_t)(mul32(y, y) >> 16); uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); y = (uint32_t) (mul32(y, (3u << 16) - xy2) >> 17); ``` when y is less than $2^8$(which also means x is larger than $2^{16}$), then `y2` will become zero, which makes `xy2` becomes zero, too. Then we observes that because `y` is less than $2^{16}=65536$. `y2` will definitely less than $2^{32}$. So we can express `y2` in Q0.32 and for `xy2`, just maintain Q16.16 ```c uint32_t y2 = (uint32_t) mul32(y, y); // Q0.32 uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // Q16.16 uint64_t tmp = mul32(y, (3u << 16) - xy2); y = (uint32_t)((tmp >> 17) + ((tmp >> 16) & 1)); // round to nearest ``` Testcase below: ```c #define testcase_num 10 uint32_t testcase[testcase_num] = { 0, 1, 4, 5, 1600, 1000000, 12582912, 4294967294, 0xFFFFFFFE, 0xFFFFFFFF}; uint32_t answer[testcase_num] = {0xFFFFFFFF, 65536, 32768, 29309, 1638, 65, 18, 1, 1, 1}; int main() { for (int i = 0; i < testcase_num; i++) { uint32_t y = fast_rsqrt(testcase[i]); if (y != answer[i]) { int32_t delta = (int32_t)answer[i] - y; double deltaf = (double)delta; double perc = (double)answer[i]; printf("testcase %d: y=%d , correct=%d ,error rate=%lf\n", i, y, answer[i], deltaf / perc); } else printf("testcase %d pass\n", i); } return 0; } ``` Numerical difference is less than 1.5%. ```shell testcase 0 pass testcase 1 pass testcase 2 pass testcase 3 differce: y=29308 , correct:29309 ,error rate: 0.000034 testcase 4 pass testcase 5 differce: y=66 , correct:65 ,error rate: -0.015385 testcase 6 pass testcase 7 pass testcase 8 pass testcase 9 pass ``` Full code after correction: ```c #include <math.h> #include <stdint.h> #include <stdio.h> static uint32_t clz(uint32_t a) { uint32_t lz = 0; for (uint32_t i = 16; i > 0; i >>= 1) { if (!(a >> i)) { lz += i; } else { a >>= i; } } return lz; } static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for (int i = 0; i < 32; i++) { if (b & (1U << i)) { r += (uint64_t)a << i; } } return r; } static const uint16_t rsqrt_table[32] = { 65535, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */ 11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */ 2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */ 362, 256, 181, 128, 90, /* 2^15 to 2^19 */ 64, 45, 32, 23, 16, /* 2^20 to 2^24 */ 11, 8, 6, 4, 3, /* 2^25 to 2^29 */ 2, 1 /* 2^30, 2^31 */ }; /* 65536/sqrt(x) */ uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return 0xFFFFFFFF; if (x == 1) return 65536; int exp = 31 - clz(x); uint32_t y = rsqrt_table[exp]; // Q16.16 if (x > (1u << exp)) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = (uint32_t)((((uint64_t)x - (1UL << exp)) << 16) >> exp); y -= (uint32_t)(delta * frac) >> 16; for (int iter = 0; iter < 2; iter++) { uint32_t y2 = (uint32_t)mul32(y, y); // Q0.32 uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // Q16.16 uint64_t tmp = mul32(y, (3u << 16) - xy2); y = (uint32_t)((tmp >> 17) + ((tmp >> 16) & 1)); // round to nearest } } return y; } ``` ```c .data table: .short 65535, 46341, 32768, 23170, 16384, # 2^0 to 2^4 .short 11585, 8192, 5793, 4096, 2896, # 2^5 to 2^9 .short 2048, 1448, 1024, 724, 512, # 2^10 to 2^14 .short 362, 256, 181, 128, 90, # 2^15 to 2^19 .short 64, 45, 32, 23, 16, # 2^20 to 2^24 .short 11, 8, 6, 4, 3, # 2^25 to 2^29 .short 2, 1 testcase: .word 0, 1, 4, 5, 1600, .word 1000000, 12582912, 4294967294, 0xFFFFFFFE, .word 0xFFFFFFFF testcase_num: .word 10 .text .global main main: li s10 0 jal x0 test_loop test_loop: lw t0 testcase_num beq s10 t0 end la t0 testcase slli t1 s10 2 add t0 t0 t1 lw a0 0(t0) jal ra fast_rsqrt li a7 36 # print unsigned ecall addi s10 s10 1 li a0 10 # print \n li a7 11 ecall jal x0 test_loop end: li a7, 10 ecall # init r, i init_clz: li a5 0 # r li a4 16 # i jal x0 clz clz_loop: srl a3 a0 a4 beq a3 x0 add_lz srl a0 a0 a4 srli a4 a4 1 jal x0 clz add_lz: add a5 a5 a4 srli a4 a4 1 jal x0 clz clz: bne a4 x0 clz_loop mv a0 a5 jr ra init_mul32: li t1 0 # r_low li t6 0 # r_hi li t2 0 # i li t3 32 # 32 jal x0 mul32 mul32_loop: li t4, 1 sll t4, t4, t2 and t5, a1, t4 beq t5, x0, 1f # 若該位是 0，跳過加法 mv a5, a0 sll a4, a0, t2 # a64_low sub a6, t3, t2 # 32 - i srl a5, a5, a6 # a64_hi add t1, t1, a4 # r_low += a64_low add t6, t6, a5 # r_hi += a64_hi sltu a7, t1, a4 # carry add t6, t6, a7 1: addi t2, t2, 1 # i++ blt t2, t3, mul32_loop mul32: blt t2 t3 mul32_loop mv a0 t1 # r_low mv a1 t6 # r_hi jr ra fast_rsqrt: beq a0 x0 ret_max # x==0 li t0 1 # 1 beq a0 t0 ret_scale1 # x==1 mv s1 a0 # transfer x to saved register addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra init_clz # jump clz() lw ra, 0(sp) # recover return address addi sp sp 4 li t0 31 sub s2 t0 a0 # exp=31-clz() la t1 table slli t3 s2 1 add t1 t1 t3 lhu s3, 0(t1) # y=rsqrt_table[exp] li t0 1 sll t0 t0 s2 bge s1 t0 intepol mv a0 s3 jr ra intepol: jal t4 y_next sub s7 s3 s6 # delta li t0 1 sll t0 t0 s2 # 1UL<<exp sub t1 s1 t0 # x-(1UL<<exp), t1 is x_low mv t2 t1 # t2 is x_hi slli t1 t1 16 srli t2 t2 16 srl t1 t1 s2 li t3 32 sub t3 t3 s2 sll t2 t2 t3 add s8 t1 t2 # frac mv a0 s7 mv a1 s8 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra call_mul32 lw ra, 0(sp) # recover return address addi sp sp 4 slli t0, a1, 16 srli t1, a0, 16 or t0, t0, t1 # (delta*frac)>>16 sub s3 s3 t0 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra newton_iter lw ra, 0(sp) # recover return address addi sp sp 4 mv a0 s3 jr ra # return y call_mul32: addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra init_mul32 # delta*frac lw ra, 0(sp) # recover return address addi sp sp 4 jr ra y_next: li t0 31 blt s2 t0 b1 li s6 0 # y_next=0 jalr x0 t4 0 b1: la t1, table addi t2, s2, 1 slli t2, t2, 1 # ×2,16-bit add t1, t1, t2 lhu s6, 0(t1) # y_next=rsqrt_table[exp+1] jalr x0 t4 0 newton_iter: mv a0 s3 mv a1 s3 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra call_mul32 lw ra, 0(sp) # recover return address addi sp sp 4 mv s7 a0 # y2=(uint32_t)mul32(y,y) mv a0 s1 mv a1 s7 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra call_mul32 lw ra, 0(sp) # recover return address addi sp sp 4 slli t0, a1, 16 # hi << 16 srli t1, a0, 16 # lo >> 16 or s8, t0, t1 # xy2=(uint32_t)(mul32(x,y2)>>16) mv a0 s3 li t0 3 slli t0 t0 16 sub t0 t0 s8 mv a1 t0 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra call_mul32 lw ra, 0(sp) # recover return address addi sp sp 4 slli t0, a1, 15 # high part for >>17 srli t1, a0, 17 # low part for >>17 or s3, t0, t1 # s3 = tmp >> 17 srli t2, a0, 16 # (tmp >> 16) 的最低位 andi t2, t2, 1 add s3, s3, t2 # y = (tmp >> 17) + round_bit # loop unrooling mv a0 s3 mv a1 s3 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra call_mul32 lw ra, 0(sp) # recover return address addi sp sp 4 mv s7 a0 # y2=(uint32_t)mul32(y,y) mv a0 s1 mv a1 s7 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra call_mul32 lw ra, 0(sp) # recover return address addi sp sp 4 slli t0, a1, 16 # hi << 16 srli t1, a0, 16 # lo >> 16 or s8, t0, t1 # xy2=(uint32_t)(mul32(x,y2)>>16) mv a0 s3 li t0 3 slli t0 t0 16 sub t0 t0 s8 mv a1 t0 addi sp sp -4 sw ra, 0(sp) # saved current return address jal ra call_mul32 lw ra, 0(sp) # recover return address addi sp sp 4 slli t0, a1, 15 # high part for >>17 srli t1, a0, 17 # low part for >>17 or s3, t0, t1 # s3 = tmp >> 17 srli t2, a0, 16 # (tmp >> 16) 的最低位 andi t2, t2, 1 add s3, s3, t2 # y = (tmp >> 17) + round_bit jr ra ret_max: addi a0 x0 -1 jr ra ret_scale1: slli a0 a0 16 jr ra ``` running in Ripes: ``` 4294967295 65536 32768 29308 1638 66 18 1 1 1 ``` ## trace in 5 staged pipeline First program counter at 0x00000000. Instruction memory output is 0x00000000 Instruction code is 0x00000d13 ![圖片](https://hackmd.io/_uploads/rJxW354J-x.png) ![Screenshot from 2025-11-02 16-38-57](https://hackmd.io/_uploads/rJnhsqNJ-g.png) ![Screenshot from 2025-11-02 16-45-27](https://hackmd.io/_uploads/BJp3i94kZe.png) ![Screenshot from 2025-11-02 16-45-42](https://hackmd.io/_uploads/Skanjq4kbe.png) we can see that `jal test_loop` cause fetch and decode of fail ``` 8: 10000297 auipc x5 0x10000 c: 0602a283 lw x5 96 x5 ``` original cycle count ![圖片](https://hackmd.io/_uploads/B1Pmi5EkWl.png) After we delete `jal x0 test_loop` ![圖片](https://hackmd.io/_uploads/HJ1uocEk-g.png) With C compiler -O0 ![圖片](https://hackmd.io/_uploads/SkDEkoEyZx.png) -O2 ![圖片](https://hackmd.io/_uploads/r17Txo4kZx.png) We can see that after C compiler, the instruction become so much, including printf, memset, ... lots of library function. After we eliminates unecessary library function: ``` .data rsqrt_table: .short 65535, 46341, 32768, 23170, 16384 .short 11585, 8192, 5793, 4096, 2896 .short 2048, 1448, 1024, 724, 512 .short 362, 256, 181, 128, 90 .short 64, 45, 32, 23, 16 .short 11, 8, 6, 4, 3 .short 2, 1 testcase: .word 0, 1, 4, 5, 1600, .word 1000000, 12582912, 4294967294, 0xFFFFFFFE, .word 0xFFFFFFFF testcase_num: .word 10 .text .global main main: li s10 0 test_loop: lw t0 testcase_num beq s10 t0 end la t0 testcase slli t1 s10 2 add t0 t0 t1 lw a0 0(t0) jal ra fast_rsqrt li a7 36 # print unsigned ecall addi s10 s10 1 li a0 10 # print \n li a7 11 ecall jal x0 test_loop end: li a7, 10 ecall fast_rsqrt.part: addi x2 x2 -16 sw x1 12 x2 sw x8 8 x2 sw x9 4 x2 addi x12 x10 0 addi x14 x0 5 addi x11 x0 0 addi x15 x0 16 srl x13 x12 x15 bne x13 x0 312 add x11 x11 x15 addi x14 x14 -1 srli x15 x15 1 bne x14 x0 -20 addi x15 x0 31 sub x11 x15 x11 la x14 rsqrt_table slli x13 x11 1 addi x15 x0 1 add x13 x14 x13 sll x15 x15 x11 lhu x8 0 x13 bgeu x15 x10 264 addi x12 x0 30 addi x13 x8 0 blt x12 x11 24 addi x13 x11 1 slli x13 x13 1 add x14 x14 x13 lhu x13 0 x14 sub x13 x8 x13 sub x15 x10 x15 sltu x14 x10 x15 sub x14 x0 x14 srli x16 x15 16 slli x14 x14 16 addi x12 x11 -32 or x14 x16 x14 slli x15 x15 16 blt x12 x0 220 srl x11 x14 x12 addi x9 x10 0 addi x10 x13 0 jal x1 408 srli x10 x10 16 sub x8 x8 x10 addi x6 x0 2 addi x11 x0 1 addi x17 x0 31 addi x16 x0 32 srli x10 x9 1 lui x28 0x30 addi x30 x0 0 addi x15 x0 0 addi x12 x15 -32 sll x13 x11 x15 sll x14 x8 x15 srai x12 x12 31 and x13 x13 x8 and x14 x14 x12 addi x15 x15 1 beq x13 x0 8 add x30 x30 x14 bne x15 x16 -36 addi x12 x0 0 addi x29 x0 0 addi x15 x0 0 jal x0 40 sll x13 x9 x13 addi x14 x0 0 add x14 x12 x14 sltu x31 x14 x12 add x13 x29 x13 addi x12 x14 0 add x29 x31 x13 addi x15 x15 1 beq x15 x16 100 sll x14 x11 x15 and x14 x14 x30 addi x13 x15 -32 beq x14 x0 -20 sub x31 x17 x15 bge x13 x0 -56 sll x14 x9 x15 srl x13 x10 x31 jal x0 -60 addi x12 x13 0 jal x0 -308 addi x10 x8 0 lw x1 12 x2 lw x8 8 x2 lw x9 4 x2 addi x2 x2 16 jalr x0 x1 0 addi x12 x0 31 sub x12 x12 x11 slli x14 x14 1 sll x14 x14 x12 srl x11 x15 x11 or x11 x14 x11 jal x0 -236 slli x29 x29 16 srli x12 x12 16 or x12 x29 x12 sub x12 x28 x12 addi x30 x0 0 addi x29 x0 0 addi x15 x0 0 srli x5 x8 1 jal x0 40 sll x13 x8 x13 addi x14 x0 0 add x14 x30 x14 sltu x31 x14 x30 add x13 x29 x13 addi x30 x14 0 add x29 x31 x13 addi x15 x15 1 beq x15 x16 40 sll x14 x11 x15 and x14 x14 x12 addi x13 x15 -32 beq x14 x0 -20 sub x31 x17 x15 bge x13 x0 -56 sll x14 x8 x15 srl x13 x5 x31 jal x0 -60 srli x8 x30 16 slli x29 x29 15 srli x30 x30 17 andi x8 x8 1 or x30 x29 x30 add x8 x8 x30 beq x6 x11 -184 addi x6 x0 1 jal x0 -336 fast_rsqrt: beq x10 x0 16 addi x14 x0 1 beq x10 x14 16 jal x0 -564 addi x10 x0 -1 jalr x0 x1 0 lui x10 0x10 jalr x0 x1 0 __mulsi3: addi x12 x10 0 addi x10 x0 0 andi x13 x11 1 beq x13 x0 8 add x10 x10 x12 srli x11 x11 1 slli x12 x12 1 bne x11 x0 -20 jalr x0 x1 0 ``` Output is same as previous hand-made instruction program. But the cycles count is less. ![圖片](https://hackmd.io/_uploads/SJFP90VyWx.png) We know that CPI larger than 1 have several reasons, such as branchmiss, dependency. In our case, there is lots of dependency in instructions. If using forwarding but no hazard detection/eliminating 5 staged pipeline CPU, hazards cause program error, for example: ![圖片](https://hackmd.io/_uploads/Sk6su1SkWg.png) in this situation, `lw t0 testcase_num` write back the value to t0 and `beq s10 t0` happen simutaneously, this means t0 won't get right value while executing `beq s10 t0`, cause wrong result. If no hazard handling but is a forwaring CPU, it should arrage two `nop` instruction to avoid hazard. Because register write back stage is two stages after exection stage. With hazard handling but no forwarding, program won't crush, but every dependency instructions won't execute until it dependency register write back complete. With forwarding, after every stage of memeory access and register right back, execution stage will get new data. But if it is hazard right after load, forward not help, so it still need wait 1 cycle(1 nop). So in above exxample three instructions: 1. With hazard detection and forward **1 nop** automatically insert avoid hazard error for **load-use**. Through forward and 1 nop, exec `beq s10 t0 end` will get result of `lw t0 testcase`, so it will successfully run. (aware, memory access stage forward will transfer memory read to exec but value hasn't update to t0, update is write back stage. Even t0 register was not right, but `beq s10 t0 end` stage has got right value) ``` 4: 10000297 auipc x5 0x10000 8: 0642a283 lw x5 100 x5 nop c: 025d0c63 beq x26 x5 56 <end> ``` 2. With hazard detection but without forward every dependent instructions will automatically insert **2 nop** instructions to avoid hazard error. ``` 4: 10000297 auipc x5 0x10000 nop nop 8: 0642a283 lw x5 100 x5 nop nop c: 025d0c63 beq x26 x5 56 <end> ``` 3. Without hazard detecion but with forward because no nop for load-use, causing hazard error. 4. Without hazard detection and forward same, no nop cause not only load-use but every dependencies have hazard error. Qestion: 1. unconditional jump 可以被　pipeline　嗎？ 2. 執行指令時，使用 Stack 的方式與想像中不一樣，我以為會是宣告變數後立即將它放入 stack，但實際上寫的時候，放入 stack 的卻是 saved register 的內容，也就是 caller function 存放其中的資料。 3. slt 指令為何老師作業都沒用到 4. riscv 官方文件寫　stack pointer 要對齊 16 byte? 5. 在執行　stack 寫入 `sw x1 0 x2` 時，當此指令執行完 memory access 階段，值就已被寫入記憶體，但 cache 卻等了 4 個 cycles 才更新剛剛寫入的記憶體值。 6. ripes cache 的運作很奇怪，單步執行時，假如運行到第 570 cycles，寫入記憶體某地址，值從 0x180 變成 0x1a4，但到了　574 cycles cache 的值才發生改變，從 0x180 變成 0x1a4，但逆向執行，cache 的值卻在　568 cycles 才變回 0x180，但記憶體則會在 570 cycles 時變回 0x180。 7. ripes 貌似沒有 branch prediction [Modern Microprocessors: A 90-Minute Guide!](https://www.lighterra.com/papers/modernmicroprocessors/) https://shakti.org.in/docs/risc-v-asm-manual.pdf