# Fast Approximation of Reciprocal Square Root contributed by < [`dotandlog`](https://github.com/dotandlog) > > [Source code](https://github.com/dotandlog). :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: In computer graphics, calculating lighting and shading often involves normalizing vectors, which requires computing the [reciprocal square root](https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Iterative_methods_for_reciprocal_square_roots) of a number $\frac{1}{\sqrt{x}}$. This operation is computationally expensive if done directly. The [Fast Inverse Square Root algorithm](https://en.wikipedia.org/wiki/Fast_inverse_square_root), made famous by its use in Quake III Arena, provides a fast approximation. In this work, we demonstrate how `bits_to_fp32` and `__builtin_clz` can be a part of such an approximation, which are retrived from the [Problem A](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-A) and [Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C) of Quiz 1, respectively. ## FP32 Format Here is the format of the [single-precision floating-point](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), also known as `FP32` or `float32`, $$ \begin{align} {FP32 = {(-1)}^{sign}\times {2}^{E-127} \times (1+\sum_{i=1}^{23}b_{23-i}2^{-i})} \end{align} $$ :::danger Re-draw the following with Graphviz. > I've already re-drown the following graph with graphviz [name=dotandlog] ::: ```graphviz digraph { node [shape=record]; // Main structure nodes main_struct [label="<f0> Sign\n1 bit\nbit 31|<f1> Exponent\n80. bits\nbits 30-23|<f2> Mantissa\n23 bits\nbits 22-0", style=filled]; // Details nodes sign_details [label="Sign bit 31\n0 → Positive\n1 → Negative", style=filled, fillcolor=lightblue]; exp_details [label="Exponent bits 30-23\n00000000 → Subnormal\n11111111 → Inf/NaN\nOther → 2^(E-127)", style=filled, fillcolor=lightpink]; mantissa_details [label="Mantissa bits 22-0\n• Normalized: 1.fraction\n• Each bit represents power of 2\n• b₂₂ = 2⁻¹, b₂₁ = 2⁻², ... b₀ = 2⁻²³", style=filled, fillcolor=lightyellow]; main_struct:f0 -> sign_details; main_struct:f1 -> exp_details; main_struct:f2 -> mantissa_details; } ``` ## How it works ```mermaid flowchart TD; A["Input float x"] B["Convert to integer"] A --> B B --> C["Count leading zeros"] C --> D["Initial approximation"] D --> E["Apply 'magic' number"] E --> F["Convert to float"] F --> G["Newton-Raphson"] G --> H["Output 1/sqrt(x)"] ``` <!-- ```graphviz digraph{ rankdir="TD"; A [label="Input float x"]; B [label="Convert to integer"]; C [label="Count leading zeros"]; D [label="Initial approximation"]; E [label="Apply 'magic' number"]; F [label="Convert to float"]; G [label="Newton-Raphson"]; H [label="Output 1/sqrt(x)"]; A -> B; B -> C; C -> D; D -> E; E -> F; F -> G; G -> H; } ``` --> 1. **Input Conversion:** The input float is reinterpreted as a 32-bit integer. 2. **Leading Zero Count:** `__builtin_clz` is used to count leading zeros in this integer, helping to normalize the input. 3. **Initial Approximation:** The integer is manipulated using bitwise operations, incorporating the leading zero count to create a better initial guess. 4. **Bit Magic:** A “magic” number (`0x5f3759df`) is used in further bitwise manipulation to refine the approximation. 5. **Float Conversion:** The manipulated integer is converted back to a float using bits_to_fp32. 6. **Newton-Raphson Iteration:** Iterations of the [Newton-Raphson method](https://en.wikipedia.org/wiki/Newton%27s_method) are applied to improve accuracy. According to the article "[Understanding Quake’s Fast Inverse Square Root](https://betterexplained.com/articles/understanding-quakes-fast-inverse-square-root/)", the author explore some experiments of the relation between iteration times and the accuracy. As you can see from the article, the accuracy reach out to about 98.06% when the third iteration. So we choose 3 as a maximum iteration number. 7. **Result:** The function returns this improved approximation of $\frac{1} {\sqrt{x}}$. ## Transaction Between FP32 And Bits - **C Code** ```cpp = float bits_to_fp32(uint32_t w) { union { uint32_t as_bits; float as_value; } fp32 = {.as_bits = w}; return fp32.as_value; } uint32_t fp32_to_bits(float f) { union { float as_value; uint32_t as_bits; } fp32 = {.as_value = f}; return fp32.as_bits; } ``` - **RISC-V Assembly Code** ```asm = # Function: bits_to_fp32 # Input: a0 - integer part, a1 - fractional part (as fixed-point), a2 - sign (0 for positive, 1 for negative) # Output: a0 - 32-bit floating-point representation bits_to_fp32: # Combine sign, exponent, and mantissa li t0, 127 # Bias for exponent li t1, 0x7FFFFF # Mask for 23-bit mantissa # Handle sign slli t2, a2, 31 # Shift sign to bit 31 # Normalize and calculate exponent addi t3, zero, 23 # Start with exponent 23 (assuming integer part) normalize_loop: bge a0, t1, end_normalize slli a0, a0, 1 # Shift left addi t3, t3, -1 # Decrease exponent j normalize_loop end_normalize: # Combine exponent and mantissa add t3, t3, t0 # Add bias to exponent slli t3, t3, 23 # Shift exponent to correct position and a0, a0, t1 # Mask mantissa or a0, a0, t3 # Combine exponent and mantissa or a0, a0, t2 # Add sign ret # Function: fp32_to_bits # Input: a0 - 32-bit floating-point representation # Output: a0 - integer part, a1 - fractional part (as fixed-point), a2 - sign decode_fp32_to_bits: # Extract sign, exponent, and mantissa li t0, 0x7F800000 # Exponent mask li t1, 0x007FFFFF # Mantissa mask # Extract sign srli a2, a0, 31 # Extract and unbias exponent and t2, a0, t0 srli t2, t2, 23 addi t2, t2, -127 # Unbias exponent # Extract mantissa and add implied 1 and a0, a0, t1 ori a0, a0, 0x800000 # Add implied 1 # Adjust based on exponent li t3, 23 sub t3, t3, t2 bge t3, zero, shift_right neg t3, t3 sll a0, a0, t3 j end_shift shift_right: srl a0, a0, t3 end_shift: # Split into integer and fractional parts li t4, 0x00FFFFFF and a1, a0, t4 # Fractional part srli a0, a0, 24 # Integer part ret ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: ## Counting Leading Zeros - **C Code** ```cpp = int clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` - **RISC-V Assembly Code** ```asm = ``` ## Reciprocal Square Root - **C Code** ```cpp = float fastInvSqrt(float x) { uint32_t i = fp32_to_bits(x); int leadingZeros = clz(i); // Magic number and shift (simplified for demonstration) i = 0x5f3759df - (i >> 1); float y = bits_to_fp32(i); // Dynamically determine iteration count based on leadingZeros int iterations = (32 - leadingZeros) / 4; // Example for (int j = 0; j < iterations; j++) { y = y * (1.5f - 0.5f * x * y * y); } return y; } ``` ## Testing Results ### Test Cases For each case, there are 4 inputs `float` numbers `a`, `b`, `c`, and `d`. The program will convert each of them into `bf16_t`, pack `a` and `b` into a `pbf16_t` number `p`, pack `c` and `d` into a `pbf16_t` number `q`, and finally perform `a * c` and `b * d` within a register by applying `pbf16_mul` on `p` and `q`. | | input `a` | input `b` | input `c` | input `d` | answer <br> `{ac, bd}` | | --- | ----------------------- | ----------------------- | ----------------------- | ----------------------- | --------------------------------- | | 1 | 1.200000 (`0x3F99999A`) | 2.312500 (`0x40140000`) | 2.310000 (`0x4013D70A`) | 1.203125 (`0x3F9A0000`) | 2.781250, 2.781250 (`0x40324032`) | | 2 | 0.000000 (`0x00000000`) | 2.312500 (`0x40140000`) | 2.310000 (`0x4013D70A`) | 1.203125 (`0x3F9A0000`) | 0.000000, 2.781250 (`0x00004032`) | | 2 | 0.000000 (`0x00000000`) | 2.312500 (`0x40140000`) | -2.310000 (`0xC013D70A`) | -1.203125 (`0xBF9A0000`) | -0.000000, -2.781250 (`0x8000C032`) | ### C Output - Case 1 ![](https://hackmd.io/_uploads/SJujn0fba.png) - Case 2 - Case 3 ### Assemly Output - Case 1 - Case 2 - Case 3 ## Analysis ![](https://hackmd.io/_uploads/S1Pur6zWa.png) ### Pipeline Stage Explanation [Ripes](https://github.com/mortbopet/Ripes/tree/master) provides different processors to run the code. In this assignment, we choose **5-stage processor** to execute the program. ![](https://hackmd.io/_uploads/S1o1Opz-T.png) Take the instruction `jal x1 112 <fp32_to_bf16>` for example. The jump and link (JAL) instruction uses the J-type format, where the J-immediate encodes a signed offset in multiples of 2 bytes. The offset is sign-extended and added to the `pc` to form the jump target address. Jumps can therefore target a ±1 MiB range. JAL stores the address of the instruction following the jump (`pc+4`) into register `rd`. The standard software calling convention uses `x1` (`ra`) as the return address register and `x5` (`t0`) as an alternate link register. > reference: [RISC-V Spec v2.2 (p.15-16)](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) <!-- ![p12](https://hackmd.io/_uploads/rkRgCyX-6.png) --> <!-- ![p12](https://hackmd.io/_uploads/HypHCkmZT.png) --> ![p16](https://hackmd.io/_uploads/Sy9cAkXbp.png) ``` rd = pc + 4 pc += offset ``` #### Instruction Fetch (IF) Stage ![JAL IF](https://hackmd.io/_uploads/BJJJOlmZp.png =50%x) 1. The PC of `jal x1 112 <fp32_to_bf16>` in my program is `0x1c`. 2. The instruction fetched from memory is `0x070000ef`. 3. Both PC (`0x0000001c`) and PC+4 (`0x00000020`) are passed to the next stage. #### Instruction Decoce (ID) Stage ![](https://hackmd.io/_uploads/SyAAOlX-a.png =50%x) 1. From the instruction `jal x1 112` (`0x070000ef`), the opcode is `0b1101111`, destination register is `ra` (`0x01`), and immediate is 112 (`0x00000070`). 2. PC (`0x0000001c`), PC+4 (`0x00000020`), and rd (`0x01`) are passed to the next stage. #### Execute (EX) Stage ![](https://hackmd.io/_uploads/SJRVilmb6.png =50%x) 1. The ALU performs addition on PC (`0x0000001c`) and immediate (`0x00000070`) passed from the previous stage, and the result (`0x0000008c`) will be the next PC. 2. PC+4 (`0x00000020`) and rd (`0x01`) are passed to the next stage ![](https://hackmd.io/_uploads/H169axQZa.png) #### Memory Access (MEM) Stage ![](https://hackmd.io/_uploads/HydnClXWp.png =50%x) 1. There is no memory access in JAL instruction, so the `Wr en` signal of data memory is cleared. 2. PC+4 (`0x00000020`) and rd (`0x01`) are passed to the next stage 3. The PC in IF stage is now `0x0000008c`, which is the first instruction in the `fp32_to_bf16` function. 4. Because the target address of JAL instruction is determined in the EX stage, and by that time, the next two instructions have already entered the pipeline. Thus, it is necessary to flush both the ID and EX stages ensure the correct execution of the program. ![](https://hackmd.io/_uploads/rJ4MGZ7Wp.png) #### Write Back (WB) Stage ![](https://hackmd.io/_uploads/r1G2N-Xbp.png) 1. In the WB stage, PC+4 (`0x00000020`) and rd (`0x01`) passed from previous stages are used to write the register file. PC+4 (`0x00000020`) would be written into the `ra` (`0x01`) register. <!-- ## Optimization ### Use temperary registers ### Avoid redundant branch ### Loop unrolling ### Obviate data hazards by reordering instructions ### Eliminate redundant branch instructions by strip mining like technique --> ## Future Work - Handling Infinity and NaN <!-- ## References -->