# Optimize sqrtf without FPU > 黃書堯 ## Problem C explanation This `sqrtf` RISC-V code originates from [Quiz3 - Problem C](https://hackmd.io/@sysprog/arch2024-quiz3-sol#Problem-C). Below is an explanation of the code logic. We have a floating point number $x$ and we want to calculate its square root. A floating-point number can be represented as : $x = 2^e\times m$. where $2^e$ is the exponent part of $x$ and $m$ is the mantissa part of $x$. ### Exponent part Since $$ \sqrt{2^e}=2^{\frac{e}{2}} $$ The square root of the exponent part can be easily calculated by right-shifting the exponent by 1 bit. ### Mantissa part We aim to compute $\sqrt{m}$, without using division instructions. Instead, we can achieve this using multiplication operations only with: $$ \sqrt{m}=m\times{\frac{1}{\sqrt{m}}} $$ From the above, we realize that we only need to calculate $\frac{1}{\sqrt{m}}$. ### Newton-Raphson method The formula of the Newton-Raphson method is as follows: $$ X_{n+1} = X_n -\frac{f(x)}{f'(x)} $$ Assuming the function is: $$ f(x) = x^2 - \frac{1}{m} $$ The iterative form becomes: $$ x_{n+1} = x_n + \frac{1}{2}x_n(1 - mx_n^2) $$ ## Remove B extension The B extension part shows as follows: ```c bseti a1, a1, 31 ``` This line sets the hidden leading bit of the mantissa. To remove the B extension dependency, we replace it with `lui` and `or`. ```diff -bseti a1, a1, 31 +lui t0, 0x80000 +or a1, a1, t0 ``` ## Running with rv32emu ### rv32emu To emulate the code, follow these steps: 1. Clone the repository: ```shell $ git clone https://github.com/sysprog21/rv32emu $ cd rv32emu $ make ``` 2. To verify the installation, run: ```shell $ make check ``` Initially, an error might occur: ```shell $ make check Running hello.elf ... [OK] Fetching SHA-1 of prebuilt binaries ... [OK] Checking SHA-1 of prebuilt binaries ... [OK] rv32emu: src/riscv.c:415: rv_create: Assertion `elf && elf_open(elf, attr->data. user.elf_program)' failed. Aborted (core dumped) rv32emu: src/riscv.c:415: rv_create: Assertion `elf && elf_open(elf, attr->data. user.elf_program)' failed. Aborted (core dumped) rv32emu: src/riscv.c:415: rv_create: Assertion `elf && elf_open(elf, attr->data. user.elf_program)' failed. Aborted (core dumped) Running puzzle ... Failed. make: *** [Makefile:354: check] 錯誤 1 ``` To resolve this, change the system language to English (United States) via: Select `Setting` -> `Region & Language` -> `Language English(United States)`. After changing the language, rerun: ```shell $ make check Check the file build/.config for configured items. Running hello.elf ... [OK] Fetching SHA-1 of prebuilt binaries ... [OK] Checking SHA-1 of prebuilt binaries ... SHA-1 verification fails! Re-fetching prebuilt binaries from "rv32emu-prebuilt" ... - 100%[===================>] 4.49M 14.7MB/s in 0.3s Running puzzle ... [OK] Running fcalc ... [OK] Running pi ... [OK] ``` This should pass all tests. ### Performance Measurement To evaluate the performance of the `sqrt.S` implementation, we can utilize the performance counter example available in `rv32emu/tests/perfcounter`. 1. Copy the example to the main directory: ```shell $ cp -r tests/perfcounter . ``` 2. Modify `Makefile`: ```diff .PHONY: clean include ../mk/toolchain.mk -CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -O2 -Wall +CFLAGS = -march=rv32im_zicsr_zifencei -mabi=ilp32 -O2 -Wall OBJS = \ getcycles.o \ getinstret.o \ - sparkle.o \ + sqrtf.o \ main.o \ BIN = perfcount.elf %.o: %.S $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< %.o: %.c $(CROSS_COMPILE)gcc $(CFLAGS) -c -o $@ $< all: $(BIN) $(BIN): $(OBJS) $(CROSS_COMPILE)gcc -o $@ $^ clean: $(RM) $(BIN) $(OBJS) ``` 3. Update `main.c`: ```diff #include <stdint.h> #include <stdio.h> #include <string.h> extern uint64_t get_cycles(); extern uint64_t get_instret(); /* * Taken from the Sparkle-suite which is a collection of lightweight symmetric * cryptographic algorithms currently in the final round of the NIST * standardization effort. * See https://sparkle-lwc.github.io/ */ -extern void sparkle_asm(unsigned int *state, unsigned int ns); +extern float my_sqrtf(float input); -#define WORDS 12 -#define ROUNDS 7 int main(void) { - unsigned int state[WORDS] = {0}; + float input = 12.4; /* measure cycles */ uint64_t instret = get_instret(); uint64_t oldcount = get_cycles(); - sparkle_asm(state, ROUNDS); + float result = sqrt(x); uint64_t cyclecount = get_cycles() - oldcount; + printf("square root of %f is %f\n", input, result); printf("cycle count: %u\n", (unsigned int) cyclecount); printf("instret: %x\n", (unsigned) (instret & 0xffffffff)); - memset(state, 0, WORDS * sizeof(uint32_t)); - sparkle_asm(state, ROUNDS); - printf("Sparkle state:\n"); - for (int i = 0; i < WORDS; i += 2) - printf("%X %X\n", state[i], state[i + 1]); return 0; } ``` 4. Compile and run the test: ```shell $ cd perfcounter $ make $ ../build/rv32emu perfcount.elf ``` Initially, the compilation might result in an error: ```shell (leetcode) (base) phlab@phlab-System-Product-Name:~/rv32emu/perfcounter$ make riscv-none-elf-gcc -march=rv32im_zicsr_zifencei -mabi=ilp32 -O2 -Wall -c -o getcycles.o getcycles.S riscv-none-elf-gcc -march=rv32im_zicsr_zifencei -mabi=ilp32 -O2 -Wall -c -o getinstret.o getinstret.S riscv-none-elf-gcc -march=rv32im_zicsr_zifencei -mabi=ilp32 -O2 -Wall -c -o sparkle.o sparkle.S riscv-none-elf-gcc -march=rv32im_zicsr_zifencei -mabi=ilp32 -O2 -Wall -c -o main.o main.c main.c:7:13: warning: conflicting types for built-in function 'sqrtf'; expected 'float(float)' [-Wbuiltin-declaration-mismatch] 7 | extern void sqrtf(); | ^~~~~ main.c:4:1: note: 'sqrtf' is declared in header '<math.h>' 3 | #include <string.h> +++ |+#include <math.h> 4 | riscv-none-elf-gcc -march=rv32im_zicsr_zifencei -mabi=ilp32 -O2 -Wall -c -o sqrtf.o sqrtf.S sqrtf.S: Assembler messages: sqrtf.S:24: Error: unrecognized opcode `bseti a1,a1,31', extension `zbs' required make: *** [Makefile:16: sqrtf.o] Error 1 ``` To resolve this, rename the function and file from `sqrtf` to `my_sqrt`. ## Performance of Square root After removing the B extension, the test output is: ```shell ../build/rv32emu perfcount.elf square root of 16.900000 is 4.110961 cycle count: 67 instret: 2d6 inferior exit code 0 ```