# Assignment2: Complete Applications
contributed by [< ukp66482 >](https://github.com/ukp66482/ca2025)
[TOC]
## Project Hierarchy
- Assignment2
- rv32emu — RISC-V emulator
- bin - Executables and configs
-- rv32emu (executable file)
-- .config
- mk -Makefile configurations for compiler and toolchain
-- toolchain\.mk
- hanoi_tower
- rsqrt
- uf8
## rv32emu
The rv32emu source code was cloned from the [original GitHub repository](https://github.com/sysprog21/rv32emu), compiled locally, and the resulting binaries `rv32emu` and `.config` were then placed in this assignment directory.
## How to run RISC-V program on a bare-metal env
I will use q1-UF8 project demonstrates how to run a RISC-V program in a bare-metal environment using the rv32emu emulator.
It need some essential files to run a RISC-V program on bare-metal env such as
- main.c
- start.S
- linker.ld
- Makefile
This setup shows how to compile, link, and execute a program without an operating system, directly interacting with the hardware through startup and assembly routines.
#### linker.ld
``` ld
OUTPUT_ARCH( "riscv" )
ENTRY(_start)
SECTIONS
{
. = 0x10000;
.text : {
*(.text._start)
*(.text)
}
.data : { *(.data) }
.bss : {
__bss_start = .;
*(.bss)
__bss_end = .;
}
.stack (NOLOAD) : {
. = ALIGN(16);
. += 4096;
__stack_top = .;
}
}
```
The `linker.ld` file defines how different sections of the program are mapped into memory in a bare-metal environment.
Since there is no operating system, the linker script explicitly assigns memory regions for code, data, stack, and heap.
It also defines key symbols that are referenced in start.S during program initialization.
| Section | Description |
|----------|-------------|
| `.text` | Stores executable code (instructions).|
| `.data` | Global/static variables with initial values.|
| `.bss` | Uninitialized global/static variables. Must be cleared to zero in startup code. |
| `.stack` | The .stack section reserves memory space for the program stack |
##### .text
The `.text` section **stores all executable instructions of the program**, including both **C functions compiled by the compiler** and **assembly routines**.
Since each assembly file explicitly declares **.text**, all these functions are placed in the same code section during linking.
``` ld
. = 0x10000;
.text : {
*(.text._start)
*(.text)
}
```
The linker first places all code labeled under `.text._start` — which contains the `_start` symbol defined in start.S — at the beginning of memory.
This guarantees that the CPU begins execution from the startup routine immediately after reset.
After the startup code, all other `.text` sections from the remaining object files (such as main.c, uf8.c, and uf8_asm.S) are placed sequentially in memory.
This layout ensures a predictable execution flow: **the program always starts from _start**, which initializes the system (stack, .bss, and .data), and then jumps to the main() function in C.
All assembly and C functions therefore share the same contiguous .text region, **starting at address 0x10000** and expanding upward in memory as code size increases.
##### ENTRY(_start)
The `ENTRY(_start)` directive defines the entry point of the program — the exact memory address where execution begins after the CPU is reset or the emulator starts.
In this project, `_start` is a label defined inside start.S, which contains the startup code responsible for **system initialization** (setting up the stack pointer, clearing .bss, copying .data, and calling main()).
Although the .text._start section already places _start at the beginning of memory, **the ENTRY(_start) directive is still required to explicitly tell the linker: **this is the starting point of the entire program**.
Without this directive, the linker might not know which symbol should be used as the program’s entry point, which could lead to incorrect or undefined startup behavior.
##### .data
The `.data` section **stores all initialized global and static variables**.
``` ld
.data : { *(.data) }
```
It collects all `.data` sections from every object file and places them together in memory right after the `.text` section.
##### .bss
The `.bss` section holds all **uninitialized global and static variables**—that is, variables that are declared but not given an explicit initial value in the source code.
``` ld
.bss : {
__bss_start = .;
*(.bss)
__bss_end = .;
}
```
- `__bss_start` and `__bss_end` are symbols marking the beginning and end of the `.bss` region.
- `*(.bss)` collects all `.bss` sections from the compiled object files and places them in this reserved space.
- During startup, the assembly code in `start.S` uses these two symbols to **clear this memory region to zero**, ensuring that **all uninitialized variables are properly initialized before main() runs**.
>The symbols __bss_start and __bss_end do not have fixed addresses in the source code.
>
>Instead, they are automatically assigned by the linker during the linking stage.
>
>When the .bss section is laid out in memory, the current location counter (.) determines their values, marking the beginning and end of the uninitialized data region.
>
>Their actual addresses can be viewed in the ELF symbol table after linking.
##### ._stack
``` ld
.stack (NOLOAD) : {
. = ALIGN(16);
. += 4096;
__stack_top = .;
}
```
The `.stack` section reserves memory space for the **program stack**, which is used to store function call frames, local variables, and return addresses during execution.
Unlike other sections, it is declared with the `NOLOAD` attribute, **meaning this region is not loaded from the program** (no data is written into it by the linker).
Instead, it simply reserves an empty block of memory in RAM for use at runtime.
#### Start.S
``` asm
# Startup code for bare metal RISC-V
.section .text._start
.globl _start
.type _start, @function
_start:
# Set up stack pointer
la sp, __stack_top
# Clear BSS
la t0, __bss_start
la t1, __bss_end
1:
bge t0, t1, 2f
sw zero, 0(t0)
addi t0, t0, 4
j 1b
2:
# Call main
call main
# Exit syscall (if main returns)
li a7, 93 # exit syscall number
li a0, 0 # exit code
ecall
# Infinite loop (should never reach here)
3:
j 3b
.size _start, .-_start
# Provide BSS markers if linker script doesn't define them
.weak __bss_start
.weak __bss_end
.weak __stack_top
```
This startup code **initializes the stack**, **clears the uninitialized data section (.bss)**, and then **transfers control to the main() function** to begin program execution in a bare-metal RISC-V environment.
#### Makefile
A simplified structure of the Makefile looks like this:
``` makefile
CC = riscv64-unknown-elf-gcc
CFLAGS = -O2 -Wall -march=rv32i -mabi=ilp32
LDFLAGS = -T linker.ld
OBJS = start.o main.o uf8.o uf8_asm.o perfcounter.o
all: uf8.elf
uf8.elf: $(OBJS)
$(CC) $(OBJS) $(LDFLAGS) -o $@
clean:
rm -f $(OBJS) uf8.elf
```
The Makefile performs the following steps:
1. Compiles each `.c` and `.S` file into an object file `.o`.
2. Links all object files together using `linker.ld`, producing `.elf`.
#### Run rv32emu
After building the project, the `compiled ELF file` can be executed on the rv32emu emulator.
The emulator loads the `.elf` binary, starts execution from the _start entry point defined in linker.ld,
and simulates the RISC-V CPU running the program in a bare-metal environment.
>An ELF (Executable and Linkable Format) file is the standard binary format used for executables, object files, and firmware images in Unix-like systems.
>
>It contains not only the program’s machine instructions, but also section information (like .text, .data, .bss), symbol tables, and the entry point address defined by ENTRY(_start) in the linker script.
>
>In a bare-metal RISC-V program, the .elf file is the final output produced after compiling and linking all source files.
The emulator (rv32emu) reads this ELF file, maps each section into simulated memory according to linker.ld, and begins execution at the entry address _start.
## 1.UF8
I compared the performance of the UF8 encode and decode functions implemented in different ways.
Specifically, **I compiled the same C code with different optimization flags** (`-O0`, `-O2`, and `-O3`) and also wrote a `hand-optimized assembly` version.
By testing these versions under the same conditions, I evaluated how compiler optimization levels and manual assembly coding affect execution efficiency on a bare-metal RISC-V environment.
### UF8 encode/decode in assembly
```asm
#////////////////////////////////////////////
#
# uf8_decode_asm Function
#
# input a0 = x, output a0 = decode_result
#
#////////////////////////////////////////////
.globl uf8_decode_asm
.type uf8_decode_asm, %function
uf8_decode_asm:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
andi t0, a0, 0x0f # t0 = mantissa
srli t1, a0, 4 # t1 = exponent
addi t2, x0, 15
sub t2, t2, t1 # t2 = 15 - exponent
lui t3, 0x8 # 0x8 << 12 = 0x8000 upper 20 bits
addi t3, t3, -1 # 0x8000 - 1 = 0x7FFF
srl t3, t3, t2 # t3 = 0x7FFF >> (15 - exponent)
slli t3, t3, 4 # t3 = t3 << 4 (offset)
sll t0, t0, t1
add t0, t0, t3
addi a0, t0, 0
lw ra, 4(sp)
lw a1, 0(sp)
addi sp, sp, 8
ret # a0 decode_result, a1 input
.size uf8_decode_asm, .-uf8_decode_asm
#////////////////////////////////////////////
#
# uf8_encode_asm Function
#
# input a0 = x, return a0 = encode_result
#
#////////////////////////////////////////////
.globl uf8_encode_asm
.type uf8_encode_asm, %function
uf8_encode_asm:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
addi t0, x0, 16
blt a0, t0, ENCODE_RET # if x < 16 return a0
jal ra, CLZ
addi t0, x0, 31
sub t0, t0, a0 # t0 = msb = 31 - clz_result
lw a0, 0(sp) # reload a0 = value
addi t1, x0, 0 # t1 = exponent = 0
addi t4, x0, 0 # t4 = overflow = 0
addi t3, x0, 5
blt t0, t3, ENCODE_FIND_E # if msb < 5 goto ENCODE_FIND
addi t1, t0, -4 # exponent = msb - 4
addi t3, x0, 15
blt t1, t3, SKIP # if exponent < 15 goto SKIP
addi t1, x0, 15 # exponent = 15
SKIP:
addi t5, x0, 0 # t5 = e = 0
lw a0, 0(sp) # reload a0 = value
ENCODE_CAL_OVERFLOW:
bge t5, t1, ENCODE_ADJUST # if e >= exponent
slli t4, t4, 1 # overflow = overflow << 1
addi t4, t4, 16 # overflow = overflow + 16
addi t5, t5, 1 # e++
jal x0, ENCODE_CAL_OVERFLOW
ENCODE_ADJUST:
bge x0, t1, ENCODE_FIND_E # if 0 >= exponent
bge a0, t4, ENCODE_FIND_E # if value >= overflow
addi t4, t4, -16 # overflow = overflow - 16
srli t4, t4, 1 # overflow = overflow >> 1
addi t1, t1, -1 # exponent--
jal x0, ENCODE_ADJUST
ENCODE_FIND_E:
addi t3, x0, 15
bge t1, t3, ENCODE_COMBINE # if exponent >= 15 break
slli t6, t4, 1 # t6 = next_overflow = overflow << 1
addi t6, t6, 16 # next_overflow += 16
blt a0, t6, ENCODE_COMBINE # if value < next_overflow break
addi t1, t1, 1 # exponent++
addi t4, t6, 0 # overflow = next_overflow
jal x0, ENCODE_FIND_E
ENCODE_COMBINE:
sub t2, a0, t4 # t2 = value - overflow
srl t2, t2, t1 # t2 = mantissa = (value - overflow) >> exponent
slli t1, t1, 4 # t1 = exponent << 4
or a0, t1, t2
ENCODE_RET:
lw ra, 4(sp)
lw a1, 0(sp)
addi sp, sp, 8
ret
```
### UF8 encode/decode in C
```c
int32_t UF8_DECODE_FN(uint32_t fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
/* Encode uint32_t to uf8 */
uint32_t UF8_ENCODE_FN(uint32_t value){
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
### Main.c and Makefile compile flag
#### Main.c:
```c
/* ============= UF8 Encode / Decode Declaration ============= */
/* Assembly versions */
extern uint32_t uf8_decode_asm(uint32_t x);
extern uint32_t uf8_encode_asm(uint32_t x);
/* C versions (linked with different optimization levels) */
extern uint32_t uf8_decode_O0(uint32_t x);
extern uint32_t uf8_encode_O0(uint32_t x);
extern uint32_t uf8_decode_O2(uint32_t x);
extern uint32_t uf8_encode_O2(uint32_t x);
extern uint32_t uf8_decode_O3(uint32_t x);
extern uint32_t uf8_encode_O3(uint32_t x);
```
#### Makefile:
In this setup, the same C source code (uf8.c) is compiled multiple times using different optimization flags (-O0, -O2, and -O3).
```makefile
%.o: %.S
$(AS) $(AFLAGS) $< -o $@
uf8_O0.o: uf8.c
$(CC) $(CFLAGS) -O0 -DUF8_SUFFIX=_O0 -c $< -o $@
uf8_O2.o: uf8.c
$(CC) $(CFLAGS) -O2 -DUF8_SUFFIX=_O2 -c $< -o $@
uf8_O3.o: uf8.c
$(CC) $(CFLAGS) -O3 -DUF8_SUFFIX=_O3 -c $< -o $@
```
### Test Function
The following test function runs a complete encode/decode verification for all possible input values from 0 to 255.
It checks that each value, when decoded and then re-encoded, matches the original input.
If any mismatch occurs, the function prints a detailed log for debugging.
```c
static void test_UF8(uf8_func encode, uf8_func decode) {
for (int i = 0; i != 256; i++) {
uint32_t val = decode(i);
uint32_t back = encode(val);
if (back != i) {
TEST_LOGGER("UF8 Encode/Decode test failed for input: ");
print_hex(i);
TEST_LOGGER(" Decoded value: ");
print_hex(val);
TEST_LOGGER(" Re-encoded value: ");
print_hex(back);
}
}
TEST_LOGGER("UF8 Encode/Decode test passed for all inputs.\n");
}
```
### Result
```
=== UF8 Encode/Decode Tests ===
Test0 : UF8 Encode/Decode by my self assembly
UF8 Encode/Decode test passed for all inputs.
Cycles: 35472
Instructions: 35472
Test1 : UF8 Encode/Decode by O0 C version
UF8 Encode/Decode test passed for all inputs.
Cycles: 67831
Instructions: 67831
Test2 : UF8 Encode/Decode by O2 C version
UF8 Encode/Decode test passed for all inputs.
Cycles: 33467
Instructions: 33467
Test3 : UF8 Encode/Decode by O3 C version
UF8 Encode/Decode test passed for all inputs.
Cycles: 21048
Instructions: 21048
```
## 2.hanoi tower
In this experiment, I implemented the Hanoi Tower problem in three different ways to compare algorithmic approaches and compiler optimizations on a bare-metal RISC-V environment.
The implementations include:
- Recursive C version — the traditional approach that calls itself recursively to move disks between towers.
- Iterative C version using Gray code — a non-recursive method that determines each move based on Gray code sequence generation.
- Iterative Assembly version using Gray code — a hand-optimized RISC-V assembly implementation of the same Gray code–based algorithm for maximum performance.
Each C version was compiled with different compiler optimization levels (-O0, -O2, and -O3) to evaluate how recursion, iteration, and assembly implementation affect instruction count.
### Iterative Hanoi Tower in Assembly
This implementation uses the Gray code algorithm to iteratively determine which disk to move in each step of the Hanoi Tower problem.
``` asm
hanoi_iter_asm:
addi sp, sp, -16
sw s0, 0(sp) # counter i
sw s1, 4(sp) # total moves
sw s2, 8(sp) # limit (2^n)
sw s3, 12(sp)
li t0, 1
sll s2, t0, a0
addi s2, s2, -1 # s2 = 2^n - 1 (total moves)
mv s1, zero # move counter = 0
li s0, 1 # current step = 1
loop_iter:
bgt s0, s2, done # while (i <= 2^n - 1)
# gray(n) = n XOR (n >> 1)
srli t1, s0, 1
xor t2, s0, t1 # gray(i)
# gray(n-1)
addi t3, s0, -1
srli t4, t3, 1
xor t5, t3, t4 # gray(i-1)
# diff = gray(i) ^ gray(i-1)
xor t6, t2, t5
addi s1, s1, 1 # move++
addi s0, s0, 1 # i++
jal zero, loop_iter
```
### Iterative Hanoi Tower in C
This C implementation uses the same Gray code–based iterative algorithm as the assembly version
```c
#ifndef HANOI_SUFFIX
#define HANOI_SUFFIX _default
#endif
#define CONCAT2(x, y) x##y
#define CONCAT(x, y) CONCAT2(x, y)
#include <stdint.h>
uint32_t CONCAT(hanoi_iter, HANOI_SUFFIX)(uint32_t n) {
if (n == 0)
return 0;
uint32_t total_moves = (1u << n) - 1; // 2^n - 1 moves total
uint32_t count = 0;
volatile uint32_t gray_now, gray_prev, diff;
// Gray code iteration:
for (uint32_t i = 1; i <= total_moves; i++) {
// gray(i) = i ^ (i >> 1)
gray_now = i ^ (i >> 1);
gray_prev = (i - 1) ^ ((i - 1) >> 1);
// which disk moves -> bit position that changed
diff = gray_now ^ gray_prev;
count++;
}
return count;
}
```
The volatile keyword (gray_now, gray_prev, diff) is used to prevent compiler optimizations, ensuring every Gray code computation is executed so that **all implementations are compared under the same baseline conditions**.
### Recursive Hanoi Tower in C
This version implements the classic recursive algorithm for solving the Hanoi Tower problem.
Unlike the iterative Gray code versions, this implementation relies entirely on function recursion to control the sequence of moves.
It serves as a reference for evaluating the performance cost of recursion compared with iterative and assembly approaches under the same problem size.
```c
#ifndef HANOI_SUFFIX
#define HANOI_SUFFIX _default
#endif
#define CONCAT2(x, y) x##y
#define CONCAT(x, y) CONCAT2(x, y)
#include <stdint.h>
uint32_t CONCAT(hanoi_rec, HANOI_SUFFIX)(uint32_t n, int from, int to, int aux) {
if (n == 0)
return 0;
uint32_t moves_before = CONCAT(hanoi_rec, HANOI_SUFFIX)(n - 1, from, aux, to);
uint32_t moves_after = CONCAT(hanoi_rec, HANOI_SUFFIX)(n - 1, aux, to, from);
return moves_before + 1 + moves_after;
}
uint32_t CONCAT(hanoi_rec_wrapper, HANOI_SUFFIX)(uint32_t n) {
return CONCAT(hanoi_rec, HANOI_SUFFIX)(n, 1, 3, 2);
}
```
### Main.c and Makefile
#### Main.c
```c
/* ============= Hanoi Tower Solver Declaration ============= */
/* Assembly versions */
extern uint32_t hanoi_iter_asm(uint32_t n);
/* C versions (linked with different optimization levels) */
extern uint32_t hanoi_rec_wrapper_O0(uint32_t n);
extern uint32_t hanoi_rec_wrapper_O2(uint32_t n);
extern uint32_t hanoi_rec_wrapper_O3(uint32_t n);
extern uint32_t hanoi_iter_O0(uint32_t n);
extern uint32_t hanoi_iter_O2(uint32_t n);
extern uint32_t hanoi_iter_O3(uint32_t n);
```
#### Makefile
The Makefile compiles both the recursive (hanoi_rec.c) and iterative (hanoi_iter.c) versions of the Hanoi Tower algorithm using different compiler optimization levels (-O0, -O2, -O3).
``` makefile
%.o: %.S
$(AS) $(AFLAGS) $< -o $@
hanoi_rec_O0.o: hanoi_rec.c
$(CC) $(CFLAGS) -O0 -DHANOI_SUFFIX=_O0 -c $< -o $@
hanoi_rec_O2.o: hanoi_rec.c
$(CC) $(CFLAGS) -O2 -DHANOI_SUFFIX=_O2 -c $< -o $@
hanoi_rec_O3.o: hanoi_rec.c
$(CC) $(CFLAGS) -O3 -DHANOI_SUFFIX=_O3 -c $< -o $@
hanoi_iter_O0.o: hanoi_iter.c
$(CC) $(CFLAGS) -O0 -DHANOI_SUFFIX=_O0 -c $< -o $@
hanoi_iter_O2.o: hanoi_iter.c
$(CC) $(CFLAGS) -O2 -DHANOI_SUFFIX=_O2 -c $< -o $@
hanoi_iter_O3.o: hanoi_iter.c
$(CC) $(CFLAGS) -O3 -DHANOI_SUFFIX=_O3 -c $< -o $@
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
```
### Test function
```c
static void hanoi_test_suite(hanoi_func solver, uint32_t disks)
{
uint32_t moves = solver(disks);
TEST_LOGGER("Disks: ");
print_dec(disks);
TEST_LOGGER("Moves: ");
print_dec(moves);
TEST_LOGGER("\n");
}
```
### Result
#### Number of disk = 3
```
=== Hanoi Tower Tests ===
Testing hanoi_tower (Assembly iterative)...
Disks: 3
Moves: 7
Cycles: 2656
Instructions: 2656
Testing hanoi_tower (C iterative O0)...
Disks: 3
Moves: 7
Cycles: 2770
Instructions: 2770
Testing hanoi_tower (C iterative O2)...
Disks: 3
Moves: 7
Cycles: 2668
Instructions: 2668
Testing hanoi_tower (C iterative O3)...
Disks: 3
Moves: 7
Cycles: 2668
Instructions: 2668
Testing hanoi_tower (C recursive O0)...
Disks: 3
Moves: 7
Cycles: 2967
Instructions: 2967
Testing hanoi_tower (C recursive O2)...
Disks: 3
Moves: 7
Cycles: 2672
Instructions: 2672
Testing hanoi_tower (C recursive O3)...
Disks: 3
Moves: 7
Cycles: 2672
Instructions: 2672
```
#### Number of disks = 7
```
=== Hanoi Tower Tests ===
Testing hanoi_tower (Assembly iterative)...
Disks: 7
Moves: 127
Cycles: 6320
Instructions: 6320
Testing hanoi_tower (C iterative O0)...
Disks: 7
Moves: 127
Cycles: 8234
Instructions: 8234
Testing hanoi_tower (C iterative O2)...
Disks: 7
Moves: 127
Cycles: 6692
Instructions: 6692
Testing hanoi_tower (C iterative O3)...
Disks: 7
Moves: 127
Cycles: 6692
Instructions: 6692
Testing hanoi_tower (C recursive O0)...
Disks: 7
Moves: 127
Cycles: 11671
Instructions: 11671
Testing hanoi_tower (C recursive O2)...
Disks: 7
Moves: 127
Cycles: 6018
Instructions: 6018
Testing hanoi_tower (C recursive O3)...
Disks: 7
Moves: 127
Cycles: 6205
Instructions: 6205
```
#### Number of disks = 15
```
=== Hanoi Tower Tests ===
Testing hanoi_tower (Assembly iterative)...
Disks: 15
Moves: 32767
Cycles: 336521
Instructions: 336521
Testing hanoi_tower (C iterative O0)...
Disks: 15
Moves: 32767
Cycles: 828035
Instructions: 828035
Testing hanoi_tower (C iterative O2)...
Disks: 15
Moves: 32767
Cycles: 434813
Instructions: 434813
Testing hanoi_tower (C iterative O3)...
Disks: 15
Moves: 32767
Cycles: 434813
Instructions: 434813
Testing hanoi_tower (C recursive O0)...
Disks: 15
Moves: 32767
Cycles: 1712752
Instructions: 1712752
Testing hanoi_tower (C recursive O2)...
Disks: 15
Moves: 32767
Cycles: 394962
Instructions: 394962
Testing hanoi_tower (C recursive O3)...
Disks: 15
Moves: 32767
Cycles: 323261
Instructions: 323261
```
## 3.rsqrt
### rsqrt.c
```c
#ifndef RSQRT_SUFFIX
#define RSQRT_SUFFIX
#endif
#define CONCAT2(x, y) x##y
#define CONCAT(x, y) CONCAT2(x, y)
#define RSQRT_FN CONCAT(rsqrt, RSQRT_SUFFIX)
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
static int clz(uint32_t x){
if(!x) return 32;
int n = 0;
if((x & 0xFFFF0000) == 0) { n += 16; x <<= 16; }
if((x & 0xFF000000) == 0) { n += 8; x <<= 8; }
if((x & 0xF0000000) == 0) { n += 4; x <<= 4; }
if((x & 0xC0000000) == 0) { n += 2; x <<= 2; }
if((x & 0x80000000) == 0) { n += 1; }
return n;
}
static uint64_t mul32(uint32_t a, uint32_t b){
uint64_t result = 0;
for(int i = 0; i < 32; i++) {
if (b & (1U << i)) {
result += ((uint64_t)a << i);
}
}
return result;
}
static const uint32_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384,
11585, 8192, 5793, 4096, 2896,
2048, 1448, 1024, 724, 512,
362, 256, 181, 128, 90,
64, 45, 32, 23, 16,
11, 8, 6, 4, 3,
2, 1
};
uint32_t CONCAT(rsqrt, RSQRT_SUFFIX)(uint32_t x){
if (x == 0) return 0xFFFFFFFF;
if (x == 1) return 65536;
int exp = 31 - clz(x);
uint32_t y = rsqrt_table[exp];
// Linear interpolation
if(x > (1u << exp)){
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint64_t numer = ((uint64_t)(x - (1u << exp)) << 16);
uint32_t frac = (uint32_t)(numer >> exp);
y = y - (uint32_t)((mul32(delta, frac)) >> 16);
}
// Newton-Raphson iterations
for(int iter = 0; iter < 2; iter++) {
uint64_t y_sq = mul32(y, y);
uint32_t y2 = (uint32_t)(y_sq >> 16);
uint64_t xy2 = (uint64_t)mul32(x, y2);
uint64_t factor_num = (3u << 16);
factor_num = (xy2 >= factor_num) ? 0 : (factor_num - xy2);
y = (uint32_t)((mul32(y, factor_num)) >> 17);
}
return y;
}
```
### Main.c and Makefile
#### Main.c
```c
/* ============= rsqrt Declaration ============= */
extern uint32_t rsqrt_O0(uint32_t x);
extern uint32_t rsqrt_O2(uint32_t x);
extern uint32_t rsqrt_O3(uint32_t x);
extern uint32_t rsqrt_Ofast(uint32_t x);
```
#### Makefile
``` makefile
rsqrt_O0.o: rsqrt.c
$(CC) $(CFLAGS) -O0 -DRSQRT_SUFFIX=_O0 -c $< -o $@
rsqrt_O2.o: rsqrt.c
$(CC) $(CFLAGS) -O2 -DRSQRT_SUFFIX=_O2 -c $< -o $@
rsqrt_O3.o: rsqrt.c
$(CC) $(CFLAGS) -O3 -DRSQRT_SUFFIX=_O3 -c $< -o $@
rsqrt_Ofast.o: rsqrt.c
$(CC) $(CFLAGS) -Ofast -DRSQRT_SUFFIX=_Ofast -c $< -o $@
%.o: %.S
$(AS) $(AFLAGS) $< -o $@
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
```
### Test function
```c
static void run_test(test_func_t test_func, rsqrt_func rsqrt)
{
uint64_t start_cycles, end_cycles, cycles_elapsed;
uint64_t start_instret, end_instret, instret_elapsed;
start_cycles = get_cycles();
start_instret = get_instret();
test_func(rsqrt);
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n\n");
}
```
### Result
```
Testing rsqrt_O0:
Cycles: 40435
Instructions: 40435
Testing rsqrt_O2:
Cycles: 18369
Instructions: 18369
Testing rsqrt_O3:
Cycles: 18380
Instructions: 18380
Testing rsqrt_Ofast:
Cycles: 18380
Instructions: 18380
```