# Performance tuning for Quake/rv32emu
> 余紹桓, 郭子敬
[Github](https://github.com/bentck719/quake-embedded)
## Goal
To utilize the mechanisms of the B extension to accelerate the execution performance of the [Quake](https://github.com/sysprog21/quake-embedded) program.
> Currently, regarding the B extension, we have only included it in the compilation parameters to allow the compiler to perform automatic optimizations. The rest of our optimizations focus more on reducing the number of RISC-V instructions.
For future optimization directions of this project, utilizing the Vector extension may be a good choice.
## Context of the Task
The [Quake](https://github.com/sysprog21/quake-embedded) program is written in C, and we aim to run it using rv32emu, a RISC-V emulator. Therefore, before execution, the Quake program must first be compiled into RISC-V instructions using the [GNU toolchain for RISC-V](https://github.com/riscv-collab/riscv-gnu-toolchain).
## Prerequisites
The project requires the following tools: [rv32emu](https://github.com/sysprog21/rv32emu/tree/master) and [riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain)
### rv32emu Installation
1. Install the required packages:
```shell
sudo apt install libsdl2-dev libsdl2-mixer-dev
```
2. Clone the repository and build the emulator:
```shell
git clone https://github.com/sysprog21/rv32emu.git
cd rv32emu && make
```
After building, `./build/rv32emu` is the executable file for rv32emu.
Additionally, run the following command to obtain files related to Quake and execute it:
```shell
make quake
```
### riscv-gnu-toolchain Installation
> Note: The installation of this toolchain can take approximately 1-3 hours.
:::warning
Alternatively, use prebuilt [xPack](https://xpack-dev-tools.github.io/riscv-none-elf-gcc-xpack/) toolchain.
:::
1. Install the required dependencies:
```shell
sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev
```
2. Clone the repository and configure the build:
```shell
git clone https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain/
sudo ./configure --prefix=/opt/riscv --enable-multilib
sudo make
```
3. Add the installed toolchain to your PATH:
```shell
export PATH=$PATH:/opt/riscv/bin
```
## Resolving the Build Issue
The `README.md` in [quake-embedded](https://github.com/sysprog21/quake-embedded) provides the following build instructions:
```shell
git clone https://github.com/sysprog21/quake-embedded && cd quake-embedded
mkdir build && cd build
cmake -DCMAKE_TOOLCHAIN_FILE=../port/boards/rv32emu/toolchain.cmake \
-DCMAKE_BUILD_TYPE=RELEASE -DBOARD_NAME=rv32emu ..
make
```
However, during the initial build, the following error was encountered:
```
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_C_COMPILER:
riscv-none-elf-gcc
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_ASM_COMPILER:
riscv-none-elf-gcc
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "ASM" or the CMake cache entry CMAKE_ASM_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.
```
I checked my previously installed riscv-gnu-toolchain:
```shell
ls /opt/riscv/bin/
```
It revealed that my toolchain prefix is `riscv64-unknown-elf` instead of `riscv-none-elf`. Therefore, I modify `port/boards/rv32emu/toolchain.cmake`:
```diff
set(CMAKE_SYSTEM_NAME Generic)
set(CMAKE_SYSTEM_PROCESSOR riscv)
if(NOT DEFINED CROSS_COMPILE)
-# set(CROSS_COMPILE "riscv-none-elf-")
+# set(CROSS_COMPILE "riscv64-unknown-elf-")
endif()
set(TOOLCHAIN_PREFIX ${CROSS_COMPILE})
set(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY)
set(CMAKE_C_COMPILER ${TOOLCHAIN_PREFIX}gcc)
set(CMAKE_CXX_COMPILER ${TOOLCHAIN_PREFIX}g++)
set(CMAKE_ASM_COMPILER ${TOOLCHAIN_PREFIX}gcc)
set(CMAKE_OBJCOPY ${TOOLCHAIN_PREFIX}objcopy)
set(CMAKE_SIZE ${TOOLCHAIN_PREFIX}size)
set(ARCH_FLAGS "-march=rv32imf -mabi=ilp32 -Ofast -flto")
set(CMAKE_C_FLAGS "${ARCH_FLAGS} -std=gnu11 -Wall -ffunction-sections -fdata-sections -Wdouble-promotion")
set(CMAKE_ASM_FLAGS "${ARCH_FLAGS} -x assembler-with-cpp")
set(CMAKE_EXE_LINKER_FLAGS "-Wl,-Map=${CMAKE_CURRENT_BINARY_DIR}/output.map -Wl,--gc-sections -u _printf_float -u _scanf_float -lm")
```
After modifications, build again, and it will be successful:
```shell
mkdir build && cd build
cmake -DCMAKE_TOOLCHAIN_FILE=../port/boards/rv32emu/toolchain.cmake \
-DCMAKE_BUILD_TYPE=RELEASE -DBOARD_NAME=rv32emu ..
make
cp ./port/boards/rv32emu/quake ./
```
## Mac rv32emu
While `make ENABLE_SYSTEM=1 system`, the make file won't download the Linux/Image and prebuild test file spontanously. Therefore, we have to download the files on [rv32emu-prebuilt-release](https://github.com/sysprog21/rv32emu-prebuilt/releases) to fetch the imperative files, .
After downloading the files, we have to move it to the directory ./build/riscv32
```
brew install
brew install dtc
make ENABLE_SYSTEM=1 system
```
## Performance Analysis
Before optimizing Quake, it is essential to identify which functions consume the most execution time. Tools like [gprof](https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html) and [perf](https://perfwiki.github.io/main/) are capable of profiling a program to provide insights into the execution time distribution across its functions. Since we are new to these tools, We plan to start by using [gprof](https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html), as it was recommended by instructor.
### Current Issue
How to perform performance profiling on the Quake program compiled with riscv64-unknown-elf is a challenge to us. Using gprof to directly analyze the Quake program compiled with riscv64-unknown-elf appears to be infeasible, as riscv64-unknown-elf uses the newlib library and does not include the glibc library required for gprof.
Therefore, we are currently exploring the following approaches:
1. Compile the quake program using GCC. This allows us to use gprof for performance analysis. However, the results might differ from the Quake program compiled with riscv64-unknown-elf.
2. Look for other tools or methods that can perform performance profiling on programs compiled with riscv64-unknown-elf.
### Analysis Method 1: Compile Quake using GCC to support gprof
We later researched the [quakegeneric](https://github.com/erysdren/quakegeneric/tree/main) project and found that it already provides a Makefile for compiling Quake using GCC. Therefore, the first method mentioned earlier can be used. By simply adding the -pg flag to the Makefile and running make:
```diff
CC ?= gcc
AR ?= ar
RM ?= rm -f
ifeq ($(CC), tcc)
override CFLAGS += -DSDL_DISABLE_IMMINTRIN_H=1
endif
ifeq ($(DEBUG), 1)
override CFLAGS += -g3
endif
ifeq ($(PARANOID), 1)
override CFLAGS += -fsanitize=address,undefined -Wall -Wextra -pedantic
override LDFLAGS += -fsanitize=address,undefined
endif
ifeq ($(RELEASE), 1)
override CFLAGS += -O3
override LDFLAGS += -O3
endif
- override CFLAGS += -m32 -std=gnu99 $(shell sdl2-config --cflags
- override LDFLAGS += -m32 -lm $(shell sdl2-config --libs)
+ override CFLAGS += -pg -m32 -std=gnu99 $(shell sdl2-config --cflags
+ override LDFLAGS += -pg -m32 -lm $(shell sdl2-config --libs)
...
```
we can obtain a `quakegeneric` executable capable of generating the gmon.out file.
After playing `quakegeneric` for a while to generate the `gmon.out` file, use `gprof` to analyze it:
```shell
$ gprof ./quakegeneric gmon.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
25.60 1.39 1.39 418979 0.00 0.00 D_DrawSpans8
18.60 2.40 1.01 5043 0.20 0.20 QG_DrawFrame
10.87 2.99 0.59 427019 0.00 0.00 D_DrawZSpans
3.13 3.16 0.17 2291446 0.00 0.00 R_EmitEdge
2.39 3.29 0.13 42023 0.00 0.00 R_DrawSurfaceBlock8_mip0
2.39 3.42 0.13 5027 0.03 0.09 R_RecursiveWorldNode
2.03 3.53 0.11 17759842 0.00 0.00 R_LeadingEdge
2.03 3.64 0.11 1023746 0.00 0.00 R_RenderFace
1.84 3.74 0.10 852572 0.00 0.00 D_PolysetDrawSpans8
1.47 3.82 0.08 605440 0.00 0.00 D_PolysetScanLeftEdge
1.47 3.90 0.08 227812 0.00 0.00 SV_HullPointContents
1.47 3.98 0.08 main
1.29 4.05 0.07 16334920 0.00 0.00 R_TrailingEdge
1.10 4.11 0.06 661078 0.00 0.00 D_PolysetCalcGradients
1.10 4.17 0.06 5033 0.01 0.01 CL_RelinkEntities
......
```
It can be observed that the most time-consuming functions in quakegeneric are `D_DrawSpans8`, `QG_DrawFrame`, and `D_DrawZSpans`. Although there may be differences in the code between quakegeneric and quake-embedded, we currently plan to use these results as a reference to identify opportunities for optimization in quake.
However, it seems that the `QG_DrawFrame` function does not exist in the `quake-embedded`. Therefore, we plan to focus on optimizing `D_DrawSpans8` and `D_DrawZSpans` as our initial targets.
### Bottleneck of D_DrawSpans8
`D_DrawSpans8` is a function responsible for applying texture mapping to a horizontal span of pixels on the screen. It calculates texture coordinates and depth values for each pixel in the span, fetches the appropriate texture color, and writes it to the frame buffer.
We discovered that Quake provides the `timedemo` command, which plays a fixed sequence of animations and outputs the average FPS. This allows us to test program performance under identical conditions after making modifications to the code (e.g., `timedemo demo3`).
Using this method, we identified which parts of the function have the greatest impact on FPS. When we commented out the code segment containing `WRITEPDEST`, the FPS improvement was nearly equivalent to modifying the `D_DrawSpans8` function to do nothing (by adding a `return` at the top):
1. The original result of executing timedemo demo3: **35.8 FPS**

2. The result of making the function do nothing (by adding `return` at the top): **49.2 FPS**

4. The result of running with the `WRITEPDEST` code commented out:: **48.5 FPS**

> Note: The above results are from running quake-embedded on rv32emu
Therefore, the `WRITEPDEST` segment is the most critical part of the optimization task:
```c
...
/* This macro is the most time-consuming part of th D_DrawSpans8 function */
#define WRITEPDEST(i) \
{ \
pdest[i] = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
s += sstep; \
t += tstep; \
}
WRITEPDEST(-16);
WRITEPDEST(-15);
WRITEPDEST(-14);
WRITEPDEST(-13);
WRITEPDEST(-12);
WRITEPDEST(-11);
WRITEPDEST(-10);
WRITEPDEST(-9);
WRITEPDEST(-8);
WRITEPDEST(-7);
WRITEPDEST(-6);
WRITEPDEST(-5);
WRITEPDEST(-4);
WRITEPDEST(-3);
WRITEPDEST(-2);
WRITEPDEST(-1);
...
```
### Bottleneck of D_DrawZSpans
1. The original result of executing timedemo demo3: **34.6 FPS**

2. The result of making the function do nothing (by adding `return` at the top): **37.3 FPS**

3. The result of running with the most time-consuming part commented out: **36.5 FPS**

```c
/* This is the most time-consuming part of th D_DrawZSpans function */
...
if ((doublecount = count >> 1) > 0)
{
do
{
ltemp = izi >> 16;
izi += izistep;
ltemp |= izi & 0xFFFF0000;
izi += izistep;
*(int *)pdest = ltemp;
pdest += 2;
} while (--doublecount > 0);
}
...
```
## Optimization
:::danger
Before performing optimizations, you MUST figure out the performance bottlencks.
:::
### [D_DrawSpans8](https://github.com/bentck719/quake-embedded/blob/rv32emu/winquake/d_scan.c#L267)
#### Reducing Texture Lookups
To optimize the performance of `D_DrawSpans8`, we propose reducing the number of texture lookups by sharing the retrieved texture value between adjacent pixels. Specifically, **the texture value looked up for one pixel will be reused for the next pixel**, effectively halving the number of texture lookup operations.
This approach **trades off rendering quality** for performance. However, the visual degradation is less noticeable on devices with smaller screen resolutions, such as embedded systems. In these cases, the performance gain is likely to outweigh the quality loss.
We implemented a new macro called `LOW_WRITEPDEST`, which modifies the way pixels are processed in `D_DrawSpans8`. The macro retrieves the texture value once and assigns it to two consecutive pixels, reducing the number of texture lookups by approximately half:
```c
fixed16_t sstep2 = sstep<<1;
fixed16_t tstep2 = tstep<<1;
...
#define LOW_WRITEPDEST(i) \
{ \
pdest[i] = pdest[i+1] = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
s += sstep2; \
t += tstep2; \
}
LOW_WRITEPDEST(-16);
LOW_WRITEPDEST(-14);
LOW_WRITEPDEST(-12);
LOW_WRITEPDEST(-10);
LOW_WRITEPDEST(-8);
LOW_WRITEPDEST(-6);
LOW_WRITEPDEST(-4);
LOW_WRITEPDEST(-2);
```
#### Replace four `sb` instructions with a single `sw` instruction.
Additionally, we noticed that since `pdest` is of type `unsigned char*`, when using `LOW_WRITEPDEST`, each write operation results in the risc-v compiler generating an `sb` (store byte) instruction:
```asm
# 120 "d_scan.c" 1
/* START_CRITICAL_SECTION */
# 0 "" 2
#NO_APP
srai t5,a5,16
mul t5,t5,t3
add a5,a1,a5
srai s7,a5,16
add a5,a1,a5
srai s8,a4,16
srai t0,a5,16
add a5,a1,a5
add a4,a2,a4
mul s7,s7,t3
add t5,a3,t5
add t5,t5,s8
lbu t5,0(t5)
srai s8,a4,16
add a4,a2,a4
sb t5,-15(t4) // store the result to &pdest[-15]
sb t5,-16(t4) // store the result to &pdest[-16]
srai t5,a5,16
add a5,a1,a5
mul t0,t0,t3
add s7,a3,s7
add s7,s7,s8
lbu s8,0(s7)
srai s7,a5,16
add a5,a1,a5
sb s8,-13(t4) // store the result to &pdest[-13]
sb s8,-14(t4) // store the result to &pdest[-14]
srai s8,a4,16
add a4,a2,a4
mul t5,t5,t3
add t0,a3,t0
add t0,t0,s8
lbu t0,0(t0)
srai s8,a5,16
add a5,a1,a5
sb t0,-11(t4) // store the result to &pdest[-11]
sb t0,-12(t4) // store the result to &pdest[-12]
srai t0,a4,16
add a4,a2,a4
mul s7,s7,t3
add t5,a3,t5
add t5,t5,t0
lbu t5,0(t5)
srai t0,a5,16
add a5,a1,a5
sb t5,-9(t4) // store the result to &pdest[-9]
sb t5,-10(t4) // store the result to &pdest[-10]
.............. // continue to process &pdest[-8] ~ &pdest[-1]
```
If the number of texture values to store exceeds 4, it is possible to calculate 4 texture values (4 bytes) at once and write them using an `unsigned int*`. This allows the riscv compiler to generate a single `sw` (store word) instruction, replacing four `sb` instructions and achieving a slight performance improvement.
Since `pdest` might not always be 4-byte aligned, we added an alignment check before applying this optimization. This ensures that using `sw` will not lead to undefined behavior or memory misalignment issues.
```c
unsigned char is_pdest_align = !((uintptr_t)pdest & 0x3);
...
#define LOW_WRITEPDEST(i) \
{ pdest[i] = pdest[i+1] = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); s += sstep<<1; t += tstep<<1; }
#define LOW_WRITEPDEST_ALIGN(i) \
{ \
tmp = (unsigned)*(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
s += sstep2; \
t += tstep2; \
tmp |= (unsigned)*(pbase + (s >> 16) + (t >> 16) * _cachewidth) << 16; \
s += sstep2; \
t += tstep2; \
tmp |= tmp << 8; \
*(uintptr_t *)(pdest + i) = tmp; \
}
if(is_pdest_align){
LOW_WRITEPDEST_ALIGN(-16);
LOW_WRITEPDEST_ALIGN(-12);
LOW_WRITEPDEST_ALIGN(-8);
LOW_WRITEPDEST_ALIGN(-4);
}
else{
LOW_WRITEPDEST(-16);
LOW_WRITEPDEST(-14);
LOW_WRITEPDEST(-12);
LOW_WRITEPDEST(-10);
LOW_WRITEPDEST(-8);
LOW_WRITEPDEST(-6);
LOW_WRITEPDEST(-4);
LOW_WRITEPDEST(-2);
}
...
```
> Note: We attempted an alternative approach where we preprocessed a few pixels to align `pdest` before dividing the remaining `pspan->count` into groups of 16 pixels and a `spancount`. This would have eliminated the need to repeatedly check alignment during the subsequent calculations. However, this approach resulted in lower FPS, likely due to the overhead introduced by the preprocessing steps. Consequently, we opted to simply check for alignment during each iteration without any special handling for unaligned cases.
The modified compiled RISC-V code (the mul instruction seems to have been optimized elsewhere by the compiler):
```asm
# 134 "d_scan.c" 1
/* START_CRITICAL_SECTION */
# 0 "" 2
#NO_APP
lbu t4,0(t4)
lbu s11,0(a2)
slli a2,t4,16
or a2,a2,s11
slli t4,a2,8
add t4,t4,a2
sw t4,-16(a4) // store the result to &pdest[-16] ~ &pdest[-13]
lbu a2,0(t3)
lbu t1,0(t1)
slli a2,a2,16
or a2,a2,t1
slli t1,a2,8
add a2,t1,a2
sw a2,-12(a4) // store the result to &pdest[-12] ~ &pdest[-9]
lbu a2,0(a6)
lbu a0,0(a0)
slli a2,a2,16
or a2,a2,a0
slli a0,a2,8
add a2,a0,a2
sw a2,-8(a4) // store the result to &pdest[-8] ~ &pdest[-5]
lbu a2,0(t5)
lbu a1,0(a1)
slli a2,a2,16
or a2,a2,a1
slli a1,a2,8
add a2,a1,a2
sw a2,-4(a4) // store the result to &pdest[-4] ~ &pdest[-1]
#APP
# 139 "d_scan.c" 1
/* END_CRITICAL_SECTION */
```
#### Results
The original result of executing timedemo demo3: **35.5 FPS**

After [modifying D_DrawSpans8](https://github.com/bentck719/quake-embedded/blob/rv32emu/winquake/d_scan.c#L267) (with the tradeoff of reduced visual quality): **39.9 FPS**

The FPS performance improved by approximately 10%, with the trade-off being reduced visual quality.
### D_DrawZSpans
Haven't figured out a way to optimize yet.
## Reference
[(Github) RISC-V GNU Toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain)
[(Github) quake-embedded](https://github.com/sysprog21/quake-embedded)
[(Github) rv32emu](https://github.com/sysprog21/rv32emu)
[(Github) quakegeneric](https://github.com/erysdren/quakegeneric)
[RISC-V Bit-manipulation A, B, C and S Extensions](https://five-embeddev.com/riscv-bitmanip/1.0.0/bitmanip.html#zbs)