余紹桓, 郭子敬
To utilize the mechanisms of the B extension to accelerate the execution performance of the Quake program.
Currently, regarding the B extension, we have only included it in the compilation parameters to allow the compiler to perform automatic optimizations. The rest of our optimizations focus more on reducing the number of RISC-V instructions.
For future optimization directions of this project, utilizing the Vector extension may be a good choice.
The Quake program is written in C, and we aim to run it using rv32emu, a RISC-V emulator. Therefore, before execution, the Quake program must first be compiled into RISC-V instructions using the GNU toolchain for RISC-V.
The project requires the following tools: rv32emu and riscv-gnu-toolchain
After building, ./build/rv32emu
is the executable file for rv32emu.
Additionally, run the following command to obtain files related to Quake and execute it:
Note: The installation of this toolchain can take approximately 1-3 hours.
Alternatively, use prebuilt xPack toolchain.
The README.md
in quake-embedded provides the following build instructions:
However, during the initial build, the following error was encountered:
I checked my previously installed riscv-gnu-toolchain:
It revealed that my toolchain prefix is riscv64-unknown-elf
instead of riscv-none-elf
. Therefore, I modify port/boards/rv32emu/toolchain.cmake
:
After modifications, build again, and it will be successful:
While make ENABLE_SYSTEM=1 system
, the make file won't download the Linux/Image and prebuild test file spontanously. Therefore, we have to download the files on rv32emu-prebuilt-release to fetch the imperative files, .
After downloading the files, we have to move it to the directory ./build/riscv32
Before optimizing Quake, it is essential to identify which functions consume the most execution time. Tools like gprof and perf are capable of profiling a program to provide insights into the execution time distribution across its functions. Since we are new to these tools, We plan to start by using gprof, as it was recommended by instructor.
How to perform performance profiling on the Quake program compiled with riscv64-unknown-elf is a challenge to us. Using gprof to directly analyze the Quake program compiled with riscv64-unknown-elf appears to be infeasible, as riscv64-unknown-elf uses the newlib library and does not include the glibc library required for gprof.
Therefore, we are currently exploring the following approaches:
We later researched the quakegeneric project and found that it already provides a Makefile for compiling Quake using GCC. Therefore, the first method mentioned earlier can be used. By simply adding the -pg flag to the Makefile and running make:
we can obtain a quakegeneric
executable capable of generating the gmon.out file.
After playing quakegeneric
for a while to generate the gmon.out
file, use gprof
to analyze it:
It can be observed that the most time-consuming functions in quakegeneric are D_DrawSpans8
, QG_DrawFrame
, and D_DrawZSpans
. Although there may be differences in the code between quakegeneric and quake-embedded, we currently plan to use these results as a reference to identify opportunities for optimization in quake.
However, it seems that the QG_DrawFrame
function does not exist in the quake-embedded
. Therefore, we plan to focus on optimizing D_DrawSpans8
and D_DrawZSpans
as our initial targets.
D_DrawSpans8
is a function responsible for applying texture mapping to a horizontal span of pixels on the screen. It calculates texture coordinates and depth values for each pixel in the span, fetches the appropriate texture color, and writes it to the frame buffer.
We discovered that Quake provides the timedemo
command, which plays a fixed sequence of animations and outputs the average FPS. This allows us to test program performance under identical conditions after making modifications to the code (e.g., timedemo demo3
).
Using this method, we identified which parts of the function have the greatest impact on FPS. When we commented out the code segment containing WRITEPDEST
, the FPS improvement was nearly equivalent to modifying the D_DrawSpans8
function to do nothing (by adding a return
at the top):
return
at the top): 49.2 FPSWRITEPDEST
code commented out:: 48.5 FPSNote: The above results are from running quake-embedded on rv32emu
Therefore, the WRITEPDEST
segment is the most critical part of the optimization task:
return
at the top): 37.3 FPSBefore performing optimizations, you MUST figure out the performance bottlencks.
To optimize the performance of D_DrawSpans8
, we propose reducing the number of texture lookups by sharing the retrieved texture value between adjacent pixels. Specifically, the texture value looked up for one pixel will be reused for the next pixel, effectively halving the number of texture lookup operations.
This approach trades off rendering quality for performance. However, the visual degradation is less noticeable on devices with smaller screen resolutions, such as embedded systems. In these cases, the performance gain is likely to outweigh the quality loss.
We implemented a new macro called LOW_WRITEPDEST
, which modifies the way pixels are processed in D_DrawSpans8
. The macro retrieves the texture value once and assigns it to two consecutive pixels, reducing the number of texture lookups by approximately half:
sb
instructions with a single sw
instruction.Additionally, we noticed that since pdest
is of type unsigned char*
, when using LOW_WRITEPDEST
, each write operation results in the risc-v compiler generating an sb
(store byte) instruction:
If the number of texture values to store exceeds 4, it is possible to calculate 4 texture values (4 bytes) at once and write them using an unsigned int*
. This allows the riscv compiler to generate a single sw
(store word) instruction, replacing four sb
instructions and achieving a slight performance improvement.
Since pdest
might not always be 4-byte aligned, we added an alignment check before applying this optimization. This ensures that using sw
will not lead to undefined behavior or memory misalignment issues.
Note: We attempted an alternative approach where we preprocessed a few pixels to align
pdest
before dividing the remainingpspan->count
into groups of 16 pixels and aspancount
. This would have eliminated the need to repeatedly check alignment during the subsequent calculations. However, this approach resulted in lower FPS, likely due to the overhead introduced by the preprocessing steps. Consequently, we opted to simply check for alignment during each iteration without any special handling for unaligned cases.
The modified compiled RISC-V code (the mul instruction seems to have been optimized elsewhere by the compiler):
The original result of executing timedemo demo3: 35.5 FPS
After modifying D_DrawSpans8 (with the tradeoff of reduced visual quality): 39.9 FPS
The FPS performance improved by approximately 10%, with the trade-off being reduced visual quality.
Haven't figured out a way to optimize yet.
(Github) RISC-V GNU Toolchain
(Github) quake-embedded
(Github) rv32emu
(Github) quakegeneric
RISC-V Bit-manipulation A, B, C and S Extensions