余紹桓, 郭子敬
To utilize the mechanisms of the B extension to accelerate the execution performance of the Quake program.
Currently, regarding the B extension, we have only included it in the compilation parameters to allow the compiler to perform automatic optimizations. The rest of our optimizations focus more on reducing the number of RISC-V instructions.
For future optimization directions of this project, utilizing the Vector extension may be a good choice.
The Quake program is written in C, and we aim to run it using rv32emu, a RISC-V emulator. Therefore, before execution, the Quake program must first be compiled into RISC-V instructions using the GNU toolchain for RISC-V.
The project requires the following tools: rv32emu and riscv-gnu-toolchain
sudo apt install libsdl2-dev libsdl2-mixer-dev
git clone https://github.com/sysprog21/rv32emu.git
cd rv32emu && make
After building, ./build/rv32emu
is the executable file for rv32emu.
Additionally, run the following command to obtain files related to Quake and execute it:
make quake
Note: The installation of this toolchain can take approximately 1-3 hours.
Alternatively, use prebuilt xPack toolchain.
sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev
git clone https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain/
sudo ./configure --prefix=/opt/riscv --enable-multilib
sudo make
export PATH=$PATH:/opt/riscv/bin
The README.md
in quake-embedded provides the following build instructions:
git clone https://github.com/sysprog21/quake-embedded && cd quake-embedded
mkdir build && cd build
cmake -DCMAKE_TOOLCHAIN_FILE=../port/boards/rv32emu/toolchain.cmake \
-DCMAKE_BUILD_TYPE=RELEASE -DBOARD_NAME=rv32emu ..
make
However, during the initial build, the following error was encountered:
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_C_COMPILER:
riscv-none-elf-gcc
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_ASM_COMPILER:
riscv-none-elf-gcc
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "ASM" or the CMake cache entry CMAKE_ASM_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.
I checked my previously installed riscv-gnu-toolchain:
ls /opt/riscv/bin/
It revealed that my toolchain prefix is riscv64-unknown-elf
instead of riscv-none-elf
. Therefore, I modify port/boards/rv32emu/toolchain.cmake
:
set(CMAKE_SYSTEM_NAME Generic)
set(CMAKE_SYSTEM_PROCESSOR riscv)
if(NOT DEFINED CROSS_COMPILE)
-# set(CROSS_COMPILE "riscv-none-elf-")
+# set(CROSS_COMPILE "riscv64-unknown-elf-")
endif()
set(TOOLCHAIN_PREFIX ${CROSS_COMPILE})
set(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY)
set(CMAKE_C_COMPILER ${TOOLCHAIN_PREFIX}gcc)
set(CMAKE_CXX_COMPILER ${TOOLCHAIN_PREFIX}g++)
set(CMAKE_ASM_COMPILER ${TOOLCHAIN_PREFIX}gcc)
set(CMAKE_OBJCOPY ${TOOLCHAIN_PREFIX}objcopy)
set(CMAKE_SIZE ${TOOLCHAIN_PREFIX}size)
set(ARCH_FLAGS "-march=rv32imf -mabi=ilp32 -Ofast -flto")
set(CMAKE_C_FLAGS "${ARCH_FLAGS} -std=gnu11 -Wall -ffunction-sections -fdata-sections -Wdouble-promotion")
set(CMAKE_ASM_FLAGS "${ARCH_FLAGS} -x assembler-with-cpp")
set(CMAKE_EXE_LINKER_FLAGS "-Wl,-Map=${CMAKE_CURRENT_BINARY_DIR}/output.map -Wl,--gc-sections -u _printf_float -u _scanf_float -lm")
After modifications, build again, and it will be successful:
mkdir build && cd build
cmake -DCMAKE_TOOLCHAIN_FILE=../port/boards/rv32emu/toolchain.cmake \
-DCMAKE_BUILD_TYPE=RELEASE -DBOARD_NAME=rv32emu ..
make
cp ./port/boards/rv32emu/quake ./
While make ENABLE_SYSTEM=1 system
, the make file won't download the Linux/Image and prebuild test file spontanously. Therefore, we have to download the files on rv32emu-prebuilt-release to fetch the imperative files, .
After downloading the files, we have to move it to the directory ./build/riscv32
brew install
brew install dtc
make ENABLE_SYSTEM=1 system
Before optimizing Quake, it is essential to identify which functions consume the most execution time. Tools like gprof and perf are capable of profiling a program to provide insights into the execution time distribution across its functions. Since we are new to these tools, We plan to start by using gprof, as it was recommended by instructor.
How to perform performance profiling on the Quake program compiled with riscv64-unknown-elf is a challenge to us. Using gprof to directly analyze the Quake program compiled with riscv64-unknown-elf appears to be infeasible, as riscv64-unknown-elf uses the newlib library and does not include the glibc library required for gprof.
Therefore, we are currently exploring the following approaches:
We later researched the quakegeneric project and found that it already provides a Makefile for compiling Quake using GCC. Therefore, the first method mentioned earlier can be used. By simply adding the -pg flag to the Makefile and running make:
CC ?= gcc
AR ?= ar
RM ?= rm -f
ifeq ($(CC), tcc)
override CFLAGS += -DSDL_DISABLE_IMMINTRIN_H=1
endif
ifeq ($(DEBUG), 1)
override CFLAGS += -g3
endif
ifeq ($(PARANOID), 1)
override CFLAGS += -fsanitize=address,undefined -Wall -Wextra -pedantic
override LDFLAGS += -fsanitize=address,undefined
endif
ifeq ($(RELEASE), 1)
override CFLAGS += -O3
override LDFLAGS += -O3
endif
- override CFLAGS += -m32 -std=gnu99 $(shell sdl2-config --cflags
- override LDFLAGS += -m32 -lm $(shell sdl2-config --libs)
+ override CFLAGS += -pg -m32 -std=gnu99 $(shell sdl2-config --cflags
+ override LDFLAGS += -pg -m32 -lm $(shell sdl2-config --libs)
...
we can obtain a quakegeneric
executable capable of generating the gmon.out file.
After playing quakegeneric
for a while to generate the gmon.out
file, use gprof
to analyze it:
$ gprof ./quakegeneric gmon.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
25.60 1.39 1.39 418979 0.00 0.00 D_DrawSpans8
18.60 2.40 1.01 5043 0.20 0.20 QG_DrawFrame
10.87 2.99 0.59 427019 0.00 0.00 D_DrawZSpans
3.13 3.16 0.17 2291446 0.00 0.00 R_EmitEdge
2.39 3.29 0.13 42023 0.00 0.00 R_DrawSurfaceBlock8_mip0
2.39 3.42 0.13 5027 0.03 0.09 R_RecursiveWorldNode
2.03 3.53 0.11 17759842 0.00 0.00 R_LeadingEdge
2.03 3.64 0.11 1023746 0.00 0.00 R_RenderFace
1.84 3.74 0.10 852572 0.00 0.00 D_PolysetDrawSpans8
1.47 3.82 0.08 605440 0.00 0.00 D_PolysetScanLeftEdge
1.47 3.90 0.08 227812 0.00 0.00 SV_HullPointContents
1.47 3.98 0.08 main
1.29 4.05 0.07 16334920 0.00 0.00 R_TrailingEdge
1.10 4.11 0.06 661078 0.00 0.00 D_PolysetCalcGradients
1.10 4.17 0.06 5033 0.01 0.01 CL_RelinkEntities
......
It can be observed that the most time-consuming functions in quakegeneric are D_DrawSpans8
, QG_DrawFrame
, and D_DrawZSpans
. Although there may be differences in the code between quakegeneric and quake-embedded, we currently plan to use these results as a reference to identify opportunities for optimization in quake.
However, it seems that the QG_DrawFrame
function does not exist in the quake-embedded
. Therefore, we plan to focus on optimizing D_DrawSpans8
and D_DrawZSpans
as our initial targets.
D_DrawSpans8
is a function responsible for applying texture mapping to a horizontal span of pixels on the screen. It calculates texture coordinates and depth values for each pixel in the span, fetches the appropriate texture color, and writes it to the frame buffer.
We discovered that Quake provides the timedemo
command, which plays a fixed sequence of animations and outputs the average FPS. This allows us to test program performance under identical conditions after making modifications to the code (e.g., timedemo demo3
).
Using this method, we identified which parts of the function have the greatest impact on FPS. When we commented out the code segment containing WRITEPDEST
, the FPS improvement was nearly equivalent to modifying the D_DrawSpans8
function to do nothing (by adding a return
at the top):
return
at the top): 49.2 FPSWRITEPDEST
code commented out:: 48.5 FPSNote: The above results are from running quake-embedded on rv32emu
Therefore, the WRITEPDEST
segment is the most critical part of the optimization task:
...
/* This macro is the most time-consuming part of th D_DrawSpans8 function */
#define WRITEPDEST(i) \
{ \
pdest[i] = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
s += sstep; \
t += tstep; \
}
WRITEPDEST(-16);
WRITEPDEST(-15);
WRITEPDEST(-14);
WRITEPDEST(-13);
WRITEPDEST(-12);
WRITEPDEST(-11);
WRITEPDEST(-10);
WRITEPDEST(-9);
WRITEPDEST(-8);
WRITEPDEST(-7);
WRITEPDEST(-6);
WRITEPDEST(-5);
WRITEPDEST(-4);
WRITEPDEST(-3);
WRITEPDEST(-2);
WRITEPDEST(-1);
...
return
at the top): 37.3 FPS/* This is the most time-consuming part of th D_DrawZSpans function */
...
if ((doublecount = count >> 1) > 0)
{
do
{
ltemp = izi >> 16;
izi += izistep;
ltemp |= izi & 0xFFFF0000;
izi += izistep;
*(int *)pdest = ltemp;
pdest += 2;
} while (--doublecount > 0);
}
...
Before performing optimizations, you MUST figure out the performance bottlencks.
To optimize the performance of D_DrawSpans8
, we propose reducing the number of texture lookups by sharing the retrieved texture value between adjacent pixels. Specifically, the texture value looked up for one pixel will be reused for the next pixel, effectively halving the number of texture lookup operations.
This approach trades off rendering quality for performance. However, the visual degradation is less noticeable on devices with smaller screen resolutions, such as embedded systems. In these cases, the performance gain is likely to outweigh the quality loss.
We implemented a new macro called LOW_WRITEPDEST
, which modifies the way pixels are processed in D_DrawSpans8
. The macro retrieves the texture value once and assigns it to two consecutive pixels, reducing the number of texture lookups by approximately half:
fixed16_t sstep2 = sstep<<1;
fixed16_t tstep2 = tstep<<1;
...
#define LOW_WRITEPDEST(i) \
{ \
pdest[i] = pdest[i+1] = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
s += sstep2; \
t += tstep2; \
}
LOW_WRITEPDEST(-16);
LOW_WRITEPDEST(-14);
LOW_WRITEPDEST(-12);
LOW_WRITEPDEST(-10);
LOW_WRITEPDEST(-8);
LOW_WRITEPDEST(-6);
LOW_WRITEPDEST(-4);
LOW_WRITEPDEST(-2);
sb
instructions with a single sw
instruction.Additionally, we noticed that since pdest
is of type unsigned char*
, when using LOW_WRITEPDEST
, each write operation results in the risc-v compiler generating an sb
(store byte) instruction:
# 120 "d_scan.c" 1
/* START_CRITICAL_SECTION */
# 0 "" 2
#NO_APP
srai t5,a5,16
mul t5,t5,t3
add a5,a1,a5
srai s7,a5,16
add a5,a1,a5
srai s8,a4,16
srai t0,a5,16
add a5,a1,a5
add a4,a2,a4
mul s7,s7,t3
add t5,a3,t5
add t5,t5,s8
lbu t5,0(t5)
srai s8,a4,16
add a4,a2,a4
sb t5,-15(t4) // store the result to &pdest[-15]
sb t5,-16(t4) // store the result to &pdest[-16]
srai t5,a5,16
add a5,a1,a5
mul t0,t0,t3
add s7,a3,s7
add s7,s7,s8
lbu s8,0(s7)
srai s7,a5,16
add a5,a1,a5
sb s8,-13(t4) // store the result to &pdest[-13]
sb s8,-14(t4) // store the result to &pdest[-14]
srai s8,a4,16
add a4,a2,a4
mul t5,t5,t3
add t0,a3,t0
add t0,t0,s8
lbu t0,0(t0)
srai s8,a5,16
add a5,a1,a5
sb t0,-11(t4) // store the result to &pdest[-11]
sb t0,-12(t4) // store the result to &pdest[-12]
srai t0,a4,16
add a4,a2,a4
mul s7,s7,t3
add t5,a3,t5
add t5,t5,t0
lbu t5,0(t5)
srai t0,a5,16
add a5,a1,a5
sb t5,-9(t4) // store the result to &pdest[-9]
sb t5,-10(t4) // store the result to &pdest[-10]
.............. // continue to process &pdest[-8] ~ &pdest[-1]
If the number of texture values to store exceeds 4, it is possible to calculate 4 texture values (4 bytes) at once and write them using an unsigned int*
. This allows the riscv compiler to generate a single sw
(store word) instruction, replacing four sb
instructions and achieving a slight performance improvement.
Since pdest
might not always be 4-byte aligned, we added an alignment check before applying this optimization. This ensures that using sw
will not lead to undefined behavior or memory misalignment issues.
unsigned char is_pdest_align = !((uintptr_t)pdest & 0x3);
...
#define LOW_WRITEPDEST(i) \
{ pdest[i] = pdest[i+1] = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); s += sstep<<1; t += tstep<<1; }
#define LOW_WRITEPDEST_ALIGN(i) \
{ \
tmp = (unsigned)*(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
s += sstep2; \
t += tstep2; \
tmp |= (unsigned)*(pbase + (s >> 16) + (t >> 16) * _cachewidth) << 16; \
s += sstep2; \
t += tstep2; \
tmp |= tmp << 8; \
*(uintptr_t *)(pdest + i) = tmp; \
}
if(is_pdest_align){
LOW_WRITEPDEST_ALIGN(-16);
LOW_WRITEPDEST_ALIGN(-12);
LOW_WRITEPDEST_ALIGN(-8);
LOW_WRITEPDEST_ALIGN(-4);
}
else{
LOW_WRITEPDEST(-16);
LOW_WRITEPDEST(-14);
LOW_WRITEPDEST(-12);
LOW_WRITEPDEST(-10);
LOW_WRITEPDEST(-8);
LOW_WRITEPDEST(-6);
LOW_WRITEPDEST(-4);
LOW_WRITEPDEST(-2);
}
...
Note: We attempted an alternative approach where we preprocessed a few pixels to align
pdest
before dividing the remainingpspan->count
into groups of 16 pixels and aspancount
. This would have eliminated the need to repeatedly check alignment during the subsequent calculations. However, this approach resulted in lower FPS, likely due to the overhead introduced by the preprocessing steps. Consequently, we opted to simply check for alignment during each iteration without any special handling for unaligned cases.
The modified compiled RISC-V code (the mul instruction seems to have been optimized elsewhere by the compiler):
# 134 "d_scan.c" 1
/* START_CRITICAL_SECTION */
# 0 "" 2
#NO_APP
lbu t4,0(t4)
lbu s11,0(a2)
slli a2,t4,16
or a2,a2,s11
slli t4,a2,8
add t4,t4,a2
sw t4,-16(a4) // store the result to &pdest[-16] ~ &pdest[-13]
lbu a2,0(t3)
lbu t1,0(t1)
slli a2,a2,16
or a2,a2,t1
slli t1,a2,8
add a2,t1,a2
sw a2,-12(a4) // store the result to &pdest[-12] ~ &pdest[-9]
lbu a2,0(a6)
lbu a0,0(a0)
slli a2,a2,16
or a2,a2,a0
slli a0,a2,8
add a2,a0,a2
sw a2,-8(a4) // store the result to &pdest[-8] ~ &pdest[-5]
lbu a2,0(t5)
lbu a1,0(a1)
slli a2,a2,16
or a2,a2,a1
slli a1,a2,8
add a2,a1,a2
sw a2,-4(a4) // store the result to &pdest[-4] ~ &pdest[-1]
#APP
# 139 "d_scan.c" 1
/* END_CRITICAL_SECTION */
The original result of executing timedemo demo3: 35.5 FPS
After modifying D_DrawSpans8 (with the tradeoff of reduced visual quality): 39.9 FPS
The FPS performance improved by approximately 10%, with the trade-off being reduced visual quality.
Haven't figured out a way to optimize yet.
(Github) RISC-V GNU Toolchain
(Github) quake-embedded
(Github) rv32emu
(Github) quakegeneric
RISC-V Bit-manipulation A, B, C and S Extensions