Performance tuning for Quake/rv32emu

余紹桓, 郭子敬

Goal

To utilize the mechanisms of the B extension to accelerate the execution performance of the Quake program.

Currently, regarding the B extension, we have only included it in the compilation parameters to allow the compiler to perform automatic optimizations. The rest of our optimizations focus more on reducing the number of RISC-V instructions.
For future optimization directions of this project, utilizing the Vector extension may be a good choice.

Context of the Task

The Quake program is written in C, and we aim to run it using rv32emu, a RISC-V emulator. Therefore, before execution, the Quake program must first be compiled into RISC-V instructions using the GNU toolchain for RISC-V.

Prerequisites

The project requires the following tools: rv32emu and riscv-gnu-toolchain

rv32emu Installation

Install the required packages:

sudo apt install libsdl2-dev libsdl2-mixer-dev

Clone the repository and build the emulator:

git clone https://github.com/sysprog21/rv32emu.git
cd rv32emu && make

After building, ./build/rv32emu is the executable file for rv32emu.

Additionally, run the following command to obtain files related to Quake and execute it:

make quake

riscv-gnu-toolchain Installation

Note: The installation of this toolchain can take approximately 1-3 hours.

Alternatively, use prebuilt xPack toolchain.

Install the required dependencies:

sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev

Clone the repository and configure the build:

git clone https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain/
sudo ./configure --prefix=/opt/riscv --enable-multilib
sudo make

Add the installed toolchain to your PATH:

export PATH=$PATH:/opt/riscv/bin

Resolving the Build Issue

The README.md in quake-embedded provides the following build instructions:

git clone https://github.com/sysprog21/quake-embedded && cd quake-embedded
mkdir build && cd build
cmake -DCMAKE_TOOLCHAIN_FILE=../port/boards/rv32emu/toolchain.cmake \
      -DCMAKE_BUILD_TYPE=RELEASE -DBOARD_NAME=rv32emu ..
make

However, during the initial build, the following error was encountered:

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_C_COMPILER:
    riscv-none-elf-gcc
  is not a full path and was not found in the PATH.

  Tell CMake where to find the compiler by setting either the environment
  variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
  the compiler, or to the compiler name if it is in the PATH.

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_ASM_COMPILER:
    riscv-none-elf-gcc
  is not a full path and was not found in the PATH.

  Tell CMake where to find the compiler by setting either the environment
  variable "ASM" or the CMake cache entry CMAKE_ASM_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.

I checked my previously installed riscv-gnu-toolchain:

ls /opt/riscv/bin/

It revealed that my toolchain prefix is riscv64-unknown-elf instead of riscv-none-elf. Therefore, I modify port/boards/rv32emu/toolchain.cmake:

set(CMAKE_SYSTEM_NAME Generic)
set(CMAKE_SYSTEM_PROCESSOR riscv)

if(NOT DEFINED CROSS_COMPILE)
-#    set(CROSS_COMPILE "riscv-none-elf-")
+#    set(CROSS_COMPILE "riscv64-unknown-elf-")
endif()

set(TOOLCHAIN_PREFIX ${CROSS_COMPILE})

set(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY)
set(CMAKE_C_COMPILER ${TOOLCHAIN_PREFIX}gcc)
set(CMAKE_CXX_COMPILER ${TOOLCHAIN_PREFIX}g++)
set(CMAKE_ASM_COMPILER ${TOOLCHAIN_PREFIX}gcc)
set(CMAKE_OBJCOPY ${TOOLCHAIN_PREFIX}objcopy)
set(CMAKE_SIZE ${TOOLCHAIN_PREFIX}size)

set(ARCH_FLAGS "-march=rv32imf -mabi=ilp32 -Ofast -flto")

set(CMAKE_C_FLAGS "${ARCH_FLAGS} -std=gnu11 -Wall -ffunction-sections -fdata-sections -Wdouble-promotion")
set(CMAKE_ASM_FLAGS "${ARCH_FLAGS} -x assembler-with-cpp")
set(CMAKE_EXE_LINKER_FLAGS "-Wl,-Map=${CMAKE_CURRENT_BINARY_DIR}/output.map -Wl,--gc-sections -u _printf_float -u _scanf_float -lm")

After modifications, build again, and it will be successful:

mkdir build && cd build
cmake -DCMAKE_TOOLCHAIN_FILE=../port/boards/rv32emu/toolchain.cmake \
      -DCMAKE_BUILD_TYPE=RELEASE -DBOARD_NAME=rv32emu ..
make
cp ./port/boards/rv32emu/quake ./

Mac rv32emu

While make ENABLE_SYSTEM=1 system, the make file won't download the Linux/Image and prebuild test file spontanously. Therefore, we have to download the files on rv32emu-prebuilt-release to fetch the imperative files, .
After downloading the files, we have to move it to the directory ./build/riscv32

brew install
brew install dtc
make ENABLE_SYSTEM=1 system

Performance Analysis

Before optimizing Quake, it is essential to identify which functions consume the most execution time. Tools like gprof and perf are capable of profiling a program to provide insights into the execution time distribution across its functions. Since we are new to these tools, We plan to start by using gprof, as it was recommended by instructor.

Current Issue

How to perform performance profiling on the Quake program compiled with riscv64-unknown-elf is a challenge to us. Using gprof to directly analyze the Quake program compiled with riscv64-unknown-elf appears to be infeasible, as riscv64-unknown-elf uses the newlib library and does not include the glibc library required for gprof.

Therefore, we are currently exploring the following approaches:

Compile the quake program using GCC. This allows us to use gprof for performance analysis. However, the results might differ from the Quake program compiled with riscv64-unknown-elf.
Look for other tools or methods that can perform performance profiling on programs compiled with riscv64-unknown-elf.

Analysis Method 1: Compile Quake using GCC to support gprof

We later researched the quakegeneric project and found that it already provides a Makefile for compiling Quake using GCC. Therefore, the first method mentioned earlier can be used. By simply adding the -pg flag to the Makefile and running make:

CC ?= gcc
AR ?= ar
RM ?= rm -f

ifeq ($(CC), tcc)
override CFLAGS += -DSDL_DISABLE_IMMINTRIN_H=1
endif

ifeq ($(DEBUG), 1)
override CFLAGS += -g3
endif

ifeq ($(PARANOID), 1)
override CFLAGS += -fsanitize=address,undefined -Wall -Wextra -pedantic
override LDFLAGS += -fsanitize=address,undefined
endif

ifeq ($(RELEASE), 1)
override CFLAGS += -O3
override LDFLAGS += -O3
endif

- override CFLAGS += -m32 -std=gnu99 $(shell sdl2-config --cflags
- override LDFLAGS += -m32 -lm $(shell sdl2-config --libs)
+ override CFLAGS += -pg -m32 -std=gnu99 $(shell sdl2-config --cflags
+ override LDFLAGS += -pg -m32 -lm $(shell sdl2-config --libs)
...

we can obtain a quakegeneric executable capable of generating the gmon.out file.
After playing quakegeneric for a while to generate the gmon.out file, use gprof to analyze it:

$ gprof ./quakegeneric gmon.out 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 25.60      1.39     1.39   418979     0.00     0.00  D_DrawSpans8
 18.60      2.40     1.01     5043     0.20     0.20  QG_DrawFrame
 10.87      2.99     0.59   427019     0.00     0.00  D_DrawZSpans
  3.13      3.16     0.17  2291446     0.00     0.00  R_EmitEdge
  2.39      3.29     0.13    42023     0.00     0.00  R_DrawSurfaceBlock8_mip0
  2.39      3.42     0.13     5027     0.03     0.09  R_RecursiveWorldNode
  2.03      3.53     0.11 17759842     0.00     0.00  R_LeadingEdge
  2.03      3.64     0.11  1023746     0.00     0.00  R_RenderFace
  1.84      3.74     0.10   852572     0.00     0.00  D_PolysetDrawSpans8
  1.47      3.82     0.08   605440     0.00     0.00  D_PolysetScanLeftEdge
  1.47      3.90     0.08   227812     0.00     0.00  SV_HullPointContents
  1.47      3.98     0.08                             main
  1.29      4.05     0.07 16334920     0.00     0.00  R_TrailingEdge
  1.10      4.11     0.06   661078     0.00     0.00  D_PolysetCalcGradients
  1.10      4.17     0.06     5033     0.01     0.01  CL_RelinkEntities
......

It can be observed that the most time-consuming functions in quakegeneric are D_DrawSpans8, QG_DrawFrame, and D_DrawZSpans. Although there may be differences in the code between quakegeneric and quake-embedded, we currently plan to use these results as a reference to identify opportunities for optimization in quake.

However, it seems that the QG_DrawFrame function does not exist in the quake-embedded. Therefore, we plan to focus on optimizing D_DrawSpans8 and D_DrawZSpans as our initial targets.

Bottleneck of D_DrawSpans8

D_DrawSpans8 is a function responsible for applying texture mapping to a horizontal span of pixels on the screen. It calculates texture coordinates and depth values for each pixel in the span, fetches the appropriate texture color, and writes it to the frame buffer.

We discovered that Quake provides the timedemo command, which plays a fixed sequence of animations and outputs the average FPS. This allows us to test program performance under identical conditions after making modifications to the code (e.g., timedemo demo3).

Using this method, we identified which parts of the function have the greatest impact on FPS. When we commented out the code segment containing WRITEPDEST, the FPS improvement was nearly equivalent to modifying the D_DrawSpans8 function to do nothing (by adding a return at the top):

The original result of executing timedemo demo3: 35.8 FPS
The result of making the function do nothing (by adding return at the top): 49.2 FPS
The result of running with the WRITEPDEST code commented out:: 48.5 FPS

Note: The above results are from running quake-embedded on rv32emu

Therefore, the WRITEPDEST segment is the most critical part of the optimization task:

...
/* This macro is the most time-consuming part of th D_DrawSpans8 function */
#define WRITEPDEST(i)                                          \
  {                                                            \
    pdest[i] = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
    s += sstep;                                                \
    t += tstep;                                                \
  }

      WRITEPDEST(-16);
      WRITEPDEST(-15);
      WRITEPDEST(-14);
      WRITEPDEST(-13);
      WRITEPDEST(-12);
      WRITEPDEST(-11);
      WRITEPDEST(-10);
      WRITEPDEST(-9);
      WRITEPDEST(-8);
      WRITEPDEST(-7);
      WRITEPDEST(-6);
      WRITEPDEST(-5);
      WRITEPDEST(-4);
      WRITEPDEST(-3);
      WRITEPDEST(-2);
      WRITEPDEST(-1);
...

Bottleneck of D_DrawZSpans

The original result of executing timedemo demo3: 34.6 FPS
The result of making the function do nothing (by adding return at the top): 37.3 FPS
The result of running with the most time-consuming part commented out: 36.5 FPS

/* This is the most time-consuming part of th D_DrawZSpans function */
...
    if ((doublecount = count >> 1) > 0)
    {
      do
      {
        ltemp = izi >> 16;
        izi += izistep;
        ltemp |= izi & 0xFFFF0000;
        izi += izistep;
        *(int *)pdest = ltemp;
        pdest += 2;
      } while (--doublecount > 0);
    }
...

Optimization

Before performing optimizations, you MUST figure out the performance bottlencks.

D_DrawSpans8

Reducing Texture Lookups

To optimize the performance of D_DrawSpans8, we propose reducing the number of texture lookups by sharing the retrieved texture value between adjacent pixels. Specifically, the texture value looked up for one pixel will be reused for the next pixel, effectively halving the number of texture lookup operations.

This approach trades off rendering quality for performance. However, the visual degradation is less noticeable on devices with smaller screen resolutions, such as embedded systems. In these cases, the performance gain is likely to outweigh the quality loss.

We implemented a new macro called LOW_WRITEPDEST, which modifies the way pixels are processed in D_DrawSpans8. The macro retrieves the texture value once and assigns it to two consecutive pixels, reducing the number of texture lookups by approximately half:

fixed16_t sstep2 = sstep<<1;
fixed16_t tstep2 = tstep<<1;
...
#define LOW_WRITEPDEST(i)                                                    \
  {                                                                          \
    pdest[i] = pdest[i+1]  = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); \
    s += sstep2;                                                            \
    t += tstep2;                                                            \
  }
        LOW_WRITEPDEST(-16);
        LOW_WRITEPDEST(-14);
        LOW_WRITEPDEST(-12);
        LOW_WRITEPDEST(-10);
        LOW_WRITEPDEST(-8);
        LOW_WRITEPDEST(-6);
        LOW_WRITEPDEST(-4);
        LOW_WRITEPDEST(-2);

Replace four `sb` instructions with a single `sw` instruction.

Additionally, we noticed that since pdest is of type unsigned char*, when using LOW_WRITEPDEST, each write operation results in the risc-v compiler generating an sb (store byte) instruction:

# 120 "d_scan.c" 1
	/* START_CRITICAL_SECTION */
# 0 "" 2
 #NO_APP
	srai	t5,a5,16
	mul	t5,t5,t3
	add	a5,a1,a5
	srai	s7,a5,16
	add	a5,a1,a5
	srai	s8,a4,16
	srai	t0,a5,16    
	add	a5,a1,a5
	add	a4,a2,a4
	mul	s7,s7,t3
	add	t5,a3,t5
	add	t5,t5,s8
	lbu	t5,0(t5)
	srai	s8,a4,16
	add	a4,a2,a4
	sb	t5,-15(t4)      // store the result to &pdest[-15]
	sb	t5,-16(t4)      // store the result to &pdest[-16]
	srai	t5,a5,16    
	add	a5,a1,a5
	mul	t0,t0,t3
	add	s7,a3,s7
	add	s7,s7,s8
	lbu	s8,0(s7)
	srai	s7,a5,16
	add	a5,a1,a5
	sb	s8,-13(t4)       // store the result to &pdest[-13]
	sb	s8,-14(t4)       // store the result to &pdest[-14]
	srai	s8,a4,16    
	add	a4,a2,a4
	mul	t5,t5,t3
	add	t0,a3,t0
	add	t0,t0,s8
	lbu	t0,0(t0)
	srai	s8,a5,16
	add	a5,a1,a5
	sb	t0,-11(t4)       // store the result to &pdest[-11]
	sb	t0,-12(t4)       // store the result to &pdest[-12]
	srai	t0,a4,16    
	add	a4,a2,a4
	mul	s7,s7,t3
	add	t5,a3,t5
	add	t5,t5,t0
	lbu	t5,0(t5)
	srai	t0,a5,16
	add	a5,a1,a5
	sb	t5,-9(t4)       // store the result to &pdest[-9]
	sb	t5,-10(t4)      // store the result to &pdest[-10]
        ..............         // continue to process &pdest[-8] ~ &pdest[-1]

If the number of texture values to store exceeds 4, it is possible to calculate 4 texture values (4 bytes) at once and write them using an unsigned int*. This allows the riscv compiler to generate a single sw (store word) instruction, replacing four sb instructions and achieving a slight performance improvement.

Since pdest might not always be 4-byte aligned, we added an alignment check before applying this optimization. This ensures that using sw will not lead to undefined behavior or memory misalignment issues.

unsigned char is_pdest_align = !((uintptr_t)pdest & 0x3);
...
#define LOW_WRITEPDEST(i)                                         \
	{ pdest[i] = pdest[i+1]  = *(pbase + (s >> 16) + (t >> 16) * _cachewidth); s += sstep<<1; t += tstep<<1; }

#define LOW_WRITEPDEST_ALIGN(i)                                            \
  {                                                                        \
    tmp = (unsigned)*(pbase + (s >> 16) + (t >> 16) * _cachewidth);        \
    s += sstep2;                                                           \
    t += tstep2;                                                           \
    tmp |= (unsigned)*(pbase + (s >> 16) + (t >> 16) * _cachewidth) << 16; \
    s += sstep2;                                                           \
    t += tstep2;                                                           \
    tmp |= tmp << 8;                                                       \
    *(uintptr_t *)(pdest + i) = tmp;                                       \
  }

	
	if(is_pdest_align){
		LOW_WRITEPDEST_ALIGN(-16);
		LOW_WRITEPDEST_ALIGN(-12);
		LOW_WRITEPDEST_ALIGN(-8);
		LOW_WRITEPDEST_ALIGN(-4);
	}
	else{
		LOW_WRITEPDEST(-16);
		LOW_WRITEPDEST(-14);
		LOW_WRITEPDEST(-12);
		LOW_WRITEPDEST(-10);
		LOW_WRITEPDEST(-8);
		LOW_WRITEPDEST(-6);
		LOW_WRITEPDEST(-4);
		LOW_WRITEPDEST(-2);
	}
...

Note: We attempted an alternative approach where we preprocessed a few pixels to align pdest before dividing the remaining pspan->count into groups of 16 pixels and a spancount. This would have eliminated the need to repeatedly check alignment during the subsequent calculations. However, this approach resulted in lower FPS, likely due to the overhead introduced by the preprocessing steps. Consequently, we opted to simply check for alignment during each iteration without any special handling for unaligned cases.

The modified compiled RISC-V code (the mul instruction seems to have been optimized elsewhere by the compiler):

# 134 "d_scan.c" 1
	/* START_CRITICAL_SECTION */
# 0 "" 2
 #NO_APP
	lbu	t4,0(t4)        
	lbu	s11,0(a2)
	slli	a2,t4,16
	or	a2,a2,s11
	slli	t4,a2,8
	add	t4,t4,a2
	sw	t4,-16(a4)      // store the result to &pdest[-16] ~ &pdest[-13]
	lbu	a2,0(t3)        
	lbu	t1,0(t1)
	slli	a2,a2,16
	or	a2,a2,t1
	slli	t1,a2,8
	add	a2,t1,a2
	sw	a2,-12(a4)      // store the result to &pdest[-12] ~ &pdest[-9]
	lbu	a2,0(a6)        
	lbu	a0,0(a0)
	slli	a2,a2,16
	or	a2,a2,a0
	slli	a0,a2,8
	add	a2,a0,a2
	sw	a2,-8(a4)       // store the result to &pdest[-8] ~ &pdest[-5]
	lbu	a2,0(t5)        
	lbu	a1,0(a1)
	slli	a2,a2,16
	or	a2,a2,a1
	slli	a1,a2,8
	add	a2,a1,a2
	sw	a2,-4(a4)       // store the result to &pdest[-4] ~ &pdest[-1]
 #APP
# 139 "d_scan.c" 1
	/* END_CRITICAL_SECTION */

Results

The original result of executing timedemo demo3: 35.5 FPS

After modifying D_DrawSpans8 (with the tradeoff of reduced visual quality): 39.9 FPS

The FPS performance improved by approximately 10%, with the trade-off being reduced visual quality.

D_DrawZSpans

Haven't figured out a way to optimize yet.

Reference

(Github) RISC-V GNU Toolchain
(Github) quake-embedded
(Github) rv32emu
(Github) quakegeneric
RISC-V Bit-manipulation A, B, C and S Extensions