Fast usermode x86-64 emulator for RISC-V

邱繼寬

Introduction

Software companies rarely provide executables for the RISC-V architecture. Consequently, proprietary software cannot be recompiled for RISC-V, limiting its availability on the platform.

Therefore, we propose a high-speed dynamic binary translation mechanism that enables existing x86 applications to run on RISC-V processors at approximately half the performance of native execution. This solution supports both Linux and Windows applications (with WINE).

Why Not QEMU ?

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Compared to the industry-standard QEMU, box64 offers several advantages:

Performance
Memory usage
MIT License

Performance

JIT compiler in box64 is more efficient, the codebase is more streamlined, and it allows interception of dynamically linked libraries, mapping them directly to the RISC-V host – a process referred to as "wrapped". This approach significantly boosts performance compared to QEMU, which simulates the entire program along with its dynamically linked libraries. Additionally, thanks to the aforementioned wrapper, box64 can emulate target x64 executables without requiring a complete x64 file system. This not only speeds up startup but also reduces runtime costs substantially.

Box64 Emulation Internals

box64

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

At the start, box64 will extract useful informations about the executable being run: where is the executable code, where to put it, the needed libraries…

For each library, it will then try to load a native library if it is registered, or if it isn’t it will try to find a version to emulate. If the library is emulated, it will receive the same treatment as the executable. Otherwise, it will be treated separately, and the native library will be loaded.

Finally, it will apply relocations: in the executable, function addresses are not hardcoded but left for the linker (or, here, box64) to give. This allows for box64 to do its magic: it will set the native functions’ address to point to a certain signature and metadata, which will alert box64’s emulator and dynarec about this so-called “bridge” (between the emulated and native world). This enables the “native calls” which is the strength of box64.

Windows Application Emulation

The best way to demonstrate the capabilities of box64 is by emulating Microsoft Windows applications.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

We leverage another open-source project, WINE, which provides a compatibility layer for Windows on Linux. This allows Windows applications originally designed for the x64 instruction set to run on RV64 without any modifications, thanks to box64 dynamically translating x64 instructions to RV64 in real time.

Demo:

Chess

$ WINEDLLOVERRIDES="CardGames,chess,slc=n,b" box64 wine64 chess1.exe

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Solitaire

$ WINEDLLOVERRIDES="CardGames,slc=n,b" box64 wine64 Solitaire1.exe

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Minesweeper

$ WINEDLLOVERRIDES="CardGames,slc,Minesweeper=n,b" box64 wine64 Minesweeper1.exe

Memory Usage

Track the memory usage of the RBtree during box64 execution, recording data in the format: "op_name, op_time, memory usage." Integrate a "producer" mechanism into box64 to store this recorded data in shared memory, which can then be accessed and processed by a "consumer."

In this case, We use ringbuf-shm to implement an interface for a shared array.

Name	Real Time	Add node	Add tree	Delete node	Delete tree	Total	Operations/sec	Avg Throughput
Chess	34.93	51041	19	28355	19	79434	2243.091	79.96 KB/sec
Solitaire	16.32	44972	18	22689	18	67697	4148.100	150.81 KB/sec
Minesweeper	14.69	45509	18	22718	18	68263	4646.902	169.54 KB/sec

See Box64 紅黑樹效能分析

Chess

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

raw

Minesweeper

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

raw

Solitaire

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

raw

Proposed Enhancements

Although box64 offers many advantages, it is challenging to fully leverage its dynamic binary translation capabilities in resource-constrained RISC-V environments like the Tinker V.
To deal with that, We implemented a series of improvements to enhance performance and adaptability:

Implement remote X Window System display support with tweaked network transport.
Optimize memory management during JIT compilation by reducing translated code block footprint.
Add peephole optimization passes for RV64 JIT code generation
Implement standard compiler optimizations including constant propagation/folding
Fix compatibility issues with legacy Linux kernels and older glibc versions

RBtree

While the current version of Box64 works quite well, the red-black tree behind its memory management can be modified to some degree.

typedef struct rbnode {
    struct rbnode *left, *right, *parent;
    uintptr_t start, end;
    uint64_t data;
    uint8_t meta;
} rbnode;

The node size in the current Box64 is 56 bytes. For example, chess.exe requires at least 51041 red-black tree nodes, which amounts to 2858296 bytes.

We can save around 15% of memory usage if we modify the structure of the rbnode:

typedef struct rbnode {
    rb_node_t node;
    uintptr_t start, end;
    uint64_t data;
} rbnode;

Nodes within a rbtree represented as an rb_node_t structure, which resides in user-managed memory, embedded within rbnode tracked by the tree. However, unlike linked list structures, the data within an rb_node_t is entirely opaque, and users cannot manually traverse the tree's binary topology as they can with lists.

jserv/rbtree

Enhance benchmark suite for essential operations
Implement cached red-black tree
Dynamic stack buffer overflow

Benchmark

CoreMark

The CoreMark benchmark is a product of the Embedded Microprocessor Benchmark Consortium, and aims to provide a modern alternative to the venerable integer-focused Dhrystone benchmark while answering many of its shortcomings and improving its relevance primarily in the area of low-power embedded systems, where Dhrystone is still predominantly used by vendors to provide rough performance estimates and means of comparison with competing products.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Emulated x86 programs achieve approximately one-quarter the performance of native RISC-V applications.

Dhrystone

A pun on the name of the floating-point focused "Whetstone" benchmark, the Dhrystone benchmark is, like Whetstone, a synthetic benchmark derived from statistical analysis with the goal of representing an "average" program. Dhrystone is unlike Whetstone in that it focuses entirely on integer arithmetic, making it more relevant to small systems and typical computing use cases in general than the more scientific and technically-oriented Whetstone.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The emulated x86 programs achieve approximately one-quarter the performance of native RISC-V programs.

rv8-bench

The rv8 benchmark suite contains a small set of currently integer centric benchmarks for regression testing of the rv8 binary translation engine. The suite contains the following test programs:

Box64 performs adequately on test cases that primarily involve basic ALU operations with minimal branching. However, benchmarks like Whetstone reveal significant performance shortcomings, indicating substantial room for improvement.