邱繼寬
Software companies rarely provide executables for the RISC-V architecture. Consequently, proprietary software cannot be recompiled for RISC-V, limiting its availability on the platform.
Therefore, we propose a high-speed dynamic binary translation mechanism that enables existing x86 applications to run on RISC-V processors at approximately half the performance of native execution. This solution supports both Linux and Windows applications (with WINE).
Compared to the industry-standard QEMU, box64 offers several advantages:
JIT compiler in box64 is more efficient, the codebase is more streamlined, and it allows interception of dynamically linked libraries, mapping them directly to the RISC-V host – a process referred to as "wrapped". This approach significantly boosts performance compared to QEMU, which simulates the entire program along with its dynamically linked libraries. Additionally, thanks to the aforementioned wrapper, box64 can emulate target x64 executables without requiring a complete x64 file system. This not only speeds up startup but also reduces runtime costs substantially.
At the start, box64 will extract useful informations about the executable being run: where is the executable code, where to put it, the needed libraries…
For each library, it will then try to load a native library if it is registered, or if it isn’t it will try to find a version to emulate. If the library is emulated, it will receive the same treatment as the executable. Otherwise, it will be treated separately, and the native library will be loaded.
Finally, it will apply relocations: in the executable, function addresses are not hardcoded but left for the linker (or, here, box64) to give. This allows for box64 to do its magic: it will set the native functions’ address to point to a certain signature and metadata, which will alert box64’s emulator and dynarec about this so-called “bridge” (between the emulated and native world). This enables the “native calls” which is the strength of box64.
The best way to demonstrate the capabilities of box64 is by emulating Microsoft Windows applications.
Demo:
$ WINEDLLOVERRIDES="CardGames,chess,slc=n,b" box64 wine64 chess1.exe
$ WINEDLLOVERRIDES="CardGames,slc=n,b" box64 wine64 Solitaire1.exe
$ WINEDLLOVERRIDES="CardGames,slc,Minesweeper=n,b" box64 wine64 Minesweeper1.exe
Track the memory usage of the RBtree during box64 execution, recording data in the format: "op_name, op_time, memory usage." Integrate a "producer" mechanism into box64 to store this recorded data in shared memory, which can then be accessed and processed by a "consumer."
In this case, We use ringbuf-shm to implement an interface for a shared array.
Name | Real Time | Add node | Add tree | Delete node | Delete tree | Total | Operations/sec | Avg Throughput |
---|---|---|---|---|---|---|---|---|
Chess | 34.93 | 51041 | 19 | 28355 | 19 | 79434 | 2243.091 | 79.96 KB/sec |
Solitaire | 16.32 | 44972 | 18 | 22689 | 18 | 67697 | 4148.100 | 150.81 KB/sec |
Minesweeper | 14.69 | 45509 | 18 | 22718 | 18 | 68263 | 4646.902 | 169.54 KB/sec |
See Box64 紅黑樹效能分析
Although box64 offers many advantages, it is challenging to fully leverage its dynamic binary translation capabilities in resource-constrained RISC-V environments like the Tinker V.
To deal with that, We implemented a series of improvements to enhance performance and adaptability:
While the current version of Box64 works quite well, the red-black tree behind its memory management can be modified to some degree.
typedef struct rbnode {
struct rbnode *left, *right, *parent;
uintptr_t start, end;
uint64_t data;
uint8_t meta;
} rbnode;
The node size in the current Box64 is 56 bytes. For example, chess.exe requires at least 51041 red-black tree nodes, which amounts to 2858296 bytes.
We can save around 15% of memory usage if we modify the structure of the rbnode:
typedef struct rbnode {
rb_node_t node;
uintptr_t start, end;
uint64_t data;
} rbnode;
Nodes within a rbtree represented as an rb_node_t
structure, which resides in user-managed memory, embedded within rbnode
tracked by the tree. However, unlike linked list structures, the data within an rb_node_t
is entirely opaque, and users cannot manually traverse the tree's binary topology as they can with lists.
The CoreMark benchmark is a product of the Embedded Microprocessor Benchmark Consortium, and aims to provide a modern alternative to the venerable integer-focused Dhrystone benchmark while answering many of its shortcomings and improving its relevance primarily in the area of low-power embedded systems, where Dhrystone is still predominantly used by vendors to provide rough performance estimates and means of comparison with competing products.
Emulated x86 programs achieve approximately one-quarter the performance of native RISC-V applications.
A pun on the name of the floating-point focused "Whetstone" benchmark, the Dhrystone benchmark is, like Whetstone, a synthetic benchmark derived from statistical analysis with the goal of representing an "average" program. Dhrystone is unlike Whetstone in that it focuses entirely on integer arithmetic, making it more relevant to small systems and typical computing use cases in general than the more scientific and technically-oriented Whetstone.
The emulated x86 programs achieve approximately one-quarter the performance of native RISC-V programs.
The rv8 benchmark suite contains a small set of currently integer centric benchmarks for regression testing of the rv8 binary translation engine. The suite contains the following test programs: