# Fast usermode x86-64 emulator for RISC-V
> 邱繼寬

## Introduction
Software companies rarely provide executables for the RISC-V architecture. Consequently, proprietary software cannot be recompiled for RISC-V, limiting its availability on the platform.
Therefore, we propose a high-speed dynamic binary translation mechanism that enables existing x86 applications to run on RISC-V processors at approximately half the performance of native execution. This solution supports both Linux and Windows applications (with [WINE](https://www.winehq.org/)).
## Why Not QEMU ?
Compared to the industry-standard QEMU, box64 offers several advantages:
- Performance
- Memory usage
- MIT License
- Simple

## Box64
Box64's Dynamic Binary Translation (DBT) system, called Dynarec (Dynamic Recompiler), is a Just-In-Time (JIT) compiler that translates x86-64 instructions into native host instructions (ARM64, RISC-V 64, or LoongArch 64) at runtime.

## Box64 + wine
> [steam](https://drive.google.com/file/d/1kTDWWlh4HZzV5r2G1jzcDHQZVjLbEift/view?usp=sharing)
### Windows Application Emulation
The best way to demonstrate the capabilities of box64 is by emulating Microsoft Windows applications.
We leverage another open-source project, [WINE](https://www.winehq.org/), which provides a compatibility layer for Windows on Linux. This allows Windows applications originally designed for the x64 instruction set to run on RV64 without any modifications, thanks to box64 dynamically translating x64 instructions to RV64 in real time.

```
+---------------------+ \
| Windows EXE | } application
+---------------------+ /
+---------+ +---------+ \
| Windows | | Windows | \ application & system DLLs
| DLL | | DLL | /
+---------+ +---------+ /
+---------+ +---------+ +-----------+ +--------+ \
| GDI32 | | USER32 | | | | | \
| DLL | | DLL | | | | Wine | \
+---------+ +---------+ | | | Server | \ core system DLLs
+---------------------+ | | | | / (on the left side)
| Kernel32 DLL | | Subsystem | | NT-like| /
| (Win32 subsystem) | |Posix, OS/2| | Kernel | /
+---------------------+ +-----------+ | | /
| |
+---------------------------------------+ | |
| NTDLL | | |
+---------------------------------------+ +--------+
+---------------------------------------+ \
| Loader and Wrappers } Emulator
+---------------------------------------+ /
+---------------------------------------+ \
| box64 (X64 -> RV64) } Dynamic binary translator
+---------------------------------------+ /
+---------------------------------------------------+ \
| Wine drivers | } Wine specific DLLs
+---------------------------------------------------+ /
+------------+ +------------+ +--------------+ \
| libc | | libX11 | | other libs | } Linux shared libraries
+------------+ +------------+ +--------------+ / (user space)
+---------------------------------------------------+ \
| Linux Kernel }
+---------------------------------------------------+ /
+---------------------------------------------------+ \
| Linux device drivers | }
+---------------------------------------------------+ /
```
## Box64 Emulation Internals
Box64 is a Linux userspace x86-64 emulator that enables running x86-64 programs on non-x86 architectures (ARM64, RISC-V, LoongArch). It uses a hybrid approach combining an interpreter with platform-specific **Dynamic Binary Translation (DBT)** for 5-10x performance gains over pure interpretation.
## High-Level Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ x86-64 Binary │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ELF Loader │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────────┐ │
│ │Parse Headers│→ │Map Segments │→ │ Relocate │→ │ Load Libraries │ │
│ └─────────────┘ └──────────────┘ └─────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Execution Engine │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ EmuRun() Loop │ │
│ │ │ │ │
│ │ ┌───────────────┴───────────────┐ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ Interpreter │ │ Dynarec │ │ │
│ │ │ (x64run*.c) │ │ (ARM64/RV64/LA64) │ │ │
│ │ │ │ │ │ │ │
│ │ │ Decode → Execute │ │ Translate → Cache │ │ │
│ │ │ one instruction │ │ → Execute native │ │ │
│ │ └─────────────────────┘ └─────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Syscalls │ │ Signals │ │ Libraries │
│ (translated) │ │ (converted) │ │ (wrapped) │
└──────────────┘ └──────────────┘ └──────────────┘
```
---
## Translation Process
### Overview
Box64's dynarec translates x86-64 code to native ARM64 (or RV64/LA64) code through a **multi-pass compilation system**:
```
┌─────────────────────────────────────────────────────────────────────┐
│ TRANSLATION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ x86-64 Code │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pass 0: Decode x86-64 instructions, find block boundaries │ │
│ │ Output: instruction count, x64 size, control flow │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pass 1: Analyze FPU/SSE state, flag dependencies │ │
│ │ Output: per-instruction metadata, register liveness │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pass 2: Calculate native code sizes │ │
│ │ Output: total ARM64 bytes needed, relocations │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pass 3: Emit final ARM64 code, resolve relocations │ │
│ │ Output: executable native block │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Native ARM64 Code (cached in dynablock) │
└─────────────────────────────────────────────────────────────────────┘
```
### State Mapping: x86-64 → ARM64 Registers
Box64 maps x86-64 CPU state to ARM64 registers for efficient execution:
| x86-64 Register | ARM64 Register | Purpose |
|-----------------|----------------|---------|
| RAX | x10 | Accumulator |
| RCX | x11 | Counter |
| RDX | x12 | Data |
| RBX | x13 | Base |
| RSP | x14 | Stack Pointer |
| RBP | x15 | Base Pointer |
| RSI | x16 | Source Index |
| RDI | x17 | Destination Index |
| R8-R15 | x18-x25 | Extended registers |
| EFLAGS | x26 | Flags register |
| RIP | x27 | Instruction Pointer |
| SavedSP | x28 | Call/ret optimization |
| x64emu_t* | x0 | Emulator context |
| Scratch | x1-x9 | Temporary values |
**FPU/SSE Mapping:**
- x87 FPU stack (ST0-ST7) → NEON v0-v31 (cached, spilled to `emu->x87[]`)
- XMM0-XMM15 → NEON v0-v31 (direct mapping via neoncache)
- YMM0-YMM15 → Upper 128 bits stored in `emu->ymm[]`
### Prolog/Epilog: Native Entry/Exit
Complete Execution Flow Diagram
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DYNAREC EXECUTION LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ C Code: DynaRun(emu, addr) │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ arm64_prolog │ │
│ │ 1. Save ARM64 callee-saved registers │ │
│ │ 2. Load x86-64 state: emu->regs[] → x10-x27 │ │
│ │ 3. Setup SavedSP (x28) │ │
│ │ 4. br x1 → Jump to native block │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Native ARM64 Code (Translated Block) │ │
│ │ │ │
│ │ Executes x86-64 instructions as ARM64... │ │
│ │ │ │
│ │ On direct jump to known block: │ │
│ │ → br <target_block> (direct chaining, no overhead) │ │
│ │ │ │
│ │ On indirect jump (call ptr, ret, jmp reg): │ │
│ │ → bl arm64_next (resolve via jump table) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ arm64_next │ │ │
│ │ │ 1. Save x86-64 registers (volatile) │ │ │
│ │ │ 2. Call LinkNext(emu, target_rip, from, &rip) │ │ │
│ │ │ 3. Restore registers │ │ │
│ │ │ 4. br x3 → Jump to resolved block │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ (continues in new block) │ │
│ │ │ │
│ │ On exit condition (syscall, exception, unhandled opcode): │ │
│ │ → br <arm64_epilog address> │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ arm64_epilog │ │
│ │ 1. Store x86-64 state: x10-x27 → emu->regs[] │ │
│ │ 2. Restore SP from SavedSP (x28) │ │
│ │ 3. Restore ARM64 callee-saved registers │ │
│ │ 4. ret → Return to C caller │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ C Code: Returns from DynaRun() │
│ Check emu->quit, emu->fork, handle syscalls, etc. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
**arm64_prolog.S** - Entry to native code:
```asm
arm64_prolog:
// 1. Save ARM64 callee-saved registers (x19-x28, d8-d15)
stp x19, x20, [sp, ...]
stp d8, d9, [sp, ...]
// 2. Load x86-64 state from emu structure into ARM64 registers
ldp x10, x11, [x0, (8*0)] // RAX, RCX ← emu->regs[0,1]
ldp x12, x13, [x0, (8*2)] // RDX, RBX ← emu->regs[2,3]
...
ldp x26, x27, [x0, (8*16)] // EFLAGS, RIP
// 3. Jump to translated native code
br x1
```
**arm64_epilog.S** - Exit from native code:
```asm
arm64_epilog:
// 1. Store x86-64 state back to emu structure
stp x10, x11, [x0, (8*0)] // emu->regs[0,1] ← RAX, RCX
...
stp x26, x27, [x0, (8*16)] // EFLAGS, RIP written back
// 2. Restore ARM64 callee-saved registers
ldp x19, x20, [sp, ...]
...
ret
```
### Deferred Flags Optimization
x86-64 computes flags on almost every arithmetic operation. Box64 uses **lazy flag evaluation**:
```c
// Instead of computing flags immediately:
// adds x10, x10, x11; mrs x26, nzcv // SLOW!
// Box64 stores operation info:
emu->df_type = d_add64; // Operation type
emu->op1 = operand1; // Operand 1
emu->op2 = operand2; // Operand 2
emu->res = result; // Result
// Flags computed only when needed (branch, PUSHF, syscall)
```
### Dynablock Structure
Each compiled block is tracked by a `dynablock_t`:
```
┌─────────────────────────────────────────────────────────────┐
│ Dynablock Memory Layout │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ dynablock_t* (8 bytes) ← Self-reference │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ Native ARM64 Code │ │
│ │ (native_size bytes) │ │
│ │ │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ Table64 (constants for 64-bit immediates) │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ JmpNext stub (branch to next block) │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ InstSize[] (x64 ↔ native instruction mapping) │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ Architecture-specific metadata │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
```c
typedef struct dynablock_s {
void* block; // Native code (block-8 = self pointer)
void* x64_addr; // Original x86-64 address
uintptr_t x64_size; // x86-64 code size
size_t native_size; // Native ARM64 code size
uint32_t in_used; // Reference count (execution safety)
uint32_t hash; // Hash for SMC detection
uint8_t done; // Block compilation complete
uint8_t gone; // Marked for deletion
uint8_t dirty; // Needs validation
instsize_t* instsize; // x64↔native instruction mapping
void* jmpnext; // Jump to next block stub
} dynablock_t;
```
**Key Files:**
- `src/dynarec/dynarec.c` - Main execution loop, EmuRun, LinkNext
- `src/dynarec/dynablock.c` - Block lifecycle management
- `src/dynarec/dynarec_native.c` - FillBlock64, multi-pass compilation
- `src/dynarec/arm64/arm64_prolog.S` - Native entry
- `src/dynarec/arm64/arm64_epilog.S` - Native exit
- `src/dynarec/arm64/dynarec_arm64_*.c` - Opcode translation (60+ files)
---
## Program Counter Mapping (x86-64 RIP to ARM64)
### Jump Table Architecture
Box64 uses a **multi-level sparse lookup table** to map x86-64 addresses to native ARM64 code:
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Jump Table Hierarchy │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ x86-64 Address: 0x00007F1234567890 │
│ │
│ Split into indices: │
│ ┌────────────┬────────────┬────────────┬────────────┐ │
│ │ idx4 (16b) │ idx3 (16b) │ idx2 (16b) │ idx1 (16b) │ │
│ │ 0x0000 │ 0x7F12 │ 0x3456 │ 0x7890 │ │
│ └─────┬──────┴─────┬──────┴─────┬──────┴─────┬──────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │jmptbl4[0]│→│jmptbl3 │→│jmptbl2 │→│jmptbl1 │→ native_code_addr │
│ │ (65536) │ │ [0x7F12] │ │ [0x3456] │ │ [0x7890] │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ └────────────┴────────────┴────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Default: native_next│ (unmapped → interpreter fallback) │
│ │ Mapped: block->block│ (dynablock native code address) │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
**Properties:**
- **O(4-5) lookups** - Near-constant time address translation
- **Lazy allocation** - Only creates table entries when addresses are used
- **Default tables** - Unmapped regions point to `native_next` handler
- **Atomic updates** - Thread-safe via compare-and-swap
### Block Lookup Flow
```
DBGetBlock(emu, x64_addr):
│
├─► Check hot page (self-modifying code region)
│ └─► If hot: return NULL (use interpreter)
│
├─► Lookup in jump table: getDB(x64_addr)
│ └─► Returns dynablock_t* or NULL
│
├─► If block exists and needs validation:
│ └─► Compute hash, compare with db->hash
│ └─► Mismatch: Invalidate, recreate block
│
└─► If no block: FillBlock64() → register in jump table
```
### Instruction Size Mapping
Each dynablock tracks the correspondence between x86-64 and ARM64 instructions:
```c
typedef struct instsize_s {
unsigned char x64:4; // x86-64 instruction bytes (0-15)
unsigned char nat:4; // ARM64 instructions × 4 bytes
} instsize_t;
```
This enables:
1. **Signal handling** - Map native PC back to x86-64 RIP
2. **Debugging** - Determine which x86-64 instruction is executing
3. **SMC detection** - Know exactly which instructions were modified
### Direct Block Chaining (LinkNext)
When one block jumps to another, Box64 avoids jump table lookup overhead:
```
Block A (x64: 0x1000): Block B (x64: 0x1020):
┌──────────────────────┐ ┌──────────────────────┐
│ ... ARM64 code ... │ │ ... ARM64 code ... │
│ mov x27, #0x1020 │ │ │
│ br <Block B native> │───────────►│ │
└──────────────────────┘ └──────────────────────┘
```
**arm64_next.S** - For indirect jumps or first-time lookups:
```asm
arm64_next:
// Save x86-64 state
stp x10, x11, [sp, ...]
// Call C function to resolve target
mov x1, x27 // x86-64 RIP
bl LinkNext // Returns native code address
// Restore state and jump
ldp x10, x11, [sp, ...]
br x0 // Jump to resolved native code
```
**Key Files:**
- `src/custommem.c` - Jump table implementation (lines 1881-2028)
- `src/dynarec/arm64/arm64_next.S` - Block linking helper
---
## Memory Management Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Memory Management Overview │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Dynarec Code Pool (2MB chunks) │ │
│ │ ┌───────────┬───────────┬───────────┬───────────┬──────────────┐ │ │
│ │ │ Block 1 │ Block 2 │ FREE │ Block 3 │ FREE │ │ │
│ │ │ (in_used) │ │ │ (dirty) │ │ │ │
│ │ └───────────┴───────────┴───────────┴───────────┴──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Protection Tracking (memprot rbtree) │ │
│ │ │ │
│ │ Page 0x1000: PROT_READ | PROT_EXEC | PROT_DYNAREC │ │
│ │ Page 0x2000: PROT_READ | PROT_WRITE │ │
│ │ Page 0x3000: PROT_READ | PROT_EXEC | PROT_NEVERCLEAN (hot page) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Self-Modifying Code Detection: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Block compiled → Page write-protected (PROT_DYNAREC) │ │
│ │ 2. Write attempt → SIGSEGV triggered │ │
│ │ 3. Signal handler → unprotectDB() + MarkDynablock() │ │
│ │ 4. Next execution → hash mismatch → block invalidated │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Custom Allocator
Box64 uses a custom memory allocator for dynarec code:
**Block Sizes:**
- `MMAPSIZE = 512 KB` - Standard allocation for regular memory pools
- `DYNMMAPSZ = 2 MB` - Primary allocation for dynarec code blocks
- `DYNMMAPSZ0 = 128 KB` - Smaller initial block (memory optimization)
**Allocation Flow:**
1. Tries to reuse free space in existing mmaplist chunks
2. If insufficient space, triggers `PurgeDynarecMap()` to reclaim stale blocks
3. Allocates new 2MB chunks using `mmap(PROT_READ|PROT_WRITE|PROT_EXEC)`
4. Tracks allocation in `rbt_dynmem` red-black tree for fast lookup
### Memory Protection for SMC Detection
Box64 uses page-level protection to detect self-modifying code:
**Custom Protection Flags:**
```c
#define PROT_DYNAREC 0x80 // Write-protected for SMC detection
#define PROT_DYNAREC_R 0x40 // Read-only dynarec page
#define PROT_NEVERCLEAN 0x100 // Hot page - never clean (mixed code/data)
```
**SMC Detection Flow:**
```
1. After block compilation → protectDB(addr, size)
└─► Removes PROT_WRITE, adds PROT_DYNAREC
2. Write to protected page → SIGSEGV (SEGV_ACCERR)
└─► Signal handler: unprotectDB(), MarkDynablock()
3. Next execution → hash mismatch detected
└─► InvalidDynablock(), create new block
```
### Hot Page Tracking
Pages with frequent code/data mixing are marked to avoid continuous invalidation:
```c
typedef union hotpage_s {
struct {
uint64_t addr:36; // Page address
uint64_t cnt:28; // Reference counter
};
uint64_t x; // Atomic update
} hotpage_t;
#define N_HOTPAGE 32 // Track 32 hot pages
#define HOTPAGE_MARK 64 // Marking threshold
```
### Process and Thread Handling
```
┌─────────────────────────────────────────────────────────────────────┐
│ PROCESS │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ box64context_t *my_context (SINGLETON) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ SHARED by ALL threads: │ │ │
│ │ │ │ │ │
│ │ │ • maplib (symbol tables) │ │ │
│ │ │ • elfs[] (loaded ELF headers) │ │ │
│ │ │ • system (bridge table) │ │ │
│ │ │ • dlprivate (dlopen handles) │ │ │
│ │ │ • signals[] (signal handlers) │ │ │
│ │ │ • atforks[] (fork handlers) │ │ │
│ │ │ • globdata (global data relocations) │ │ │
│ │ │ • tlsdata (initial TLS template) │ │ │
│ │ │ • db_sizes (dynarec block sizes) │ │ │
│ │ │ • stacksizes (thread stack tracking) │ │ │
│ │ │ • box64_path, box64_ld_lib (PATH vars) │ │ │
│ │ │ • argc, argv, envv (program args) │ │ │
│ │ │ • stack (main thread stack) │ │ │
│ │ │ • seggdt[] (GDT segments) │ │ │
│ │ │ • canary[] (stack protector) │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ SYNCHRONIZATION (protecting shared resources): │ │ │
│ │ │ │ │ │
│ │ │ • mutex_dyndump (dynarec block creation) │ │ │
│ │ │ • mutex_trace (trace output) │ │ │
│ │ │ • mutex_tls (TLS operations) │ │ │
│ │ │ • mutex_thread (thread list) │ │ │
│ │ │ • mutex_bridge (bridge table) │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ x64emu_t #0 │ │ x64emu_t #1 │ │ x64emu_t #2 │ │
│ │ (Thread 0) │ │ (Thread 1) │ │ (Thread 2) │ │
│ │ │ │ │ │ │ │
│ │ PER-THREAD: │ │ PER-THREAD: │ │ PER-THREAD: │ │
│ │ • regs[] │ │ • regs[] │ │ • regs[] │ │
│ │ • eflags │ │ • eflags │ │ • eflags │ │
│ │ • xmm[] │ │ • xmm[] │ │ • xmm[] │ │
│ │ • x87stack │ │ • x87stack │ │ • x87stack │ │
│ │ • init_stack│ │ • init_stack│ │ • init_stack│ │
│ │ • segldt[] │ │ • segldt[] │ │ • segldt[] │ │
│ │ • tlsdata │ │ • tlsdata │ │ • tlsdata │ │
│ │ • jmpbuf │ │ • jmpbuf │ │ • jmpbuf │ │
│ │ │ │ │ │ │ │
│ │ context ────┼──────┼─────────────┼──────┼─► my_context│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
Each thread's `x64emu_t` has a pointer back to the shared `my_context`
### Fork Handling
Fork is handled through a **deferred execution model**:
```c
// In wrapper (wrappedlibc.c)
pid_t my_fork(x64emu_t* emu) {
emu->quit = 1; // Signal to exit main loop
emu->fork = 1; // Mark fork pending
return 0;
}
// In main loop (dynarec.c / x64run.c)
if(emu->fork) {
int forktype = emu->fork;
emu->fork = 0;
emu = x64emu_fork(emu, forktype); // Execute actual fork
}
```
**x64emu_fork() Flow:**
1. Execute `pthread_atfork()` prepare callbacks
2. Perform actual `fork()` syscall
3. In parent: Execute parent callbacks, return child PID
4. In child: Execute child callbacks, return 0
**Key insight:** Fork doesn't execute immediately in wrapper - it sets flags and executes when control returns to main loop. This ensures proper x86-64 instruction completion.
### pthread_create Handling
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Thread Creation Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ x86-64: pthread_create(&thread, attr, start_routine, arg) │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ my_pthread_create(): │ │
│ │ 1. Allocate stack (mmap 2MB) │ │
│ │ 2. Create new x64emu_t context │ │
│ │ 3. Copy parent CPU state (SetupX64Emu) │ │
│ │ 4. Pre-compile entry point with dynarec │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Native pthread_create(pthread_routine, emuthread) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴────────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Parent Thread │ │ Child Thread │ │
│ │ (continues) │ │ pthread_routine: │ │
│ │ │ │ - Set TLS │ │
│ │ │ │ - Setup stack │ │
│ │ │ │ - R_RIP = entry │ │
│ │ │ │ - DynaRun() │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ Shared: dynarec code cache, box64context_t, symbols │
│ Per-thread: x64emu_t, TLS data, stack │
└─────────────────────────────────────────────────────────────────────────────┘
```
Thread creation allocates a new emulator context for each thread:
```c
int my_pthread_create(x64emu_t *emu, pthread_t* t, void* attr,
void* start_routine, void* arg) {
// 1. Allocate stack (default 2MB)
void* stack = mmap(NULL, stacksize, PROT_READ|PROT_WRITE, ...);
// 2. Create new x64emu_t for thread
x64emu_t *emuthread = NewX64Emu(my_context, start_routine, stack, ...);
SetupX64Emu(emuthread, emu); // Copy parent's CPU state
// 3. Pre-compile entry point
if(BOX64ENV(dynarec))
DBGetBlock(emu, start_routine, 1, 0);
// 4. Create native thread
return pthread_create(t, attr, pthread_routine, emuthread);
}
```
**Thread Execution (pthread_routine):**
```c
static void* pthread_routine(void* p) {
emuthread_t *et = (emuthread_t*)p;
// Register thread's emulator context in TLS
pthread_setspecific(thread_key, p);
// Setup x86-64 stack frame
Push64(emu, 0); // Backtrace marker
R_RIP = et->fnc; // Set entry point
R_RDI = (uintptr_t)et->arg; // Thread argument
// Execute thread
DynaRun(emu);
return (void*)R_RAX; // Return thread result
}
```
**Key insight:** All threads share the same dynarec code cache (`my_context->dynablocks`) but each has its own `x64emu_t` (CPU state).
**Key Files:**
- `src/custommem.c` - Memory management (2700+ lines)
- `src/libtools/threads.c` - Thread handling
- `src/wrapped/wrappedlibc.c` - Fork wrapper
- `src/emu/x64int3.c` - x64emu_fork implementation
---
## Signal Handling
When a signal occurs during native block execution, Box64 must convert ARM64 PC → x86-64 RIP before handling:
### Signal Flow
```
Signal (SIGSEGV/SIGBUS/etc.) during native block execution
│
▼
my_box64signalhandler():
│
├─► Get native PC from ucontext (ARM64 register context)
│
├─► FindDynablockFromNativeAddress(native_pc)
│ └─► Search through compiled dynablocks
│
├─► getX64Address(db, native_pc)
│ └─► Walk instsize[] array to map native→x86-64
│
├─► Is this Self-Modifying Code?
│ ├─► Yes: unprotectDB(), mark dirty
│ │ copyUCTXreg2Emu(), siglongjmp(2)
│ └─► No: Construct x86-64 ucontext, call user handler
│
└─► Return or longjmp based on handler result
```
### RIP Determination from Native PC
```c
uintptr_t getX64Address(dynablock_t* db, uintptr_t native_addr) {
uintptr_t x64addr = (uintptr_t)db->x64_addr;
uintptr_t armaddr = (uintptr_t)db->block;
for (int i = 0; i < db->isize; i++) {
int x64sz = db->instsize[i].x64;
int armsz = db->instsize[i].nat * 4; // ARM64 = 4 bytes each
if (native_addr >= armaddr && native_addr < armaddr + armsz)
return x64addr; // Found the instruction!
x64addr += x64sz;
armaddr += armsz;
}
return 0; // Not found
}
```
### setjmp/longjmp for Control Flow
```c
// In EmuRun() - dynarec.c
JUMPBUFF jmpbuf[1];
emu->jmpbuf = jmpbuf;
if ((skip = SigSetJmp(emu->jmpbuf, 1))) {
// Returned from signal via longjmp
// skip == 1: use interpreter for one instruction
// skip == 2: block was dirty, regenerate
// skip == 3: retry with dynarec
}
```
### Signal Handler Translation
For x86-64 signal handlers, Box64 constructs a fake x86-64 ucontext:
```c
// Stack layout for x86-64 signal handler:
[xstate - FPU/SSE/AVX state]
[siginfo_t]
[x64_ucontext_t with gregs, eflags, segments]
```
**copyUCTXreg2Emu()** extracts ARM64 register values and maps them to x86-64:
```c
#define GO(R) emu->regs[_##R].q[0] = CONTEXT_REG(p, x##R)
GO(RAX); GO(RCX); GO(RDX); // ... all 16 GP registers
emu->ip.q[0] = x64_rip; // Computed from getX64Address()
```
**Key Files:**
- `src/libtools/signals.c` - Signal handling (2500+ lines)
- `src/dynarec/dynablock.c` - getX64Address()
- `src/os/emit_signals_linux.c` - Signal emission
---
## Library Wrapping Mechanism
Box64 uses native libraries instead of emulating x86-64 libraries. This provides massive performance gains since library code (OpenGL, SDL, libc, etc.) runs at native speed.
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Library Wrapping Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ x86-64 Code calling SDL_Init(): │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ mov edi, 0x20 ; SDL_INIT_VIDEO │ │
│ │ call <bridge_addr> ; Call to bridge (INT3 instruction) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Bridge (onebridge_t) │ │
│ │ ┌──────┬──────┬────────────────┬─────────────────┬──────┐ │ │
│ │ │ 0xCC │ 'SC' │ wrapper_fn_ptr │ native_SDL_Init │ 0xC3 │ │ │
│ │ │ INT3 │ │ │ │ RET │ │ │
│ │ └──────┴──────┴────────────────┴─────────────────┴──────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Wrapper Function (iFu) │ │
│ │ // Extract x86-64 args from registers │ │
│ │ uint32_t flags = R_EDI; │ │
│ │ │ │
│ │ // Call native function (ARM64 ABI) │ │
│ │ int result = native_SDL_Init(flags); │ │
│ │ │ │
│ │ // Store result in x86-64 return register │ │
│ │ R_EAX = result; │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Native ARM64 SDL_Init() │ │
│ │ (runs at full native speed) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Wrapper Generation System
The `rebuild_wrappers.py` script generates wrapper code from declarative specifications:
**Input:** `*_private.h` files define wrapped functions:
```c
GO(SDL_Init, iFu) // int SDL_Init(uint32_t flags)
GO(SDL_Quit, vFv) // void SDL_Quit(void)
GOM(SDL_AddEventWatch, vFEpp) // Special: needs emu context for callbacks
DATA(SDL_version, 4) // Data symbol, 4 bytes
```
**Output:**
- `wrapper.c` - Function implementations (1.2MB)
- `wrapper.h` - Function declarations
- Per-library `*types.h`, `*defs.h` files
### Signature System
Format: `[ReturnType]F[Arguments...]`
| Code | Type | Code | Type |
|------|------|------|------|
| `v` | void | `p` | void* |
| `i` | int32 | `I` | int64 |
| `u` | uint32 | `U` | uint64 |
| `f` | float | `d` | double |
| `c` | int8 | `C` | uint8 |
| `E` | x64emu_t* | `V` | varargs |
**Examples:**
- `iFpp` → `int func(void*, void*)`
- `vFEpp` → `void func(emu, void*, void*)` (special handling)
- `dFd` → `double func(double)`
### Bridge Mechanism
Bridges translate between x86-64 code and native functions:
```c
typedef union onebridge_s {
struct {
uint8_t CC; // 0xCC (INT3 breakpoint)
uint8_t S, C; // 'S' 'C' signature marker
wrapper_t w; // Wrapper function pointer
uintptr_t f; // Native function address
uint8_t C3; // 0xC3 (RET)
const char* name; // Function name (debug)
};
} onebridge_t;
```
**Bridge Invocation Flow:**
1. x86-64 code calls bridge address (executes INT3)
2. Emulator catches INT3, identifies bridge
3. Wrapper extracts arguments from x86-64 registers
4. Native function called with ARM64 ABI
5. Result stored in x86-64 return register (RAX/XMM0)
6. RET returns to x86-64 code
### Callback Handling
When native code calls back into x86-64 code (e.g., SDL event callbacks):
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Callback Flow (Native → x86-64) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ x86-64 code registers callback: │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ SDL_AddEventWatch(my_x64_callback, userdata); │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Wrapper: find_eventfilter_Fct(my_x64_callback) │ │
│ │ → Stores x64 callback address in slot │ │
│ │ → Returns native_callback_wrapper address │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Native SDL receives: native_callback_wrapper + userdata │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ [Event occurs] │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Native SDL calls: native_callback_wrapper(userdata, event) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ native_callback_wrapper(): │ │
│ │ return RunFunctionFmt(x64_callback_addr, "pp", userdata, event);│ │
│ │ → Sets up x86-64 registers (RDI=userdata, RSI=event) │ │
│ │ → Calls DynaRun() to execute x86-64 callback │ │
│ │ → Returns result from R_EAX │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
```c
// Callback slot pattern
static uintptr_t my_callback_fct_0 = 0; // x86-64 function pointer
static int my_callback_0(void* arg) {
// Call back into x86-64 code
return (int)RunFunctionFmt(my_callback_fct_0, "p", arg);
}
static void* find_callback_Fct(void* fct) {
// Check if already wrapped
if(my_callback_fct_0 == (uintptr_t)fct) return my_callback_0;
// Find empty slot
if(my_callback_fct_0 == 0) {
my_callback_fct_0 = (uintptr_t)fct;
return my_callback_0;
}
// ... more slots
}
```
**RunFunctionFmt()** - Execute x86-64 callback:
```c
uint64_t RunFunctionFmt(uintptr_t fnc, const char* fmt, ...) {
// Parse format string ("pp", "Up", etc.)
// Set up x86-64 calling convention:
// RDI, RSI, RDX, RCX, R8, R9 for integer args
// XMM0-XMM7 for float args
// Call x86-64 function via DynaCall
// Return result from R_RAX
}
```
### ABI Translation
| Aspect | x86-64 (System V) | ARM64 |
|--------|-------------------|-------|
| Int args | RDI, RSI, RDX, RCX, R8, R9 | X0-X7 |
| Float args | XMM0-XMM7 | V0-V7 |
| Return (int) | RAX | X0 |
| Return (float) | XMM0 | V0 |
Wrappers handle translation automatically by extracting from x86-64 registers and letting ARM64 compiler handle native ABI.
### ~100+ Wrapped Libraries
Box64 wraps major system and application libraries:
- **Core:** libc, libm, libpthread, libdl
- **Graphics:** OpenGL, Vulkan, SDL1/2, GTK2/3
- **Audio:** OpenAL, PulseAudio, ALSA
- **System:** DBus, Xlib, XCB
- **Wine:** Wine compatibility libraries
**Key Files:**
- `src/wrapped/rebuild_wrappers.py` - Wrapper generator (1836 lines)
- `src/wrapped/*_private.h` - Function declarations
- `src/wrapped/generated/wrapper.c` - Generated wrappers
- `src/tools/bridge.c` - Bridge creation
- `src/tools/callback.c` - Callback execution
- `src/librarian/library.c` - Library loading
---
## ELF Loading
Box64 loads x86-64 ELF binaries through a comprehensive loading and relocation system.
### Loading Flow
```
LoadAndCheckElfHeader()
│
├─► ParseElfHeader64() - Validate ELF magic, class, machine (EM_X86_64)
│
├─► CalcLoadAddr() - Calculate memory requirements from PT_LOAD segments
│
├─► AllocLoadElfMemory() - Map segments into memory
│ ├─► For PIE: find47bitBlockElf() to locate address space
│ ├─► mmap() each PT_LOAD segment
│ └─► Zero-fill BSS (p_memsz > p_filesz)
│
├─► LoadNeededLibs() - Load DT_NEEDED dependencies
│ ├─► Expand RPATH/RUNPATH ($ORIGIN, ${PLATFORM})
│ └─► Recursively load shared libraries
│
├─► RelocateElfRELA() - Apply relocations
│ ├─► R_X86_64_RELATIVE: *p = delta + addend
│ ├─► R_X86_64_GLOB_DAT: *p = symbol_addr
│ ├─► R_X86_64_JUMP_SLOT: PLT entry (lazy binding)
│ └─► R_X86_64_COPY: memcpy(p, symbol, size)
│
└─► RunElfInit() - Execute constructors (.init, .init_array)
```
### TLS (Thread-Local Storage)
x86-64 uses FS segment register for TLS. Box64 emulates this:
```c
// TLS structure at FS:0
Offset Purpose
0x00 TCB (self-pointer)
0x10 Self pointer
0x28 Stack canary
0x30 Pointer guard
```
**ARCH_PRCTL syscalls** set/get FS/GS base addresses for TLS access.
### Entry Point Execution
```c
// Stack layout at entry (top to bottom):
[argc]
[argv[0]] ... [argv[n]] [NULL]
[env[0]] ... [env[n]] [NULL]
[auxv entries: AT_PAGESZ, AT_ENTRY, AT_PLATFORM="x86_64", AT_RANDOM, ...]
[AT_NULL]
```
**Key Files:** `src/elfs/elfloader.c`, `src/elfs/elfparser.c`, `src/emu/x64tls.c`
---
## Syscall Handling
Box64 intercepts x86-64 syscalls (opcode 0x0F 0x05) and translates them to native syscalls.
### Syscall Flow
```
SYSCALL instruction detected (x64run0f.c)
│
▼
x64Syscall_linux(emu)
│
├─► Fast path: syscallwrap[] table (128+ common syscalls)
│ └─► Direct native syscall with register mapping
│
└─► Custom handlers (switch statement)
├─► Memory: mmap, mprotect, munmap, mremap
├─► Process: clone, fork, execve
├─► Signals: rt_sigaction, sigaltstack
└─► Special: futex, arch_prctl
```
### Register Mapping (x86-64 Syscall ABI)
| Register | Purpose |
|----------|---------|
| RAX | Syscall number (in), return value (out) |
| RDI | Argument 1 |
| RSI | Argument 2 |
| RDX | Argument 3 |
| R10 | Argument 4 (not RCX) |
| R8 | Argument 5 |
| R9 | Argument 6 |
**Return:** Success = positive value, Error = `-errno`
### Special Syscalls
**clone (56)** - Thread/process creation:
- Creates new `x64emu_t` context for child
- Handles vfork detection by flag pattern
- Manages thread-local storage
**mmap (9)** - Memory mapping:
- Tracks protection for SMC detection
- Handles MAP_32BIT flag
**rt_sigaction (13)** - Signal handling:
- Translates x86-64 signal structures
- Manages signal handlers with restorer functions
**Key Files:** `src/emu/x64syscall.c`, `src/os/os_linux.c`
---
## Wine/Proton Support
Box64 includes special handling for Wine to run Windows applications.
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Wine/Proton Execution Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Windows Application (.exe) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Wine (x86-64 Linux binary) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ ntdll.so │ │ kernel32.so │ │ user32.so │ │ │
│ │ │ (Win API) │ │ (Win API) │ │ (Win API) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Box64 Emulator │ │
│ │ │ │
│ │ x86-64 Wine code ──► Dynarec ──► Native ARM64 execution │ │
│ │ │ │
│ │ Library calls ──► Wrapped native libs (SDL, OpenGL, Vulkan) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Native ARM64 System │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Mesa/GPU │ │ PulseAudio │ │ libc │ │ │
│ │ │ (native) │ │ (native) │ │ (native) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Wine Detection
Wine binaries are auto-detected: `wine64`, `wine64-preloader`, `wineserver`
- Sets `box64_wine = 1` flag
- Enables Wine-specific memory handling
### Memory Layout
```c
// Wine memory pre-reservation
{0x00010000, 0x00008000}, // Low memory
{0x00110000, 0x30000000}, // Main allocation region
{0x7f000000, 0x03000000}, // High memory
```
**Proton Address Hack:** Bridges loaded at `0x700000000000` to pass seccomp checks.
### WOW64 Support (32-bit Windows Apps)
| Mode | Description |
|------|-------------|
| Box86 + Box64 | Box86 for x86, Box64 for x86-64 |
| Wine WOW64 | Single Wine binary, cpu.dll handles 32-bit |
| Box64 `-DWOW64=ON` | Box64 compiles 32-bit WOW64 DLL |
### Game-Specific Settings
**Unity Games:**
```bash
MESA_GL_VERSION_OVERRIDE=3.2 BOX64_DYNAREC_STRONGMEM=1 box64 wine game.exe
```
**Configuration file** (`~/.box64rc`):
```ini
[game.exe]
BOX64_DYNAREC_BIGBLOCK=3
BOX64_DYNAREC_CALLRET=1
[*setup*]
BOX64_DYNAREC_SAFEFLAGS=1
```
---
## Key Source Files Reference
| File | Purpose |
|------|---------|
| `src/dynarec/dynarec.c` | Main execution loop, EmuRun, LinkNext |
| `src/dynarec/dynablock.c` | Block lifecycle, SMC detection |
| `src/dynarec/dynarec_native.c` | FillBlock64, multi-pass compilation |
| `src/dynarec/arm64/arm64_prolog.S` | Native entry |
| `src/dynarec/arm64/arm64_epilog.S` | Native exit |
| `src/custommem.c` | Memory management, jump table |
| `src/libtools/signals.c` | Signal handling |
| `src/libtools/threads.c` | Thread management |
| `src/elfs/elfloader.c` | ELF loading, relocation |
| `src/emu/x64syscall.c` | Syscall handling |
| `src/tools/bridge.c` | Library bridge mechanism |
| `src/steam.c` | Steam/Proton integration |
# Code cache managment
> [2985](https://github.com/ptitSeb/box64/issues/2985)
I found out that ~80.6% of blocks(total 12,478 blocks) are freed only in the final exit.

I've been experimenting with tracking dynarec block usage by adding an atomic `usage_count` field to dynablock_t and instrumenting each block to increment it at runtime. I also implemented a global linked list (using the existing mutex_dyndump lock) that tracks all living blocks, allowing statistics collection about block lifecycle.
The testing shows that most dynamic blocks are executed very infrequently:
The testing shows that **most dynamic blocks are executed very infrequently**:
**Simple program (qsort):**
```
Total: 72 blocks (15.4 KB)
Executed ≤1 time: 22 blocks (30.6%) → 3.4 KB (22.1%)
```
**Complex game ([Petsitting](https://goose-nest.itch.io/petsitting)):**
```
Total: 399,513 blocks (49.24 MB)
Executed ≤1 time: 323,780 blocks (81.0%) → 32.93 MB (66.9%)
```
#### First attempt: Hot methold (LFU)
> [81b080](https://github.com/ptitSeb/box64/commit/81b080eb5d5dd2a23e5b952d11efea24bfe4c2fa)
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ PURGE FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ AllocDynarecMap() needs memory but no free space available │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ if(BOX64ENV(dynarec_purge)) │ │
│ │ PurgeDynarecMap(list, size); │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ For each memory chunk (blocklist_t) in mmaplist: │
│ ├─► Skip if bl->nopurge == 1 (optimization: no purgeable blocks) │
│ │ │
│ ├─► For each allocated block in chunk: │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────────────────────────────────────────────────────┐ │
│ │ │ hot = atomic_load(db->hot) │ │
│ │ │ │ │
│ │ │ if (hot == 1 && db->done) { │ │
│ │ │ in_used = atomic_load(db->in_used) │ │
│ │ │ if (in_used == 0) { │ │
│ │ │ FreeDynablock(db) ← PURGE THIS BLOCK! │ │
│ │ │ if (freed enough) return 1; │ │
│ │ │ } │ │
│ │ │ } │ │
│ │ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ └─► Set bl->nopurge = 1 if no purgeable blocks remain │
│ │
│ Return: 1 if freed a block, 0 otherwise │
└─────────────────────────────────────────────────────────────────────────────┘
```
1. in_used counter - Prevents purging blocks that are actively executing
2. Atomic operations - Thread-safe counter updates
3. nopurge flag - Optimization to skip chunks with no purgeable blocks
4. hot=0 blocks - Never touched (reserved for system blocks)
It use a simple "hot" approach. only block that have hot==1 (and are unused) get purged.
the memory freed seems ... not that much.
```
Unique x64 addresses: 396,666
SUCCESS (cold, hot=1): 326,313 (82.26%)
TOO_HOT (hot ≥ 2): 70,349 (17.74%)
BLOCKED / SKIP: 4 ( 0.00%)
```
```
Total Attempts: 16,341,787
SUCCESS (purged): 507,920 ( 3.1%)
BLOCKED (in use): 47 ( 0.0%)
SKIP (hot != 1): 6,323 ( 0.0%)
TOO_HOT (hot >= 2): 15,827,497 ( 96.9%)
```
#### Sec attempt: Tick methold (LRU)
Instead of Hot, of having a "Tick" counter instead. Tick is a global variable and is incremented each time a dynablock is created. A pseudo "Age" can then be computed and only older blocks get purged (used blocks are still keep young because the dynablock tick would be update each time it's run)
alternative would be to use the hardware counter to have a "tick" more close to a real timer. but that would mean it needs to be calibrated at first.
My previous idea was to find a function r(t) that fits the following form:

where r(t) represents the percentage of blocks we need to free at each tick t. I assumed that memory usage would grow like a logarithmic function. (So we can use the current tick to decide the threshold, and the totol free rate will be 80% )
However, it turned out to be a complete mess(tried: sinh, logistic, exponential, linear, quadratic). The skip rate became too high(75%~90%), which made the purge system unusable.
So I started observing the behavior on different process and found that the dynamic block usage varies significantly across process. This might explain why a global threshold is far less effective than I expected. Below is the test case for steamwebhelper with a fixed threshold of 1024:

---
I gather stats to show the recreation rate along both tick and age.(Box64 + wine + steam)
I use a fixed threshold of 256, which is intentionally small.

Currently, a purge is triggered whenever `AllocDynarecMap()` cannot find space, but most purge attempts fail according to the statistics. (Skip means the block is too young and it's not per-x64 address it's per-event)
| Event | Count | Percent |
|---------|-------|---------|
| SUCCESS | 4.8M | 3.7% |
| BLOCKED | 99.4M | 76.1% |
| SKIP | 26.3M | 20.2% |


