陳孟鴻
Just-in-time (JIT) compilation is a technology that translates / compiles the program during execution (run time). The source of "translate" and "compile" here may include but not limited to the source code, intermediate representation (IR), or bytecode of a program. The JIT compiler would collect and analyze at the run time, and detect the hotspot of program. The hotspot is the usually reached region of a program, such as looping or recursion. After the hotspot is located, the JIT compiler would compile it and emit the binary / machine code of host platform. The emitted machine code could be reused that reduces the count of compilation, which might be a large overhead.
The other technologies that commonly used in program simulation are interpretation and ahead-of-time (AOT) compilation. The former usually reads a single instruction of a program and does the corresponding operation on the simulated virtual CPU. The latter would compile the instructions of a program before it actually runs, like the most compile language but the source is the guest binary / machine code.
Comparing with interpretation or AOT compilation, JIT compilation has more run time information which could further optimize the local execution path and improve the performance.
The original implementation used the least-frequently-used (LFU) cache replacement in userspace program simulation. The advantage of this cache replacement policy is that we could get the hit count of a set of block and decide whether the JIT compiler is going to be activated or not. However, if the cache is not large enough, the replacement would occurs frequently and it would impact on the overall performance.
For example, there are two cache (A
, B
) which both of them have hit count = 1
are inserted into a filled cache.
A
would replace the least frequently used one from cache.
Then, B
is going to replace A
, because A
has the least hit count (== 1
).
When we reach the region of A
, B
is replaced again because of the same reason.
The repeated replacement here would impact the performance.
The lest-recently-used cache replacement policy fits our use case here.
The recently reached region is more like to walk through again.
Howerver, some system call, such as malloc(), free(), memset(), printf()
, might be frequently called in whole simulation but might not be guaranteed that how far it is between two calling.
This would not be the problem because the region which is freqently called will be captured by the profiler as the potential hotspot, and then invoke the JIT compilation.
The emitted machine code would have the similar functionality to reduce the overhead of interpreting the guest instruction, but the former one is the host machine code and the latter is translated IR.
Related materials:
In system simulation, MMU would trigger the page fault exception of invalid / illegal memory accessing when translating the virtual address (VA) of memory (VA) to the physical address (PA) of memory. The reason that causes fault may but not limited to the access of invalid / unallocated physical address, or unmatched permission (R/W/X). When running with the interpreter, it could simply jump to the exception handler by calling the handler function. In the JIT compilation, however, the minimal unit of the translation and execution is basic block, which would make the page fault not happening at the right moment.
To address this problem, we add an execution path as the page fault checker that examines whether we need to jump to the exception handler or not.
This extra path would redirect the current control flow to the page fault checker.
The checker would store the guest state (RISC-V registers), which have been mapped to the physical registers (x64 / Aarch64 registers), back to the stack, which stores the guest state in the interpreter mode, and then use the jumping instruction to redirect the control flow to the function pointer of exception handler, which is compiled by GCC / LLVM and is more efficient than the one written from scratch, when it is needed.
Before this implementation, we need to set the memory operation, such lw, sw
and fetching the instructions, to the non-translatable instructions, and switch the exection to the interpreter mode.
The mode switching impacts on the efficency and also the performance of the JIT compilation.
Related materials: