owned this note
owned this note
Published
Linked with GitHub
# lua: system-wide profiler
* **Status**: In progress
* **Start date**: 02-07-2021
* **Authors**: Mikhail Shishatskiy @Shishqa m.shishatskiy@tarantool.org, Maxim Kokryashkin @fckxorg m.kokryashkin@tarantool.org
* **Issues**: [#781](https://github.com/tarantool/tarantool/issues/781), [#4001](https://github.com/tarantool/tarantool/issues/4001)
## Summary
The document describes the platform profiler for LuaJIT. It is needed to obtain a complete view of platform performance. Existing LuaJIT profiler only able to give you information about virtual machine states and guest stack. Hence, the document proposes to provide a new sampling profiler, so it will be able to gather stack traces from both C and Lua.
## Background and motivation
Currently, available options for profiling LuaJIT are not fine enough to get an understanding of performance. For example, perf only able to show host stack, so all the Lua calls are seen as single pcall. Oppositely, jit.p module provided with LuaJIT is not able to give any information about the host stack.
To get a detailed perspective of the platform performance, a more advanced profiler is needed. The desired profiler must be able to capture both guest and host stacks simultaneously, along with virtual machine states.
Assume we have the following chunks of code:
```c=
// file c_payload.c
// fibonnacci number
double c_payload(double n) {
if (n <= 1) {
return n;
}
return c_payload(n - 1) + c_payload(n - 2);
}
```
```lua=
-- file lua_payload.lua
local jit = require("jit")
local ffi = require("ffi")
local os = require("os")
------------------------------------------------
-- fibonacci number
function lua_payload(n)
if n <= 1 then
return n
end
return lua_payload(n - 1) + lua_payload(n - 2)
end
-- mind project-specific package location --
local c = ffi.load("c_payload")
ffi.cdef[[
double c_payload(double n);
]]
------------------------------------------------
local start_time = os.clock()
local benchmark_time = 10
jit.off()
jit.flush()
local i = 0
while os.clock() - start_time < benchmark_time do
if i % 2 == 0 then
c.c_payload(20)
else
lua_payload(20)
end
i = i + 1
end
```
The `c_payload.c` code compiled into a shared library and loaded by LuaJIT via the FFI.
To get the difference, you can take a look at the flamegraphs (learn more about [flamegraphs](https://www.brendangregg.com/flamegraphs.html)) for this code generated by perf, jit.p, and PoC for the proposed profiler below.
### perf
![perf](https://i.imgur.com/DlKbFpo.png)
### jit.p
![jit.p](https://i.imgur.com/sDZZDZx.png)
### sysprof
![sysprof](https://i.imgur.com/Yf80MDE.png)
As you can see, the sysprof's output looks like merged flamegraphs produced by perf and jit.p.
## Detailed design
The proposed approach is to reuse the backend implemented in the existing `jit.p` profiler and add host stack capturing on top of it. It is important, that we create a different API for the new profiler, so the user can choose, which one is preferable for a particular case. But only one profiler can be used at a time.
### Prerequisites
Firstly, it is necessary to make timer machinery in `lj_profile.c` self-reliant by introducing `lj_profile_timer` structure and start/stop interface for it. This makes the timer useful for not only for jit.p but for sysprof too.
Another thing to do is to fix an invalid vmstate during stack resize. The issue itself is similar to the `tarantool/tarantool#5842`.
### Host stack
The default sampling profiler implementation in LuaJIT, which can be seen [here](https://github.com/tarantool/luajit/blob/tarantool/src/lj_profile.c), follows this flow:
```
luaJIT_profile_start --> profile_timer_start
...
|lock VM state
[signal emmited] --> profile_signal_trigger: __|prepare args for a callback
|schedule callback execution
|unlock VM state
...
luaJIT_profile_stop --> profile_timer_stop
```
Callback, which is scheduled by `profile_signal_trigger` can be used to dump needed information, including VM stack. However, the mechanism of VM hooks is used for a callback, so it executes later than the signal was emitted. That is why even though the guest stack is still the same by the time when a callback is executed, the host stack has been changed already, so the final stack dump can not be considered valid.
Hence, to get a valid final snapshot of both stacks, a dump should be done right at the signal, like [there](https://github.com/Shishqa/luajit/blob/c0da971640512696f5c166e8f2dc1ed982a8f451/src/profile/sysprof.c#L63).
The host stack can be dumped with `backtrace(3)`. However, for complex stack layouts (e.g. fiber stacks in Tarantool), it may not be enough, so one can change default backtracing function to another.
### VM stack
Depends on the VM state when signal occured, we consider guest stack valid or not. For example, we can not dump Lua stack, when VM is in the INTERP state because stack pointers can be inconsistent.
We are using an implementation similar to the one, which is used in [lj_debug_dumpstack](https://github.com/tarantool/luajit/blob/af889e4608e6eca495dd85e6161d8bcd7d3628e6/src/lj_debug.c#L580) to dump guest stack. But there is a problem with that because sometimes the VM stack can be invalid, thanks to this [bug](https://github.com/tarantool/luajit/blob/af889e4608e6eca495dd85e6161d8bcd7d3628e6/src/vm_x64.dasc#L4594). As you can see down the link, VM state changes to LFUNC, and after that stack reallocation takes place. So if our signal takes place in between, we will get a segmentation fault. Anyway, that issue is easy to fix, so this approach is suitable.
### Traces
If JIT is enabled while the profiler is running, traces are recorded as well. For now, we report only a traceno and info about a proto, where trace recording started.
### Event streaming
The event stream has the following format:
```
Event stream format:
stream := symtab sysprof
symtab := see symtab description
sysprof := prologue sample* epilogue
prologue := 'l' 'j' 'p' version reserved
version := <BYTE>
reserved := <BYTE> <BYTE> <BYTE>
sample := sample-guest | sample-host | sample-trace
sample-guest := sample-header stack-lua stack-host
sample-host := sample-header stack-host
sample-trace := sample-header traceno sym-addr line-no
sample-header := <BYTE>
stack-lua := frame-lua* frame-lua-last
stack-host := frame-host* frame-host-last
frame-lua := frame-lfunc | frame-cfunc | frame-ffunc
frame-lfunc := frame-header sym-addr line-no
frame-cfunc := frame-header exec-addr
frame-ffunc := frame-header ffid
frame-lua-last := frame-header
frame-header := <BYTE>
frame-host := exec-addr
frame-host-last := <ULEB128>
line-no := <ULEB128>
traceno := <ULEB128>
ffid := <ULEB128>
sym-addr := <ULEB128>
exec-addr := <ULEB128>
epilogue := sample-header
<BYTE> : A single byte (no surprises here)
<ULEB128>: Unsigned integer represented in ULEB128 encoding
(Order of bits below is hi -> lo)
version: [VVVVVVVV]
* VVVVVVVV: Byte interpreted as a plain integer version number
sample-header: [FUUUUEEE]
* EEE : 3 bits for representing vmstate (LJ_VMST_*)
* UUUU : 4 unused bits
* F : 0 for regular samples, 1 for epilogue's Final header
(if F is set to 1, all other bits are currently ignored)
frame-header: [FUUUUUEE]
* EE : 2 bits for representing frame type (FRAME_*)
* UUUUU : 5 unused bits
* F : 0 for regular frames, 1 for final frame
(if F is set to 1, all other bits are currently ignored)
frame-host-last = NULL
```
### Symtab
It is a heavy task to dump names of functions every time, so instead, we will dump a symbol table in the beginning. Later on, it will be sufficient to dump only a function's address. However, some functions can be loaded and unloaded several times, and their addresses will be different each time. Hence, we will update the symbol table accordingly. To carry out the symtab update, we will drop in new symtab record into the main event stream.
A symbol table looks like this (the same format as symtab in memprof):
<!--
```
1 byte 8 bytes 8 bytes
_______________________________________________________________
| type | address of function | function name | first line number|
---------------------------------------------------------------
```
-->
```
symtab := prologue sym*
prologue := 'l' 'j' 's' version reserved
version := <BYTE>
reserved := <BYTE> <BYTE> <BYTE>
sym := sym-lua | sym-final
sym-lua := sym-header sym-addr sym-chunk sym-line
sym-header := <BYTE>
sym-addr := <ULEB128>
sym-chunk := string
sym-line := <ULEB128>
sym-final := sym-header
string := string-len string-payload
string-len := <ULEB128>
string-payload := <BYTE> {string-len}
<BYTE> : A single byte (no surprises here)
<ULEB128>: Unsigned integer represented in ULEB128 encoding
(Order of bits below is hi -> lo)
version: [VVVVVVVV]
* VVVVVVVV: Byte interpreted as a plain numeric version number
sym-header: [FUUUUUTT]
* TT : 2 bits for representing symbol type
* UUUUU : 5 unused bits
* F : 1 bit marking the end of the symtab (final symbol)
```
## API
For the system profiler both C API and Lua API are implemented. The C API for the sysprof is described in the <lmisclib.h> header.
The C API looks like this:
```c=
/* Configure the proflier. */
LUAMISC_API int luaM_sysprof_configure(const struct luam_sysprof_config *config);
/* Start profiling. */
LUAMISC_API int luaM_sysprof_start(lua_State *L,
const struct luam_sysprof_options *opt);
/* Stop profiling. */
LUAMISC_API int luaM_sysprof_stop(lua_State *L);
/* Get counters for each VM state. */
LUAMISC_API int luaM_sysprof_report(struct luam_sysprof_counters *counters);
```
The profiler can work in three modes:
```c=
enum luam_sysprof_mode {
/*
** DEFAULT mode collects only data for luam_sysprof_counters, which is stored
** in memory and can be collected with luaM_sysprof_report after profiler
** stops.
*/
LUAM_SYSPROF_DEFAULT,
/*
** LEAF mode = DEFAULT + streams samples with only top frames of host and
** guests stacks in format described above.
*/
LUAM_SYSPROF_LEAF,
/*
** CALLGRAPH mode = DEFAULT + streams samples with full callchains of host
** and guest stacks in format described above.
*/
LUAM_SYSPROF_CALLGRAPH
};
```
Available profiler options are stored in the `luam_sysprof_options` structure.
```c=
/* Profiler options */
struct luam_sysprof_options {
/* Profiling mode */
enum luam_sysprof_mode mode;
/* Sampling interval in msec */
uint64_t interval;
/* Custom buffer to write data. */
uint8_t *buf;
/* The buffer's size. */
size_t len;
/* Context for the profile writer and final callback. */
void *ctx;
};
```
As you can see, writer backend is customizable, and our implementation does not lock one at the C-API level.
Backtracing function, frame writer and on-stop callback is a one more degree of freedom:
```c=
struct luam_sysprof_config {
/*
** Writer function for profile events.
** Should return amount of written bytes on success or zero in case of error.
** Setting *data to NULL means end of profiling.
** For details see <lj_wbuf.h>.
*/
size_t (*writer)(const void **data, size_t len, void *ctx);
/*
** Callback on profiler stopping. Required for correctly cleaning
** at VM finalization when profiler is still running.
** Returns zero on success.
*/
int (*on_stop)(void *ctx, uint8_t *buf);
/*
** Backtracing function for the host stack. Should call `frame_writer` on
** each frame in the stack in the order from the stack top to the stack
** bottom. The `frame_writer` function is implemented inside the sysprof
** and will be passed to the `backtracer` function. If `frame_writer` returns
** NULL, backtracing should be stopped. If `frame_writer` returns not NULL,
** the backtracing should be continued if there are frames left.
*/
void (*backtracer)(void *(*frame_writer)(int frame_no, void *addr));
};
```
One can replace backtracing function by passing properly filled `luam_sysprof_config` structure to `luaM_sysprof_configure` function.
To start the profiler one needs to fill the `luam_sysprof_options` with desired values and pass it to the `luaM_sysprof_start` function. In case of successful start, it returns `SYSPROF_SUCCESS`.
To stop the profiler, one should call `luaM_sysprof_stop` function, which will return `SYSPROF_SUCCESS` in case of successful profiling finalization.
Starting profiler from Lua is simple:
```lua
local started, err, errno = misc.sysprof.start{
mode = profiling_mode,
interval = sampling_interval,
path = fname,
}
```
where `profiling_mode` is one of the following:
```
"D" - DEFAULT mode
"L" - LEAF mode
"C" - CALLGRAPH mode
```
In the `DEFAULT` mode, only the counters for each state will be collected. In the `LEAF` mode the sysprof provides you with information about a top stack frame for each sample. And in the `CALLCHAIN` mode, the profiler gives you a complete call stack for each sample.
Take a look at the examples below to understand the difference.
| ![](https://i.imgur.com/4J9ItbN.png) |
|:---:|
|`DEFAULT` mode |
| ![Leaf mode](https://i.imgur.com/ALIVNn1.png) |
|:--:|
|`LEAF` mode|
| ![](https://i.imgur.com/X0nL83r.png) |
| :--: |
| `CALLCHAIN` mode |
`sampling_interval` is the sampling period in milliseconds. `fname` is name of the file where profile events are written. Writer for this function perform `fwrite()` for each call retrying in case of EINTR. When the profiling is stopped `fclose()` is called. The profiler's function's contract is similar to standard io.* interfaces. If it is impossible to open a file for writing or profiler fails to start, nil is returned on failure (plus an error message as a second result and a system-dependent error code as a third result). Otherwise, returns true value.
Stopping profiler from Lua is simple too:
```
local stopped, err, errno = misc.sysprof.stop()
```
If there is any error occurred at profiling stopping (an error when file descriptor was closed) `sysprof.stop()` returns `nil` (plus an error message as a second result and a system-dependent error code as a third result). Returns true otherwise.
If you want to build LuaJIT without the Lua and platform profiler, you should build it with `-DLUAJIT_DISABLE_SYSPROF=ON`. If it is disabled `misc.sysprof.start()` and `misc.sysprof.stop()` always return false.
### Retrieving the results
Profiler's binary output can be parsed with the `luajit-parse-sysprof` utility:
```
luajit-parse-sysprof - parser of the profile collected
with LuaJIT's sysprof.
SYNOPSIS
luajit-parse-sysprof [options] sysprof.bin
Supported options are:
--help Show this help and exit
--split Split callchains by vmstate
```
Or, alternatively, you can do it like this:
```
tarantool -e 'require("sysprof")(arg)' - sysprof_output.bin
```
It produces another file, which can be read by `flamegraph.pl` utility to generate a flamegraph image like this
![sysprof](https://i.imgur.com/Yf80MDE.png)
## Rationale and alternatives
Another way to implement such a thing is to make perf to see guest stack. To do so, we need to map virtual machine symbols (and that functionality is present in LuaJIT ([link](https://github.com/tarantool/luajit/blob/d4e12d7ac28e3bc857d30971dd77deec66a67297/src/lj_trace.c#L96))) and do something so perf could unwind the virtual machine stack.
Stack unwinding from outside of the LuaJIT is the problem we didn’t manage to solve for today. There are different approaches to do this:
- *Save rsp register value to rbp and preserve rbp.* However, LuaJIT uses rbp as a general-purpose register, and it is hard not to break everything trying to use it only for stack frames.
- *Coordinated work of `jit.p` and perf.* This approach requires modifying perf the way it will send LuaJIT suspension signal, and after getting info about the host stack, it will receive information about the guest stack and join them. This solution is quite possible, but modified perf doesn't seem like a production-ready solution.
- *Dwarf unwinding*