lua: system-wide profiler

Status: In progress
Start date: 02-07-2021
Authors: Mikhail Shishatskiy @Shishqa m.shishatskiy@tarantool.org, Maxim Kokryashkin @fckxorg m.kokryashkin@tarantool.org
Issues: #781, #4001

Summary

The document describes the platform profiler for LuaJIT. It is needed to obtain a complete view of platform performance. Existing LuaJIT profiler only able to give you information about virtual machine states and guest stack. Hence, the document proposes to provide a new sampling profiler, so it will be able to gather stack traces from both C and Lua.

Background and motivation

Currently, available options for profiling LuaJIT are not fine enough to get an understanding of performance. For example, perf only able to show host stack, so all the Lua calls are seen as single pcall. Oppositely, jit.p module provided with LuaJIT is not able to give any information about the host stack.

To get a detailed perspective of the platform performance, a more advanced profiler is needed. The desired profiler must be able to capture both guest and host stacks simultaneously, along with virtual machine states.

Assume we have the following chunks of code:









// file c_payload.c

// fibonnacci number
double c_payload(double n) {
  if (n <= 1) {
    return n;
  }
  return c_payload(n - 1) + c_payload(n - 2);
}








































-- file lua_payload.lua

local jit = require("jit")
local ffi = require("ffi")
local os = require("os")

------------------------------------------------

-- fibonacci number
function lua_payload(n)
  if n <= 1 then
    return n
  end
  return lua_payload(n - 1) + lua_payload(n - 2)
end

-- mind project-specific package location --
local c = ffi.load("c_payload")

ffi.cdef[[
double c_payload(double n);
]]

------------------------------------------------

local start_time = os.clock()
local benchmark_time = 10

jit.off()
jit.flush()

local i = 0
while os.clock() - start_time < benchmark_time do
    if i % 2 == 0 then
        c.c_payload(20)
    else
        lua_payload(20)
    end
    i = i + 1
end

The c_payload.c code compiled into a shared library and loaded by LuaJIT via the FFI.

To get the difference, you can take a look at the flamegraphs (learn more about flamegraphs) for this code generated by perf, jit.p, and PoC for the proposed profiler below.

perf

jit.p

sysprof

As you can see, the sysprof's output looks like merged flamegraphs produced by perf and jit.p.

Detailed design

The proposed approach is to reuse the backend implemented in the existing jit.p profiler and add host stack capturing on top of it. It is important, that we create a different API for the new profiler, so the user can choose, which one is preferable for a particular case. But only one profiler can be used at a time.

Prerequisites

Firstly, it is necessary to make timer machinery in lj_profile.c self-reliant by introducing lj_profile_timer structure and start/stop interface for it. This makes the timer useful for not only for jit.p but for sysprof too.

Another thing to do is to fix an invalid vmstate during stack resize. The issue itself is similar to the tarantool/tarantool#5842.

Host stack

The default sampling profiler implementation in LuaJIT, which can be seen here, follows this flow:

luaJIT_profile_start --> profile_timer_start

...                                                  
                                                    |lock VM state
[signal emmited] --> profile_signal_trigger:      __|prepare args for a callback
                                                    |schedule callback execution
                                                    |unlock VM state
...

luaJIT_profile_stop --> profile_timer_stop

Callback, which is scheduled by profile_signal_trigger can be used to dump needed information, including VM stack. However, the mechanism of VM hooks is used for a callback, so it executes later than the signal was emitted. That is why even though the guest stack is still the same by the time when a callback is executed, the host stack has been changed already, so the final stack dump can not be considered valid.

Hence, to get a valid final snapshot of both stacks, a dump should be done right at the signal, like there.

The host stack can be dumped with backtrace(3). However, for complex stack layouts (e.g. fiber stacks in Tarantool), it may not be enough, so one can change default backtracing function to another.

VM stack

Depends on the VM state when signal occured, we consider guest stack valid or not. For example, we can not dump Lua stack, when VM is in the INTERP state because stack pointers can be inconsistent.

We are using an implementation similar to the one, which is used in lj_debug_dumpstack to dump guest stack. But there is a problem with that because sometimes the VM stack can be invalid, thanks to this bug. As you can see down the link, VM state changes to LFUNC, and after that stack reallocation takes place. So if our signal takes place in between, we will get a segmentation fault. Anyway, that issue is easy to fix, so this approach is suitable.

Traces

If JIT is enabled while the profiler is running, traces are recorded as well. For now, we report only a traceno and info about a proto, where trace recording started.

Event streaming

The event stream has the following format:

Event stream format:

stream          := symtab sysprof
symtab          := see symtab description
sysprof         := prologue sample* epilogue
prologue        := 'l' 'j' 'p' version reserved
version         := <BYTE>
reserved        := <BYTE> <BYTE> <BYTE>
sample          := sample-guest | sample-host | sample-trace
sample-guest    := sample-header stack-lua stack-host
sample-host     := sample-header stack-host
sample-trace    := sample-header traceno sym-addr line-no
sample-header   := <BYTE>
stack-lua       := frame-lua* frame-lua-last
stack-host      := frame-host* frame-host-last
frame-lua       := frame-lfunc | frame-cfunc | frame-ffunc
frame-lfunc     := frame-header sym-addr line-no
frame-cfunc     := frame-header exec-addr
frame-ffunc     := frame-header ffid
frame-lua-last  := frame-header
frame-header    := <BYTE>
frame-host      := exec-addr
frame-host-last := <ULEB128>
line-no         := <ULEB128>
traceno         := <ULEB128>
ffid            := <ULEB128>
sym-addr        := <ULEB128>
exec-addr       := <ULEB128>
epilogue        := sample-header

<BYTE>   :  A single byte (no surprises here)
<ULEB128>:  Unsigned integer represented in ULEB128 encoding

(Order of bits below is hi -> lo)

version: [VVVVVVVV]
 * VVVVVVVV: Byte interpreted as a plain integer version number

sample-header: [FUUUUEEE]
 * EEE  : 3 bits for representing vmstate (LJ_VMST_*)
 * UUUU : 4 unused bits
 * F    : 0 for regular samples, 1 for epilogue's Final header
          (if F is set to 1, all other bits are currently ignored)

frame-header: [FUUUUUEE]
 * EE    : 2 bits for representing frame type (FRAME_*)
 * UUUUU : 5 unused bits
 * F     : 0 for regular frames, 1 for final frame
           (if F is set to 1, all other bits are currently ignored)

frame-host-last = NULL

Symtab

It is a heavy task to dump names of functions every time, so instead, we will dump a symbol table in the beginning. Later on, it will be sufficient to dump only a function's address. However, some functions can be loaded and unloaded several times, and their addresses will be different each time. Hence, we will update the symbol table accordingly. To carry out the symtab update, we will drop in new symtab record into the main event stream.

A symbol table looks like this (the same format as symtab in memprof):

symtab         := prologue sym*
prologue       := 'l' 'j' 's' version reserved
version        := <BYTE>
reserved       := <BYTE> <BYTE> <BYTE>
sym            := sym-lua | sym-final
sym-lua        := sym-header sym-addr sym-chunk sym-line
sym-header     := <BYTE>
sym-addr       := <ULEB128>
sym-chunk      := string
sym-line       := <ULEB128>
sym-final      := sym-header
string         := string-len string-payload
string-len     := <ULEB128>
string-payload := <BYTE> {string-len}

<BYTE>   :  A single byte (no surprises here)
<ULEB128>:  Unsigned integer represented in ULEB128 encoding

(Order of bits below is hi -> lo)

version: [VVVVVVVV]
 * VVVVVVVV: Byte interpreted as a plain numeric version number

sym-header: [FUUUUUTT]
 * TT    : 2 bits for representing symbol type
 * UUUUU : 5 unused bits
 * F     : 1 bit marking the end of the symtab (final symbol)

API

For the system profiler both C API and Lua API are implemented. The C API for the sysprof is described in the <lmisclib.h> header.

The C API looks like this:











/* Configure the proflier. */
LUAMISC_API int luaM_sysprof_configure(const struct luam_sysprof_config *config);
/* Start profiling. */
LUAMISC_API int luaM_sysprof_start(lua_State *L,
                                   const struct luam_sysprof_options *opt);
/* Stop profiling. */
LUAMISC_API int luaM_sysprof_stop(lua_State *L);

/* Get counters for each VM state. */
LUAMISC_API int luaM_sysprof_report(struct luam_sysprof_counters *counters);

The profiler can work in three modes:



















enum luam_sysprof_mode {
  /*
  ** DEFAULT mode collects only data for luam_sysprof_counters, which is stored
  ** in memory and can be collected with luaM_sysprof_report after profiler
  ** stops.
  */
  LUAM_SYSPROF_DEFAULT,
  /*
  ** LEAF mode = DEFAULT + streams samples with only top frames of host and
  ** guests stacks in format described above.
  */
  LUAM_SYSPROF_LEAF,
  /*
  ** CALLGRAPH mode = DEFAULT + streams samples with full callchains of host
  ** and guest stacks in format described above.
  */
  LUAM_SYSPROF_CALLGRAPH
};

Available profiler options are stored in the luam_sysprof_options structure.













/* Profiler options */
struct luam_sysprof_options {
  /* Profiling mode */
  enum luam_sysprof_mode mode;
  /* Sampling interval in msec */
  uint64_t interval;
  /* Custom buffer to write data. */
  uint8_t *buf;
  /* The buffer's size. */
  size_t len;
  /* Context for the profile writer and final callback. */
  void *ctx;
};

As you can see, writer backend is customizable, and our implementation does not lock one at the C-API level.

Backtracing function, frame writer and on-stop callback is a one more degree of freedom:
























struct luam_sysprof_config {
  /*
  ** Writer function for profile events.
  ** Should return amount of written bytes on success or zero in case of error.
  ** Setting *data to NULL means end of profiling.
  ** For details see <lj_wbuf.h>.
  */
  size_t (*writer)(const void **data, size_t len, void *ctx);
  /*
  ** Callback on profiler stopping. Required for correctly cleaning
  ** at VM finalization when profiler is still running.
  ** Returns zero on success.
  */
  int (*on_stop)(void *ctx, uint8_t *buf);
  /*
  ** Backtracing function for the host stack. Should call `frame_writer` on
  ** each frame in the stack in the order from the stack top to the stack
  ** bottom. The `frame_writer` function is implemented inside the sysprof
  ** and will be passed to the `backtracer` function. If `frame_writer` returns
  ** NULL, backtracing should be stopped. If `frame_writer` returns not NULL,
  ** the backtracing should be continued if there are frames left.
  */
  void (*backtracer)(void *(*frame_writer)(int frame_no, void *addr));
};

One can replace backtracing function by passing properly filled luam_sysprof_config structure to luaM_sysprof_configure function.

To start the profiler one needs to fill the luam_sysprof_options with desired values and pass it to the luaM_sysprof_start function. In case of successful start, it returns SYSPROF_SUCCESS.

To stop the profiler, one should call luaM_sysprof_stop function, which will return SYSPROF_SUCCESS in case of successful profiling finalization.

Starting profiler from Lua is simple:

local started, err, errno = misc.sysprof.start{
  mode = profiling_mode,
  interval = sampling_interval,
  path = fname,
}

where profiling_mode is one of the following:

"D" - DEFAULT mode
"L" - LEAF mode
"C" - CALLGRAPH mode

In the DEFAULT mode, only the counters for each state will be collected. In the LEAF mode the sysprof provides you with information about a top stack frame for each sample. And in the CALLCHAIN mode, the profiler gives you a complete call stack for each sample.

Take a look at the examples below to understand the difference.


`DEFAULT` mode


`LEAF` mode


`CALLCHAIN` mode

sampling_interval is the sampling period in milliseconds. fname is name of the file where profile events are written. Writer for this function perform fwrite() for each call retrying in case of EINTR. When the profiling is stopped fclose() is called. The profiler's function's contract is similar to standard io.* interfaces. If it is impossible to open a file for writing or profiler fails to start, nil is returned on failure (plus an error message as a second result and a system-dependent error code as a third result). Otherwise, returns true value.

Stopping profiler from Lua is simple too:

local stopped, err, errno = misc.sysprof.stop()

If there is any error occurred at profiling stopping (an error when file descriptor was closed) sysprof.stop() returns nil (plus an error message as a second result and a system-dependent error code as a third result). Returns true otherwise.

If you want to build LuaJIT without the Lua and platform profiler, you should build it with -DLUAJIT_DISABLE_SYSPROF=ON. If it is disabled misc.sysprof.start() and misc.sysprof.stop() always return false.

Retrieving the results

Profiler's binary output can be parsed with the luajit-parse-sysprof utility:

luajit-parse-sysprof - parser of the profile collected
                       with LuaJIT's sysprof.

SYNOPSIS

luajit-parse-sysprof [options] sysprof.bin

Supported options are:

  --help                            Show this help and exit
  --split                           Split callchains by vmstate

Or, alternatively, you can do it like this:

tarantool -e 'require("sysprof")(arg)' - sysprof_output.bin

It produces another file, which can be read by flamegraph.pl utility to generate a flamegraph image like this

sysprof

Rationale and alternatives

Another way to implement such a thing is to make perf to see guest stack. To do so, we need to map virtual machine symbols (and that functionality is present in LuaJIT (link)) and do something so perf could unwind the virtual machine stack.
Stack unwinding from outside of the LuaJIT is the problem we didn’t manage to solve for today. There are different approaches to do this:

Save rsp register value to rbp and preserve rbp. However, LuaJIT uses rbp as a general-purpose register, and it is hard not to break everything trying to use it only for stack frames.
Coordinated work of jit.p and perf. This approach requires modifying perf the way it will send LuaJIT suspension signal, and after getting info about the host stack, it will receive information about the guest stack and join them. This solution is quite possible, but modified perf doesn't seem like a production-ready solution.
Dwarf unwinding

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.