--- tags: PyPy --- # Tracking down a segfault I am trying to get the [numpy-feedstock](https://github.com/conda-forge/numpy-feedstock/pull/238) from conda-forge to work on PyPy + win64. There is a segfault that only appears - on windows 64-bit - ~~on hardware that has a AVX512 CPU~~ the crash happens without this - when using the `/GL` compiler flag and the corresponding `/LTCG` linker flag for whole program optimization (the default in Python3) - when using the JIT - in NumPy Of course this is the particular setup that I found reproduces consistently. It may be possible to trigger the segfault in other ways. ## The segfault There is a struct `ufunc_full_args` defined as `{PyObject*, PyObject*}` passed to [a function](https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/src/umath/ufunc_object.c#L2619) with the declaraction ``` static int f(PyUFuncObject *ufunc, PyArrayObject *op[], ufunc_full_args full_args, PyObject *type_tup, PyObject *extobj, NPY_CASTING casting, NPY_ORDER order, npy_bool subok, PyArrayObject *wheremask` ) ``` Before [the single call](https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/src/umath/ufunc_object.c#L4885) to the function, `full_args` is valid. Occasionally once inside the function, it is invalid (zeroed out), at least according to `fprintf` statements sprinkled in the code to print the struct together with the `__LINE__` of code. ## Reproducing Here is how to reproduce ### Set up Get a conda environment and activate it ``` conda create -n pypy3.7 pypy conda activate pypy3.7 conda install cython libblas libcblas liblapack \ pytest hypothesis m2-patch ``` This is a maximal list of packages in order to run the whole NumPy test suite. Later on I was able to cut down the reproducer to a simpler `py` file, so simplification may be possible Get the numpy source code, build NumPy to work in-place ``` git clone https://github.com/numpy/numpy.git cd numpy git checkout v1.21.0 python setup.py build_ext -i ``` ### The minimum reproducer (so far) Put this code in a file `test_loop.py` in the numpy root directory so it can run without needed to change `PYTHONPATH`. It is taken from the NumPy tests and wrapped to loop 100,000 times ``` import numpy as np import warnings # Writing to a broadcasted array with overlap # should warn, gh-2705 for i in range(100_000): a = np.arange(2) b = np.arange(4).reshape((2, 2)) u, v = np.broadcast_arrays(a, b) assert u.strides[0] == 0 x = u + v print(11) with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") print(14, i) u += v print(16) print('done') ``` and run it `python test_loop.py`. It should crash after the `print(14, i)` and before the `print(16)`, never reaching `done` ### Instrumentation Here are two ways to catch the exact place the segfault happens #### Via the Visual Studio debugger <rant> There seems to be no easy way to build a c-extension module via `distutils` and set the compiler to "add debug info" in windows. On Linux I can simply export `CFLAGS=-O0 -g`</rant>. Modify the file `<conda-root>/envs/pypy3.7/lib-python/3/distutils/_msvccompiler.py` to add debug info: - at line 249 add the `/Zi` flag ``` self.compile_options = [ '/nologo', '/Ox', '/W3', '/GL', '/DNDEBUG', '/MD', '/Zi' ] ``` - at line 256 add the `/DEBUG` flag ``` ldflags = [ '/nologo', '/INCREMENTAL:NO', '/LTCG', '/DEBUG' ] ``` - add a `import pdb;pdb.set_trace()` in the test file - run the test. When you hit the `pdb` prompt, attach to the python process with the Visual Studio debugger. Then continue the test. Now you should hit an exception. But what is causing it? Hard to tell. > [seberg] The `full_args.in` value says `Error reading register value.`? #### Via print statements I ended up using print statements because I wanted to pinpoint the change. - Apply the diff below as a patch (that is why you install m2-patch in the conda environment :) ) (TBD: add instructions for running `patch`, I think it is `patch -i <path-to-file`, there may be a way to get `git` to do it) - Rebuild `python setup.py build_ext -i` - Run the test. For me this crashed after 14618 iterations, printing (@seberg: For me it crashes after a random number of iterations 12478, 1582) ``` 14 14616 4884 000002D206A80610 2631 000002D206A80610 4898 16 4884 000002D205D8A770 2631 000002D205D8A770 4898 11 14 14617 4884 000002D2059A35A0 2631 000002D2059A35A0 4898 16 4884 000002D206CBB000 2631 000002D206CBB000 4898 11 14 14618 4884 000002D205F0FAF0 2631 0000000000000000 ``` The `00000000` is problematic ## Things I tried ### Changing compilation flags Any change in optimization like `/Od` or removing the `/GL` and` /LTCG` global optimization flags cleared the segfault. Changing the function signature to use `ufunc_full_args *full_args` instead cleared the segfault. Running with `--jit off` cleared the segfault Changing the reproducing code tended to clear the segfault. The reproducer must call the `+=` ufunc with a view on the left-hand-side. Which is weirder still since, as far as I can tell, the segfault happens before entering the ufunc loop ### Getting a debug brreakpoint I could not figure out how to set a watchpoint on the value that is getting zeroed out since the code is heavily optimized. In each invocation of the outside function (`ufunc_generic_fastcall`), the `full_args` struct has a different address. The variable itself is optimized out of `ufunc_generic_fastcall`. ## Things @seberg tried * ✘ Checking whether `tp_call` is at fault, Python >3.7 would be using `tp_vectorcall` so a subtle bug in the `tp_call` code might go unnoticed. (Checking python debug/valgrind, with `tp_vectorcall` manually removed) - no success. - Doesn't really make sense, then we should see something in `pypy` on linux... * ✔ Add `gc.collect()` at the end of the loop: - Script finishes successfully (very slow) * ✔ Change `ufunc_full_args full_args = {NULL, NULL};` to **`volatile ufunc_full_args full_args = {NULL, NULL};`** - Script finishes successfully. * ?? changing the code to `np.add(u, v, u, subok=False)` (equivlanet without `subok`) prints out that `full_args.in` gets NULL (as an `if (full_args.in == NULL)` in `PyUFunc_GenericFunctionInternal`), but does not crash. Even if I change it to `Py_DECREF(full_args.in)` from `Py_XDECREF(full_args.in)` (maybe the compiler knows to optimize it out?) * Tracing the value through breakpoints (tracing points) agrees that it first is `NULL` when `PyUFunc_GenericFunctionInternal` is called... * Use `gc.hooks.on_gc_minor` and `gc.hooks.on_gc_collect`... see below... * ✔ Removing the warning, "fixes" the issue (this may be calling back into Python!?) * Was unable to break in `array_might_be_written` easily? Or even `PyErr_WarnEx`? **OK, The final proof(?) that this is GC related:** (Note about the plot: The failures do not start immediately, they are only this regular after they started first?) ![Event positions](https://i.imgur.com/waAeQtg.png) In that plot dark means a failure (run with `subok=False`, printing out `"X"` if a `NULL` happened, else printing out `"."`). Basically, whenever NO failure occured, there is white space, whenever a failure occurred there is purple. All failures are aligned at 0, and before/afer events are plotted along the y-Axis. The failures occure about every 50 events here, so the line repeats (approximately). (Also note that the loop contains `u + v`, which also adds a blank space "success event") Te grean colors are GC related runs. **Result:** Whenever a failure is printed, a GC run happened just before or sometimes after. If the gc run is indeed threaded, this is probably expected to fluctuate. But, it never happens that there are many calls in between. <details><summary> Adapted script, plotting script, and diff: </summary> The script to generate is: ```python import numpy as np import gc import sys import warnings gc.hooks.on_gc_minor = lambda x: (print("M"), sys.stdout.flush()) gc.hooks.on_gc_collect_step = lambda x: (print("C"), sys.stdout.flush()) gc.hooks.on_gc_collect = lambda x: (print("F"), sys.stdout.flush()) # input(" <Press Enter to start>") # Writing to a broadcasted array with overlap # should warn, gh-2705 for i in range(100_000): a = np.arange(2) b = np.arange(4).reshape((2, 2)) u, v = np.broadcast_arrays(a, b) assert u.strides[0] == 0 x = u + v # print(11) with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") # print(14, i) np.add(u, v, u, subok=False) # print(16) # gc.collect() <-- makes things pass?! print('done') ``` The diff to print in the ufunc code (sorry, not minimal, but retaining anyway): ```diff diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c index 9e73dfd94..fc003c783 100644 --- a/numpy/core/src/umath/ufunc_object.c +++ b/numpy/core/src/umath/ufunc_object.c @@ -2615,6 +2615,14 @@ fail: } +static int +breaking_point() +{ + printf("X"); + fflush(stdout); + return 0; +} + static int PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op, ufunc_full_args full_args, PyObject *type_tup, PyObject *extobj, @@ -2629,6 +2637,14 @@ PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op, npy_intp default_op_out_flags; PyArray_Descr *dtypes[NPY_MAXARGS]; + + if (full_args.in == NULL) { + breaking_point(); + } + else { + printf("."); + fflush(stdout); + } /* These parameters come from extobj= or from a TLS global */ int buffersize = 0, errormask = 0; @@ -4944,8 +4960,12 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc, retobj[i] = wrapped; } + //if (full_args.in == NULL) { + // printf("huhuhuhu %p!\n", full_args.in); + //} + Py_XDECREF(typetup); - Py_XDECREF(full_args.in); + Py_DECREF(full_args.in); Py_XDECREF(full_args.out); if (ufunc->nout == 1) { return retobj[0]; @@ -4960,6 +4980,7 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc, return (PyObject *)ret; } + fail: Py_XDECREF(typetup); Py_XDECREF(full_args.in); ``` The script to analyze: ```python # -*- coding: utf-8 -*- import numpy as np from matplotlib import pyplot as plt r = open("numpy/output.txt").read().replace("\n", "").replace("\x00", "").split("warnings")[-1] arr = np.frombuffer(r.encode("ascii"), dtype="S1") colors = np.full(len(arr), np.nan, dtype=np.float32) colors[arr == b"X"] = 1 colors[arr == b"M"] = 3 colors[arr == b"C"] = 3.5 colors[arr == b"F"] = 4 w = np.flatnonzero(arr == b"X")[3:2000] # find failures, skip some for padding delay = np.arange(-70, 70)[:, None] embed = w + delay plt.pcolormesh(w, delay[:, 0], colors[embed]) plt.ylabel("Event before/after failure") plt.xlabel("Global event count") ax = plt.gca() axins = ax.inset_axes([0.55, 0.55, 0.4, 0.4]) axins.pcolormesh(w, delay[:, 0], colors[embed]) x1, x2, y1, y2 = 41000, 44000, -10, 10 axins.set_xlim(x1, x2) axins.set_ylim(y1, y2) axins.set_xticklabels("") axins.set_yticklabels("") ax.indicate_inset_zoom(axins, edgecolor="black") plt.savefig("events.png", dpi=800) plt.show() ``` </details> ## What is next? The two instrumentation techiniques give different results. The prints make it look like the function call I noted above, the debugger says the segfault is elsewhere. Here are a few "shower thoughts". Maybe any of these, or maybe something else: - Some other thread is messing things up. What else might be running in a thread? ~~The PyPy garbage collector comes to mind.~~ (silly me, the GC runs in the main thread) The full_args.in pointer that is being zeroed-out is a `PyObject*`, maybe it is being collected? But that doesn't make sense, the pointer would not be replaced, and I checked the `refcount`, the object is alive. - GC collector seems a very likely related (see "Things @seberg tried"). I suppose this is somehow in registers (and that is why `volatile`, etc. helps?). - A compiler bug (yeah, right). But even if so, the failure should be consistent. - The printing is not actually doing what I think it is and the problem is elsewhere. Maybe code reordering is messing with the prints - Maybe I should go back to the debugger and try to look at the assembler code for the jump into the function: what is the register layout for this call? (since it is `static`, is it being inlined? ) ## @antocuni notes I don't have a Windows machine to try but let me add a couple of comments: * GC/thread: AFAIK, the PyPy GC does **not** run on its own thread. If the GC kicks in, it runs in the current thread. * The `volatile` thing is interesting, because it probably means that the **memory** where the `full_args` resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that `full_args.in` is zeroed. * **Tenative explanation**: the compiler optimizes the code heavily and stores `full_args.in` in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake). ##### @seberg, register inspection: * I tried inspecting registers with racing points on: ``` volatile ones: {(void *)$RAX}, {(void *)$RCX}, {(void *)$RDX}, {(void *)$R8}, {(void *)$R9}, {(void *)$R10}, {(void *)$R11} {*(void **)$RAX}, {*(void **)$RCX}, {*(void **)$RDX}, {*(void **)$R8}, {*(void **)$R9}, {*(void **)$R10}, {*(void **)$R11} non-volatile ones: {(void *)$RBX}, {(void *)$RBP}, {(void *)$RDI}, {(void *)$RSI}, {(void *)$RSP}, {(void *)$R12}, {(void *)$R13}, {(void *)$R14}, {(void *)$R15} {*(void **)$RBX}, {(void *)$RBP}, {*(void **)$RDI}, {*(void **)$RSI}, {*(void **)$RSP}, {*(void **)$R12}, {*(void **)$R13}, {*(void **)$R14}, {*(void **)$R15} ``` On entry to `PyUFunc_GenericFunctionInternal`, `full_args.in` is definitely stored as `*(void **)$R8`, i.e. the R8 register contains a reference to `full_args.in`. But: That is a volatile (caller) saved register... ### Things which are worth to try 1. try to make the struct `ufunc_full_args` bigger: maybe the compiler applies this optimization only if the sizeof is <=16. 2. try to put `ufunc_generic_fastcall` and/or `PyUFunc_GenericFunctionInternal` in their own separate C files. Maybe if they are in different compilation unit the compiler does not apply this optimization * (seberg) Did not try this, but adding `__forceinline` to each function seems to "fix" things somewhat: Adding it to `PyUFunc_GenericFunctionInternal` fixes the print there, adding it to `_find_array_prepare` fixes the segfault with `subok=True`. 3. Try to trigger the GC repeteadly. If our theory is correct, the following should eventually trigger the assert: ```c= for(int i=0; i<1000000; i++) { assert(full_args.in != 0); PyRun_SimpleString("'a'*1000") } ``` * (seberg) tried briefly, but couldn't make it run into the error. 4. If the above suggestion doesn't work, it might be that the culprit is not the GC alone but the GC+JIT. So we could try with a more complex example in which we run some Python code which eventually triggers the JIT. ## @antocuni notes I don't have a Windows machine to try but let me add a couple of comments: * GC/thread: AFAIK, the PyPy GC does **not** run on its own thread. If the GC kicks in, it runs in the current thread. * The `volatile` thing is interesting, because it probably means that the **memory** where the `full_args` resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that `full_args.in` is zeroed. * **Tenative explanation**: the compiler optimizes the code heavily and stores `full_args.in` in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake). ## Patch for printing ``` diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c index 0644a28c0..3f9702940 100644 --- a/numpy/core/src/umath/ufunc_object.c +++ b/numpy/core/src/umath/ufunc_object.c @@ -1263,8 +1263,8 @@ iterator_loop(PyUFuncObject *ufunc, void *innerloopdata, npy_uint32 *op_flags) { - npy_intp i, nin = ufunc->nin, nout = ufunc->nout; - npy_intp nop = nin + nout; + int i, nin = ufunc->nin, nout = ufunc->nout; + int nop = nin + nout; NpyIter *iter; char *baseptrs[NPY_MAXARGS]; @@ -1766,7 +1766,7 @@ _parse_axes_arg(PyUFuncObject *ufunc, int op_core_num_dims[], PyObject *axes, PyErr_SetString(PyExc_TypeError, "axes should be a list."); return -1; } - list_size = PyList_Size(axes); + list_size = (int)PyList_Size(axes); if (list_size != nop) { if (list_size != nin || _has_output_coredims(ufunc)) { PyErr_Format(PyExc_ValueError, @@ -2616,7 +2616,7 @@ fail: static int -PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op, +PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject *op[], ufunc_full_args full_args, PyObject *type_tup, PyObject *extobj, NPY_CASTING casting, NPY_ORDER order, npy_bool subok, PyArrayObject *wheremask) @@ -2626,8 +2626,10 @@ PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op, const char *ufunc_name; int retval = -1; npy_uint32 op_flags[NPY_MAXARGS]; - npy_intp default_op_out_flags; + npy_uint32 default_op_out_flags; + fprintf(stdout, "%d %p ", __LINE__, full_args.in); + fflush(stdout); PyArray_Descr *dtypes[NPY_MAXARGS]; /* These parameters come from extobj= or from a TLS global */ @@ -4206,7 +4208,7 @@ PyUFunc_GenericReduction(PyUFuncObject *ufunc, } } else if (PyTuple_Check(axes_obj)) { - naxes = PyTuple_Size(axes_obj); + naxes = (int)PyTuple_Size(axes_obj); if (naxes < 0 || naxes > NPY_MAXDIMS) { PyErr_SetString(PyExc_ValueError, "too many values for 'axis'"); @@ -4767,7 +4769,6 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc, if (kwnames != NULL) { if (!ufunc->core_enabled) { NPY_PREPARE_ARGPARSER; - if (npy_parse_arguments(ufunc->name, args + len_args, 0, kwnames, "$out", NULL, &out_obj, "$where", NULL, &where_obj, @@ -4880,6 +4881,8 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc, keepdims_obj, &keepdims) < 0) { goto fail; } + fprintf(stdout, "%d %p ", __LINE__, full_args.in); + fflush(stdout); if (!ufunc->core_enabled) { errval = PyUFunc_GenericFunctionInternal(ufunc, operands, @@ -4892,6 +4895,8 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc, full_args, typetup, extobj, casting, order, subok, axis_obj, axes_obj, keepdims); } + fprintf(stdout, "%d ", __LINE__); + fflush(stdout); if (errval < 0) { goto fail; ```