Tracking down a segfault

I am trying to get the numpy-feedstock from conda-forge to work on PyPy + win64. There is a segfault that only appears

on windows 64-bit
~~on hardware that has a AVX512 CPU~~ the crash happens without this
when using the /GL compiler flag and the corresponding /LTCG linker flag for whole program optimization (the default in Python3)
when using the JIT
in NumPy

Of course this is the particular setup that I found reproduces consistently. It may be possible to trigger the segfault in other ways.

The segfault

There is a struct ufunc_full_args defined as {PyObject*, PyObject*} passed to a function with the declaraction

static int
f(PyUFuncObject *ufunc, 
  PyArrayObject *op[],
  ufunc_full_args full_args, 
  PyObject *type_tup, 
  PyObject *extobj,
  NPY_CASTING casting, 
  NPY_ORDER order, 
  npy_bool subok,
  PyArrayObject *wheremask`
 )

Before the single call to the function, full_args is valid. Occasionally once inside the function, it is invalid (zeroed out), at least according to fprintf statements sprinkled in the code to print the struct together with the __LINE__ of code.

Reproducing

Here is how to reproduce

Set up

Get a conda environment and activate it

conda create -n pypy3.7 pypy
conda activate pypy3.7
conda install cython libblas libcblas liblapack \
    pytest hypothesis m2-patch

This is a maximal list of packages in order to run the whole NumPy test suite. Later on I was able to cut down the reproducer to a simpler py file, so simplification may be possible

Get the numpy source code, build NumPy to work in-place

git clone https://github.com/numpy/numpy.git
cd numpy
git checkout v1.21.0
python setup.py build_ext -i

The minimum reproducer (so far)

Put this code in a file test_loop.py in the numpy root directory so it can run without needed to change PYTHONPATH. It is taken from the NumPy tests and wrapped to loop 100,000 times

import numpy as np
import warnings

# Writing to a broadcasted array with overlap 
# should warn, gh-2705
for i in range(100_000):
    a = np.arange(2)
    b = np.arange(4).reshape((2, 2))
    u, v = np.broadcast_arrays(a, b)
    assert u.strides[0] == 0
    x = u + v
    print(11)
    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always")
        print(14, i)
        u += v
        print(16)
print('done')

and run it python test_loop.py. It should crash after the print(14, i) and before the print(16), never reaching done

Instrumentation

Here are two ways to catch the exact place the segfault happens

Via the Visual Studio debugger

<rant> There seems to be no easy way to build a c-extension module via distutils and set the compiler to "add debug info" in windows. On Linux I can simply export CFLAGS=-O0 -g</rant>.

Modify the file <conda-root>/envs/pypy3.7/lib-python/3/distutils/_msvccompiler.py to add debug info:

at line 249 add the /Zi flag

self.compile_options = [
  '/nologo', '/Ox', '/W3', '/GL', '/DNDEBUG', '/MD', '/Zi'
]

at line 256 add the /DEBUG flag

ldflags = [
  '/nologo', '/INCREMENTAL:NO', '/LTCG', '/DEBUG'
]

add a import pdb;pdb.set_trace() in the test file
run the test. When you hit the pdb prompt, attach to the python process with the Visual Studio debugger. Then continue the test.

Now you should hit an exception. But what is causing it? Hard to tell.

[seberg] The full_args.in value says Error reading register value.?

Via print statements

I ended up using print statements because I wanted to pinpoint the change.

Apply the diff below as a patch (that is why you install m2-patch in the conda environment :) ) (TBD: add instructions for running patch, I think it is patch -i <path-to-file, there may be a way to get git to do it)
Rebuild python setup.py build_ext -i

Run the test. For me this crashed after 14618 iterations, printing (@seberg: For me it crashes after a random number of iterations 12478, 1582)

14 14616
4884 000002D206A80610 2631 000002D206A80610 4898 16
4884 000002D205D8A770 2631 000002D205D8A770 4898 11
14 14617
4884 000002D2059A35A0 2631 000002D2059A35A0 4898 16
4884 000002D206CBB000 2631 000002D206CBB000 4898 11
14 14618
4884 000002D205F0FAF0 2631 0000000000000000

The 00000000 is problematic

Things I tried

Changing compilation flags

Any change in optimization like /Od or removing the /GL and /LTCG global optimization flags cleared the segfault.

Changing the function signature to use ufunc_full_args *full_args instead cleared the segfault.

Running with --jit off cleared the segfault

Changing the reproducing code tended to clear the segfault. The reproducer must call the += ufunc with a view on the left-hand-side. Which is weirder still since, as far as I can tell, the segfault happens before entering the ufunc loop

Getting a debug brreakpoint

I could not figure out how to set a watchpoint on the value that is getting zeroed out since the code is heavily optimized. In each invocation of the outside function (ufunc_generic_fastcall), the full_args struct has a different address. The variable itself is optimized out of ufunc_generic_fastcall.

Things @seberg tried

✘ Checking whether tp_call is at fault, Python >3.7 would be using tp_vectorcall so a subtle bug in the tp_call code might go unnoticed. (Checking python debug/valgrind, with tp_vectorcall manually removed)
- no success.
- Doesn't really make sense, then we should see something in pypy on linux…
✔ Add gc.collect() at the end of the loop:
- Script finishes successfully (very slow)
✔ Change ufunc_full_args full_args = {NULL, NULL}; to volatile ufunc_full_args full_args = {NULL, NULL};
- Script finishes successfully.
?? changing the code to np.add(u, v, u, subok=False) (equivlanet without subok) prints out that full_args.in gets NULL (as an if (full_args.in == NULL) in PyUFunc_GenericFunctionInternal), but does not crash. Even if I change it to Py_DECREF(full_args.in) from Py_XDECREF(full_args.in) (maybe the compiler knows to optimize it out?)
- Tracing the value through breakpoints (tracing points) agrees that it first is NULL when PyUFunc_GenericFunctionInternal is called…
- Use gc.hooks.on_gc_minor and gc.hooks.on_gc_collect… see below…
✔ Removing the warning, "fixes" the issue (this may be calling back into Python!?)
- Was unable to break in array_might_be_written easily? Or even PyErr_WarnEx?

OK, The final proof(?) that this is GC related:

(Note about the plot: The failures do not start immediately, they are only this regular after they started first?)

In that plot dark means a failure (run with subok=False, printing out "X" if a NULL happened, else printing out "."). Basically, whenever NO failure occured, there is white space, whenever a failure occurred there is purple. All failures are aligned at 0, and before/afer events are plotted along the y-Axis. The failures occure about every 50 events here, so the line repeats (approximately).

(Also note that the loop contains u + v, which also adds a blank space "success event")

Te grean colors are GC related runs.

Result: Whenever a failure is printed, a GC run happened just before or sometimes after. If the gc run is indeed threaded, this is probably expected to fluctuate. But, it never happens that there are many calls in between.

Adapted script, plotting script, and diff:

The script to generate is:

import numpy as np
import gc
import sys
import warnings

gc.hooks.on_gc_minor = lambda x: (print("M"), sys.stdout.flush())
gc.hooks.on_gc_collect_step = lambda x: (print("C"), sys.stdout.flush())
gc.hooks.on_gc_collect = lambda x: (print("F"), sys.stdout.flush())

# input("  <Press Enter to start>")

# Writing to a broadcasted array with overlap 
# should warn, gh-2705
for i in range(100_000):
    a = np.arange(2)
    b = np.arange(4).reshape((2, 2))
    u, v = np.broadcast_arrays(a, b)
    assert u.strides[0] == 0
    x = u + v
    # print(11)
    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always")
        # print(14, i)
        np.add(u, v, u, subok=False)
        # print(16)
    # gc.collect()  <-- makes things pass?!
print('done')

The diff to print in the ufunc code (sorry, not minimal, but retaining anyway):

diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c
index 9e73dfd94..fc003c783 100644
--- a/numpy/core/src/umath/ufunc_object.c
+++ b/numpy/core/src/umath/ufunc_object.c
@@ -2615,6 +2615,14 @@ fail:
 }


+static int
+breaking_point()
+{
+    printf("X");
+    fflush(stdout);
+    return 0;
+}
+
 static int
 PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
         ufunc_full_args full_args, PyObject *type_tup, PyObject *extobj,
@@ -2629,6 +2637,14 @@ PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
     npy_intp default_op_out_flags;

     PyArray_Descr *dtypes[NPY_MAXARGS];
+
+    if (full_args.in == NULL) {
+        breaking_point();
+    }
+    else {
+        printf(".");
+        fflush(stdout);
+    }

     /* These parameters come from extobj= or from a TLS global */
     int buffersize = 0, errormask = 0;
@@ -4944,8 +4960,12 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
         retobj[i] = wrapped;
     }

+    //if (full_args.in == NULL) {
+    // printf("huhuhuhu %p!\n", full_args.in);
+    //}
+
     Py_XDECREF(typetup);
-    Py_XDECREF(full_args.in);
+    Py_DECREF(full_args.in);
     Py_XDECREF(full_args.out);
     if (ufunc->nout == 1) {
         return retobj[0];
@@ -4960,6 +4980,7 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
         return (PyObject *)ret;
     }

+
 fail:
     Py_XDECREF(typetup);
     Py_XDECREF(full_args.in);

The script to analyze:

# -*- coding: utf-8 -*-

import numpy as np
from matplotlib import pyplot as plt

r = open("numpy/output.txt").read().replace("\n", "").replace("\x00", "").split("warnings")[-1]
arr = np.frombuffer(r.encode("ascii"), dtype="S1")

colors = np.full(len(arr), np.nan, dtype=np.float32)
colors[arr == b"X"] = 1
colors[arr == b"M"] = 3
colors[arr == b"C"] = 3.5
colors[arr == b"F"] = 4

w = np.flatnonzero(arr == b"X")[3:2000]  # find failures, skip some for padding

delay = np.arange(-70, 70)[:, None] 
embed = w + delay


plt.pcolormesh(w, delay[:, 0], colors[embed])
plt.ylabel("Event before/after failure")
plt.xlabel("Global event count")

ax = plt.gca()
axins = ax.inset_axes([0.55, 0.55, 0.4, 0.4])
axins.pcolormesh(w, delay[:, 0], colors[embed])
x1, x2, y1, y2 = 41000, 44000, -10, 10
axins.set_xlim(x1, x2)
axins.set_ylim(y1, y2)
axins.set_xticklabels("")
axins.set_yticklabels("")

ax.indicate_inset_zoom(axins, edgecolor="black")

plt.savefig("events.png", dpi=800)

plt.show()

What is next?

The two instrumentation techiniques give different results. The prints make it look like the function call I noted above, the debugger says the segfault is elsewhere.

Here are a few "shower thoughts". Maybe any of these, or maybe something else:

Some other thread is messing things up. What else might be running in a thread? ~~The PyPy garbage collector comes to mind.~~ (silly me, the GC runs in the main thread) The full_args.in pointer that is being zeroed-out is a PyObject*, maybe it is being collected? But that doesn't make sense, the pointer would not be replaced, and I checked the refcount, the object is alive.
- GC collector seems a very likely related (see "Things @seberg tried"). I suppose this is somehow in registers (and that is why volatile, etc. helps?).
A compiler bug (yeah, right). But even if so, the failure should be consistent.
The printing is not actually doing what I think it is and the problem is elsewhere. Maybe code reordering is messing with the prints
Maybe I should go back to the debugger and try to look at the assembler code for the jump into the function: what is the register layout for this call? (since it is static, is it being inlined? )

@antocuni notes

I don't have a Windows machine to try but let me add a couple of comments:

GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The volatile thing is interesting, because it probably means that the memory where the full_args resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that full_args.in is zeroed.
Tenative explanation: the compiler optimizes the code heavily and stores full_args.in in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).

@seberg, register inspection:

I tried inspecting registers with racing points on:

volatile ones:
  {(void *)$RAX}, {(void *)$RCX}, {(void *)$RDX}, {(void *)$R8}, {(void *)$R9}, {(void *)$R10}, {(void *)$R11}
  {*(void **)$RAX}, {*(void **)$RCX}, {*(void **)$RDX}, {*(void **)$R8}, {*(void **)$R9}, {*(void **)$R10}, {*(void **)$R11}

non-volatile ones:
  {(void *)$RBX}, {(void *)$RBP}, {(void *)$RDI}, {(void *)$RSI}, {(void *)$RSP}, {(void *)$R12}, {(void *)$R13}, {(void *)$R14}, {(void *)$R15}
  {*(void **)$RBX}, {(void *)$RBP}, {*(void **)$RDI}, {*(void **)$RSI}, {*(void **)$RSP}, {*(void **)$R12}, {*(void **)$R13}, {*(void **)$R14}, {*(void **)$R15}

On entry to PyUFunc_GenericFunctionInternal, full_args.in is definitely stored as *(void **)$R8, i.e. the R8 register contains a reference to full_args.in.

But: That is a volatile (caller) saved register…

Things which are worth to try

try to make the struct ufunc_full_args bigger: maybe the compiler applies this optimization only if the sizeof is <=16.
try to put ufunc_generic_fastcall and/or PyUFunc_GenericFunctionInternal in their own separate C files. Maybe if they are in different compilation unit the compiler does not apply this optimization
- (seberg) Did not try this, but adding __forceinline to each function seems to "fix" things somewhat: Adding it to PyUFunc_GenericFunctionInternal fixes the print there, adding it to _find_array_prepare fixes the segfault with subok=True.
Try to trigger the GC repeteadly. If our theory is correct, the following should eventually trigger the assert:




for(int i=0; i<1000000; i++) {
    assert(full_args.in != 0);
    PyRun_SimpleString("'a'*1000")
}

(seberg) tried briefly, but couldn't make it run into the error.

If the above suggestion doesn't work, it might be that the culprit is not the GC alone but the GC+JIT. So we could try with a more complex example in which we run some Python code which eventually triggers the JIT.

@antocuni notes

I don't have a Windows machine to try but let me add a couple of comments:

GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The volatile thing is interesting, because it probably means that the memory where the full_args resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that full_args.in is zeroed.
Tenative explanation: the compiler optimizes the code heavily and stores full_args.in in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).

Patch for printing

diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c
index 0644a28c0..3f9702940 100644
--- a/numpy/core/src/umath/ufunc_object.c
+++ b/numpy/core/src/umath/ufunc_object.c
@@ -1263,8 +1263,8 @@ iterator_loop(PyUFuncObject *ufunc,
                     void *innerloopdata,
                     npy_uint32 *op_flags)
 {
-    npy_intp i, nin = ufunc->nin, nout = ufunc->nout;
-    npy_intp nop = nin + nout;
+    int i, nin = ufunc->nin, nout = ufunc->nout;
+    int nop = nin + nout;
     NpyIter *iter;
     char *baseptrs[NPY_MAXARGS];

@@ -1766,7 +1766,7 @@ _parse_axes_arg(PyUFuncObject *ufunc, int op_core_num_dims[], PyObject *axes,
         PyErr_SetString(PyExc_TypeError, "axes should be a list.");
         return -1;
     }
-    list_size = PyList_Size(axes);
+    list_size = (int)PyList_Size(axes);
     if (list_size != nop) {
         if (list_size != nin || _has_output_coredims(ufunc)) {
             PyErr_Format(PyExc_ValueError,
@@ -2616,7 +2616,7 @@ fail:


 static int
-PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
+PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject *op[],
         ufunc_full_args full_args, PyObject *type_tup, PyObject *extobj,
         NPY_CASTING casting, NPY_ORDER order, npy_bool subok,
         PyArrayObject *wheremask)
@@ -2626,8 +2626,10 @@ PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
     const char *ufunc_name;
     int retval = -1;
     npy_uint32 op_flags[NPY_MAXARGS];
-    npy_intp default_op_out_flags;
+    npy_uint32 default_op_out_flags;

+    fprintf(stdout, "%d %p ", __LINE__, full_args.in);
+    fflush(stdout);
     PyArray_Descr *dtypes[NPY_MAXARGS];

     /* These parameters come from extobj= or from a TLS global */
@@ -4206,7 +4208,7 @@ PyUFunc_GenericReduction(PyUFuncObject *ufunc,
         }
     }
     else if (PyTuple_Check(axes_obj)) {
-        naxes = PyTuple_Size(axes_obj);
+        naxes = (int)PyTuple_Size(axes_obj);
         if (naxes < 0 || naxes > NPY_MAXDIMS) {
             PyErr_SetString(PyExc_ValueError,
                     "too many values for 'axis'");
@@ -4767,7 +4769,6 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
     if (kwnames != NULL) {
         if (!ufunc->core_enabled) {
             NPY_PREPARE_ARGPARSER;
-
             if (npy_parse_arguments(ufunc->name, args + len_args, 0, kwnames,
                     "$out", NULL, &out_obj,
                     "$where", NULL, &where_obj,
@@ -4880,6 +4881,8 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
             keepdims_obj, &keepdims) < 0) {
         goto fail;
     }
+    fprintf(stdout, "%d %p ", __LINE__, full_args.in);
+    fflush(stdout);

     if (!ufunc->core_enabled) {
         errval = PyUFunc_GenericFunctionInternal(ufunc, operands,
@@ -4892,6 +4895,8 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
                 full_args, typetup, extobj, casting, order, subok,
                 axis_obj, axes_obj, keepdims);
     }
+    fprintf(stdout, "%d ", __LINE__);
+    fflush(stdout);

     if (errval < 0) {
         goto fail;