I am trying to get the numpy-feedstock from conda-forge to work on PyPy + win64. There is a segfault that only appears
/GL
compiler flag and the corresponding /LTCG
linker flag for whole program optimization (the default in Python3)Of course this is the particular setup that I found reproduces consistently. It may be possible to trigger the segfault in other ways.
There is a struct ufunc_full_args
defined as {PyObject*, PyObject*}
passed to a function with the declaraction
static int
f(PyUFuncObject *ufunc,
PyArrayObject *op[],
ufunc_full_args full_args,
PyObject *type_tup,
PyObject *extobj,
NPY_CASTING casting,
NPY_ORDER order,
npy_bool subok,
PyArrayObject *wheremask`
)
Before the single call to the function, full_args
is valid. Occasionally once inside the function, it is invalid (zeroed out), at least according to fprintf
statements sprinkled in the code to print the struct together with the __LINE__
of code.
Here is how to reproduce
Get a conda environment and activate it
conda create -n pypy3.7 pypy
conda activate pypy3.7
conda install cython libblas libcblas liblapack \
pytest hypothesis m2-patch
This is a maximal list of packages in order to run the whole NumPy test suite. Later on I was able to cut down the reproducer to a simpler py
file, so simplification may be possible
Get the numpy source code, build NumPy to work in-place
git clone https://github.com/numpy/numpy.git
cd numpy
git checkout v1.21.0
python setup.py build_ext -i
Put this code in a file test_loop.py
in the numpy root directory so it can run without needed to change PYTHONPATH
. It is taken from the NumPy tests and wrapped to loop 100,000 times
import numpy as np
import warnings
# Writing to a broadcasted array with overlap
# should warn, gh-2705
for i in range(100_000):
a = np.arange(2)
b = np.arange(4).reshape((2, 2))
u, v = np.broadcast_arrays(a, b)
assert u.strides[0] == 0
x = u + v
print(11)
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
print(14, i)
u += v
print(16)
print('done')
and run it python test_loop.py
. It should crash after the print(14, i)
and before the print(16)
, never reaching done
Here are two ways to catch the exact place the segfault happens
<rant> There seems to be no easy way to build a c-extension module via distutils
and set the compiler to "add debug info" in windows. On Linux I can simply export CFLAGS=-O0 -g
</rant>.
Modify the file <conda-root>/envs/pypy3.7/lib-python/3/distutils/_msvccompiler.py
to add debug info:
/Zi
flag
self.compile_options = [
'/nologo', '/Ox', '/W3', '/GL', '/DNDEBUG', '/MD', '/Zi'
]
/DEBUG
flag
ldflags = [
'/nologo', '/INCREMENTAL:NO', '/LTCG', '/DEBUG'
]
import pdb;pdb.set_trace()
in the test filepdb
prompt, attach to the python process with the Visual Studio debugger. Then continue the test.Now you should hit an exception. But what is causing it? Hard to tell.
[seberg] The
full_args.in
value saysError reading register value.
?
I ended up using print statements because I wanted to pinpoint the change.
patch
, I think it is patch -i <path-to-file
, there may be a way to get git
to do it)python setup.py build_ext -i
14 14616
4884 000002D206A80610 2631 000002D206A80610 4898 16
4884 000002D205D8A770 2631 000002D205D8A770 4898 11
14 14617
4884 000002D2059A35A0 2631 000002D2059A35A0 4898 16
4884 000002D206CBB000 2631 000002D206CBB000 4898 11
14 14618
4884 000002D205F0FAF0 2631 0000000000000000
00000000
is problematicAny change in optimization like /Od
or removing the /GL
and /LTCG
global optimization flags cleared the segfault.
Changing the function signature to use ufunc_full_args *full_args
instead cleared the segfault.
Running with --jit off
cleared the segfault
Changing the reproducing code tended to clear the segfault. The reproducer must call the +=
ufunc with a view on the left-hand-side. Which is weirder still since, as far as I can tell, the segfault happens before entering the ufunc loop
I could not figure out how to set a watchpoint on the value that is getting zeroed out since the code is heavily optimized. In each invocation of the outside function (ufunc_generic_fastcall
), the full_args
struct has a different address. The variable itself is optimized out of ufunc_generic_fastcall
.
tp_call
is at fault, Python >3.7 would be using tp_vectorcall
so a subtle bug in the tp_call
code might go unnoticed. (Checking python debug/valgrind, with tp_vectorcall
manually removed)
pypy
on linux…gc.collect()
at the end of the loop:
ufunc_full_args full_args = {NULL, NULL};
to volatile ufunc_full_args full_args = {NULL, NULL};
np.add(u, v, u, subok=False)
(equivlanet without subok
) prints out that full_args.in
gets NULL (as an if (full_args.in == NULL)
in PyUFunc_GenericFunctionInternal
), but does not crash. Even if I change it to Py_DECREF(full_args.in)
from Py_XDECREF(full_args.in)
(maybe the compiler knows to optimize it out?)
NULL
when PyUFunc_GenericFunctionInternal
is called…gc.hooks.on_gc_minor
and gc.hooks.on_gc_collect
… see below…array_might_be_written
easily? Or even PyErr_WarnEx
?OK, The final proof(?) that this is GC related:
(Note about the plot: The failures do not start immediately, they are only this regular after they started first?)
In that plot dark means a failure (run with subok=False
, printing out "X"
if a NULL
happened, else printing out "."
). Basically, whenever NO failure occured, there is white space, whenever a failure occurred there is purple. All failures are aligned at 0, and before/afer events are plotted along the y-Axis. The failures occure about every 50 events here, so the line repeats (approximately).
(Also note that the loop contains u + v
, which also adds a blank space "success event")
Te grean colors are GC related runs.
Result: Whenever a failure is printed, a GC run happened just before or sometimes after. If the gc run is indeed threaded, this is probably expected to fluctuate. But, it never happens that there are many calls in between.
The script to generate is:
import numpy as np
import gc
import sys
import warnings
gc.hooks.on_gc_minor = lambda x: (print("M"), sys.stdout.flush())
gc.hooks.on_gc_collect_step = lambda x: (print("C"), sys.stdout.flush())
gc.hooks.on_gc_collect = lambda x: (print("F"), sys.stdout.flush())
# input(" <Press Enter to start>")
# Writing to a broadcasted array with overlap
# should warn, gh-2705
for i in range(100_000):
a = np.arange(2)
b = np.arange(4).reshape((2, 2))
u, v = np.broadcast_arrays(a, b)
assert u.strides[0] == 0
x = u + v
# print(11)
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
# print(14, i)
np.add(u, v, u, subok=False)
# print(16)
# gc.collect() <-- makes things pass?!
print('done')
The diff to print in the ufunc code (sorry, not minimal, but retaining anyway):
diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c
index 9e73dfd94..fc003c783 100644
--- a/numpy/core/src/umath/ufunc_object.c
+++ b/numpy/core/src/umath/ufunc_object.c
@@ -2615,6 +2615,14 @@ fail:
}
+static int
+breaking_point()
+{
+ printf("X");
+ fflush(stdout);
+ return 0;
+}
+
static int
PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
ufunc_full_args full_args, PyObject *type_tup, PyObject *extobj,
@@ -2629,6 +2637,14 @@ PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
npy_intp default_op_out_flags;
PyArray_Descr *dtypes[NPY_MAXARGS];
+
+ if (full_args.in == NULL) {
+ breaking_point();
+ }
+ else {
+ printf(".");
+ fflush(stdout);
+ }
/* These parameters come from extobj= or from a TLS global */
int buffersize = 0, errormask = 0;
@@ -4944,8 +4960,12 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
retobj[i] = wrapped;
}
+ //if (full_args.in == NULL) {
+ // printf("huhuhuhu %p!\n", full_args.in);
+ //}
+
Py_XDECREF(typetup);
- Py_XDECREF(full_args.in);
+ Py_DECREF(full_args.in);
Py_XDECREF(full_args.out);
if (ufunc->nout == 1) {
return retobj[0];
@@ -4960,6 +4980,7 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
return (PyObject *)ret;
}
+
fail:
Py_XDECREF(typetup);
Py_XDECREF(full_args.in);
The script to analyze:
# -*- coding: utf-8 -*-
import numpy as np
from matplotlib import pyplot as plt
r = open("numpy/output.txt").read().replace("\n", "").replace("\x00", "").split("warnings")[-1]
arr = np.frombuffer(r.encode("ascii"), dtype="S1")
colors = np.full(len(arr), np.nan, dtype=np.float32)
colors[arr == b"X"] = 1
colors[arr == b"M"] = 3
colors[arr == b"C"] = 3.5
colors[arr == b"F"] = 4
w = np.flatnonzero(arr == b"X")[3:2000] # find failures, skip some for padding
delay = np.arange(-70, 70)[:, None]
embed = w + delay
plt.pcolormesh(w, delay[:, 0], colors[embed])
plt.ylabel("Event before/after failure")
plt.xlabel("Global event count")
ax = plt.gca()
axins = ax.inset_axes([0.55, 0.55, 0.4, 0.4])
axins.pcolormesh(w, delay[:, 0], colors[embed])
x1, x2, y1, y2 = 41000, 44000, -10, 10
axins.set_xlim(x1, x2)
axins.set_ylim(y1, y2)
axins.set_xticklabels("")
axins.set_yticklabels("")
ax.indicate_inset_zoom(axins, edgecolor="black")
plt.savefig("events.png", dpi=800)
plt.show()
The two instrumentation techiniques give different results. The prints make it look like the function call I noted above, the debugger says the segfault is elsewhere.
Here are a few "shower thoughts". Maybe any of these, or maybe something else:
PyObject*
, maybe it is being collected? But that doesn't make sense, the pointer would not be replaced, and I checked the refcount
, the object is alive.
volatile
, etc. helps?).static
, is it being inlined? )I don't have a Windows machine to try but let me add a couple of comments:
GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The volatile
thing is interesting, because it probably means that the memory where the full_args
resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that full_args.in
is zeroed.
Tenative explanation: the compiler optimizes the code heavily and stores full_args.in
in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).
I tried inspecting registers with racing points on:
volatile ones:
{(void *)$RAX}, {(void *)$RCX}, {(void *)$RDX}, {(void *)$R8}, {(void *)$R9}, {(void *)$R10}, {(void *)$R11}
{*(void **)$RAX}, {*(void **)$RCX}, {*(void **)$RDX}, {*(void **)$R8}, {*(void **)$R9}, {*(void **)$R10}, {*(void **)$R11}
non-volatile ones:
{(void *)$RBX}, {(void *)$RBP}, {(void *)$RDI}, {(void *)$RSI}, {(void *)$RSP}, {(void *)$R12}, {(void *)$R13}, {(void *)$R14}, {(void *)$R15}
{*(void **)$RBX}, {(void *)$RBP}, {*(void **)$RDI}, {*(void **)$RSI}, {*(void **)$RSP}, {*(void **)$R12}, {*(void **)$R13}, {*(void **)$R14}, {*(void **)$R15}
On entry to PyUFunc_GenericFunctionInternal
, full_args.in
is definitely stored as *(void **)$R8
, i.e. the R8 register contains a reference to full_args.in
.
But: That is a volatile (caller) saved register…
try to make the struct ufunc_full_args
bigger: maybe the compiler applies this optimization only if the sizeof is <=16.
try to put ufunc_generic_fastcall
and/or PyUFunc_GenericFunctionInternal
in their own separate C files. Maybe if they are in different compilation unit the compiler does not apply this optimization
__forceinline
to each function seems to "fix" things somewhat: Adding it to PyUFunc_GenericFunctionInternal
fixes the print there, adding it to _find_array_prepare
fixes the segfault with subok=True
.Try to trigger the GC repeteadly. If our theory is correct, the following should eventually trigger the assert:
for(int i=0; i<1000000; i++) {
assert(full_args.in != 0);
PyRun_SimpleString("'a'*1000")
}
I don't have a Windows machine to try but let me add a couple of comments:
GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The volatile
thing is interesting, because it probably means that the memory where the full_args
resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that full_args.in
is zeroed.
Tenative explanation: the compiler optimizes the code heavily and stores full_args.in
in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).
diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c
index 0644a28c0..3f9702940 100644
--- a/numpy/core/src/umath/ufunc_object.c
+++ b/numpy/core/src/umath/ufunc_object.c
@@ -1263,8 +1263,8 @@ iterator_loop(PyUFuncObject *ufunc,
void *innerloopdata,
npy_uint32 *op_flags)
{
- npy_intp i, nin = ufunc->nin, nout = ufunc->nout;
- npy_intp nop = nin + nout;
+ int i, nin = ufunc->nin, nout = ufunc->nout;
+ int nop = nin + nout;
NpyIter *iter;
char *baseptrs[NPY_MAXARGS];
@@ -1766,7 +1766,7 @@ _parse_axes_arg(PyUFuncObject *ufunc, int op_core_num_dims[], PyObject *axes,
PyErr_SetString(PyExc_TypeError, "axes should be a list.");
return -1;
}
- list_size = PyList_Size(axes);
+ list_size = (int)PyList_Size(axes);
if (list_size != nop) {
if (list_size != nin || _has_output_coredims(ufunc)) {
PyErr_Format(PyExc_ValueError,
@@ -2616,7 +2616,7 @@ fail:
static int
-PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
+PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject *op[],
ufunc_full_args full_args, PyObject *type_tup, PyObject *extobj,
NPY_CASTING casting, NPY_ORDER order, npy_bool subok,
PyArrayObject *wheremask)
@@ -2626,8 +2626,10 @@ PyUFunc_GenericFunctionInternal(PyUFuncObject *ufunc, PyArrayObject **op,
const char *ufunc_name;
int retval = -1;
npy_uint32 op_flags[NPY_MAXARGS];
- npy_intp default_op_out_flags;
+ npy_uint32 default_op_out_flags;
+ fprintf(stdout, "%d %p ", __LINE__, full_args.in);
+ fflush(stdout);
PyArray_Descr *dtypes[NPY_MAXARGS];
/* These parameters come from extobj= or from a TLS global */
@@ -4206,7 +4208,7 @@ PyUFunc_GenericReduction(PyUFuncObject *ufunc,
}
}
else if (PyTuple_Check(axes_obj)) {
- naxes = PyTuple_Size(axes_obj);
+ naxes = (int)PyTuple_Size(axes_obj);
if (naxes < 0 || naxes > NPY_MAXDIMS) {
PyErr_SetString(PyExc_ValueError,
"too many values for 'axis'");
@@ -4767,7 +4769,6 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
if (kwnames != NULL) {
if (!ufunc->core_enabled) {
NPY_PREPARE_ARGPARSER;
-
if (npy_parse_arguments(ufunc->name, args + len_args, 0, kwnames,
"$out", NULL, &out_obj,
"$where", NULL, &where_obj,
@@ -4880,6 +4881,8 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
keepdims_obj, &keepdims) < 0) {
goto fail;
}
+ fprintf(stdout, "%d %p ", __LINE__, full_args.in);
+ fflush(stdout);
if (!ufunc->core_enabled) {
errval = PyUFunc_GenericFunctionInternal(ufunc, operands,
@@ -4892,6 +4895,8 @@ ufunc_generic_fastcall(PyUFuncObject *ufunc,
full_args, typetup, extobj, casting, order, subok,
axis_obj, axes_obj, keepdims);
}
+ fprintf(stdout, "%d ", __LINE__);
+ fflush(stdout);
if (errval < 0) {
goto fail;