I am trying to get the numpy-feedstock from conda-forge to work on PyPy + win64. There is a segfault that only appears
/GL
compiler flag and the corresponding /LTCG
linker flag for whole program optimization (the default in Python3)Of course this is the particular setup that I found reproduces consistently. It may be possible to trigger the segfault in other ways.
There is a struct ufunc_full_args
defined as {PyObject*, PyObject*}
passed to a function with the declaraction
Before the single call to the function, full_args
is valid. Occasionally once inside the function, it is invalid (zeroed out), at least according to fprintf
statements sprinkled in the code to print the struct together with the __LINE__
of code.
Here is how to reproduce
Get a conda environment and activate it
This is a maximal list of packages in order to run the whole NumPy test suite. Later on I was able to cut down the reproducer to a simpler py
file, so simplification may be possible
Get the numpy source code, build NumPy to work in-place
Put this code in a file test_loop.py
in the numpy root directory so it can run without needed to change PYTHONPATH
. It is taken from the NumPy tests and wrapped to loop 100,000 times
and run it python test_loop.py
. It should crash after the print(14, i)
and before the print(16)
, never reaching done
Here are two ways to catch the exact place the segfault happens
<rant> There seems to be no easy way to build a c-extension module via distutils
and set the compiler to "add debug info" in windows. On Linux I can simply export CFLAGS=-O0 -g
</rant>.
Modify the file <conda-root>/envs/pypy3.7/lib-python/3/distutils/_msvccompiler.py
to add debug info:
/Zi
flag
/DEBUG
flag
import pdb;pdb.set_trace()
in the test filepdb
prompt, attach to the python process with the Visual Studio debugger. Then continue the test.Now you should hit an exception. But what is causing it? Hard to tell.
[seberg] The
full_args.in
value saysError reading register value.
?
I ended up using print statements because I wanted to pinpoint the change.
patch
, I think it is patch -i <path-to-file
, there may be a way to get git
to do it)python setup.py build_ext -i
00000000
is problematicAny change in optimization like /Od
or removing the /GL
and /LTCG
global optimization flags cleared the segfault.
Changing the function signature to use ufunc_full_args *full_args
instead cleared the segfault.
Running with --jit off
cleared the segfault
Changing the reproducing code tended to clear the segfault. The reproducer must call the +=
ufunc with a view on the left-hand-side. Which is weirder still since, as far as I can tell, the segfault happens before entering the ufunc loop
I could not figure out how to set a watchpoint on the value that is getting zeroed out since the code is heavily optimized. In each invocation of the outside function (ufunc_generic_fastcall
), the full_args
struct has a different address. The variable itself is optimized out of ufunc_generic_fastcall
.
tp_call
is at fault, Python >3.7 would be using tp_vectorcall
so a subtle bug in the tp_call
code might go unnoticed. (Checking python debug/valgrind, with tp_vectorcall
manually removed)
pypy
on linux…gc.collect()
at the end of the loop:
ufunc_full_args full_args = {NULL, NULL};
to volatile ufunc_full_args full_args = {NULL, NULL};
np.add(u, v, u, subok=False)
(equivlanet without subok
) prints out that full_args.in
gets NULL (as an if (full_args.in == NULL)
in PyUFunc_GenericFunctionInternal
), but does not crash. Even if I change it to Py_DECREF(full_args.in)
from Py_XDECREF(full_args.in)
(maybe the compiler knows to optimize it out?)
NULL
when PyUFunc_GenericFunctionInternal
is called…gc.hooks.on_gc_minor
and gc.hooks.on_gc_collect
… see below…array_might_be_written
easily? Or even PyErr_WarnEx
?OK, The final proof(?) that this is GC related:
(Note about the plot: The failures do not start immediately, they are only this regular after they started first?)
In that plot dark means a failure (run with subok=False
, printing out "X"
if a NULL
happened, else printing out "."
). Basically, whenever NO failure occured, there is white space, whenever a failure occurred there is purple. All failures are aligned at 0, and before/afer events are plotted along the y-Axis. The failures occure about every 50 events here, so the line repeats (approximately).
(Also note that the loop contains u + v
, which also adds a blank space "success event")
Te grean colors are GC related runs.
Result: Whenever a failure is printed, a GC run happened just before or sometimes after. If the gc run is indeed threaded, this is probably expected to fluctuate. But, it never happens that there are many calls in between.
The script to generate is:
The diff to print in the ufunc code (sorry, not minimal, but retaining anyway):
The script to analyze:
The two instrumentation techiniques give different results. The prints make it look like the function call I noted above, the debugger says the segfault is elsewhere.
Here are a few "shower thoughts". Maybe any of these, or maybe something else:
PyObject*
, maybe it is being collected? But that doesn't make sense, the pointer would not be replaced, and I checked the refcount
, the object is alive.
volatile
, etc. helps?).static
, is it being inlined? )I don't have a Windows machine to try but let me add a couple of comments:
GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The volatile
thing is interesting, because it probably means that the memory where the full_args
resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that full_args.in
is zeroed.
Tenative explanation: the compiler optimizes the code heavily and stores full_args.in
in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).
I tried inspecting registers with racing points on:
On entry to PyUFunc_GenericFunctionInternal
, full_args.in
is definitely stored as *(void **)$R8
, i.e. the R8 register contains a reference to full_args.in
.
But: That is a volatile (caller) saved register…
try to make the struct ufunc_full_args
bigger: maybe the compiler applies this optimization only if the sizeof is <=16.
try to put ufunc_generic_fastcall
and/or PyUFunc_GenericFunctionInternal
in their own separate C files. Maybe if they are in different compilation unit the compiler does not apply this optimization
__forceinline
to each function seems to "fix" things somewhat: Adding it to PyUFunc_GenericFunctionInternal
fixes the print there, adding it to _find_array_prepare
fixes the segfault with subok=True
.Try to trigger the GC repeteadly. If our theory is correct, the following should eventually trigger the assert:
I don't have a Windows machine to try but let me add a couple of comments:
GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The volatile
thing is interesting, because it probably means that the memory where the full_args
resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks that full_args.in
is zeroed.
Tenative explanation: the compiler optimizes the code heavily and stores full_args.in
in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).