or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
---
Tracking down a segfault
I am trying to get the numpy-feedstock from conda-forge to work on PyPy + win64. There is a segfault that only appears
on hardware that has a AVX512 CPUthe crash happens without this/GL
compiler flag and the corresponding/LTCG
linker flag for whole program optimization (the default in Python3)Of course this is the particular setup that I found reproduces consistently. It may be possible to trigger the segfault in other ways.
The segfault
There is a struct
ufunc_full_args
defined as{PyObject*, PyObject*}
passed to a function with the declaractionBefore the single call to the function,
full_args
is valid. Occasionally once inside the function, it is invalid (zeroed out), at least according tofprintf
statements sprinkled in the code to print the struct together with the__LINE__
of code.Reproducing
Here is how to reproduce
Set up
Get a conda environment and activate it
This is a maximal list of packages in order to run the whole NumPy test suite. Later on I was able to cut down the reproducer to a simpler
py
file, so simplification may be possibleGet the numpy source code, build NumPy to work in-place
The minimum reproducer (so far)
Put this code in a file
test_loop.py
in the numpy root directory so it can run without needed to changePYTHONPATH
. It is taken from the NumPy tests and wrapped to loop 100,000 timesand run it
python test_loop.py
. It should crash after theprint(14, i)
and before theprint(16)
, never reachingdone
Instrumentation
Here are two ways to catch the exact place the segfault happens
Via the Visual Studio debugger
<rant> There seems to be no easy way to build a c-extension module via
distutils
and set the compiler to "add debug info" in windows. On Linux I can simply exportCFLAGS=-O0 -g
</rant>.Modify the file
<conda-root>/envs/pypy3.7/lib-python/3/distutils/_msvccompiler.py
to add debug info:/Zi
flag/DEBUG
flagimport pdb;pdb.set_trace()
in the test filepdb
prompt, attach to the python process with the Visual Studio debugger. Then continue the test.Now you should hit an exception. But what is causing it? Hard to tell.
Via print statements
I ended up using print statements because I wanted to pinpoint the change.
patch
, I think it ispatch -i <path-to-file
, there may be a way to getgit
to do it)python setup.py build_ext -i
00000000
is problematicThings I tried
Changing compilation flags
Any change in optimization like
/Od
or removing the/GL
and/LTCG
global optimization flags cleared the segfault.Changing the function signature to use
ufunc_full_args *full_args
instead cleared the segfault.Running with
--jit off
cleared the segfaultChanging the reproducing code tended to clear the segfault. The reproducer must call the
+=
ufunc with a view on the left-hand-side. Which is weirder still since, as far as I can tell, the segfault happens before entering the ufunc loopGetting a debug brreakpoint
I could not figure out how to set a watchpoint on the value that is getting zeroed out since the code is heavily optimized. In each invocation of the outside function (
ufunc_generic_fastcall
), thefull_args
struct has a different address. The variable itself is optimized out ofufunc_generic_fastcall
.Things @seberg tried
tp_call
is at fault, Python >3.7 would be usingtp_vectorcall
so a subtle bug in thetp_call
code might go unnoticed. (Checking python debug/valgrind, withtp_vectorcall
manually removed)pypy
on linux…gc.collect()
at the end of the loop:ufunc_full_args full_args = {NULL, NULL};
tovolatile ufunc_full_args full_args = {NULL, NULL};
np.add(u, v, u, subok=False)
(equivlanet withoutsubok
) prints out thatfull_args.in
gets NULL (as anif (full_args.in == NULL)
inPyUFunc_GenericFunctionInternal
), but does not crash. Even if I change it toPy_DECREF(full_args.in)
fromPy_XDECREF(full_args.in)
(maybe the compiler knows to optimize it out?)NULL
whenPyUFunc_GenericFunctionInternal
is called…gc.hooks.on_gc_minor
andgc.hooks.on_gc_collect
… see below…array_might_be_written
easily? Or evenPyErr_WarnEx
?OK, The final proof(?) that this is GC related:
(Note about the plot: The failures do not start immediately, they are only this regular after they started first?)
In that plot dark means a failure (run with
subok=False
, printing out"X"
if aNULL
happened, else printing out"."
). Basically, whenever NO failure occured, there is white space, whenever a failure occurred there is purple. All failures are aligned at 0, and before/afer events are plotted along the y-Axis. The failures occure about every 50 events here, so the line repeats (approximately).(Also note that the loop contains
u + v
, which also adds a blank space "success event")Te grean colors are GC related runs.
Result: Whenever a failure is printed, a GC run happened just before or sometimes after. If the gc run is indeed threaded, this is probably expected to fluctuate. But, it never happens that there are many calls in between.
Adapted script, plotting script, and diff:
The script to generate is:
The diff to print in the ufunc code (sorry, not minimal, but retaining anyway):
The script to analyze:
What is next?
The two instrumentation techiniques give different results. The prints make it look like the function call I noted above, the debugger says the segfault is elsewhere.
Here are a few "shower thoughts". Maybe any of these, or maybe something else:
The PyPy garbage collector comes to mind.(silly me, the GC runs in the main thread) The full_args.in pointer that is being zeroed-out is aPyObject*
, maybe it is being collected? But that doesn't make sense, the pointer would not be replaced, and I checked therefcount
, the object is alive.volatile
, etc. helps?).static
, is it being inlined? )@antocuni notes
I don't have a Windows machine to try but let me add a couple of comments:
GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The
volatile
thing is interesting, because it probably means that the memory where thefull_args
resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks thatfull_args.in
is zeroed.Tenative explanation: the compiler optimizes the code heavily and stores
full_args.in
in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).@seberg, register inspection:
I tried inspecting registers with racing points on:
On entry to
PyUFunc_GenericFunctionInternal
,full_args.in
is definitely stored as*(void **)$R8
, i.e. the R8 register contains a reference tofull_args.in
.But: That is a volatile (caller) saved register…
Things which are worth to try
try to make the struct
ufunc_full_args
bigger: maybe the compiler applies this optimization only if the sizeof is <=16.try to put
ufunc_generic_fastcall
and/orPyUFunc_GenericFunctionInternal
in their own separate C files. Maybe if they are in different compilation unit the compiler does not apply this optimization__forceinline
to each function seems to "fix" things somewhat: Adding it toPyUFunc_GenericFunctionInternal
fixes the print there, adding it to_find_array_prepare
fixes the segfault withsubok=True
.Try to trigger the GC repeteadly. If our theory is correct, the following should eventually trigger the assert:
@antocuni notes
I don't have a Windows machine to try but let me add a couple of comments:
GC/thread: AFAIK, the PyPy GC does not run on its own thread. If the GC kicks in, it runs in the current thread.
The
volatile
thing is interesting, because it probably means that the memory where thefull_args
resides is not changed (else you would get the crash even with volatile), but that somehow the code generated by the compiler thinks thatfull_args.in
is zeroed.Tenative explanation: the compiler optimizes the code heavily and stores
full_args.in
in some register. At some point the GC kicks in, the register is cleared and it's never restored. This could happend because of a compiler bug (if it's in a caller saved register which is not saved properly) or a pypy GC/JIT bug (if it's in a callee saved register which is overwritten by mistake).Patch for printing