owned this note
owned this note
Published
Linked with GitHub
# Refcounting crash in ANGLE 4 rebase
| | |
| --- | --- |
| Document owner | @ErichDonGubler |
| Last updated | 2022-04-19 |
| Status | **Resolved**. This document's contents are out-of-date. |
| Backlink to rebase report | [Link](https://hackmd.io/XxvU5HgHQVWw-kxKkKGA_A?both=#Refcounting-crash-around-ID3DDevice) |
Currently, Mozilla's rebase of ANGLE v4 (see [here](https://hackmd.io/XxvU5HgHQVWw-kxKkKGA_A) for version/commit details) is running into crashing issue while calling [`IUnknown::Release`](https://learn.microsoft.com/en-us/windows/win32/api/unknwn/nf-unknwn-iunknown-release) on a specific COM object. This crash can be straightforwardly reproduced in WebGL with relatively simple shaders and `gl` commands to use a [`video` element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/video) as the source of a 2D texture(see [repro steps](#Reproducing-the-crash)' sample HTML source below). This issue appears to be due to imbalanced COM object reference counting operations on a [`ID3DDevice`](https://learn.microsoft.com/en-us/windows/win32/api/d3d11/nn-d3d11-id3d11device) instance created and assigned to the `Renderer11::mDevice` member in the `rx::Renderer11::callD3D11CreateDevice()` method, such that a double-free occurs. Therefore, one of the following must be true for the COM object pointed to by `rx::Renderer11::mDevice`:
* [`IUnknown::AddRef`](https://learn.microsoft.com/en-us/windows/win32/api/unknwn/nf-unknwn-iunknown-addref) is not being called enough.
* [`IUnknown::Release`](https://learn.microsoft.com/en-us/windows/win32/api/unknwn/nf-unknwn-iunknown-release) is being called too many times.
This imbalance is exposed in one of several execution paths when they are the _last_ path to run; the paths that we currently know of are:
* In `rx::Renderer11::release()`, where several dozen decrements of `mDevice` occur.
* In `mozilla::gfx::SharedSurface_ANGLEShareHandle::~SharedSurface_ANGLEShareHandle()`, while destroying the `IDXGIKeyedMutex` it extracts from ANGLE (viz., the one assigned to the `mKeyedMutex` member).
It is currently unclear whether this defective behavior is rooted in new ANGLE source, an older issue in Firefox code that is only now being exposed, or something else.
TODO: add source links
## Reproducing the crash
### Dependencies
You will need:
1. A Windows machine with a recent version of the Windows OS. the remaining dependencies here are assumed to be installed on this device.
1. A [working development environment for Firefox](https://firefox-source-docs.mozilla.org/setup/windows_build.html).
1. A 2022 edition of the MSVC toolchain (which should be mostly completed by the previous step), plus some configuration steps.
:::spoiler
1. Install the Visual Studio editor as part of this.
1. You will need the [Child Process Debugging Power Tool](https://marketplace.visualstudio.com/items?itemName=vsdbgplat.MicrosoftChildProcessDebuggingPowerTool) in order to automatically attached to `firefox.exe`'s child processes.
1. You will need the proper version of the Windows 10 SDK. ATOW, that's version 10.0.20348.0.
1. The Windows SDK itself can be installed from [this archive page](https://developer.microsoft.com/en-us/windows/downloads/sdk-archive/).
1. The Windows SDK component for your Visual Studio installation will also need to have a matching version installed. The easiest way to do this is to use the `Visual Studio Installer` application to `Modify` your installations `Individual components`, like in these screenshots:
:::spoiler Screenshots
![](https://hackmd.io/_uploads/BJwPqIEqo.png)
![](https://hackmd.io/_uploads/ByKu9LV5j.png)
:::
:::
1. A checkout of Firefox with the ANGLE rebase implemented. For reference, Erich most recently reproduced this issue with the revision [`631c67e3`](https://hg.mozilla.org/try/rev/631c67e36dcb1c59f903878e37ec10e45e6c649d), as observable from CI runs in Mozilla's [Treeherder](https://treeherder.mozilla.org/jobs?repo=try&revision=8e078b45b35b4f36785a8970e22805319da77ee4).
1. A static file server to serve the reproduction files with. @ErichDonGubler used [`sfz`](https://crates.io/crates/sfz) with no arguments in a folder with [these files (AKA `repro.zip`)](https://github.com/mozilla/angle/files/10352343/repro.zip).
### Reproduction steps
The basic flow for reproduction is:
1. Change working directory to your Firefox checkout (affectionately referred to as `$GECKO_CHECKOUT` going forward).
1. Run `./mach build` with [optimizations disabled](https://firefox-source-docs.mozilla.org/setup/configuring_build_options.html#optimization) and [debug symbols enabled](https://firefox-source-docs.mozilla.org/contributing/debugging/debugging_on_windows.html#debugging-optimized-builds); for convenience, you can just use the following `mozconfig` file at the root of your checkout of Firefox:
```
ac_add_options --disable-optimize
ac_add_options --enable-debug
```
1. Run `./mach run --debug`. This will generate and open a new Visual Studio solution.
1. Configure the solution's debugging to automatically attach to child processes.
:::spoiler
1. Navigate to the app menu strip > `Debug` > `Other Debug Targets` > `Child Process Debugging Settings...`
![](https://hackmd.io/_uploads/HkhpnSNqo.png)
1. Tick the `Enable child process debugging checkbox`, and save it (i.e., <kbd>Ctrl</kbd> + <kbd>S</kbd>).
![](https://hackmd.io/_uploads/By0C3rE9i.png)
:::
1. Configure the invocation of `firefox.exe` to open your `repro.zip`'s `index.html` page automatically.
:::spoiler
1. Extract the `repro.zip` archive from earlier into a directory somewhere. From this point, we'll call that "the `repro` directory".
1. Start serving files from your `repro` directory using a local HTTP server. As mentioned above, @ErichDonGubler used a naive invocation of `sfz`, which serves to port 5000 by default:
```
# With your `repro` directory as the CWD:
$ sfz
Files served on http://127.0.0.1:5000
```
3. Open the solution's `Properties` from the `Solution Explorer` view, which by default is on the left side.
![](https://hackmd.io/_uploads/H1TERrN9s.png)
1. Add `-new-tab` arguments that point to where your local file server will be serving up the files from `repro.zip`. Continuing the above example, you can use `-new-tab http://127.0.0.1:5000/index.html`, as seen in this screenshot:
![](https://hackmd.io/_uploads/HyTbkUVcj.png)
:::
All you need to do now is `Debug` in Visual Studio (i.e., use the <kbd>F5</kbd> shortcut), and the problem should reproduce in less than twenty or so seconds. If this does not happen, it's probably because the `index.html` tab didn't load all the way; refresh the page, and it should happen quickly, like in the following screenshot:
![](https://hackmd.io/_uploads/SyuZWPEqo.png)
:::info
ℹ️ N.B. that you will encounter a debug break from what seems to be an assert in Microsoft code when the refcount of `mDevice` goes to `0`. This is expected (and probably related to Microsoft's code detecting that something is off :sweat_drops:), but is _not_ the crash that we're debugging. Example screenshot:
![](https://hackmd.io/_uploads/BJFg-vV5o.png)
:::
FIXME: s/751/756 in screenshots of breakpoint lines
### Watching the refcount change in Visual Studio
1. Set a breakpoint in Visual Studio at `$GECKO_CHECKOUT/gfx/angle/checkout/src/libANGLE/renderer/d3d/d3d11/Renderer11.cpp:756`, immediately after where the `createDevice` function pointer arg is called within `Renderer11::callD3D11CreateDevice`.
Using the address that `mDevice` gets set to, you will be able to get the address of the reference count itself
2. You can create a data breakpoint in the `Watch` view for the refcount as you:
1. Click on the _`Add item to watch`_ row using the following expression:
```
(unsigned __int64*)(*(unsigned __int64*)($ADDRESS_OF_COM_PTR + 0x130) + 8)
```
...where `$ADDRESS_OF_COM_PTR` is the address stored in the `mDevice` variable from the previous step.
4. You can now set a memory breakpoint on the above `Watch` expression by right-clicking on it, and clicking the `Break When Value Changes` item.
![](https://hackmd.io/_uploads/rysIhL4qo.png)
### Generating a log of refcounting operations
Building on the previous section, @ErichDonGubler has generated reports of the callstack for each refcounting operation via the `Output` window in Visual Studio. This is done by adding `Action`s to the settings of the breakpoints created in the previous section. Concretely:
1. Set the following messages on the breakpoints created in the previous section:
:::warning
:warning: This content has been outmoded by the live collaboration on GitHub below.
:::
:::spoiler Outdated
* For the `Renderer11::callD3D11CreateDevice` breakpoint, @ErichDonGubler has been using:
```
--!! Starting to track refs `mDevice` at {mDevice}:$CALLSTACK
```
...and yes, `\n` does not get translate to a newline, but it's workable in the scripting he's is doing on top of this to balance the refcount operations tree he's building.
* For the data breakpoint on `mDevice`'s inner refcount, @ErichDonGubler has been using:
```
--!! Ref count for device was modified:$CALLSTACK
```
:::
1. Run the debugger (viz., <kbd>F5</kbd>) once, and wait until the crash reproduces.
:::warning
Great news! @ErichDonGubler has started writing a tool to consume these logs with Nical on [GitHub](https://github.com/nical/angle-refcnt-dbg/).
:::