Try   HackMD

Refcounting crash in ANGLE 4 rebase

Document owner @ErichDonGubler
Last updated 2022-04-19
Status Resolved. This document's contents are out-of-date.
Backlink to rebase report Link

Currently, Mozilla's rebase of ANGLE v4 (see here for version/commit details) is running into crashing issue while calling IUnknown::Release on a specific COM object. This crash can be straightforwardly reproduced in WebGL with relatively simple shaders and gl commands to use a video element as the source of a 2D texture(see repro steps' sample HTML source below). This issue appears to be due to imbalanced COM object reference counting operations on a ID3DDevice instance created and assigned to the Renderer11::mDevice member in the rx::Renderer11::callD3D11CreateDevice() method, such that a double-free occurs. Therefore, one of the following must be true for the COM object pointed to by rx::Renderer11::mDevice:

This imbalance is exposed in one of several execution paths when they are the last path to run; the paths that we currently know of are:

  • In rx::Renderer11::release(), where several dozen decrements of mDevice occur.
  • In mozilla::gfx::SharedSurface_ANGLEShareHandle::~SharedSurface_ANGLEShareHandle(), while destroying the IDXGIKeyedMutex it extracts from ANGLE (viz., the one assigned to the mKeyedMutex member).

It is currently unclear whether this defective behavior is rooted in new ANGLE source, an older issue in Firefox code that is only now being exposed, or something else.

TODO: add source links

Reproducing the crash

Dependencies

You will need:

  1. A Windows machine with a recent version of the Windows OS. the remaining dependencies here are assumed to be installed on this device.
  2. A working development environment for Firefox.
  3. A 2022 edition of the MSVC toolchain (which should be mostly completed by the previous step), plus some configuration steps.
    1. Install the Visual Studio editor as part of this.
      1. You will need the Child Process Debugging Power Tool in order to automatically attached to firefox.exe's child processes.
    2. You will need the proper version of the Windows 10 SDK. ATOW, that's version 10.0.20348.0.
      1. The Windows SDK itself can be installed from this archive page.

      2. The Windows SDK component for your Visual Studio installation will also need to have a matching version installed. The easiest way to do this is to use the Visual Studio Installer application to Modify your installations Individual components, like in these screenshots:

        Screenshots


  4. A checkout of Firefox with the ANGLE rebase implemented. For reference, Erich most recently reproduced this issue with the revision 631c67e3, as observable from CI runs in Mozilla's Treeherder.
  5. A static file server to serve the reproduction files with. @ErichDonGubler used sfz with no arguments in a folder with these files (AKA repro.zip).

Reproduction steps

The basic flow for reproduction is:

  1. Change working directory to your Firefox checkout (affectionately referred to as $GECKO_CHECKOUT going forward).

  2. Run ./mach build with optimizations disabled and debug symbols enabled; for convenience, you can just use the following mozconfig file at the root of your checkout of Firefox:

    ​​​​ac_add_options --disable-optimize
    ​​​​ac_add_options --enable-debug
    
  3. Run ./mach run --debug. This will generate and open a new Visual Studio solution.

  4. Configure the solution's debugging to automatically attach to child processes.

    1. Navigate to the app menu strip > Debug > Other Debug Targets > Child Process Debugging Settings...

    1. Tick the Enable child process debugging checkbox, and save it (i.e., Ctrl + S).

  5. Configure the invocation of firefox.exe to open your repro.zip's index.html page automatically.

    1. Extract the repro.zip archive from earlier into a directory somewhere. From this point, we'll call that "the repro directory".

    2. Start serving files from your repro directory using a local HTTP server. As mentioned above, @ErichDonGubler used a naive invocation of sfz, which serves to port 5000 by default:

      ​​​​​​​​# With your `repro` directory as the CWD:
      ​​​​​​​​$ sfz
      ​​​​​​​​Files served on http://127.0.0.1:5000
      
    3. Open the solution's Properties from the Solution Explorer view, which by default is on the left side.

    4. Add -new-tab arguments that point to where your local file server will be serving up the files from repro.zip. Continuing the above example, you can use -new-tab http://127.0.0.1:5000/index.html, as seen in this screenshot:

All you need to do now is Debug in Visual Studio (i.e., use the F5 shortcut), and the problem should reproduce in less than twenty or so seconds. If this does not happen, it's probably because the index.html tab didn't load all the way; refresh the page, and it should happen quickly, like in the following screenshot:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More β†’

ℹ️ N.B. that you will encounter a debug break from what seems to be an assert in Microsoft code when the refcount of mDevice goes to 0. This is expected (and probably related to Microsoft's code detecting that something is off :sweat_drops:), but is not the crash that we're debugging. Example screenshot:

FIXME: s/751/756 in screenshots of breakpoint lines

Watching the refcount change in Visual Studio

  1. Set a breakpoint in Visual Studio at $GECKO_CHECKOUT/gfx/angle/checkout/src/libANGLE/renderer/d3d/d3d11/Renderer11.cpp:756, immediately after where the createDevice function pointer arg is called within Renderer11::callD3D11CreateDevice.

    Using the address that mDevice gets set to, you will be able to get the address of the reference count itself

  2. You can create a data breakpoint in the Watch view for the refcount as you:

    1. Click on the Add item to watch row using the following expression:
    ​​​​(unsigned __int64*)(*(unsigned __int64*)($ADDRESS_OF_COM_PTR + 0x130) + 8)
    

    …where $ADDRESS_OF_COM_PTR is the address stored in the mDevice variable from the previous step.

  3. You can now set a memory breakpoint on the above Watch expression by right-clicking on it, and clicking the Break When Value Changes item.

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More β†’

Generating a log of refcounting operations

Building on the previous section, @ErichDonGubler has generated reports of the callstack for each refcounting operation via the Output window in Visual Studio. This is done by adding Actions to the settings of the breakpoints created in the previous section. Concretely:

  1. Set the following messages on the breakpoints created in the previous section:

    :warning: This content has been outmoded by the live collaboration on GitHub below.

    Outdated
    • For the Renderer11::callD3D11CreateDevice breakpoint, @ErichDonGubler has been using:

      ​​​​​​​​--!! Starting to track refs `mDevice` at {mDevice}:$CALLSTACK
      

      …and yes, \n does not get translate to a newline, but it's workable in the scripting he's is doing on top of this to balance the refcount operations tree he's building.

    • For the data breakpoint on mDevice's inner refcount, @ErichDonGubler has been using:

      ​​​​​​​​--!! Ref count for device was modified:$CALLSTACK
      
  2. Run the debugger (viz., F5) once, and wait until the crash reproduces.

Great news! @ErichDonGubler has started writing a tool to consume these logs with Nical on GitHub.