owned this note
owned this note
Published
Linked with GitHub
# GFX 2019 work week
# Monday
## Interning positions?
(gw, kvark)
- we intern sizes but not positions of clips and primitives
- gecko bakes the scroll offsets
- new API now allows us to ask for the scroll offsets and “unbake” them
- however, hit testing still is a problem (TODO: clarify)
- need `get_relative_transform` to be used consistently
- blocked on flattening rework (TODO: resolve)
## Render task graph
(nical, gw, kvark)
Problem 1: a task is dependent on my multiple other tasks
Case: multiple drop shadows on the same text item
Opportunity: only downscale the text once, use for all shadow tasks
![](https://i.imgur.com/rLWv1NL.jpg)
Note: render task cache can't be used, since it doesn't handle dependencies well (scheduling issue).
- solution: schedule the RT cache as late as possible
- when dependencies are in the texture cache, we'd need to render in a pass and blit back to the texture cache
Concern: render task rect rect allocation assumes that the source is in the previous pass.
- solution: blit contents of a task across passes
- schedule as late as possible
- retain some of the render task slices/rects as opposed to clearing
Blits are expensive:
- bounds are not tight
- the perf on Intel scales with the number of pixels we touch
TODO: check ARM/Mali for when the tiles are resolved:
- does it happen if the tile is unchanged?
- what if it was just cleared?
Motivation:
1. reduce redundant shadow tasks
2. remove "mark for saving"
3. SVG filters are expressed as graphs
Q: retain across frames?
A: currently, not retaining any shadow tasks
Q: debugging tools for RT graph resolver?
A: fun thing to write, given that the thing is fairly standalone
N: need tooling to find out the best scheduling off-line, compare with the run-time by the number of pixels
Current "best" strategy:
- ping-pong as current WR
- schedule late
Q: incremental deployment of the new alloctor?
1. first, integrate with existing behavior
2. enable for shadows and other things
3. play with strategies
![](https://i.imgur.com/ijT0O1s.jpg)
Future:
1. Since texture array, sub-manage slices. Try work with slices, not rects.
2. Render pass as the whole render target, select the slice in VS.
3. Try identifying the 1:1 pass work, use sub-passes.
- tile cache memory limits? know how many mask slices are there
- can provide the whole frame as one giant render pass
Q: can we exploit the axises and auto-rotate things?
- would be good!
- segmentation solves the problem to some extent
- could also exploit the symmetry
Rounded corners optimizations:
- only render the corners into the mask
- exploit the symmetry
- more precise bounding/geometry to reduce fill rate
- can't apply the local clip rect in this case!
- quad tree subdivision (or a regular grid) - still draw rects
- can't multiply the clip mask (TODO: discuss)
### Current PLS "optimization"
(gw, kvark, Gankro)
![](https://i.imgur.com/5rVXtUq.jpg)
Ideas:
1. use a test case that doesn't rely on tile elimination
2. avoid unorm <-> f32 convertions
3. bind as write-only more often (requires 32-bit chunks to be written)
4. don't multiply clip values early, only do in the combination pass
TODO: compile a list of questions for ARM
1. how exactly we can get advantage of tile elimination?
- does it work for off-screen targets?
- what are the supported formats?
- what are the states that affect it?
2. is anything happening to a tile we don't touch by geometry?
3. TODO
## WebGPU integration
(kats, kvark, jgilbert)
Process of vendoring wgpu-rs:
- move into the tree
- improve the remote layer
- establish scripts to/from GitHub
Gecko will have 2 implementations as well, in the shape of different structs with the same virtual interface: local and remote. Differences are:
1. `Client` parameter in all functions of the remote layer
2. Swapchain integration (unknown on Gecko side)
3. Pass dependencies collection in the remote layer
Q: how do we reduce the JS calls in client apps?
Moving into Gecko:
1. copy into tree as "gfx/webgpu"
2. connect it into libgkrust's [Cargo.toml](https://searchfox.org/mozilla-central/source/toolkit/library/rust/shared/Cargo.toml) and [lib.rs](https://searchfox.org/mozilla-central/source/toolkit/library/rust/shared/lib.rs)
3. Run `./mach vendor rust`, check complains about licensing. This will add dependencies to third_party/rust, make sure it looks sane
4. add build option "--enable-webgpu" similar to WR [here](https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/toolkit/moz.configure#681-711). make the libgkrust integration stuff from step 2 behind a feature flag controlled by the build option
- JG: Why allow non-webgpu builds? We don't let you build without webgl.
- DM: switch ON when ready at least in some form? No need to slow down everyone for now
- JG: This is done with a pref, not a build option, usually.
- DM: WR was a build option at the beginning, before it was able to consider shipping anywhere
- JG: We know we're going to be shipping it, and that we want a prototype to play with sooner rather than later. To that end, it seems like all downside to have this be a build option. Just leave it as a pref, if that's acceptable. (which I think it is!)
- DM: OK, sounds reasonable
- DM: A tricky part is selecting which backend to build with. If it's optional, we can straightforwardly enable Vulkan build on Linux CI. If it's mandatory, we'll need to resolve the backend selection logic right here, which complicates the integration a bit.
6. To add a taskcluster job, first decide if you want a full Firefox build with webgpu enabled, or a standalone webgpu build. I think the former might make more sense if you just need to catch build regressions. For the latter, copy and modify the existing webrender standalone jobs such as [this one](https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/taskcluster/ci/webrender/kind.yml#1-26,44-67) - copy it into a new `taskcluster/ci/webgpu/kind.yml` file. You won't need the wrench-deps stuff, just run `cargo build` in the gfx/webgpu folder
## WebRender architecture overview
(gw)
![](https://i.imgur.com/EKUhg3p.jpg)
Display Lists:
- Items
- text
- box shadow
- image
- Stacking contexts
- filters
- Clip chains
- Reference frames
Scene ("model"):
- picture tree
- spatial tree
- clip chains
Q: "scene" term vs WR capturing
Q: internation? (see picture caching)
Q: tile caching?
can be scrolled around
Frame ("view"):
- update spatial tree
- update picture tree
- update visibility
- update primitives
- generate render tasks
- assign passes
- batch
Submit:
- apply resource updates
- for each pass (see GPU work topic)
### Picture caching
Q: stuff moved but not marked as changed by the debug overlay?
A: could be fixed-position element that isn't cached, drawn on top
Interning key:
1. item itself
2. clipping
3. transform
4. animated properties (e.g. opacity)
Picture = (prim uuid, uuid, uuid, ..)
Tiles 1024x256. Identifying the dirty regions and updating them with a scissor rect.
Q: what happens with complex regions?
A: draw the whole thing
- should set the Z on tiles to reject the pixels over the valid tiles (TODO: verify)
- blog-post-like pages are still the bad case
Q: tile coordinate space?
A: world. If stuff is scrolled, the positions are adjusted, so we get the same world results.
Q: how does the valid content gets into the new frame
A: copied through the texture cache
New API in development to expose the scroll offsets to WR from Gecko, allows removing hacks in WR and caching more surfaces.
Clusters are build during flattening:
- bounds
- spatial node
### GPU work in depth
![](https://i.imgur.com/ZnGAnoj.jpg)
`draw_tile_frame`:
- for each pass
- bind pass n-1 as input
- for each A8 target
- draw clips
- draw blurs
- for each RGBA8 target
- draw borders
- draw alpha batches
- draw blurs
- draw scalings
Most drawing looks like:
1. bind textures
2. bind shader
3. update VAO/instances
4. draw instanced
How the shader looks: brush_solid -> brush.glsl
```glsl
main() { // VS
fetch_brush();
brush_vs();
}
main() { // FS
brush_fs();
do_clip();
}
```
Data is passed to shaders
1. `PrimitiveInstance` written to the instance buffer - 16 bytes with prim address, clip address, flags
2. brush common and specific data is written to the GPU cache
3. read by `fetch_brush`
List of all brush and scene shaders:
![](https://i.imgur.com/h6wWhqA.jpg)
### GPU cache
Rows associated with block count per element(16, 64, 512, etc).
Simple slab allocator to find a next entry after the user provided all the data via `request()`.
TODO: validation could be more comprehensive
Q: do we have i32 textures?
A: segmentation! primitive header: color, UVS, and GPU cache address - are written into the f32 prim header
## Texture uploads
(mstange, gw, kvark, jrmuizel, dan, ..)
Client storage on Mac:
- alignment (stride % 32 == 0)
- use rectangle textures
- don't use texture storage
- don't modify texture in flight
- don't use it with texture data
Use as an upload vector only, not as direct texture storage.
Problems:
1. stalls! no proper PBO renames
2. forced format conversion: no BGRA8 internal format on Mac
Potential path to fight stalls:
- switch to Scatter
- re-initialize GPU cache texture
Q: remove GPU cache texture in favor of vertex data only?
Idea: small test suite to figure out what works well and what not on a platform (texture uploads, UBOs, depth testing, etc)
## WebRender debugging workshop
https://paper.dropbox.com/doc/WR-debugging-workshop--AalZzf941wQkvDMDIAURIDBqAQ-RC4fgQlmYHmrU83Sd8Nds
# Tuesday
## Battery life
(jrmuizel, markus)
Need to:
1. avoid copy of the frame contents
2. avoid rendering when we only need to scroll
3. do partial presents
4. for video, give YUV surface to WS
5. not composite occluded windows
6. use the same D3D device for video decoding to avoid a copy
Initial plan:
- document splitting. Separate OS layers for documents.
Current GL behavior:
- we call glSwapBuffers
- Window Server knows what window is changed, considers the opaque-ness
- unfortunately, our window is transparent, forcing window server to composite it on top of its window background
Short-term measures for baseline:
1. switch to opaque by using core animation layer (instead of NSView-backed GL context), with multiple layers (with different opaque-ness) using the same surface
2. use scissor rect in GL compositor to save GPU work from our process (doesn't save the window server any work)
3. Possibly use different framebuffers for different layers (maybe tile the window?) to get partial present and better overdraw cache locality
Chrome on Mac uses aggressive CA layerization. Can enable debug visualization of the layer borders to see how they are composited and invalidated.
Android problem. Chrome is bad at it. Has recent APIs (both SDK and NDK) are similar to CA (SurfaceControl), need to use them!
Solution ideas:
- identify multiple scroll roots in WR (ideally, pipe through this information from Gecko)
- use tile caching and CA layers for our scroll roots
- fall back to the current way of compositing if something is unusual
- at first, have one giant WR tile per layer
- picture caching is relevant since it has tracking for dependencies and knows when it needs to be invalidated
- could draw directly into the tiles/layers, but that would require batching to be separate as well during invalidation
### Video
Current:
1. clear
2. draw video
3. WR copy to screen
Use WebGL? Already drawn into `IOSurface`, so we could put them on screen easily.
WR needs to be able to mark images as "special" (aka "WebGL" image). WR needs to expose "this layer can go here" semantics.
WR makes a decision to make layers, attaches to a root layer that Gecko provides. There is a bit of compositor logic in WR to know about platform layers.
Planeshift crate? Take it and shape into what we need.
### Document splitting
One layer for document, one for chrome.
Need to get that first, play with partial present, and then go into layerization of the content.
Q: priorities for platforms
A: Windows and Android first
Q: measuring battery perf, avoiding regressions
- Intel power gadget on Mac
- Resize many windows with throbbers
- Mobile has comprehensive tools
- Need to measure the number of dirty pixels we send to WS, periodically compare against the power metrics
TODO: find a champion to experiment with compositing on document splitting
Jessie to follow up with perf team on Q2 OKRs for measuring battery perf
Problem: currently WR regresses versus non-WR on Windows, it's not using partial present.
First step in partial present: compute the dirty rect for chrome (throbber!). Currently WR doesn't do any dirty tracking on chrome, since it's outside of the tile cache.
## Fission update
(rhunt)
Essense: enhance the process isolation.
Old: 4 content processes for all tabs. Spectre attack proved that we can't trust JS memory.
New: website is the only thing in an OS process. Any iframes inside it from different sites are in other processes.
![](https://i.imgur.com/zWweCXi.jpg)
Tricky case: site A contains site B that contains site A. Still needs the inner and outer sites in the same process...
Problem: memory usage raises with the many processes.
- WebRender helps here because it shares a lot of context between tabs/iframes
- ImageLib is still a problem though, since it lives in the content process
Fission roughly targets Nightly Windows desktop at the end of 2019. Currently in phase 2 (preffable, partially functioning).
### Filters
Limitations: no filters on iframes.
TODO: look at how Chrome is doing compositing w.r.t. filters
We have telementry to know if we are painting a sub-document inside a filter. Could be an SVG filter as well.
WR should eventually support SVG filters natively. Rough plan and prototyping is taking place. Lots of incremental work to follow.
### New API
... to recursively draw the contents starting from chrome in its process. Useful for screenshots and other things, where the compositor results aren't enough.
## Non-WR codepath
(jrmuizel)
non-accelerated WR
Q: how do we indicate to the user that they are on non-WR and some bugs aren't going to be fixed?
Q: is it worth cleaning up the frame layer builder (FLB)?
A: need to establish the timeline
WR ships in 67 on NV, 68 on AMD, then Intel.
Q: Win7 doesn't have direct composition, so WR path will be slow?
- Win7 end of life is Jan 2020, we'll probably be fine with a slowdown, considering users will be migrating to Win10 next year
- Main blockers: direct composition and driver quality
- Today ~50% of Windows users are on 7 and 8 variants
- half of those have D2D unavailable or blacklisted
- we can drop D2D (for content only? as opposed to canvas2d) once we have the majority of current D2D users moved to WR
### Android
- ES2 we don't support. Missing things:
- integer texture (can work around)
- fp32 textures (should switch universally to fp16)
- [bug 1541135](https://bugzilla.mozilla.org/show_bug.cgi?id=1541135)
- array textures
- instanced arrays
- dynamic loops in GLSL of blurs (can work around)
- should be able to enable on ES3
### Software
Changes since Orlando: picture caching helps.
Problem: supporting both WR and non-WR code paths for FLB is becoming more of a pain.
Options:
1. LLVM pipe
- tried before on desktops, was somewhat usable
- on the spot benchmark shows it to be usable, spends 90% of time in the text run rendering (presumably, because of subpixel AA), 5% in clip shaders, and the rest in blits
2. [SwiftShader](https://swiftshader.googlesource.com/SwiftShader) (in the future - Vulkan version)
- TODO: need to benchmark with WR
- currently being integrated into Gecko for WebGL (~Q2)
- exposed [GLES extensions](https://opengles.gpuinfo.org/displayreport.php?id=2653)
3. D3D11 WARP
4. Skia backend for WR?
- allow us to make more shortcut
Depth testing
- slower in software than on HW (because of memory bandwidth?)
- could be fast? not at the moment
- can do more aggressive culling on WR side for this case
Suggestion: "safe" WR mode:
- where it doesn't run any complex shaders and just basically do compositing
- we can run software WR with the existing GPU compositor in Gecko to get that
- can use the same facilities as we need for direct composition
- call `skia` to do all the non-trivial work?
Printing requires SW rendering. It can go through the same path as SVG images.
Plan:
- FLB stays at least throughout 2019
- reach the point where enough configurations have moved to WR so that FLB's performance doesn't matter any more
- gradual removal of features:
- drop support for component alpha
- drop AGRs (only have async scrolling and active layers)
- turn D2D off for content in 2019
Idea: disable texture array for the texture cache (and more). Start with 512, gradually increase to the max size.
### Next:
- Prioritize getting someone to test WR on software GL to see how well it works
- Need to determine specific performance problems
- could also then get feedback from swiftshader to see if they have suggestions to help determine if that is the path to follow
- swiftshader integrated behind a pref at some point?
- When to do Windows 7?
- Determine the blockers for shipping on more Windows
## WebGPU status
Link to [Dzmitry's doc](https://paper.dropbox.com/doc/GFX-WebGPU-Status--AadnWwLTdH8NeGAUaFvBI3cwAQ-YwHQXrbiI66UVGfwUnR1D)
![](https://i.imgur.com/MbgXr74.jpg)
This summer: MVP snapshot of API where we can work towards release version of spec
Q. Binding and shader language resolved by then?
TBD. MVPs might not consume same shader language. What to consume is largest outstanding question.
## Display Lists
Items we'd definitely like to address this quarter (non-WR display lists):
https://bugzilla.mozilla.org/show_bug.cgi?id=1534549
https://bugzilla.mozilla.org/show_bug.cgi?id=1539597
https://bugzilla.mozilla.org/show_bug.cgi?id=1502049
For WR display lists
- Potential low hanging fruit there we could address to improve performance
- Alexis is already looking in to some of that
- But we can discuss in more detail during WR planning
For non-WR:
- Focus on work that isn't blocked by removing frame layer builder
- See how far we get in terms of performance improvements this quarter
- In June: we can make a call around how much more time we want to spend for now
## Android GPU debugging and optimization workshop
(gw, kvark, jnicol)
Mali GPU debugger:
- needs `build.rs` changes on unrooted devices in order to pre-load their library
- need the tool to be launched first, then the app
- shows render passes, but no tile invalidation info
- shows shader costs for cycles, ALU, loads/stores, TU instructions, and register occupation
We spend 200us CPU time per draw call!
Things to keep in mind for Android:
1. minimize framebuffer switches (causing resolves): bind, clear, draw, done
2. invalidate earlier and more, i.e. the depth buffer
3. reduce shader complexity
4. try using pixel local storage to reduce the writes
## Flattening semantics
(kvark, gw)
Two ways to interpret flattening as a concept:
1. input Z is ignored
2. output Z is zeroed, but input can affect X and Y
Problem: if Z is zeroed, the transform becomes uninvertible... We rely on this in several places through WR code. Current clip-scroll tree computes world transform and the inverse. The latter used in CPU and GPU code.
Case: plane splitting. We can move the splitting logic into the space of the preserve3D root. But then we need to figure out the view vector for sorting, and that requires the inverse for the world transform of the preserve3D root...
### Preserve Z during flattening
The first idea to explore is turning Z into identity transformation on flattening instead of zeroing out. That would still "work" within an assumption that input Z is zero.
### Another concept of inversion
The flattened transform is not generally invertible in a sense that you can't build an un-transform from it that is usable on any vector. However, we know that we are only going to be using it on 2D vectors, effectively, so the un-transform should have concrete sense for these cases.
We might need to come up with a mathematical primitive that explains this 2.5D transform and allows both CPU and GPU code to use it.
## Mix-blend rollback
(kvark, gw)
RIP pictur-ization of mix-blend
CSS is written with a real read-back in mind.
## CSS Shooter perf
(kvark, gw)
Main cost is on:
- CPU compositor
- GPU
Lots of draw calls and FBO switches with blits...
TODO: double check sample queries, ensure we take offscreen surfaces into account, also include the blits!
CSS uses the background blend mode with 2+ textures that Gecko supports but provides to WR as regular mix-blend SC.
Figured out the following solutions:
1. use `KHR_blend_equation_advanced` and IHV specific variants
- fall back to `EXT_shader_framebuffer_fetch` on some Mobile platforms
- fall back to the current slow path
3. special background blend code path?
4. picture caching on plane splitting SCs:
- depends on general picture caching work (ability to intern dependencies of all the pictures and find out which need to be cached)
- blocked by the flattening rewrite...
5. draw some of the plane split output as opaque
# Wednesday
## Planning!
End of 2019 goals:
- all platforms (a bit of everything)
- Linux (wayland prioritized)
- MVP Android
- Laptops
- more Windows (include Win7/8)
- Beta
![](https://i.imgur.com/BrhuyiP.jpg)
Android:
- glyph zooming $
- PLS optimizaitons
- GLES 2.0
- RGBA/swizzling
- disable array textures $
- tests
- fix `getScaledFont` crash
Linux:
- shader binary support
- blacklisting
- Wayland support
- fix vsync
Mac:
- Core Animation (CA) presentation
- CA document splitting
- CA WebGL
- blob recording of native themes:
- don't rasterize themes in content process
- texture uploads
- testing coverage
Picture caching:
- version 2.0
- universal picture internation
- cache filter outputs (blur)
- cache and share clip masks
- use blits for tiles $
Direct composition:
- scrolling
- document rendering
- WebGL
- video
- Windows 7 presentation
- Angle subpx extension
- test suits run with WR
- replace D2D with WR/Canvas2D
Threading:
- use less threads
- multi-thread scene building
- parallel task scheduling that isn't Rayon $$$
- non-blocking hit-testing
- remove IPC channel support
Display list:
- delta encoding
- spatial/clip trees
- make items tighter $
- reduce scene building times
- data pipe:
- better way to pass data through IPC
- more animated propertied $$
Blobs:
- recoordination
- bounds changing invalidation
- SVG filters
- (some form of) path rendering
- image font performance
- global locks in Skia
- single global context
- clip paths on GPU
Software WR:
- test LLVM pipe
- ship SwiftShader
- pick low-hanging fruits
Performance:
- enable document splitting
- gradient fast path
- make box shadows to be 1-st class primitives
- improve opaque pass fragment count
- optimize resource bindings
- optimize clip mask renderings
- local space raster scale $$
- SIMD optimizations
- render task graph 2.0
- proof of concept Vulkan/D3D12/Metal $$$
- better primitive culling
- animation junk at 60fps (frame scheduling)
- consider spatial culling structure
- optimize Intel GPU perf
- make FPS shooter fast
- BGRA8 and swizzling support
- glyph cache optimizations
- mipping
- size/scaling re-use
- sharing between windows
Tooling:
- Android mobile profiling tools
- multi-frame WR captures
- WR capture tiled blobs
- picture caching debugging infrastructure $
Correctness:
- WR 67 bugs
- WR 68 bugs
- snapping!
Refactor:
- remove Cairo
- rename Document and Pipeline terms
- tech debt cleanup
- rename some modules: tiling.rs, clip_scroll_tree.rs
Other:
- hire more engineers
- compile the list of websites we are good at (better than Chrome)
Fission:
- move ImageLib to WR process
- move font management to WR process
Security:
- fuzzying
- font sanitation
## APZ discussion
(??)
![](https://i.imgur.com/JlVOmZq.jpg)
## Better clip mask rendering
Goals:
- avoid doing work more than once (when a clip affects multiple primitives)
- avoid doing work on fully opaque areas of the clip
- simplify the cs_clip shaders
![](https://i.imgur.com/4jj2tBQ.jpg)
Ideas to explore:
1. Clip mask inversion if we know that it's more 0 than 1.
2. Use stencil. Potentially, test stencil for each clip.
3. Share clips between items (under conditions).
Gather data about:
- number of clips affecting items
- average ratio of a clip area to the sum area of all clips
- how widely image masks are used
- how often the clip is shared between primitives
- what is the ratio of total primitive area versus the clip area
- are sub-pixel offsets of the primitives different?
### Fast clears
We need to get back to a point where clips are fast-cleared to 1. This requires disabling scissor and re-evaluating performance against the current path that tries to render the first clip without blending. We can still render the corners of the first clip without blending. We don't need to render the opaque areas at all.
## Snapping
Need automated infrastructure that modifies the reftests:
- scaling both ref and image
- switch some of the pictures to have their own surfaces
- change the node in spatial tree where we switch to screen space rasterization
## Document splitting overview
![](https://i.imgur.com/JRN8UNG.jpg)
## BGRA8 on Mac and Android
On Android, we don't always have BGRA8 internal format with `glTexStorage`. On MacOS, we never have that.
Choices are:
1. use `glTexImage2D` to make BGRA8 our internal format. Pay for mipmap allocation in VRAM. (Currently used on Android)
2. use `glTexStorage(RGBA8)` and pay for conversion of data from BGRA8. (Currently used on Mac).
- we can convince ImageLib to produce RGBA8 data in the first place
3. use `glTexStorage(RGBA8)` and *pretend* the data is in RGBA8, but use a swizzling sampler state when reading from it.
- as a follow up, we can make some of the cached render tasks to produce BGRA8 right away, so that texture cache entries have more consistent swizzling
5. use texture rectangles with BGRA8 internal format. Requires us to remove texture arrays.
## DL interning API
Core idea:
- move internation logic scene building to the API side
- the current DL builder would just intern everything as an implementation detail
- another DL builder would work with primitive handles and update vectors
- there is a benefit of providing structure of update arrays, especially if those don't have any variable-encoded enums inside
DL restructuring:
- provide spatial tree, clip tree, picture tree and potentially a hit test tree
![](https://i.imgur.com/95Ca4Lm.jpg)
## Picture caching improvements
Picture cache slices:
- Introduced by:
- WebGL, canvas, video elements
- Scroll roots (if using for performance within WR / low-end GPUs)
- Don't want to do component alpha blend, because:
- It's not supported by OS compositors.
- If we are doing slices for internal WR reasons (performance) we probably don't want to render twice anyway.
- For each slice:
- Try to determine if opaque.
- If yes, enable subpixel AA.
- Otherwise, use grayscale AA.
- Various possible options for switching between subpx / gray AA:
- Consider a sticky downgrade where an interned text run stays gray after downgrading.
- Might be OK to switch between them.
- Consider using framebuffer fetch as a follow up.
- Interpolate between subpx / grayscale based on fragment alpha.
# Thursday
## Planning - part 2
WebRender 2019 Roadmap: https://github.com/orgs/FirefoxGraphics/projects/1
## GLES2 limitations
(jrmuizel, gw, kvark, nical)
FWIW, we should expect 75%+ of Android devices to support ES3.
No texture arrays:
- use big 4k by 4k textures
- Some very old devices are 2k, but should be vanishingly few.
- can still use the rect packer over smaller 512x512 portions of it
- if out of space, allocate a new texture and break batches. Hopefully, not often
Around 50% ES2 devices support half-float textures:
- [texture_float](https://opengles.gpuinfo.org/listreports.php?extension=GL_OES_texture_float)
- [texture_half_float](https://opengles.gpuinfo.org/listreports.php?extension=GL_OES_texture_half_float)
- pack data into RGBA8?
- move some stuff into vertex attributes
Vertex texel fetch:
- Mali-450(? Amazon device) doesn't support vertex texture units.
[Nothing supports instancing](https://opengles.gpuinfo.org/listreports.php?extension=GL_EXT_instanced_arrays):
- instead of using instanced attributes, copy them over per vertex
- roughly 4x memory/bandwidth requirements for vertex buffers, could be improved
## DL internation - part 2
(kvark, gw)
Idea: hide internation semantics as an implementation detail. Expose the API as `device.createSomeObject() -> objectHandle`. If the implementation decides to return the old handle - the client (Gecko in our case) doesn't care. Could be done gradually, object by object.
## Scene building threads
(nical, kvark)
Go through Low/High priority scene builds to a SB thread per document. Essentially doing the same thing as now, just more granular and explicit.
Still needs a way to synchronize scene builds for both documents when resizing Gecko window.
## Frame scheduling
Situation: we don't get more than 2 frames ahead.
Problem: if we fire 2 frames in a row, we'll not have enough time for the 3rd frame procession through the pipeline. Big stall.
![](https://i.imgur.com/E3c3YOe.jpg)
Option: don't limit the pipeline by 2 frames
- coalesce display lists on WR side when needed instead of throttling
- conflict with WebGL requirements
- can still throttle in Gecko, just to a higher number of pipeline stages
Q: why do we even have a renderer thread?
- we go through compositor because of language barrier (impl detail)
- no real reason, just convenient to implement
Idea: don't go through RB when asking for a scene build:
- texture cache isn't needed
- fonts can be shared
Tasks:
1. serialize DL creation with the end of scene building
2. remove the RB visit
RB needs to be Vsync synchronized, because it uses the results of inputs.
WebGL:
- the less frames in flight the better for latency
- not very clean, has half a frame in flight
- transaction = drawn frame + fence
- we only pass the transaction when the fence is reached
Idea: the best way to budget frames and to pipelining is having some heuristics that predict frame consistency.
- but we don't really want to put heuristics, web is too complex
- but we already have a heuristic to estimate the time from input sampling to VSync...
Q: how do we reproduce the scheduling problems in general?
Time is only sampled at the start of the compositor. So by the time inputs are sampled, we live in the past.
Chrome approach: DL building starts at -1 vsync, rendering starts 5ms before the vsync.
Current WR approach: DL building starts at -2 vsync, rendering starts at -1 vsync.
Note: chrome has less latency but not neccessarily higher throughput. Goal is to make the input latency stable (not necessarily constant).
Idea: both of these periods before vsync-0 are not related to vsyncs, strictly speaking. We need some heuristics to know when to start that work, to finish before the vsync on the GPU at the end of the day. We need to:
1. detach them from the refresh driver, at first have them fixed to current numbers
2. start making the heuristics more flexible, based on the previous frames
Problem: we only know how to wake up threads on the refresh driver at the moment
- solution doesn't have to be exact: an error within 1ms is still acceptable
- need to look up the way Chrome does it
## Tiling with direct composition
(mstange, jrmuizel, gw, kvark)
Example Intel-based macBook has:
- 720k of L2
- 8M of L3
Total byte size of the screen buffer is 20M, it doesn't fit into L3 cache, causing us to wait for RAM a lot. Solution:
- draw to tiles instead of blitting from the full screen into tiles
- either blit or direct-composit the tiles on screen
- don't wait for a picture to repeat itself a few frames, always go the tiling code path
Q: What is the best tile size?
- having it fit in 256K makes us fully within L2 cache and has some benefits
- current tiles are 4x bigger: 1024x256, still fit in the L3 cache. Can make them 2-4 times more big if we want to.
- small tiles cause a lot of batches
Q: why does drawing many instances of full-tile blends not scale linearly in GPU time?
- there is a fixed cost to load the initial framebuffer color as well as write it down at the end
- just like with tilers on mobile!
# Friday
## Backface visibility and Preserve3D
(kvark, matt)
![](https://i.imgur.com/Nq5rjCb.jpg)
Basic order of operations on a stacking context is the following:
- gather all the children into a "surface"
- apply filters, e.g. opacity
- transform
Current handling of backface-visibility is not fully correct in WR, since we treat backface visibility in relation to the parent SC.
If Preserve3D is enabled, there is no "surface" to flatten/bake into, so we evaluate the ordering and backface visibility in the parent coordinate system. Opacity overrides Preserve3D, so there is no conflict of where the surface is considered.
Rules of backface-visibility evaluation are the following:
1. if self has a transform (in non-preserve-3D context), this is what we evaluate visibility for
2. if parent has Preserve3D, we evaluate visibility in the parent context
## Intel 4k performance
(kvark, gw)
Bugzilla [page](https://bugzilla.mozilla.org/show_bug.cgi?id=1474294) runs badly, we could identify 2 major issues:
1. GPU utilization reaches 10ms and suffers from B_Image and B_Clip items. Could be explain simply by the memory bandwidth and bad utilization of L2/L3 caches.
2. CPU staggers to process 200-450 draw calls. Upon further inspection it turns out that we break the batches because of the blend mode, which (for the platforms that don't support dual-source blending natively) depends on the color of the text.
One solution to the CPU issue would be adding the dual-source-blending extension to Angle, which Lee is looking into.
Thre is another thing we could do: draw the text into an off-screen target with shader output pre-multiplied by the text color. Then we can draw all the text into the main target as "B_Image" with simple blending and no batch breaks.
TODO: prototype this solution
Another observation from Buzilla captured is the number of scissor changes (regions). We've seen cases where regions could be coalesced better than they are.
TODO: investigate why dirty regions are not aligned to tile boundaries