Plan of piet-gpu tour

# Plan of piet-gpu tour ## Introduction `vello` is an experimental 2D graphics renderer, written in Rust. In this post, we aim to describe `vello`'s architecture. ## Pipeline overview Executing a task efficiently on a GPU that requires breaking it up into parallelizable primitive tasks. For example, in the final stage of 3D rendering, the parallelizable primitve is the task of rendering a triangle. In `vello`, the "canvas" is imagined as being tiled by squares, and rendering of stuff in each tile happens in parallel. Here's the overall pipeline for rendering: **piet --> scene graph --> per-tile command list --> tile renderer** * (CPU) An application defines a "scene" using `piet`'s API. This scene is organized as a graph, where the nodes are operations such as translation, rotation, etc. or groups of objects. * (CPU) The scene graph is encoded into a GPU-suitable format, then dispatched to the GPU. * (GPU) The scene graph is re-interpreted into into lists of drawing commands per tile (per-tile command lists, abbreviated as PTCLs). * (GPU) Tiles are rendered in parallel, by executing the drawing commands stored in their command lists. The CPU/GPU label given to each pipeline stage describes whether it is executed on the CPU or the GPU. Any task marked GPU must have been parallelized in some manner. We've discussed tiles, but the parallelization of scene graph re-interpreted into PTCLs is really the heart of `piet-gpu`. ## Architecture background ### Modern GPU APIs, and modern GPGPU [Recommended reading.](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf) **CPU:** the central processing unit in a computer. **GPU:** a graphics processing unit. Separate hardware installed in many computers . Contains large number of processors. Each processor is relatively weak compared to processors in a CPU, but their large number makes the GPU powerful. **GPGPU:** "general purpose GPU", a GPU that can be programmed for use beyond traditional graphics problems. Most GPUs today are GPGPUs. **Parallelism/concurrency:** given multiple processors capable of computing, we can set up a task to be executed so that each processor does a little bit of the task, in parallel (at the same time, or concurrently) with the other processors. This allows us to bring to bear, on the task at hand, the power of all the processors available to us. **Concurrent activity:** a process composed of multiple items that happen concurrently (see Fig. 3.1 below, taken from [Volkov, 2016](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf) ). The items could be: instructions, warps, memory transactions, (further examples?) ![](https://i.imgur.com/38afbsW.png) **Latency:** the start to end time of an item in a concurrent activity. Units of $[T]$. **Latency hiding:** latency hiding is the overlapping of latencies in a concurrent activity so that part of the latency of one item is "hidden into" the latency of another item. **Mean latency:** the mean latency of items in a given time period $\delta t$. (Units of $[T]$. **Throughput:** the number of items in a concurrent activity that fall within a particular time period $\delta t$, divided by the length of that duration. (Units of $[\textrm{thing}]/[T]$). *Note that throughput is a quantification of latency hiding.* (Units of $[\textrm{thing}]$) **Concurrency:** number of items processed at the same time, in a given time interval $\delta t$. **Little's law:** $$\textrm{mean concurrency in $\delta t$} = \textrm{mean latency in $\delta t$} \times \textrm{throughput in $\delta t$}$$ We want to maximize throughput (latency hiding) as much as possible. Little's law tells us that, assuming all items of interest line up perfectly in time period $t$, we can increase the throughput in one of two ways: 1) increasing the number of items executed in $\delta t$, or decreasing their latency. **Threads:** name for processors that can calculate independently of each other. **Physical threads (warps/subgroups). Logical threads. Lanes.** A GPU, unlike a CPU, has many physical threads. Each physical thread has a number of sub-processors, called "logical threads", or "lanes". The sub-processors are "logical" threads because they are not truly independent: each lane in a physical thread must execute the same instruction per clock cycle. However, each lane can function on an independent piece of data. On Nvidia devices, the physical threads are referred to as "warps", and on AMD as "subgroups". In both, the logical threads are just called "threads". In the SIMD world, the logical threads are often called "lanes". **SIMD:** A GPU is at its heart, a SIMD ("single instruction, multiple data") machine. That is, its processors at the most basic level (the lanes in a warp) are structured so that they execute the same instruction per cycle. However, each lane can function on an independent piece of data. All basic instructions on a GPU are vectorial: they operate on N-length vectors rather than scalars, where N is the number of lanes in a physical thread. **SIMT:** A GPU is not a strict SIMD machine, as its "logical lanes" can emulate physical lanes. Recall that SIMD lanes are not independent as they must execute the same instruction. A program that requires the machine's lanes to execute different instructions across a warp would not run on a strictly SIMD machine, but modern GPUs can operate as SIMT ("single instruction, multiple (logical) threads") machines which overcomes this difficulty at the decreased throughput. Suppose that program requires somes lanes (mark them as being in collection A) to execute instruction ALPHA, and the remaining lanes (collection B) to execute instruction BETA. The GPU can temporarily idle the lanes in collection B, and let the SIMD warp execute instruction ALPHA. It then idles the lanes in collection A, and executes the instructions in collection B. Therefore, rather than writing pure SIMD programs, the programmer may occassionally choose to write portions of their program that would be executed in SIMT mode. This comes at the cost of decreased throughput (decrease in the number of lanes concurrently executing). Things that increase latency: * SIMD/SIMT divergence during branching. * Lack of coalescing memory access: memory should be loaded so that it fits cleanly into the register * Shared memory bank conflicts. **Occupancy:** number of warps executed (when looking at a specific GPU), or more generally, number of warps executed divided by the maximum number of warps that can be executed at the same time on the machine. > Each transaction is delivered to the appropriate memory controller via the interconnection network. The controllers accumulate the incoming transactions and serve them in the same or a different order for a better DRAM operating efficiency. Reordering goals include, for example, improving row buffer hit rates and reducing the number of switches between readsand writes. #### Common optimization pitfalls: Branching/SIMT---just because the GPU can do it, doesn't mean the GPU likes it. Uncoalesced memory access: is a picture worth a thousand words? ![](https://i.imgur.com/2Mr0kvR.png) Shared memory bank conflicts. ![](https://i.imgur.com/ePDgXIM.png) ### Historical background - piet-gpu-hal There are three major modern graphics APIs: DirectX 12, Vulkan, and Metal. There are some older graphics APIs, such as OpenGL, and then perhaps some web related graphics APIs (WebGL?). Which API should piet-gpu use? That is not an easy question to answer, and the Rust crate `gfx-hal` answers that question with: "use none of those APIs; instead use `gfx-hal`, which is a Vulkan-like generic API that can target any one of DX12, OpenGL, Metal, Vulkan, etc. in the background, as needed". Neat! However, piet-gpu was aiming for two things: 1) experimentation and 2) simplicity by focusing on only what is needed. Among other reasons, this is why Raph chose to write piet-gpu-hal, which while heavily inspired by gfx-hal, is in contrast very barebones, and focused on exposing a narrow subset of the features the modern graphics APIs offer. Not all the functionality needed to one day have a smooth piet-gpu experience, but only the functionality needed *today*. Some of these features are experimental (related to using Vulkan 1.2), which are not high priority items for gfx-hal. Last, but not least, piet-gpu-hal aims to compile shaders as much as possible at run time, while gfx-hal tends to compile shaders during run-time. It is worth noting that the the gfx-rs project, of which gfx-hal is a sub-project, is rapidly evolving in capability and elegance. gfx-hal is THE machine behind behind another gfx-rs project: `web-gpu`, an intermediate graphics layer that may end up being the `hal` that replaces `piet-gpu-hal` in the future, once a fully-featured hardware abstraction layer may be required. Other gfx-rs based projects such as `naga` may also come in useful someday. ## Algorithm background > conceptual background needed to understand the code ### Backdrop calculations Use the "main ray", "vertical ray" and "horizontal ray" explanation from Li et al. 2016 #### Flat clipping Concept from MPVG, the paper doesn't explain it very well so maybe use an alternative formulation (one idea is to represent clip test as pseudocode and explain how if statements does not consume stack) ### Coarse raster? ### Tile-based deferred renderers ### GPU-side flattening ## Kernel deep-dive - data structures ## Conclusion Do it later ™