VCS #1 Tile-based Renderer

# VCS #1 Tile-based Renderer This is the documentation for the *Assignment 1: Implementing a Parallel Sort-Middle Tiled Renderer* of CMU 15-473/673 Visual Computing System. The task it to implement a few missing parts of the rasterizer and to parallelize the renderer using a sort-middle tiled approach (a.k.a. a *tiled-based renderer*). :::warning For academic integrity reason, most of the codes for this project will be hidden. If for any reason you would like to get the access to the codebase, please contact Tony at linghent@andrew.cmu.edu or lockbrains@gmail.com. ::: ## Data Handling The first step in the assignment is to fix the renderer by implementing a triangle rasterizer (a quad fragment generator). I will do this by implementing two functions currently stubbed out in `Rasterizer.h: RasterizeTriangle()}` and `TriangleSIMD::TestQuadFragment()`. ### SIMD and `__m128` Instructions In this assignment, we will pass the coordinates using the Intel **SIMD** (Single Instruction, Multiple Data) data type, specifically we will be using `__m128`. It is a data type specific to Intel's SIMD instruction set, particularly used in the Streaming SIMD Extensions (SSE) family of instructions. It represents a 128-bit-sized register that can store and operate on *four* 32 floating-point values *simultaneously*. To handle `__m128`, we will need to use specific API to do basic arithmetic operations. Specifically in this assignment, we will need the following API's to work around. ```cpp= #include <xmmintrin.h> int main () { // addition __m128 result_addition = _mm_add_ps(1.0f, 2.0f); __m128i result_addition_int = _mm_add_epi32(1,2); // subtraction __m128 result_subtraction = _mm_sub_ps(1.0f, 2.0f); __m128i result_subtraction_int = _mm_sub_epi32(1,2); // multiplication __m128 result_mul = _mm_mul_ps(1.0f, 2.0f); __m128i result_mul_int = _mm_mullo_epi32(1, 2); // division __m128 result_div = _mm_div_ps(1.0f, 2.0f); // set to zero __m128 zero = _mm_setzero_ps(); // set to value __m128 one = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); } ``` We can also handle comparisons and bit operations using SIMD instructions. ```cpp= #include <xmmintrin.h> int main () { // Compare Equal __m128 result_equal = _mm_cmpeq_ps(1.0f, 1.0f); __m128i result_equal_int = _mm_cmpeq_si128(1,1); // Compare Less Than __m128 result_lessThan = _mm_cmplt_ps(1.0f, 2.0f); __m128i result_lessThan_int = _mm_cmplt_si128(1,2); // Compare Greater Than __m128 result_greaterThan = _mm_cmpgt_ps(1.0f, 2.0f); __m128i result_greaterThan_int = _mm_cmpgt_si128(1,2); // Bitwise And __m128 result_and = _mm_and_ps(1.0f, 2.0f); __m128i result_and_int = _mm_and_si128(1,2); // Bitwise Or __m128 result_or = _mm_or_ps(1.0f, 2.0f); __m128i result_or_int = _mm_or_si128(1,2); } ``` ### N.4 Format Handling In this assignment, the vertex positions will be stored in a fixed-point representation with 4 bits of subpixel precision, and this is called **N.4 format** of floating point values - $n$ bits for integer and 4 bits for decimal. For example, a binary expression `1001.0110` can be understood as: - Integer part:`1001` = 9. - Decimal part:`0110` = $\frac{6}{16}$ = 0.375 So this number is 9.375. N.4 format is widely used in GPU to reduce the cost of floating-point computation. Furthermore, one of the most important operations of N.4 format should be *rounding*. Using the definition, for a binary notation of a N.4 format floating-point number $a$, we can use bitwise shift to do floor and ceil operations. \begin{align} \lfloor a \rfloor &= a >> 4. \\ \lceil a \rceil &= (a + \texttt{0xF}) >> 4. \end{align} ## Rasterization ### Point-in-triangle Test When we want to rasterize the scene, we will make a call to the `RasterizeTriangle()` function. This function accepts a region of the output image and a triangle as input. It then identify all $2\times2$ pixel chunks within this screen region that may be at least partially covered by this input triangle. During this process, we will need to determine if the triangle actually *covers* samples in this $2\times2$ pixel block -- this is called the **triangle coverage test**. This part is done with the function `TestQuadFragment()` -- it accepts the coordinates of four screen points as input, corresponding to the pixels in this pixel block. Its output is a bitmask indicating which of these points lies within the triangle, as shown in the following figure. ![Frame 140](https://hackmd.io/_uploads/Skk7iXmCR.png) Since a `__m128` register can save upto 4 integers, we will use two registers for storing the $x$ and $y$ coordinates for the four samples in the quad $2\times2$ fragments (in N.4 fixed-point format as stated in the comments). #### Edge Equation To test if a pixel is covered by the triangle, we use the **Edge Equation**. Label the $i$-th vertex of the triangle as $(x_i, y_i)$. For a query point $\textbf{q} = (x,y)$, we first calculate \begin{align*} A_i &= (y_i - y_{i+1}),\\ B_i &= (x_{i+1} - x_i).\\ \end{align*} The Edge Equation the $i$-th edge is computed by \begin{equation} E_i = A_i(x - x_i) + B_i(y - y_i). \end{equation} The necessary and sufficient conditions for the point $\textbf{q}$ to pass the inside-triangle test is \begin{equation} (E_0 > 0 \And E_1 >0 \And E_2 >0) || (E_0 < 0 \And E_1 <0 \And E_2 <0). \end{equation} i.e. the Edge Equations have to be samely signed. In case a point falls exactly on the edge (i.e. $E_i$ = 0), we check if the edge is *owned* by the current triangle. We will be able to check this by accessing `isOwnerEdge[i]`. If `isOwnerEdge[i]` is true, it means this edge is owned by the triangle, so we consider all the points on this edge in the triangle. We will use the Edge Equation in the mask calculation in the function `TestQuadFragment()`. ### Rasterize Triangle Now that we have the function `TestQuadFragment()` which returns a bit mask indicating if the sample for each fragment is covered, we will be able to rasterize the triangle. We will do this in the `RasterizeTriangle()` method. The way we rasterize a triangle can be summarized as find all the candidate pixels, and test each pixel for coverage by using `TestQuadFragment()` we have already written. We will take as input, - `regionX0, regionY0`: the top-left corner coordinate of current working tile. - `regionW, regionH`: the width and height of the current working tile in pixels. - `tri`: set up triangle equations, where we keep the coordinates of the corners of the triangle. - `triSIMD`: all values of `tri` in SIMD registers. - `processQuadFragmentFunc`: A function value -- for every quad fragment that may generate coverage, this method should call this function. ## Sort-middle Tile-based Renderer Now we will start the implementation of a sort-middle tiled parallel renderer. The tiled renderer will have two phases: - Place the triangles into *bins*. It will build a data structure where there is one bin per screen tile, and each bin contains a list of triangles that potentially overlap the bin. - Process the bins in parallel, dynamically distributing the next unprocessed bin to the next available core. The process can be shown as the Figure 2. ![Frame 141 (1)](https://hackmd.io/_uploads/S1nk_Xj0C.png) ### Initialization In tile-based rendering, we partition the screen into smaller tiles, which we call them `bins`. Each tile will be responsible for processing the triangles inside. Usually, in order to improve the performance, we will classify triangles by the tiles they covered. The process is called binning. A intuitive way to keep track of the bins is to use a vector of tiles, where each tile is a vector of projected triangles, let’s call it bins. However, when binning is processed in a multi-thread context, multiple threads may handle triangles simultaneously, and place them into corresponding bins. If they write to the global `bins` directly, it may result in: - *Race conditions.* Multiple threads writing into the same data structure – it may cause a data disagreement or even a program crash. - *Performance bottleneck.* In order to avoid race conditions, we will need to apply locks to ensure the thread safety – but this is almost guaranteed to be slow, as it lowers the parallelism. #### Thread Local Bins To solve the above issues, we introduce the thread local bins, maintained by the data structure, which we call it the `threadLocalBins`. Now, each thread has its own independent data structure for maintaining bins. When a thread `i` is binning, it writes data into its own `threadLocalBins[i]`. By using the thread local bins, we also avoid using locks, which lowers the cost for synchronization. Once every thread finish its own binning, we will need to merge everyone’s local bins into the global bins, and hand over it to the `ProcessBins()` to handle them. #### Steps The following steps describe the things we need to get done in the `BinTriangles()`. 1. **Initialization.** Assume the screen is partitioned into `gridWidth` × `gridHeight` tiles. - Global bins. It will be a vector of vectors of `ProjectedTriangles`, with size `gridWidth` × `gridHeight`. - Thread local bins. It will be a vector of vector of vectors of `ProjectedTriangles`, because it will keep track of all the local bins structures for all the threads. i.e. `threadLocalBins[threadId]` return the local bins for the `threadId`-th thread. 2. **Thread local binning.** Each thread will only handle a portion of the triangles in the scene. They will put their own triangles into the `threadLocalBins[threadId]` according to the triangles’ coverages. Notice that this also indicates that the majority of the bins in a `threadLocalBins[threadId]` will be empty, because a thread will only handle one or a few tiles. For each thread, the tiles out of their responsibility can be perceived as empty, as shown in following figure.![Frame 142 (1)](https://hackmd.io/_uploads/rycQFXs0R.png) 3. **Merge into global bin.** After all the threads finish binning, we will merge all the local bins in `threadLocalBins` into a global `bins`, where triangles are categorized by tiles. For each tile, we will merge the triangle sublists of all the threads. In these sublists, the triangles live in this tile. 4. **Process the global bin.** Finally we traverse the global bins, and handle each triangle in the tiles. Since each tile is handled independently, we will run `ProcessBin()` on multiple threads. ### Binning Triangles #### Local Bins Recall that in the BinTriangles(), we will handle the triangles in the tiles, of which the thread is in charge. ### Processing the Bins The next thing we want to do is to process the bins in the global bins. For each tile, we will call a `ProcessBin()` to rasterize the fragments in the bin. Notice that in a tile-based renderer, parallelism happens among tiles, instead of fragments. Processing within a tile is carried out sequentially. Thus, we will call `ShadeFragment()` to shade a single quad fragment at a time, rather than call `ShadeFragments()` as was done on the full fragment buffer in the non-tiled implementation. #### Frame Buffer Tiles Instead of writing directly into the frame buffer, we also update the color and depth into the frame buffer. One advantage of a tiled renderer is that an entire frame-buffer tile will fit in cache. For maximum performance we will write results to tile-major frame buffer structure while processing the tile, and then copy the results back out to the renderer’s frame buffer at the end. To do this, we will also initialize the tiled frame buffers in the initialization. #### Adaptions Since the rasterization happens at a granularity of quad fragments, the loop runs on triangles in the tile, and loops over each of the four fragments. We will call the `RasterizeTriangle()` we wrote in Section 3.2, with a region size specified to be the size of the sub-frame buffer for the current tile. Here are some technical challenges that we may need to solve: - Since we directly shade the fragments (instead of shading the fragments in the fragment buffer), we will call `ShadeFragment()` instead of `ShadeFragments()`. Looking at the interface of `ShadeFragment()`, we will need to know which thread handles the triangle being processed. Otherwise, we will use the wrong vertex output buffer or index output buffer. - Since we write into tiled frame-buffer, we need to carefully treat the indexing. - Specifically, there is a `triId` parameter in the interface of `ShadeFragment()` – this is not the member variable in the `ProjectedTriangle` class. This is the index of the loop when in the `BinTriangles()` when we binned this triangle. To solve these issues, we need to carry and propagate these information. To facilitate the processing, we will maintain a vector of BinnedTriangle structs, where the structure keeps the thread ID, the loop index, and the projected triangle. ### Finish After the processing, we will need to write the results back to the frame buffer.