This is the documentation for the Assignment 1: Implementing a Parallel Sort-Middle Tiled Renderer of CMU 15-473/673 Visual Computing System. The task it to implement a few missing parts of the rasterizer and to parallelize the renderer using a sort-middle tiled approach (a.k.a. a tiled-based renderer).
For academic integrity reason, most of the codes for this project will be hidden.
If for any reason you would like to get the access to the codebase, please contact Tony at linghent@andrew.cmu.edu or lockbrains@gmail.com.
The first step in the assignment is to fix the renderer by implementing a triangle rasterizer (a quad fragment generator). I will do this by implementing two functions currently stubbed out in Rasterizer.h: RasterizeTriangle()}
and TriangleSIMD::TestQuadFragment()
.
__m128
InstructionsIn this assignment, we will pass the coordinates using the Intel SIMD (Single Instruction, Multiple Data) data type, specifically we will be using __m128
. It is a data type specific to Intel's SIMD instruction set, particularly used in the Streaming SIMD Extensions (SSE) family of instructions. It represents a 128-bit-sized register that can store and operate on four 32 floating-point values simultaneously.
To handle __m128
, we will need to use specific API to do basic arithmetic operations. Specifically in this assignment, we will need the following API's to work around.
We can also handle comparisons and bit operations using SIMD instructions.
In this assignment, the vertex positions will be stored in a fixed-point representation with 4 bits of subpixel precision, and this is called N.4 format of floating point values - bits for integer and 4 bits for decimal.
For example, a binary expression 1001.0110
can be understood as:
1001
= 9.0110
= = 0.375So this number is 9.375.
N.4 format is widely used in GPU to reduce the cost of floating-point computation. Furthermore, one of the most important operations of N.4 format should be rounding. Using the definition, for a binary notation of a N.4 format floating-point number , we can use bitwise shift to do floor and ceil operations.
When we want to rasterize the scene, we will make a call to the RasterizeTriangle()
function. This function accepts a region of the output image and a triangle as input. It then identify all pixel chunks within this screen region that may be at least partially covered by this input triangle.
During this process, we will need to determine if the triangle actually covers samples in this pixel block – this is called the triangle coverage test. This part is done with the function TestQuadFragment()
– it accepts the coordinates of four screen points as input, corresponding to the pixels in this pixel block. Its output is a bitmask indicating which of these points lies within the triangle, as shown in the following figure.
Since a __m128
register can save upto 4 integers, we will use two registers for storing the and coordinates for the four samples in the quad fragments (in N.4 fixed-point format as stated in the comments).
To test if a pixel is covered by the triangle, we use the Edge Equation. Label the -th vertex of the triangle as . For a query point , we first calculate
The Edge Equation the -th edge is computed by
The necessary and sufficient conditions for the point to pass the inside-triangle test is
i.e. the Edge Equations have to be samely signed. In case a point falls exactly on the edge (i.e. = 0), we check if the edge is owned by the current triangle. We will be able to check this by accessing isOwnerEdge[i]
. If isOwnerEdge[i]
is true, it means this edge is owned by the triangle, so we consider all the points on this edge in the triangle.
We will use the Edge Equation in the mask calculation in the function TestQuadFragment()
.
Now that we have the function TestQuadFragment()
which returns a bit mask indicating if the sample for each fragment is covered, we will be able to rasterize the triangle. We will do this in the RasterizeTriangle()
method.
The way we rasterize a triangle can be summarized as find all the candidate pixels, and test each pixel for coverage by using TestQuadFragment()
we have already written. We will take as input,
regionX0, regionY0
: the top-left corner coordinate of current working tile.regionW, regionH
: the width and height of the current working tile in pixels.tri
: set up triangle equations, where we keep the coordinates of the corners of the triangle.triSIMD
: all values of tri
in SIMD registers.processQuadFragmentFunc
: A function value – for every quad fragment that may generate coverage, this method should call this function.Now we will start the implementation of a sort-middle tiled parallel renderer. The tiled renderer will have two phases:
The process can be shown as the Figure 2.
In tile-based rendering, we partition the screen into smaller tiles, which we call them bins
. Each tile will be responsible for processing the triangles inside. Usually, in order to improve the performance, we will classify triangles by the tiles they covered. The process is called binning.
A intuitive way to keep track of the bins is to use a vector of tiles, where each tile is a vector of projected triangles, let’s call it bins. However, when binning is processed in a multi-thread context, multiple threads may handle triangles simultaneously, and place them into corresponding bins. If they write to the global bins
directly, it may result in:
To solve the above issues, we introduce the thread local bins, maintained by the data structure, which we call it the threadLocalBins
. Now, each thread has its own independent data structure for maintaining bins. When a thread i
is binning, it writes data into its own threadLocalBins[i]
. By using the thread local bins, we also avoid using locks, which lowers the cost for synchronization.
Once every thread finish its own binning, we will need to merge everyone’s local bins into the global bins, and hand over it to the ProcessBins()
to handle them.
The following steps describe the things we need to get done in the BinTriangles()
.
gridWidth
× gridHeight
tiles.
ProjectedTriangles
, with size gridWidth
× gridHeight
.ProjectedTriangles
, because it will keep track of all the local bins structures for all the threads. i.e. threadLocalBins[threadId]
return the local bins for the threadId
-th thread.threadLocalBins[threadId]
according to the triangles’ coverages. Notice that this also indicates that the majority of the bins in a threadLocalBins[threadId]
will be empty, because a thread will only handle one or a few tiles. For each thread, the tiles out of their responsibility can be perceived as empty, as shown in following figure.threadLocalBins
into a global bins
, where triangles are categorized by tiles. For each tile, we will merge the triangle sublists of all the threads. In these sublists, the triangles live in this tile.ProcessBin()
on multiple threads.Recall that in the BinTriangles(), we will handle the triangles in the tiles, of which the thread is in charge.
The next thing we want to do is to process the bins in the global bins. For each tile, we will call a ProcessBin()
to rasterize the fragments in the bin. Notice that in a tile-based renderer, parallelism happens among tiles, instead of fragments. Processing within a tile is carried out sequentially. Thus, we will call ShadeFragment()
to shade a single quad fragment at a time, rather than call ShadeFragments()
as was done on the full fragment buffer in the non-tiled implementation.
Instead of writing directly into the frame buffer, we also update the color and depth into the frame buffer. One advantage of a tiled renderer is that an entire frame-buffer tile will fit in cache. For maximum performance we will write results to tile-major frame buffer structure while processing the tile, and then copy the results back out to the renderer’s frame buffer at the end.
To do this, we will also initialize the tiled frame buffers in the initialization.
Since the rasterization happens at a granularity of quad fragments, the loop runs on triangles in the tile, and loops over each of the four fragments. We will call the RasterizeTriangle()
we wrote in Section 3.2, with a region size specified to be the size of the sub-frame buffer for the current tile.
Here are some technical challenges that we may need to solve:
ShadeFragment()
instead of ShadeFragments()
. Looking at the interface of ShadeFragment()
, we will need to know which thread handles the triangle being processed. Otherwise, we will use the wrong vertex output buffer or index output buffer.triId
parameter in the interface of ShadeFragment()
– this is not the member variable in the ProjectedTriangle
class. This is the index of the loop when in the BinTriangles()
when we binned this triangle.To solve these issues, we need to carry and propagate these information. To facilitate the processing, we will maintain a vector of BinnedTriangle structs, where the structure keeps the thread ID, the loop index, and the projected triangle.
After the processing, we will need to write the results back to the frame buffer.