Try   HackMD

VCS #1 Tile-based Renderer

This is the documentation for the Assignment 1: Implementing a Parallel Sort-Middle Tiled Renderer of CMU 15-473/673 Visual Computing System. The task it to implement a few missing parts of the rasterizer and to parallelize the renderer using a sort-middle tiled approach (a.k.a. a tiled-based renderer).

For academic integrity reason, most of the codes for this project will be hidden.

If for any reason you would like to get the access to the codebase, please contact Tony at linghent@andrew.cmu.edu or lockbrains@gmail.com.

Data Handling

The first step in the assignment is to fix the renderer by implementing a triangle rasterizer (a quad fragment generator). I will do this by implementing two functions currently stubbed out in Rasterizer.h: RasterizeTriangle()} and TriangleSIMD::TestQuadFragment().

SIMD and __m128 Instructions

In this assignment, we will pass the coordinates using the Intel SIMD (Single Instruction, Multiple Data) data type, specifically we will be using __m128. It is a data type specific to Intel's SIMD instruction set, particularly used in the Streaming SIMD Extensions (SSE) family of instructions. It represents a 128-bit-sized register that can store and operate on four 32 floating-point values simultaneously.

To handle __m128, we will need to use specific API to do basic arithmetic operations. Specifically in this assignment, we will need the following API's to work around.

#include <xmmintrin.h> int main () { // addition __m128 result_addition = _mm_add_ps(1.0f, 2.0f); __m128i result_addition_int = _mm_add_epi32(1,2); // subtraction __m128 result_subtraction = _mm_sub_ps(1.0f, 2.0f); __m128i result_subtraction_int = _mm_sub_epi32(1,2); // multiplication __m128 result_mul = _mm_mul_ps(1.0f, 2.0f); __m128i result_mul_int = _mm_mullo_epi32(1, 2); // division __m128 result_div = _mm_div_ps(1.0f, 2.0f); // set to zero __m128 zero = _mm_setzero_ps(); // set to value __m128 one = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); }

We can also handle comparisons and bit operations using SIMD instructions.

#include <xmmintrin.h> int main () { // Compare Equal __m128 result_equal = _mm_cmpeq_ps(1.0f, 1.0f); __m128i result_equal_int = _mm_cmpeq_si128(1,1); // Compare Less Than __m128 result_lessThan = _mm_cmplt_ps(1.0f, 2.0f); __m128i result_lessThan_int = _mm_cmplt_si128(1,2); // Compare Greater Than __m128 result_greaterThan = _mm_cmpgt_ps(1.0f, 2.0f); __m128i result_greaterThan_int = _mm_cmpgt_si128(1,2); // Bitwise And __m128 result_and = _mm_and_ps(1.0f, 2.0f); __m128i result_and_int = _mm_and_si128(1,2); // Bitwise Or __m128 result_or = _mm_or_ps(1.0f, 2.0f); __m128i result_or_int = _mm_or_si128(1,2); }

N.4 Format Handling

In this assignment, the vertex positions will be stored in a fixed-point representation with 4 bits of subpixel precision, and this is called N.4 format of floating point values -

n bits for integer and 4 bits for decimal.

For example, a binary expression 1001.0110 can be understood as:

  • Integer part:1001 = 9.
  • Decimal part:0110 =
    616
    = 0.375

So this number is 9.375.

N.4 format is widely used in GPU to reduce the cost of floating-point computation. Furthermore, one of the most important operations of N.4 format should be rounding. Using the definition, for a binary notation of a N.4 format floating-point number

a, we can use bitwise shift to do floor and ceil operations.
a=a>>4.a=(a+0xF)>>4.

Rasterization

Point-in-triangle Test

When we want to rasterize the scene, we will make a call to the RasterizeTriangle() function. This function accepts a region of the output image and a triangle as input. It then identify all

2×2 pixel chunks within this screen region that may be at least partially covered by this input triangle.

During this process, we will need to determine if the triangle actually covers samples in this

2×2 pixel block this is called the triangle coverage test. This part is done with the function TestQuadFragment() it accepts the coordinates of four screen points as input, corresponding to the pixels in this pixel block. Its output is a bitmask indicating which of these points lies within the triangle, as shown in the following figure.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Since a __m128 register can save upto 4 integers, we will use two registers for storing the

x and
y
coordinates for the four samples in the quad
2×2
fragments (in N.4 fixed-point format as stated in the comments).

Edge Equation

To test if a pixel is covered by the triangle, we use the Edge Equation. Label the

i-th vertex of the triangle as
(xi,yi)
. For a query point
q=(x,y)
, we first calculate
Ai=(yiyi+1),Bi=(xi+1xi).

The Edge Equation the
i
-th edge is computed by
Ei=Ai(xxi)+Bi(yyi).

The necessary and sufficient conditions for the point

q to pass the inside-triangle test is
(E0>0&E1>0&E2>0)||(E0<0&E1<0&E2<0).

i.e. the Edge Equations have to be samely signed. In case a point falls exactly on the edge (i.e.

Ei = 0), we check if the edge is owned by the current triangle. We will be able to check this by accessing isOwnerEdge[i]. If isOwnerEdge[i] is true, it means this edge is owned by the triangle, so we consider all the points on this edge in the triangle.

We will use the Edge Equation in the mask calculation in the function TestQuadFragment().

Rasterize Triangle

Now that we have the function TestQuadFragment() which returns a bit mask indicating if the sample for each fragment is covered, we will be able to rasterize the triangle. We will do this in the RasterizeTriangle() method.

The way we rasterize a triangle can be summarized as find all the candidate pixels, and test each pixel for coverage by using TestQuadFragment() we have already written. We will take as input,

  • regionX0, regionY0: the top-left corner coordinate of current working tile.
  • regionW, regionH: the width and height of the current working tile in pixels.
  • tri: set up triangle equations, where we keep the coordinates of the corners of the triangle.
  • triSIMD: all values of tri in SIMD registers.
  • processQuadFragmentFunc: A function value for every quad fragment that may generate coverage, this method should call this function.

Sort-middle Tile-based Renderer

Now we will start the implementation of a sort-middle tiled parallel renderer. The tiled renderer will have two phases:

  • Place the triangles into bins. It will build a data structure where there is one bin per screen tile, and each bin contains a list of triangles that potentially overlap the bin.
  • Process the bins in parallel, dynamically distributing the next unprocessed bin to the next available core.

The process can be shown as the Figure 2.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Initialization

In tile-based rendering, we partition the screen into smaller tiles, which we call them bins. Each tile will be responsible for processing the triangles inside. Usually, in order to improve the performance, we will classify triangles by the tiles they covered. The process is called binning.

A intuitive way to keep track of the bins is to use a vector of tiles, where each tile is a vector of projected triangles, let’s call it bins. However, when binning is processed in a multi-thread context, multiple threads may handle triangles simultaneously, and place them into corresponding bins. If they write to the global bins directly, it may result in:

  • Race conditions. Multiple threads writing into the same data structure – it may cause a data disagreement or even a program crash.
  • Performance bottleneck. In order to avoid race conditions, we will need to apply locks to ensure the thread safety – but this is almost guaranteed to be slow, as it lowers the parallelism.

Thread Local Bins

To solve the above issues, we introduce the thread local bins, maintained by the data structure, which we call it the threadLocalBins. Now, each thread has its own independent data structure for maintaining bins. When a thread i is binning, it writes data into its own threadLocalBins[i]. By using the thread local bins, we also avoid using locks, which lowers the cost for synchronization.

Once every thread finish its own binning, we will need to merge everyone’s local bins into the global bins, and hand over it to the ProcessBins() to handle them.

Steps

The following steps describe the things we need to get done in the BinTriangles().

  1. Initialization. Assume the screen is partitioned into gridWidth × gridHeight tiles.
    • Global bins. It will be a vector of vectors of ProjectedTriangles, with size gridWidth × gridHeight.
    • Thread local bins. It will be a vector of vector of vectors of ProjectedTriangles, because it will keep track of all the local bins structures for all the threads. i.e. threadLocalBins[threadId] return the local bins for the threadId-th thread.
  2. Thread local binning. Each thread will only handle a portion of the triangles in the scene. They will put their own triangles into the threadLocalBins[threadId] according to the triangles’ coverages. Notice that this also indicates that the majority of the bins in a threadLocalBins[threadId] will be empty, because a thread will only handle one or a few tiles. For each thread, the tiles out of their responsibility can be perceived as empty, as shown in following figure.
    Frame 142 (1)
  3. Merge into global bin. After all the threads finish binning, we will merge all the local bins in threadLocalBins into a global bins, where triangles are categorized by tiles. For each tile, we will merge the triangle sublists of all the threads. In these sublists, the triangles live in this tile.
  4. Process the global bin. Finally we traverse the global bins, and handle each triangle in the tiles. Since each tile is handled independently, we will run ProcessBin() on multiple threads.

Binning Triangles

Local Bins

Recall that in the BinTriangles(), we will handle the triangles in the tiles, of which the thread is in charge.

Processing the Bins

The next thing we want to do is to process the bins in the global bins. For each tile, we will call a ProcessBin() to rasterize the fragments in the bin. Notice that in a tile-based renderer, parallelism happens among tiles, instead of fragments. Processing within a tile is carried out sequentially. Thus, we will call ShadeFragment() to shade a single quad fragment at a time, rather than call ShadeFragments() as was done on the full fragment buffer in the non-tiled implementation.

Frame Buffer Tiles

Instead of writing directly into the frame buffer, we also update the color and depth into the frame buffer. One advantage of a tiled renderer is that an entire frame-buffer tile will fit in cache. For maximum performance we will write results to tile-major frame buffer structure while processing the tile, and then copy the results back out to the renderer’s frame buffer at the end.

To do this, we will also initialize the tiled frame buffers in the initialization.

Adaptions

Since the rasterization happens at a granularity of quad fragments, the loop runs on triangles in the tile, and loops over each of the four fragments. We will call the RasterizeTriangle() we wrote in Section 3.2, with a region size specified to be the size of the sub-frame buffer for the current tile.
Here are some technical challenges that we may need to solve:

  • Since we directly shade the fragments (instead of shading the fragments in the fragment buffer), we will call ShadeFragment() instead of ShadeFragments(). Looking at the interface of ShadeFragment(), we will need to know which thread handles the triangle being processed. Otherwise, we will use the wrong vertex output buffer or index output buffer.
  • Since we write into tiled frame-buffer, we need to carefully treat the indexing.
  • Specifically, there is a triId parameter in the interface of ShadeFragment() – this is not the member variable in the ProjectedTriangle class. This is the index of the loop when in the BinTriangles() when we binned this triangle.

To solve these issues, we need to carry and propagate these information. To facilitate the processing, we will maintain a vector of BinnedTriangle structs, where the structure keeps the thread ID, the loop index, and the projected triangle.

Finish

After the processing, we will need to write the results back to the frame buffer.