# Intersection Prediction for Accelerated GPU Ray Tracing
###### tags: `GPUs`
###### paper origin: MICRO 2021
###### papers: [link](https://people.ece.ubc.ca/aamodt/publications/papers/liu.micro2021.pdf)
# 1. Introduction
## 1.1 Motivation
* The AO workload has high computational requirements because each intersection point must be sampled with many rays.
* Repeated BVH Node Accesses in Figure 1 (left) form around 88% of memory accesses.
* Since they are not part of the final ray intersection computation, there is an opportunity to predict the traversal and skip them.
* Simply using a cache would require a prohibitively large structure to achieve similar speedups seen in Figure 1 (right) due to the large working set size.

## 1.2 Contribution
* They propose and evaluate a ray intersection predictor module that speculatively skips ray-box intersection tests in a BVH tree traversal and proceeds directly to nodes deep in the AS.
* They introduce a warp repacking extension in the predictor that provides similar work distribution amongst threads in a warp and creates memory access coalescing opportunities.
* We model a detailed baseline ray tracing unit in the cyclelevel GPU simulator GPGPU-Sim correlated with the NVIDIA RTX 2080Ti.
# 2. Background
## 2.1 Ray Tracing
* Rays are mathematically characterized as a semi-infinite line of o + t · d with an origin, direction, and length
* Rays with similar origins (o) and directions (d) take similar paths through the AS
* there is an opportunity to memoize traversal and optimize future rays in the frame
## 2.2 Ambient Occlusion
* Ambient occlusion is a visual effect where crevices appear darker because less ambient light can reach them
* Ray tracing produces AO by tracing many short rays covering a hemisphere that originate from the point of interest to determine the amount of occlusion
* Any rayobject intersection in AO indicates that the point is shadowed from some ambient light. The proportion of rays that intersect objects represents the amount of blocked ambient light

## 2.3 Ray Traversal Algorithm

* The outer while loop iterates until the ray has completed its traversal.
* The inner while loops perform a depth-first traversal through the BVH
* If the first inner while loop reaches a leaf node, then a second inner while loop tests for primitive intersections
# 3. RAY INTERSECTION PREDICTOR ALGORITHM

We can estimate the number of BVH nodes skipped by a predictor as follows.
Terms for the predictor :
* A ray **hits** if it intersects the scene, regardless of whether the predictor is used. Only rays that hit can skip BVH nodes
* Rays are **predicted** if a prediction is found in the predictor table
* Rays are **verified** if the ray finds an intersection using the prediction
* Rays are **mispredicted** if the ray is predicted but not verified
A ray must, on average :
* fetch **n** nodes on a full traversal,
* evaluate **k** predictions from the predictor entry,
* fetch **m** nodes when traversing from each of the predictions
* ***p*** : percentage of all rays that are predicted
* ***v*** : percentage of all rays that are verified
All rays can be divided into 3 kinds of rays :
1. (1 − p)% of rays are not predicted and traverse n nodes.
2. v% of rays are verified and traverse only km nodes.
3. (p − v)% of rays are mispredicted and must traverse the predicted nodes as well as the full traversal, for a total of km + n nodes.
the number of nodes **N** that an average ray traverses can be estimated as:
$$
N = (1 − p)n + vkm + (p − v)(km + n) \\ = n − pn + vkm + pkm + pn − vkm − vn \\ = n + pkm − vn
$$
As a result, the **number of nodes skipped n − N** is:
$n − N = vn − pkm$
* vn : nodes skipped from verified rays
* pkm : the overhead of verifing the predicted nodes
As a result, the number of nodes skipped is increased by having a high verified rate v and decreased by overpredicting (increasing p) or traversing more nodes on predictions (increasing k or m).
# 4. PROPOSED ARCHITECTURE
## 4.1 Predictor Table Architecture

* Node predictions are stored in a predictor table per SM
* Each row in a set-associative way is a predictor entry consisting of a valid bit, a ray hash tag, and one or more slots to store predicted nodes
* a 4-way set-associative predictor table, with 1024 total entries, one node per entry, and 15 bits for the tag, performs the best
* When accessing the predictor, the ray parameters are hashed and the resulting hash pattern is employed for predictor lookup.
* The hash pattern is used to index the table and compared against tags in all ways for the selected row.
* If a tag match is found, the ray is considered predicted and the corresponding node addresses are returned for verification.
* Verifying a ray prediction involves traversing the BVH tree starting from the predicted node.
* increase the number of nodes per entry can increase the number of verified rays, each ray must evaluate more nodes
* vn will increase but pkm (k) increase
## 4.2 Hashing
* They similarly encode the ray origin and direction into a single value, but rather than using it to sort rays, they use it to index the predictor table.
* The goal is to maximize predictor table collisions between similar rays while minimizing collisions between different rays.
### 4.2.1 Grid Spherical

* This hash function combines the quantized ray origin and direction in cartesian coordinates and spherical coordinates, respectively
* The ray origin components x,y, z are each mapped to the integer range \[0, 2^n) using the scene bounding box, which is represented by the two extreme corners
* These are then concatenated to a single value. They refer to this as the Grid Hash block
* Likewise, the ray direction components θ ∈ \[0, 180),ϕ ∈ \[0, 360) are quantized by discretizing to an integer and extracting the most significant m bits from integer θ, and m + 1 bits from integer ϕ, before being concatenated.
* The two values are combined using bitwise-xor.
* The choice of n (the number of bits used for the origin) and m (the number of bits used for the direction) control how tight the hash function is.
* using more bits will cause fewer rays to map to the same entry.
### 4.2.2 Two Point

* The ray origin is hashed with the same method as in Grid Spherical.
* The estimated target intersection point is computed as t = o + r · l · d, where t is the target point, o is the origin, d is the normalized direction, l is the length of the maximum extent of the scene bounding box, and r is a fixed estimated length ratio to be chosen.
* This estimated target point is then passed through the Grid Hash block and xor-ed with the hashed origin.
They find that using the Grid Spherical hash function produces slightly better results than using Two Point. They use five origin bits and three direction bits, resulting in a ray hash of 15 bits.
## 4.3 Go Up Level

* They observed that such rays will often intersect different leaf nodes, but similar ancestor nodes of the leaves
* Go Up Level: an optimization in which instead of predicting leaf nodes, they predict interior nodes covering a larger region of space
* the BVH level of the prediction relative to the leaf node with the intersected primitive.
* with a Go Up Level of k, the k-th ancestor of the leaf node is stored in the predictor table to be used as a prediction for similar future rays.
* higher Go Up Level => higher probability of verifying a ray
* however, for each prediction, the ray must then traverse a larger portion of the BVH tree to verify the prediction and find an intersection.
* This tradeoff is presented in Equation 1; increasing the Go Up Level increases v, but also m. they find a Go Up Level of around 3 works best.
## 4.4 Warp Repacking

* Mispredicted rays access more nodes than without the predictor, introducing a “long tail” problem and prolonging warp execution
* All other threads must wait for Thread 5 to finish before the warp can complete
* Thread 5 mispredicts and executes intersection tests with the predicted primitives before performing regular BVH traversal.
* There are three possible outcomes:
* an intersection is predicted correctly (Threads 1, 4, 7, 8),
* mispredicted (Thread 5),
* not predicted at all (Threads 2, 3, 6)
* keep those rays together which have same condition
### 4.4.1 Implementation

* They perform repacking in the predictor unit. After threads have completed their predictor table lookup, the **predicted rays are removed from the warp** and not predicted rays remain
* This structure only stores the ray IDs of predicted rays until it fills up with 32 rays or reaches a timeout, at which point they are queued for traversal.
* We only track ray IDs because the remainder of the data associated with each ray is stored in the ray buffer, indexed by the ray ID. We impose a short timeout to ensure timely processing when there are insufficient rays.
# 5. METHODOLOGY
## 5.1 Ray Tracing Unit
In the context of ray tracing, the entire traversal and intersection process can be treated as a single ``__traceray()`` instruction like the NVIDIA RT Core Ray Query
we model a ray tracing unit consisting of a traversal block and a ray-triangle intersection block similar to the NVIDIA RT Core
The RT unit is accessed using the special ``__traceray()`` instruction that passes in all necessary ray data and the root node of the AS. The instruction is intercepted from normal GPU execution, redirected to execute in the RT unit, and the results are written back.
### 5.1.1 Interface
Our CUDA ray tracing kernel traces one ray per thread. The ray data is retrieved using the thread ID and includes origin, direction, and t-parameters that define the ray length. This data, along with the root node of the BVH tree, is passed into the RT unit for each thread. The RT unit stores the data in the ray buffer, indexed by the ray ID,
The RT unit executes up to eight warps at once, so the ray buffer stores a maximum of 32 × 8 = 256 rays.
### 5.1.2 Traversal Unit
Every new ``__traceray()`` command received triggers the controller to initialize a traversal stack. We configure the traversal stack to accomodate eight entries, which is sufficient for many traversals
When the data returns, the node is broadcasted to the ray buffer. Any matching entries trigger an intersection test, and the ray and node data are forwarded to the intersection unit.
### 5.1.3 Intersection Unit
The intersection unit accelerates two types of intersection tests, the ray-box test while traversing through BVH nodes and the ray triangle test once a leaf node is. There are 32 pipelined hardware units for each test type, such that all 32 threads in a warp can compute their ray intersection in parallel.
# 6. Result


* These results show that the predictor reduces memory accesses by 13% and overall execution time by a geometric average of 26%.
* Sorted rays benefit less from the predictor because similar rays are traced close together and do not have an opportunity to train the predictor

* Approximately 7% of energy is saved using the predictor