T&I Engine Notes

# T&I Engine Notes ## Big Picture ![](https://i.imgur.com/h5Pz3Qm.png) * 使用 fixed pipeline * Fixed logic can achieve higher performance than programmable logic through reduction of unused resources and the removal of instruction fetching and decoding * TRVs, LISTs, and IST1s require data fetching. For efficient memory access, we assigned a level-one (L1) cache memory and a ray accumulation buffer to each pipeline * Three level-two (L2) caches for TRVs, LISTs, and IST1s between off-chip memory and L1 caches reduce memory traffic * Ray generation and shading are processed by programmable shaders. Use of external DRAM as the interface between the T&I and the shader would be a bandwidth burden to the memory system. Hence, we configure on-chip FIFO (First In, First Out) buffers to store 128 rays for both input (ray generation to T&I) and output (T&I to shading), respectively. ### Ray Dispatcher * An RD gets a ray from the programmable shader and transfers the ray to the TRVs * The RD first calculates inverse direction vectors for traversal * It then performs an intersection test between the ray and scene’s axis-aligned bounding box (AABB) and gets the scene-Max value (the maximum t distance in the scene) * Because only one ray can be supplied for each cycle, rays are supplied to the TRVs in a FIFO manner ### TRV * each has four-entry shortstack memory to perform traversal * It uses a small, n-entry stack to maintain the last n pushes. If the stack is empty, the ray restarts the traversal on the restart node determined by the push-down method. * [Interactive k-D Tree GPU Raytracing](https://graphics.stanford.edu/papers/i3dkdtree/gpu-kd-i3d.pdf) ![](https://i.imgur.com/N66QY8W.png) * one FP adder, one FP multiplier, and three FP comparators ### LIST * searches the primitive list in a leaf node ### IST1 * ray-plane test & barycentric test * seven adders, seven multipliers, three comparators ### IST2 * compute the final $u$, $v$, $t$ values of the hit point * The main goal of separating IST2 is a reduction in area-expensive reciprocal units. IST1 does not require reciprocal operations * one reciprocal unit, three multipliers ### Data flow ![](https://i.imgur.com/MY2v5Ul.png) ## Ordered Depth-First Layout * reduce memory bandwidth * In kd-trees, a compact eight-byte depth-first layout (DFL) [Pharr and Humphreys 2010] is widely used for ray tracing because of its small memory footprint and the depth-first search manner of ray traversal. In this layout, the parent node and the left child node are adjacent, so they have parent-child locality. Therefore, this layout has only a single pointer for the right child node. In the DFL, the arrangement criterion of child nodes is a geometric position. We improve the DFL by utilizing parent-child locality. * In contrast to the original DFL, the ODFL uses surface area rather than the geometric position to arrange child nodes. This means that the child node with the larger surface area is stored next to its parent node * Because the probability of a ray intersecting with a node is proportional to its surface area, the probability that a ray accesses the same cache line also increases ![](https://i.imgur.com/CE6OIEc.png) * to access $k$, we need to access three cache lines with the original DFL. In contrast, we only need to access one cache line with the ODFL ## Three-phase Ray-triangle Intersection Test * Early exit: phase 1 或 phase 2 沒有被相交的光線不必進到 phase 3，可以減少 computation 與 memory 請求 * Phase 1 和 phase 2 使用相同的 module 因為他們使用相似的計算，並且需要 data fetching * Phase 3 不需要 data fetching，但是有倒數運算，因此把 IST2 獨立出來 ![](https://i.imgur.com/Wad3JWs.png) * the early t version of Shevtsov et al.’s method [2007] has the lowest costs for non-intersecting triangles and Wald’s method [2004] has the lowest costs for intersecting triangles ### Phase 1: Ray-plane Test * determines whether the ray instersects with the triangle within the $t$ intervals ![](https://i.imgur.com/DX9lIhL.png) * $nu$, $nv$, and $np$ are normal data. $pu$ and $pv$ are vertex data. $ci$ is the projection axis. $e0u$, $e0v$, $e1u$, and $e1v$ are edges data. $ou$, $ov$, and $ow$ are the ray’s origin. $du$, $dv$, and $dw$ are the ray’s direction * Our strategy is also effective for reducing memory access. First, we assumed that the 2-bit $ci$ value could be embedded into other values. We then only allotted 16 bytes ($np$, $nu$, $nv$, and $pu$) for a ray-plane test. If the ray does not pass this test, then fetching of other data for the barycentric test in the triangle is not required. ### Phase 2: Barycentric Coordinate Test * determine whether the hit point is inside or outside of the triangle ![](https://i.imgur.com/GLzmFhL.png) ### Phase 3: Final Hit Point Calculation * calculate the final $t$, $u$, and $v$ ![](https://i.imgur.com/LyF2Tt2.png) ## Ray Accumulation Unit * memory latency hiding * If a ray causes cache miss, it should wait until the shape data is fetched. In this case, the ray is stored in a small ray accumulation (RA) buffer. Other rays can be processed during this period, so the long memory latency can be effectively hidden. * the rays that reference the same cache line are accumulated in the same row in an RA buffer. When the shape data requested by the accumulated rays is fetched from the L2 cache, these rays with the shape data are transferred from the RA buffer to the operation pipeline. Note that a non-blocking cache is needed to support our method. ![](https://i.imgur.com/EKDfsGr.png) * Each RA unit is located between an input buffer, an L1 cache, and an operation pipeline (e.g., TRV, LIST, and IST1) and fetches shape data from L2 cache/memory. * 這部份還沒看完，改天再看 ## Simulation Results and Analysis * 這部份還沒看，改天再看