HART Notes - HackMD

# HART Notes * dedicated ray-tracing hardware for BVH updates and ray traversal * BVH reconstruction on CPU * the traversal unit in the raytracing hardware performs both BVH traversal and ray-primAABB intersection tests * we maintain two consecutive sets of shape data (node and primAABB) in an L1 cache block to reuse the data in the next iteration ## Big Picture ![](https://i.imgur.com/CcuRKca.png) ## Traversal Scheme Using PrimAABBs * reusing the primAABBs for traversal * Originally, primAABBs are used for BVH building and BV refitting. We maintain primAABBs data after BV refitting and reuse them for traversal. When a ray reaches a leaf node, we perform a ray-primAABB test using the existing traversal unit before sending the ray to the intersection (IST) unit ## Traversal and Intersection Unit ![](https://i.imgur.com/c7amKv5.png) * ray data transmission between the T&I units and the programmable shaders use small FIFO buffers to reduce memory traffic. * The ratio of TRV units to IST units is 16:1, which is different from that in the previous fixed ray-tracing pipelines. The reason for this is that ray-primAABB tests in TRV units minimize the number of ray-triangle intersection tests ### RD * The RD gets rays from the programmable shaders and dispatches the rays to the TRV units * The RD also calculates the inverse direction vector for TRV units ### TRV * The TRV units perform both BVH traversal and ray-primAABB intersection tests * Each TRV unit includes a ray-AABB intersection routine, stack memory, a ray accumulation buffer, and an L1 cache * The ray-AABB intersection calculation part is fully pipelined and it consists of six floating-point adders, six floating-point multipliers, and 13 floatingpoint comparators to achieve a throughput of one ray-AABB intersection test per cycle * 使用 [Restart trail for stackless BVH traversal](https://dl.acm.org/doi/pdf/10.5555/1921479.1921496) * shadow ray 使用 [SATO: Surface Area Traversal Order for Shadow Ray Tracing](http://gamma.cs.unc.edu/SATO/SATO_files/sato_preprint.pdf) #### FSM ![](https://i.imgur.com/LItohSK.png) * STAT_TRV: initial traversal stage to fetch data. If the parent node is an inner node, we fetch the child node’s AABB and the next state is set to STAT_LCHD to visit the left child node. If the parent node is a leaf node, the next state is STAT_PRIM. If the traversal is finished, the next state is STAT_SHD * STAT_LCHD: performs the left child traversal. After the ray-AABB intersection test, the next state is set to STAT_RCHD * STAT_RCHD: performs the right child traversal. After the ray-AABB intersection test, the next state is set to STAT_TRV_POST * STAT_TRV_POST: determines the next visit node. The next state is set to STAT_TRV * STAT_PRIM: performs ray-primAABB intersection tests. If the ray passes the test, the next state is STAT_IST. Additionally, if there are remaining primAABBs for further intersection in the leaf node, the processing is iterated with the current state (STAT_PRIM). When we find the hit point of an occluded ray or have visited all primAABBs in the leaf, we change the state into STAT_TRV POST * STAT_IST: passes the ray into the IST unit * STAT_SHD: passes the ray into the shaders when the final hit point of the ray is found ### IST * Each IST unit consists of 11 floating-point adders, 11 floating-point multipliers, one reciprocal unit, and four floating-point comparators * The IST unit, like the TRV unit, includes an L1 cache and a ray accumulator buffer * The ray-triangle intersection unit (IST) is based on [Wald’s algorithm](http://www.sci.utah.edu/~wald/PhD/wald_phd.pdf) * this algorithm has the lowest cost of all ray-triangle intersection algorithms for hit triangles * Because of the increased possibility that the ray could hit the triangles in a leaf node after filtering of the ray-primAABB scheme, Wald’s algorithm is a good choice for our architecture * Additionally, precomputation-based algorithms such as Wald’s algorithm can be used to design an effective H/W intersection unit; consecutive memory access to precomputed data can simplify cache configuration and increase pipeline utilization compared to the Moller-Trumbore algorithm, which requires one index and three vertices ## Cache-Data Reuse Scheme * The block size of an L1 traversal cache is 64 bytes, so two sets of BVH node data or primAABB data can be stored in a cache block * In the case of BVH node data, left and right child nodes are stored consecutively * PrimAABBs in a leaf node are also stored consecutively * Therefore, we can reuse the cache-block data for the next iteration after the data are obtained * From the L1 cache, we obtain two sets of shape data (node or primAABB data) in an entire cache block and continuously maintain these data in the next pipeline stages * After dozens of cycles, when the ray comes back to the top of the traversal pipeline for the next iteration, we can reuse the shape data. If the required shape data exist in the maintained cache-block data, a cache access for the shape data is bypassed and the processing of the ray is treated as a cache hit