# PLOCTree
A Fast, High-Quality Hardware BVH Builder
---
## Background
<!--
Let's start with some basic concepts which are required in order to understand the paper.
-->
---
### Bounding Volume Hierarchy (BVH)
Accelleration data structure for ray tracing.
<!--
In case you ̶d̶o̶n̶'̶t̶ ̶k̶n̶o̶w̶ ̶a̶l̶r̶e̶a̶d̶y̶ are stupid, BVHs are kind of needed for raytracing...
... and this is relevant
... because raytracing is f***ing cool.
-->
---
<!--
Paper: "Automatic Creation of Object Hierarchies for Ray Tracing"
Authors: Jeffrey Goldsmith and John Salmon
Year: 1987
-->

<!--
So you use volumes like Axis Alligned Bounding Boxes (AABB:S) to enclose geometry primitives. AABB:s are one really common as the ray intersection test is really fast and easy to compute.
-->
---

<!--
The next step is to construct a hierachy of such boxes.
-->
---

<!--
This results in a tree of AABB:s which is used as accelleration data structure for the ray intersection tests reducing the cost form O(n) to O(logn).
-->
---
### Surface Area Heuristic (SAH)
<!-- Because optimal BVH construction is NP-hard. -->
Estimates the likelihood of intersection with a given tree node based on the surface area of its bounding box.

<!-- assumption: keeping thing close together in the same boundiong box is good -->
---
### Morton Code

<!--
morton order maps two dimentions into a single dimention (array) while preserving approximate distancing
also, easy to compute from binary coordinates
-->
---
### Streaming algorithm
<!-- maybe move in backgound? -->
From [wikipedia](https://en.wikipedia.org/wiki/Streaming_algorithm#Data_stream_model): "streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes"
---
### BVH builders
---
#### SAH-based
Top-down construction via "SAH sweeps"
**PRO**: High quality
**CON**: Costly
---
#### LBVH-based
Sort triangles according to their Morton Codes, and then produces a hierarchy based on that
**PRO**: Linear cost
**CON**: Inaccurate
---
#### Parallel Locality Ordered Clustering (PLOC)
Parallel, bottom-up BVH construction algorithm
Uses Morton Code order for efficient neighboring distance computation
---
##### The PLOC algorithm steps:
1. Assign each triangle a Morton Code based on its centroid
2. Sort triangles in Morton code order
3. Apply multiple "PLOC sweeps" <!-- iterative approach -->
---
##### PLOC sweeps


----
##### PLOC sweeps


----
##### PLOC sweeps


----
##### PLOC sweeps


---
##### Nearest Neighbors using Morton Code ordering

<!-- Illustration of the nearest neighbor search for r ¼ 2. Two clusters (red triangles) search for their nearest neighbors (blue triangles) in the neighborhood (red curve). Notice how the algorithm adapts to the den- sity of clusters in the neighborhood. -->
---
## PLOCTree
<!--
* Hardware implemented version of PLOC
* Compare with other HW implementations
* Compare with GPU PLOC
-->
---
### Algorithm adaptation
Changes to the PLOC algorithm have to be made on order for an hardware implementation
---
#### Basic, serial PLOC
<!-- should'n this be "Serial PLOC" ... -->

* All the data is required at start (line 1-3)
* Runs kernel multiple times
---
#### Streaming, hardware-oriented PLOC
<!-- ... and this "Streaming PLOC"? -->
* Fuses the two loops into a single one
* Exploits having a small execution window
---
### Hardware implementation
* Pipeline for a fixed-radius PLOC sweep
* System made out of several pipelines
---
#### PLOC sweep pipeline

---
#### System architecture

---
#### Evaluation
* RTL level in SystemVerilog
* Various scenes <!-- Fifteen test scenes, verify from RTL as hexdump and render using it-->
* Performance metrics <!-- SAH cost of output trees -->
* GPU PLOC and on 1080 Ti <!-- memory traffic extracted -->
---
## Results
---
<!-- 1-3W mobile gpu, nothing left for rendering -->
<!-- Run at full power on desktop -->
* PLOCTree consumes 1.4–1.9W
* 7% slower than LBVH
* More power/energy consumtion than LBVH
* 5 x faster with 3 x less bandwith usage, 5 x less silicon than "binned SAH builder"
* 4 x faster and 8 x less bandwidth than CUDA PLOC implementation
---

---

---
## Shortcomings
* Shaky performance metrics, not accurate - is performance really better than LBVH?
* Not acounting for newer memory or on-chip memory technology
* Limited number of pipelines, small evaluation scope
* Justifcation for mobile platform
###### tags: `chalmers`
{"metaMigratedAt":"2023-06-15T07:58:40.154Z","metaMigratedFrom":"Content","title":"PLOCTree","breaks":false,"contributors":"[{\"id\":\"807f00c8-2fd2-461f-8f36-6208c1dfcf36\",\"add\":9805,\"del\":6234},{\"id\":\"cabcfdb6-b825-4a88-a8c5-d9a8a8e815b7\",\"add\":622,\"del\":854},{\"id\":\"2d395e15-de4c-4b46-8e91-90d30f216231\",\"add\":2183,\"del\":190}]"}