# Vulkan-Sim: A GPU Architecture Simulator for Ray Tracing
###### tags: `GPUs`
###### paper origin: MACRO-2022
###### paper: [link](https://people.ece.ubc.ca/aamodt/publications/papers/saed.micro2022.pdf)
## 1. Background
* Ray tracing has become more prevalent in recent graphics workloads.
* By exploring ray tracing acceleration techniques architects can improve hardware to enable more complex scenes in realtime
* Architecture research is enabled by simulators, and for graphical workloads, hardware simulators model system behaviour better than software rendering tools
* Most prior graphics hardware research evaluates proposals using specially created simulators that are often not publicly available
## 2. Introduction
1. They introduce VulkanSim, a detailed cycle-level simulator for enabling architecture research for ray tracing
2. Extend GPGPU-Sim, integrating it with Mesa, an open-source graphics library to support the Vulkan API, and add dedicated ray traversal and intersection units
3. Demonstrate an explicit mapping of the Vulkan ray tracing pipeline to a modern GPU using a technique we call **delayed intersection and any-hit execution**
4. Evaluate several ray tracing workloads with Vulkan-Sim, identifying bottlenecks and inefficiencies of the ray tracing hardware we model
## 3. Need to know before this paper
### Baseline GPU Architecture

* The GPU consists of multiple compute units called SMs
* The memory partitions include memory access scheduling logic to interface with off-chip DRAM
* Threads in the application are organized into warps, a collection of 32 threads, which run together using SIMT
* To model such hardware Vulkan-Sim models a ray tracing accelerator in each SM, called the RT Unit.
### Ray Tracing Accelerators
* RT Unit accelerates pointer chasing through a key data structure used in ray tracing, the **bounding volume hierarchy**
* RT Unit computes intersections between rays and geometry in the scene using **Box Intersection Evaluators** and **Triangle Intersection Evaluators**
* In Vulkan, **vkCmdTraceRaysKHR** calls within a shader initiate activity on the RT Unit
* RT unit modeled in vulkan-sim using T&I engine and SGRT
### Ray Tracing With Vulkan

#### how ray tracing work
1. Primary rays originate from the camera, passing through the image plane and into the scene to calculate the color of that pixel
2. If the primary ray hits an object, we can cast a secondary ray towards the light source to determine whether the point is shadowed
#### how ray tracing work with vulkan

1. data structure and function
* **VK_KHR_acceleration_structure** extension handles AS building and management (cpu)
* bottom-level AS(BLAS) for each unique object's geometry
* top-level AS(TLAS) to position the objects within the scene with BLAS instances and its corresponding transformation matrix
* 
* **VK_KHR_ray_tracing_pipeline** is responsible for ray tracing shader stages and pipeline (cpu)
* **vkCmdTraceRaysKHR** call is invoked to launch a ray tracing kernel on the GPU (gpu)
2. data flow
1. Each GPU thread maps to a ray and begins at the ray generation shader
2. Calls the traceRayEXT function to begin ray tracing
3. Ray-triangle hits are validated by the any-hit shader
4. ray intersections with custom geometry are validated by the intersection shader
5. If the ray misses all objects in the scene, then a miss shader and otherwise a closest hit shader is executed
### 4. Simulation vulkan ray tracing
#### challenges of simulating vulkan
* **Finding a way to map the high-level ray tracing pipeline onto a programmable GPU**
* Solve method
* how a GPU thread can map to a ray tracing workload
1. One thread per raygen shader (they use this)
* each thread executes a raygen shader and continues to traverse the BVH tree and call other shaders sequentially when a traceRayEXT function is encountered
* Any thread divergence in the raygen shader continues in other shaders
2. One thread per shader
* each shader is executed on a separate thread. When a shader call is needed, the calling thread writes to a work queue. One kernel is launched for each shader type which starts consuming from the work queue
* it adds synchronization overhead
* Each rays in a warp can intersect multiple different leaf nodes, and each ray can execute a different set of shaders
1. Immediate execution
* thread reaches an any-hit or intersection shader during traversal, it calls the shader immediately
* causes high divergence when threads in a warp call different shaders
2. Delayed intersection and any-hit execution (they use this)
* This method executes intersection and any-hit shaders after the traversal of all threads in the warp is completed
* Afterwards, each thread executes all any-hit and intersection shaders for the ray sequentially
#### Functional Simulation
##### 1. Acceleration Structure
* The acceleration structure implementation in Vulkan-Sim is a 6-wide BVH tree adopted from Mesa
* The TLAS is made up of **internal nodes** and **leaf nodes**
* internal node is 64 bytes
* 
* top level leaf nodes is 128 bytes
* 
* The BLAS consists of internal nodes, triangle leaves, and procedural leaves
* Triangle leaves is 64 bytes
* 
##### 2. Shader Translation
* 
* PTX extension
* 
* **raceRayEXT** is an important GLSL function in RT shaders, that starts AS traversal and executes other shaders
* 
##### 3. Shader Binding Table
* Vulkan uses a shader binding table to record all shaders in a ray tracing pipeline
* Vulkan-Sim assigns an ID to each shader when it is registered, which is returned to the user program and used as shader handler
##### 4. Traversal and Intersection Implementation
* 
* we record memory addresses that are accessed with its size and data type to a transactions buffer, which is then sent to the timing model to simulate memory access latencies
##### 5. Kernel Invocation
* Vulkan-Sim starts a kernel with block size of (32,1,1) and grid size of (launch width, launch height, launch depth)
* Each thread executes a raygen shader
* Shader input and output are handled through special Vulkan buffers called descriptor sets
#### Timing Model

* model the full Vulkan ray tracing pipeline, including shader execution
* treat the RT unit as an execution unit
1. RT Unit Overview
* The performance model of the RT unit focuses on two sources of latency: **BVH operations** and **memory accesses**
* In each cycle, a warp is selected, and memory requests from the threads in the warp are scheduled to be issued to the L1 cache
* The returning ray tracing data is directed into the Response FIFO to be processed, modeling the latency of memory accesses
* the Operation Scheduler determines the requesting thread and forwards its ray properties along with the returned geometry data to the Operation Units
2. Warp Management
* Up to eight warps can co-exist within the RT unit
* Each warp maintains a Ray Buffer
* The traversal stack is maintained as a short stack with eight entries per-thread memory
3. Memory Scheduling
* Once a warp has been scheduled, the memory scheduler evaluates the Ray Status and the Traversal Stack in the Ray Buffer for each thread in the warp
* The Ray Status indicates if the thread is ready to issue a memory request
* The Memory Scheduler collects each of these addresses from all the threads in the warp, merging any identical requests
4. BVH Operations
* At the beginning of each cycle, the RT unit pops from the Response FIFO if there is data and forwards it to the Operation Units
* A flag in the returned data determines which of the three pipelined hardware units should be used
* ray-box intersection unit
* ray-triangle intersection unit
* transformation units
#### Software Architecture

## 4. METHODOLOGY



## 5. RESULTS
* In these workloads, ALU operations account for 60% of the measured instruction type breakdown, followed by memory operations with 25%, and only around 1% trace ray instructions
* EXT is the most realistic workload that we evaluate, where trace ray instructions make up around 60% of memory accesses, with the RT units active for 92% of total cycles on average
* 
* All workloads fall under the memory bound but are far from both the memory and compute bound, implying that **ray tracing performance is generally limited by memory accesses but these workloads are underutilizing the available resources**
* 
## 6. improve method (appendix)
### Independent Thread Scheduling
* warp divergency

* ITS

* ITS performs best when warp splits are similar in length and both execute long latency instructions that create stalls, allowing ITS to schedule both splits to execute in parallel
* 