# Emerald : Graphic Modeling for SoC Systems
###### tags: `GPUs` `GPU simulator` `Graphics model` `SoC`
###### paper: [link](https://www.ece.ubc.ca/~aamodt/publications/papers/gubran.isca2019.pdf)
###### slide: [link](https://www.ece.ubc.ca/~aamodt/publications/papers/Emerald_ISCA_final_public.pptx)
###### video: [link](https://www.youtube.com/watch?v=F32FgYePzXc)
##### paper origin: ISCA-2019
# Introduction
* **Motivation:**
* gem5 and MARSSx86 lack an ability to capture all hardware due to the absence models for key element such as graphics rendering.
* Researchers develop solutions that ignore significant system-wide behaviors.
* Software rendering instead of a hardware model can severely misrepresent the behavior of a system.
* **paper goal:**
To provide an architecture model for a comtemporary heterogeneous SoC that meets the following three requirements:
* support a full operating system software stack
* model the main hardware components. (e.g., CPU, GPUs and the memory hierarchy)
* provide a flexible plateform to model additional special components
* **Emerald:**
* builds on GPGPU-Sim and extend gem5 full-system simulation to support graphic
* Emerald's graphic model is able to execute graphics shaders using the same model used by GPGPU-Sim thus enabling the same microarchitecture model to be used for both graphics and GPGPU workload.
* add additional models to support graphic-specific functions
* **Contributions:**
* Emerald, a simulation infrastruction that extends GPGPU-Sim to provide a unified simulator for graphics and GPGPU workloads.
* An infrastructure to simulate SoC systems using Android and gem5.
* A study of the behavior of memory in SoC systems high-lighting the importance of detailed modeling incorporating dependencies between components, feedback from the system, and the timing of events.
* DFSL, Dynamic fragment shading load-balancing.
# Architecture

1. A CPU cluster with multiple cache levels
2. GPU cluster consists of multiple GPU shader cores. GPU threads are organized in a group of 32 threads (warp).
3. GPU L1 caches are connected to a interconnection network
4. GPU L2 cache is coherent with CPU caches
5. The system network connects the CPU cluster, the GPU cluster, DMA devices and the main memory
# Emerald Graphic Pipeline

* A draw call initiates rendering by providing a set of input vertices (**A**).
* Vertex data includes 3D positions and other optional vertex attributes.
* Vertices are distributed across SIMT cores in batches (**B**, **C**).
* Vertex shading stage transforms 3D vertex coordinates to 2D screen-space coordinates.
* Once vertices complete vertex shading, they proceed through primitive assembly, which converts a stream of screen-space vertices to a sequence of base primitives. (**D**).
* In clipping & Culling (**E**) stage, back-faced and out of rendering volumn primitives are discarded.
* Primitives are distributed to GPU clusters based on screen-space position **F**.
* Each cluster performs primitive setup **G**.
* Coarse rasterization (**H**) identifies the screen tiles each primitive covers.
* In fine rasterization (**I**) input attribures for individual pixels are generated.
* Hierarchical-Z/stencil stage (**J**), where a low-resolution on-chip depth/stencil buffer is used to eliminate invisible fragments.
* Surviving fragments are assembled in tiles (**K**) and shaded by the SIMT cores.
* In stages **L**, **M**, **N**, raster operations are performed as part of the shader program.
* Fragment data is written to the corresponding pixel position in the frambuffer.
# Emerald GPU Architecture

Emerald's hardware architecture consists of a set of single-instruction multiple-thread core clusters (**1**) and L2 cache with atomic operation unit (**2**). An interconnection network **3** connects GPU clusters, the AUO and the L2 cache to each other to the rest of the system (ie. DRAM, SoC, NoC).
## GPU SIMT cluster

* **2** to **8** implement stages **G** to **K** in Emerald Graphic Pipeline.
* SIMT cores execute shader program for vertices and fragments by grouping them into warps (sets of 32 threads), which execute on SIMT lanes in a lock-step manner.
* Screen-space is divided into tiles (TC tiles), where each TC tile position is assigned to a single SIMT core.
* Vertex Processing and Operations (VPO) unit assigns primitives to clusters for screen-space processing.
## VPO Unit

* SIMT cores write position data of vertex warps into one of the vertex warp buffer.
* A bounding-box calculation unit consumes position data from each warp in and calculates the bounding-box for each primitive covered by the warp.
* Bounding-box calculations generate a warp-sized primitive mask for each SIMT cluster.
* The primitive mask generation stage sends each created mask to its corresponding cluster
* The PMRB unit collects primitives masks from all clusters and processes masks according to their draw call order to check if the corresponding primitive covers part of the screen-space area assigned to the current cluster.
## Model Performance
Compared to Tegra K1 SoC with a set of 14 benchmarks.
Eecution time has a correlation of 98% with 32.2% average absolute relative error, while pixel fill-rate (pixel per cycle) has a correlation of 76.5% with a 33% absolute relative error.
# Case Study 1
## DASH Scheduler
* DASH aims to balance access to DRAM by classifying CPU and IP traffic into a set of priority levels:
1. Urgent IPs
2. Memory non-intensive CPU applications
3. Non-urgent IPs
4. Memory intensive CPU applications
* DASH categorizes IP deadlines into short and long deadlines.
## HMC Controller
* HMC proposes implementing a heterogeneous memory controller that aims to support locality/parrallelism based on traffic source.
* CPU assigned channels use an address mapping that improves locality by assigning consecutive addresses to the same row buffer; IPs assigned channels use an address mapping improves parallelism by assigning consecutive addresses across DRAM banks.
* HMC defines seperate memory regions by using different DRAM channels for handling CPU and IPs accesses.
## Evaluation
### Experimental Setup
* Emerald runs Android full-system mode using Emerald's GPU model and gem5's CPU display controller models.
* Running the system under regular and high load configurations.
### Result
#### Regular-load scenario

* DASH prioritize CPU requests over that of the GPU's while frame begin rendered.
* GPU execution taking longer may increase energy consumption.
* For HMC, traffic from CPU thread and the GPU thread was not balanced throughout the frame.

* The lower row-buffer locality at IP-assigned channels. -> lower row-buffer hit rates and reduced number of bytes fetched per row-bufffer activation.
#### High-load scenario

* HMC shows similar behavior to that of regular-load scenario.
* DASH reduces frame rates compared to the baseline by an average 8.9% and 9.7% for DCB and DTC configurations, respectively.
# Case Study 2
## Goal
To propose and evaluate a method for dynamically load-balancing the fragment shading stage on the GPU by controlling the granularity of the work assigned to each GPU cores.

## Experimental Setup
* Using Emerald standalone mode.
* GPU configuration resembles a high-end mobile GPU.
* Using *work tile(**WT**)* to define granualarity when assigning work to GPU cores. (A WT of size N is NxN TC tiles, N >= 1)
## Load-Balance vs. Locality

The WT size that achieves the optimal performance varies from one workload to another.

Factors that may conrtibute to the variation on execution time with WT size:
* L2 misses/DRAM traffic are very similar across WT sizes.
* L1 cache locality is a significant factor in performance. Execution time correlates by 78% with misses 79% with L1 depth misses and 82% with texture misses.
## Dynamic Fragment Shading Load-Balancing
* Dynamically adjust work distribution across GPU cores so as to reduce rendering time.
* The goal of DFSL is to lower GPU energy consumption by reducing average rendering time per frame.
* DFSL works by running two phases, an evaluation phase and a run phase.
* DFSL Algorithm

* line 13-25 is the evaluation phase.
## Result

* MLB for maximum load-balance using WT size of 1.
* MLC for maximum locality using WT size of 10.
* SOPT for the average of all the frames across all configs and founding the best WT.
* DFSL is able to speed up frame rendering by an average of 19% compared to MLB and by 7.3% compared to **SOPT**.
## Future work
* Emerald only supports vertex and fragment shading. Futrue work includes adding support for geometry and tessllation.
* Adding details to the graphics model: e.g., texturing & compression.