# Emerald : Graphic Modeling for SoC Systems ###### tags: `GPUs` `GPU simulator` `Graphics model` `SoC` ###### paper: [link](https://www.ece.ubc.ca/~aamodt/publications/papers/gubran.isca2019.pdf) ###### slide: [link](https://www.ece.ubc.ca/~aamodt/publications/papers/Emerald_ISCA_final_public.pptx) ###### video: [link](https://www.youtube.com/watch?v=F32FgYePzXc) ##### paper origin: ISCA-2019 # Introduction * **Motivation:** * gem5 and MARSSx86 lack an ability to capture all hardware due to the absence models for key element such as graphics rendering. * Researchers develop solutions that ignore significant system-wide behaviors. * Software rendering instead of a hardware model can severely misrepresent the behavior of a system. * **paper goal:** To provide an architecture model for a comtemporary heterogeneous SoC that meets the following three requirements: * support a full operating system software stack * model the main hardware components. (e.g., CPU, GPUs and the memory hierarchy) * provide a flexible plateform to model additional special components * **Emerald:** * builds on GPGPU-Sim and extend gem5 full-system simulation to support graphic * Emerald's graphic model is able to execute graphics shaders using the same model used by GPGPU-Sim thus enabling the same microarchitecture model to be used for both graphics and GPGPU workload. * add additional models to support graphic-specific functions * **Contributions:** * Emerald, a simulation infrastruction that extends GPGPU-Sim to provide a unified simulator for graphics and GPGPU workloads. * An infrastructure to simulate SoC systems using Android and gem5. * A study of the behavior of memory in SoC systems high-lighting the importance of detailed modeling incorporating dependencies between components, feedback from the system, and the timing of events. * DFSL, Dynamic fragment shading load-balancing. # Architecture ![](https://i.imgur.com/n1HQK75.png) 1. A CPU cluster with multiple cache levels 2. GPU cluster consists of multiple GPU shader cores. GPU threads are organized in a group of 32 threads (warp). 3. GPU L1 caches are connected to a interconnection network 4. GPU L2 cache is coherent with CPU caches 5. The system network connects the CPU cluster, the GPU cluster, DMA devices and the main memory # Emerald Graphic Pipeline ![](https://i.imgur.com/HKaFPA9.png) * A draw call initiates rendering by providing a set of input vertices (**A**). * Vertex data includes 3D positions and other optional vertex attributes. * Vertices are distributed across SIMT cores in batches (**B**, **C**). * Vertex shading stage transforms 3D vertex coordinates to 2D screen-space coordinates. * Once vertices complete vertex shading, they proceed through primitive assembly, which converts a stream of screen-space vertices to a sequence of base primitives. (**D**). * In clipping & Culling (**E**) stage, back-faced and out of rendering volumn primitives are discarded. * Primitives are distributed to GPU clusters based on screen-space position **F**. * Each cluster performs primitive setup **G**. * Coarse rasterization (**H**) identifies the screen tiles each primitive covers. * In fine rasterization (**I**) input attribures for individual pixels are generated. * Hierarchical-Z/stencil stage (**J**), where a low-resolution on-chip depth/stencil buffer is used to eliminate invisible fragments. * Surviving fragments are assembled in tiles (**K**) and shaded by the SIMT cores. * In stages **L**, **M**, **N**, raster operations are performed as part of the shader program. * Fragment data is written to the corresponding pixel position in the frambuffer. # Emerald GPU Architecture ![](https://i.imgur.com/MKRIV7J.png) Emerald's hardware architecture consists of a set of single-instruction multiple-thread core clusters (**1**) and L2 cache with atomic operation unit (**2**). An interconnection network **3** connects GPU clusters, the AUO and the L2 cache to each other to the rest of the system (ie. DRAM, SoC, NoC). ## GPU SIMT cluster ![](https://i.imgur.com/5YC5b2f.png) * **2** to **8** implement stages **G** to **K** in Emerald Graphic Pipeline. * SIMT cores execute shader program for vertices and fragments by grouping them into warps (sets of 32 threads), which execute on SIMT lanes in a lock-step manner. * Screen-space is divided into tiles (TC tiles), where each TC tile position is assigned to a single SIMT core. * Vertex Processing and Operations (VPO) unit assigns primitives to clusters for screen-space processing. ## VPO Unit ![](https://i.imgur.com/khwTjkC.png) * SIMT cores write position data of vertex warps into one of the vertex warp buffer. * A bounding-box calculation unit consumes position data from each warp in and calculates the bounding-box for each primitive covered by the warp. * Bounding-box calculations generate a warp-sized primitive mask for each SIMT cluster. * The primitive mask generation stage sends each created mask to its corresponding cluster * The PMRB unit collects primitives masks from all clusters and processes masks according to their draw call order to check if the corresponding primitive covers part of the screen-space area assigned to the current cluster. ## Model Performance Compared to Tegra K1 SoC with a set of 14 benchmarks. Eecution time has a correlation of 98% with 32.2% average absolute relative error, while pixel fill-rate (pixel per cycle) has a correlation of 76.5% with a 33% absolute relative error. # Case Study 1 ## DASH Scheduler * DASH aims to balance access to DRAM by classifying CPU and IP traffic into a set of priority levels: 1. Urgent IPs 2. Memory non-intensive CPU applications 3. Non-urgent IPs 4. Memory intensive CPU applications * DASH categorizes IP deadlines into short and long deadlines. ## HMC Controller * HMC proposes implementing a heterogeneous memory controller that aims to support locality/parrallelism based on traffic source. * CPU assigned channels use an address mapping that improves locality by assigning consecutive addresses to the same row buffer; IPs assigned channels use an address mapping improves parallelism by assigning consecutive addresses across DRAM banks. * HMC defines seperate memory regions by using different DRAM channels for handling CPU and IPs accesses. ## Evaluation ### Experimental Setup * Emerald runs Android full-system mode using Emerald's GPU model and gem5's CPU display controller models. * Running the system under regular and high load configurations. ### Result #### Regular-load scenario ![](https://i.imgur.com/G7chP7z.png) * DASH prioritize CPU requests over that of the GPU's while frame begin rendered. * GPU execution taking longer may increase energy consumption. * For HMC, traffic from CPU thread and the GPU thread was not balanced throughout the frame. ![](https://i.imgur.com/S46DzDp.png) * The lower row-buffer locality at IP-assigned channels. -> lower row-buffer hit rates and reduced number of bytes fetched per row-bufffer activation. #### High-load scenario ![](https://i.imgur.com/KK83Unz.png) * HMC shows similar behavior to that of regular-load scenario. * DASH reduces frame rates compared to the baseline by an average 8.9% and 9.7% for DCB and DTC configurations, respectively. # Case Study 2 ## Goal To propose and evaluate a method for dynamically load-balancing the fragment shading stage on the GPU by controlling the granularity of the work assigned to each GPU cores. ![](https://i.imgur.com/kNvNbqx.png) ## Experimental Setup * Using Emerald standalone mode. * GPU configuration resembles a high-end mobile GPU. * Using *work tile(**WT**)* to define granualarity when assigning work to GPU cores. (A WT of size N is NxN TC tiles, N >= 1) ## Load-Balance vs. Locality ![](https://i.imgur.com/SwWc4u5.png) The WT size that achieves the optimal performance varies from one workload to another. ![](https://i.imgur.com/2KZ8m3l.png) Factors that may conrtibute to the variation on execution time with WT size: * L2 misses/DRAM traffic are very similar across WT sizes. * L1 cache locality is a significant factor in performance. Execution time correlates by 78% with misses 79% with L1 depth misses and 82% with texture misses. ## Dynamic Fragment Shading Load-Balancing * Dynamically adjust work distribution across GPU cores so as to reduce rendering time. * The goal of DFSL is to lower GPU energy consumption by reducing average rendering time per frame. * DFSL works by running two phases, an evaluation phase and a run phase. * DFSL Algorithm ![](https://i.imgur.com/3PiiFek.png) * line 13-25 is the evaluation phase. ## Result ![](https://i.imgur.com/yjuA9yv.png) * MLB for maximum load-balance using WT size of 1. * MLC for maximum locality using WT size of 10. * SOPT for the average of all the frames across all configs and founding the best WT. * DFSL is able to speed up frame rendering by an average of 19% compared to MLB and by 7.3% compared to **SOPT**. ## Future work * Emerald only supports vertex and fragment shading. Futrue work includes adding support for geometry and tessllation. * Adding details to the graphics model: e.g., texturing & compression.