# Inter-kernel Reuse-aware Thread Block Scheduling ###### tags: ` GPUs `, ` memory systems `, ` scheduling ` , ` caches ###### paper origin: IEEE TACO 2020 ###### papers: [link](https://dl.acm.org/doi/pdf/10.1145/3406538) # 1. INTRODUCTION * **Motivation:** * There is a growing gap between the computational and memory capabilities of GPUs. * Multi-level cache hierarchies are typically poorly utilized. * One source of locality that is not exploited by current GPU caches is "inter-kernel locality". * **Research Problems:** * At a synchronization point(kernel boundaries), a core writes through all its dirty data to the shared last-level cache; it also invalidates all data in its private cache(typically, L1) * **Proposed Solutions:** * Thread block (TB) schedulers that exploit these protocols to increase inter-kernel reuse, thereby improving performance and energy. * 4 new decentralized hardware TB schedulers that seek to exploit inter-kernel reuse (locality) by scheduling threads that require a piece of data on the same core as threads(from a previous kernel) that produced or shared that data. * stealing to balance work across the entire GPU while still enabling inter-kernel reuse * This way also inherently exploits some intra-kernel reuse as a side-effect. * **Results** * best-performing scheduler decreases * average execution time by 19% (max 43%) * energy by 11% (max 21%) * network traffic by 42% (max 86%) * "regular" GPU applications * inherently load-balanced, so "Stealing" only marginally improves performance * irregular, iterative GPU applications * execution time by 10% (max 18%), energy by 8% (max 19%), and network traffic by 16% (max 37%) # 2. BACKGROUND ## 2.1 Cache Coherence for GPUs * several new GPU coherence protocols have been proposed that do not completely invalidate caches at synchronization and thus allow data to be reused across kernels. * For example, DeNovo, which ensure that the cache invalidations only occur for data that are not owned. ## 2.2 GPU Scheduling Modern GPUs need three schedulers to schedule threads. * A first-level global scheduler assigns kernels to Compute Units (CUs) * a second-level global scheduler allocates TBs to CUs * this paper * a thirdlevel per-CU scheduler schedules the warps of those TBs on to the available hardware. # 3. THREAD BLOCK DEPENDENCIES * Thread block i of the current kernel depends on thread block i of the previous kernel. We call such dependencies "self-dependencies". We only count producer-consumer. * Accesses to read-only data are not included even though they often exhibit inter-kernel reuse as well. * if the data were written by the thread’s own TB in the previous kernel, a "self-dependency" was counted. * If the data were at the boundary of the TB and was written by a neighboring TB in the previous kernel, a "ghost zone" dependency was counted. * If the data were written by any other TB or even written by the same TB but in an older kernel, an "other" dependency was counted. ![](https://dl.acm.org/cms/attachment/d4ba4f7a-c5fe-4047-8098-18d31756f55b/taco1703-24-f01.jpg) Fig. 1. Example code template for a generic iterative GPU kernel. The data loads into variables me, left, and right show a self-dependency on the data update performed in the previous kernel. # 4. IKRA SCHEDULER DESIGN four hardware TB schedulers that improve on RR. * The first three use a predetermined static allocation of TBs to CUs. * The fourth employs work stealing to handle cases where the static schedule results in load-imbalance. * In order to keep hardware cost extremely low, this aper use chunking of TBs, which also increases intra-kernel reuse. ## 4.1 Chunk The Chunk scheduler ( C ) is a simple extension of RR. Instead of allocating TBs one-by-one like RR, Chunk groups contiguous TBs into chunks and assigns them to each CU at the beginning of the kernel. ![](https://dl.acm.org/cms/attachment/a31cb3b2-fac5-4454-9344-91792790cd3d/taco1703-24-f02.jpg) Fig. 2. Example kernel execution with the Chunk scheduler. T=0: Chunk entries of four different CUs when a kernel with a 6x4 TB grid is launched on a system using Chunk. T=1: Chunk entries after each CU has executed one TB. T=6: Chunk entries after each CU has executed its entire chunk. The chunk entries are local to each CU. For clarity, the entries have been color-coded to match the section of the TB grid that they map to. ## 4.2 Reset * The Reset scheduler ( R ) is identical to Chunk in all aspects except that it always starts allocating chunks from the first CU. * It allows the scheduler to implicitly handle self dependencies. As TBs with the same ID are always assigned to the same CU. ## 4.3 Flip * The Flip scheduler ( F ) builds on Reset. * TBs that finished executing last in the previous kernel have the highest likelihood of still having data in the L1 cache. ## 4.4 Steal * The scheduler Steal (S) add support for work stealing to find a more balanced allocation of TBs to CUs, while still maintaining as much inter-kernel locality as possible. * left neighbor * right neightbor * sequential search ![](https://dl.acm.org/cms/attachment/235a462e-afad-439b-b68e-00b699a80095/taco1703-24-f03.jpg) Fig. 3. Example kernel execution with the Steal scheduler. T=0: Chunk entries of four different CUs when a kernel with a 6x4 TB grid is launched on a system using Steal. T=8: CU 0 finishes executing its entire chunk and attempts to steal a TB from CU 1. T=9: CU 0 steals TB (5,1,0) from CU 1 and (1) inserts it in its steal queue and (2) updates CU 1’s chunk entry by decrementing the end ID. The chunk entries are local to each CU. For clarity, the entries have been color-coded to match the section of the TB grid that they map to. # 5. RESULTS WITH BASELINE CONFIGURATION ## 5.1 Microbenchmarks * GPU-small * GPU-medium * GPUlarge * Work-Stealing ![](https://dl.acm.org/cms/attachment/f743d1a0-0809-4ef3-a517-c07bf6561991/taco1703-24-f04.jpg) Fig.4. Comparisonof(a)executiontime,(b)dynamicenergy,(c)networktraffic,and(d)GPUmemoryaccess breakdowns of all schedulers for four different microbenchmarks. All bars are normalized to G. ## 5.2 Regular Applications * Compared to G, S is the fastest scheduler, reducing execution time by 19% (max 43%), energy by 11% (max 21%), and network traffic by 42% (max 86%) on average. * F and S have approximately the same benefits on average. ![](https://dl.acm.org/cms/attachment/dbe5c3ef-e285-4e5d-aaa1-555a7abc6148/taco1703-24-f05.jpg) Fig. 5. Comparison of (a) execution time, (b) dynamic energy, (c) network traffic, and (d) GPU memory access breakdowns for the four schedulers for regular applications with multiple kernel invocations. All bars are normalized to G. ## 5.3 Irregular Applications * Compared to G, S reduces average execution time by 10% (max 18%), energy by 8% (max 19%), and network traffic by 16% (max 37%). * C causes a significant slowdown in six out of the eight experiments—the static chunking and allocation of TBs to CUs is no longer load-balanced. * S is indispensable for performance. ![](https://dl.acm.org/cms/attachment/40dfa4e1-a1ee-460f-9f52-ca548c917918/taco1703-24-f06.jpg) Fig.6. Comparisonof(a)executiontime,(b)dynamicenergy,(c)networktraffic,and(d)GPUmemoryaccess breakdowns for the four schedulers for irregular applications with two inputs each (normalized to G). Part (d) contains both data and synchronization accesses. # 6. Questions and Suggetions * 優點 * 使用了 inter-kernel 的 locality 去增加 cache hit. * overhead 不大 * 缺點 * 只在 regular application 有比較好的表現