# PAVER: Locality Graph-Based Thread Block Scheduling for GPUs ###### tags: `GPUs` ###### Paper: [link](https://dl.acm.org/doi/fullHtml/10.1145/3451164) ###### From: ACM Transactions on Architecture and Code Optimization ###### Published:08 June 2021 ###### School: University of California, Riverside, USA ## INTRDUCTION * <img src="https://dl.acm.org/cms/attachment/7d1545d5-2ce5-4129-8230-109a51941ccf/taco1803-32-f01.jpg"> * memory traffic becomes a major bottleneck for present-day GPUs * small amount of private cache can be allocated for each thread * constant demand of data from the GPU’s many computation cores * ever-increasing data size of GPU applications * L1 and L2 caches are much more likely to suffer from cache thrashing * lead to more cache operations * more energy consumption * tremendous under-utilization of the other GPU resources * threads sharing the cache * read-after-write(**RAW**), structured parallelism * read-after-read (**RAR**), unstructured parallelism * this paper * Since the grid structure of the tasks is application-specific, PAVER addresses this problem with a generic **graph-based approach** to improve performance and memory efficiency. * just-in-time compilation approach to gather the data sharing statistics among the thread blocks which constitute the same kernel and are able to use the same allocated memory. The JIT analysis is accomplished after compilation and before the kernel launch * partition the graph in order to assign TB groups with the most locality to SMs * incorporate a task-stealing process to move a TB from one SM queue to another when all the TBs in the latter finish early ## BACKGROUND AND MOTIVATION * baseline TB scheduling is assumed to be LRR(loose round-robin) * miss catagory * cold miss * capacity miss * conflict miss(set conflicts) * locality catagory * temporal locality * spatial locality * GPU caches' locality * L1 locality * If TBs that use distant parts of the memory also share the same SM, it will lead to unnecessary L1 evictions and thrashing, hurting the performance. * L2 locality * if TBs executing on different SMs lack block locality, they will suffer from L2 eviction. * cache hit and miss distribution for different benchmarks * L1 miss * 33.6% of all misses are conflict or capacity misses on average * L2 miss * 9% conflict or capacity misses on average *** * TBs' locality catagory * <img src="https://dl.acm.org/cms/attachment/77e58722-4a71-49dc-97eb-58a8ffacb72a/taco1803-32-f02.jpg"> * Unshared * single warp in a TB accesses a data * Intra-TB * multiple warps within a TB access same data * Inter-TB * multiple TBs access same data * Intra-TB ∩ Inter-TB * data block is accessed by multiple warps within a TB, and across multiple TBs * this paper usage **inter-TB** and **intra-TB ∩ inter-TB** ## OVERVIEW * locality-aware TB scheduling framework <img src="https://dl.acm.org/cms/attachment/25339737-d9a6-49db-a3cb-1d8b647a93ce/taco1803-32-f03.jpg"> 1. Extracted **load address ranges** for a TB from the PTX code 2. Constructed the **locality graph** 4. partition the graph, and each **partition** contains TBs having maximum locality among them 5. locality-aware **TB scheduler** assigns the TBs to the SMs ## GENERATING LOCALITY GRAPHS ### Locality Graph * vertices: thread blocks(TB)s' ID * edge weight: the number of **common shared data references** accessed by those TBs * every element in location (i,j) specifies the number of data references shared between TBi and TBj * symmetric metrix <img src="https://dl.acm.org/cms/attachment/c13715f9-e269-48ef-a212-5abc3a2f5665/taco1803-32-f04.jpg" width="50%" height="50%"> *** * the **color** density of the point represents **weight** on the particular edge * "consecutive TB have max locality" is not necessarily true * locality pattern * same row and same column((SYRK, SYR2K, MM)) * same row and first column of 2D Grid (HS, DWT, SRAD) * same row of 1D Grid (MGS, STO, PF, HTW) * arbitrary (BFS, BTR, SPMV) <img src="https://dl.acm.org/cms/attachment/be6c7de8-3a7a-4873-8ab2-593d83d69f71/taco1803-32-f05.jpg" width="70%" height="70%"> ### Identifying Locality Information * perform analysis on the PTX code before the kernel launch in order to extract locality information from the kernel PTX * JIT-based **static analysis** to extract TB locality information from each kernel * static(can be known in launch time) * A[0], A[i] (where i is a loop parameter), A[tid] * non static * A[A[0]], since the value stored in A[0] cannot be known except at runtime * <img src="https://dl.acm.org/cms/attachment/05f800af-bbc2-4856-891c-1e7b3aaec0db/taco1803-32-f06.jpg"> * formula: L(TBi,TBj)=|Ai ∩ Aj|+|Bi ∩ Bj| * The applications with large sharing are likely to benefit from proper TB scheduling *** * overhead * does not affect the **kernel-execution time** but increases the **host side time** * ![](https://i.imgur.com/d9TKHD4.png) * the device to host cudaMemcpy has been excluded from the overhead calculation as it does not delay the kernel load time * The JIT overhead for some applications with high inter-TB locality SYRK, SYR2K, and MM are 0.26%, 0.2%, and 0.01%, respectively ## PAVER TB SCHEDULING * <img src="https://dl.acm.org/cms/attachment/151dbcfd-cc95-4d83-a518-d9d99916f4e2/taco1803-32-f07.jpg" width="70%" height="70%"> * a. MST-TS * a naive approach * b. kway-TS * improve the hit rate of L1 caches * c. RB-TS * account for both L1 and L2 cache performances ### Baseline * BCS * groups two consecutive TBs into a pair and assigns them to the same SM * the TB pair assignment is delayed until the TB contexts for the pair are available, leading to resource starvation in the SM ### Maximum Spanning Tree-Based TB Scheduler (MST-TS) * can use Prim’s or Kruskal’s algorithm. This paper use the former. * aims at achieving high L1 locality and load-balancing across the SMs * shortcomming * only captures one-dimensional sharing without considering data sharing between more than one TB ### k -Way Partition-Based TB Scheduler (kway-TS) * k is the total number of partitions equal to the number of SMs * METIS is used to partition the graph in a **load-balanced manner**(evenly partition) while maximizing the sum of edge weights within partitions * groups have the highest connectivity than MST-TS * re-order the TBs in each partition using **Prim's MST** such that the subset of TBs in each partition executing concurrently has maximum locality with each other * shortcomming * with multiple partitions without locality running simultaneously on different SMs. The locality between the SMs may be lost leading to L2 thrashing. This type of scheduling prioritizes L1 locality over L2 locality. ### Recursive Bi-Partition-Based TB Scheduler (RB-TS) <img src="https://dl.acm.org/cms/attachment/203e7935-9774-4532-b3c9-cfb418c90704/taco1803-32-algo1.svg"> * creates a binary tree where the leaf nodes (𝐺1, 𝐺2, 𝐺3 and 𝐺4 ) are prioritized from left to right * This preserves L1 locality by TBs grouped within the same leaf node, and preserves L2 data locality by concurrently scheduling adjacent groups on different SMs that share the same parent. ## PAVER RUNTIME ### Hardware Implementation * a global queue (located in global memory) is used to store the pointers to the TB groups * in global memory * global queue * TB groups * in SMs * three registers(next, tail and next) * ![](https://i.imgur.com/s3PYeRS.png) * storage overhead: * 64-bit * 2 (for next and tail) * 16-bit (for nextTB) * timing overhead: * off the critical path ### Task Stealing * PAVER focuses on RAR locality * granularity of tasks stolen is determined as: 𝑀𝑎𝑥 𝑊𝑎𝑖𝑡𝑖𝑛𝑔𝑇𝐵 - 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑊𝑎𝑖𝑡𝑖𝑛𝑔𝑇𝐵 ![](https://i.imgur.com/BhdFeGK.png) * If 𝑆𝑡𝑜𝑙𝑒𝑛 𝑇 𝐵 𝑐𝑜𝑢𝑛𝑡 > 2, it still captures some locality * affects the L1 locality within the donor SM but the overall kernel execution saves the extra cycles by load-balancing the SMs * Storage Overhead: * 𝑀𝑎𝑥 𝑊𝑎𝑖𝑡𝑖𝑛𝑔𝑇𝐵(16-bit) * 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑊𝑎𝑖𝑡𝑖𝑛𝑔𝑇𝐵(16-bit) * tolen TB count(16-bit) * Control logic Overhead * 3 adders,1 comparator and 1 divider ### Generalized Runtime Algorithm for all TB Policies ![](https://i.imgur.com/yX8Mlrg.png) ## EVALUATION ### Methodology * GPGPU-Sim * warp scheduling policy: * greedy-then-oldest (GTO) * CCWS can be applied to PAVER to further improve the performance * compare with Loose Round Robin(**LRR**) and Block CTA Scheduler(**BCS**) as the baseline * LRR's locality mainly on l2 * BCS schedules two neighboring TBs at the same time * has better L1 locality than LRR * same L2 locality as LRR ### Results #### Execution time * kernel execution time as well as JIT Analysis Overhead of our different TB scheduling policies with respect to LRR and BCS * ![](https://i.imgur.com/jJC8kHp.png) #### Speedup * ![](https://i.imgur.com/vQfnDYp.png) * ![](https://i.imgur.com/g5fPhSa.png) * Fermi * high locality * 2.8%, 23.8% and 29% * low locality * 1.8%, 2.5% and 3.8% * no locality * reduced 0.15%,0.3% and 0.5% * average speed-up * 1.8%, 10.1% and 12.6% * Pascal * high locality * 10.8%, 32.5% and 49.1% * low locality * 0.2%, 1.8% and 3.7% * overall speed-up * 3.3%, 13.32% and 20.49% * Volta * high locality * 4.5%, 28.6% and 41.2% * low locality * 0.4%, 3.5% and 3.8% * overall speed-up * 0.18%, 12.4% and 17.4% * warp scheduling techniques can further increase the speedup * BCS leads to thread block throttling * Applications such as BTR, BFS and SPMV having irregular data-locality pattern application have been analyzed through our graph-based approach and gives significant speedup #### Effect of Task Stealing * Average Performance benefits of 3% and 2% were observed for 𝑘-way-TS and RB-TS respectively w.r.t no task stealing case in Fermi #### Effect of Cache Size * With increased cache size PAVER outperforms the baseline configuration by 16% #### Miss Rates * lower L1 miss rates in RB-TS as compared to 𝑘-way-TS ![](https://i.imgur.com/wwMan7v.png) * L1 miss rate * reduced 43.3%, 10% and 21% * The L1 miss rate shown is proportional to the L2 accesses * L2 accesses * fermi * high * reduced 78.4%, 61.4% and 56.7% * low * reduced 95%, 91.88% and 90.22% * overall * reduced 89.6%, 81.7% and 79.3% * Pascal * high * reduced 79%, 67% and 51.59% * low * reduced 94.2%, 92.6% and 89.1% * overall * reduced 89.6%, 84.3% and 76.94% * Volta * high * reduced 76.4%, 65% and 59.8% * low * reduced 90.9%, 85.5% and 85.3% * overall * reduced 87%, 80.5% and 78.3% * In 𝑘-way * the partition size is **huge** and the concurrently running TB might not necessarily have the **maximum sharing within the partition** * In RB-TS * all the concurrently running TB i.e. all the TB inside a partition are likely to have maximum sharing