Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

# Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs ###### tags: `GPUs` ###### paper origin: ISCA 2015 ###### papers: [link](https://dl.acm.org/doi/epdf/10.1145/2872887.2750393) # 1. Introduction ## 1.1 Problem ![](https://i.imgur.com/Um2dFgH.png) * Irregular applications is challenging to effectively harness data parallel accelerators such as GPUs because of they are dominated by irregular control,data, and memory access ﬂows. * Recently introduced nested device-side kernel launching functionality in the GPUis a step in the right direction, but still falls short of being ableto effectively harness the GPUs performance potential ## 1.2 Solution * Dynamic Thread Block Launch (DTBL) mechanism can launch light weight thread blocks dynamically and on demand from GPU threads * The overhead of launching a thread block is considerably smaller than launching a kernel * They extend the GPU microarchitecture to integrate the execution of these dynamically spawned parallel work units into the base GPU microarchitecture ## 1.3 Contribution 1. This paper introduce DTBL as an effective lightweight execution mechanism for spawning dynamically created parallel work. 2. This paper implement a device side runtime API for supporting DTBL with a single call by leveraging the API for CUDA Dynamic Parallelism. 3. The base GPU microarchitecture is extended at a modest cost to support the dynamic generation, tracking, and efﬁcient management of the spawned thread blocks. 4. DTBL achieves average 1.21x speedup over the original implementation with ﬂat GPU programming methodology and average 1.40x over the implementation with device-side kernel launches using CDP # 2. Background ![](https://i.imgur.com/NtxAVWy.png) ## 2.1 Baseline GPU Architecture and Execution Model * GPU is connected to the host CPU by the interconnection bus and accepts operation commands such as memory copy and kernel launching from the CPU. * Thread blocks are scheduled and executed on the GPU computation units called Streaming Multiprocessors (SMXs), each of which features multiple cores, registers and a scratch memory space used as L1 cache or shared memory ## 2.2 Host-side Kernel Launching and Scheduling * All the launching commands are passed to the GPU through stream (e.g. CUDA stream) * The streams are mapped to Hardware Work Queues (HWQ) in the GPU * The Kernel Management Unit (KMU) manages multiple HWQs by dispatching kernels at the head of the queue to the Kernel Distributor * The Kernel Distributor holds all the active kernels ready for execution. * The SMX scheduler takes one entry from the Kernel Distributor in ﬁrst-come-ﬁrst-serve (FCFS) order ## 2.3 Device-side kernel launching * In this model, parent kernels are able to launch nested child kernels by invoking several device-side API calls to specify the child kernel conﬁguration, setup parameters * there is a path from each SMX to the KMU so that all the SMXs are able to issue new kernel launching commands to the KMU * device-launched kernels also take advantage of concurrent kernel execution capability since the KMU dispatch them with other host-launched kernels # 3. Dynamic Parallelism ## 3.1 Nested Device-side Kernel Launching ### High Kernel Density * a BFS iteration can have about 3K device kernel launch * result in high kernel launching overhead and memory footprint ### Low compute Intensity * device launched kernel are usually fine-grained and have low degrees of parallelism * average number of threads in each device launched kernel is around 40 which is close to the warp size ### Low Concurrency and Scheduling Efficiency * the low compute intensity in these device kernels and the low resources usage when executed on SMX, multiple device kernels are able to execute concurrently * the number of independent HWQs determines the maximum number of kernels that can be executed concurrently * imposes a limit on kernel-level concurrency for current kernel scheduling strategy on the GPU * there may not be enough warps to fully occupy the SMX * if some of the SM are not assigned with any warps => low utilization * if SM are not assigned with enough number of warps => poor memory latency hiding ability ## 3.2 Light Weight Thread Blocks * They propose DTBL where thread blocks rather than entire kernels can be dynamically launched from a GPU thread, Thread blocks (TBs) can be viewed as light weight versions of a kernel. * the dynamic creation of thread blocks can effectively increase the SMX occupancy => higher utilization * The dynamic TB launch overhead as well as memory footprint are signiﬁcantly lower than that of kernel launch. # 4. Dynamic Thread Block Launch ## 4.1. DTBL Execution Model ![](https://i.imgur.com/LtcCHTG.png) * TBs with the same conﬁguration can be scheduled together to achieve the designed occupancy for the original kernel, possibly leading to higher GPU utilization ### Thread Hierarchy Within an Aggregated TB * When coalesced to an active kernel, the number of threads in each dimension of an aggregated TB should be the same as that of a native TB ### Synchronization * threads within an aggregated TB can be synchronized explicitly by calling a barrier function * because of the base programming model has no explicit barrier valid across native or aggregated TBs, so aggregated groups cannot be explicitly synchronized by its invoking kernel. * Therefore, it is the programmers’ responsibility to ensure the correctness of the program without any assumption on the execution order of aggregated groups. ### Programming Interface ![](https://i.imgur.com/vpQfOGv.png) ![](https://i.imgur.com/cRSMHly.png) * The similarity between the two code segments demonstrate that DTBL introduces minimal extensions to the programming interface ## 4.2 Architecture Extensions and SMX scheduling ![](https://i.imgur.com/N6NxkLn.png) ### Launching Aggregated Groups （1)~(3) This is the ﬁrst step that happens when the aggregated grouplaunching API is invoked by one or more GPU threads * For each newly formed aggregated group, the SMX (1) allocates global memory blocks through the memory controller to (2) store the parameters and conﬁguration information. * After parameters are loaded to the parameter buffer, the SMX (3) passes the aggregation operation command to the SMX scheduler with the information for each aggregated group ### Thread Blocks Coalescing (4)~(9) In this step, the SMX scheduler receives the aggregation operation command and attempts to match the newly launched aggregated groups with the existing kernels in the Kernel Distributor Entries (KDE) for TB coalescing based on aggregated group conﬁgurations. * (4) The process is implemented as a new part of the DTBL scheduling policy * (5) For each aggregation group, the SMX scheduler ﬁrst searches the KDE to locate any existing eligible kernels that can accept the new TBs * If none are found, the aggregated group is launched as a new device kernel. * Their experiments show that an aggregated group is able to match eligible kernels on average 98% of the time * (6) If an eligible kernel is found, the SMX scheduler allocates an entry in the Aggregated Group Table (AGT) with the three-dimensional aggregated group size and the parameter address * (7) The AGT is composed of multiple Aggregated Group Entries (AGE) and serves to track all the aggregated groups. Aggregated groups that are coalesced to the same eligible kernel are linked together with the Next ﬁeld of the AGE * (8) The AGEI or the global memory pointer of the new aggregated group information is used to update the two KDE registersNext AGEI (NAGEI) and Last AGEI (LAGEI) if necessary * NAGEI indicates the next aggregated group to be scheduled in the kernel * LAGEI indicates the last aggregated group to be coalesced to this kernel * (9) All the kernels in the Kernel Distributor are marked by the FCFS with a single bit when they are queued to be scheduled and unmarked when all its TBs are scheduled * when a kernel is unmarked (executed on SM and waiting for the TBs to finish), the FCFS controller would mark the kernel again so the new aggregated group can be scheduled the next time the kernel is selected by the SM. ### Aggregated Thread Blocks Scheduling on SMX (10)~(13) The last step in DTBL scheduling manages to schedule all the aggregated TBs on the SMXs. * (10) Once the SMX scheduler determines the native kernel or aggregated group to schedule, it records the corresponding index of KDE (KDEI) and AGEI in its control registers (SSCR) * NextBL ﬁeld to store the index of the nextTB to be distributed to the SMX * (11) The SMX scheduler then distributes TBs to each SMX. The Thread Block Control Register (TBCR) on each SMX is updated correspondingly using the same value of KDEI and AGEI in SSCR to record the kernel index in the Kernel Distributor and the aggregated group index in the AGT so the SMX can locate the function entry and parameter address correctly for the scheduled TB. * (12) (13) Once the TB ﬁnishes execution, the SMX notiﬁes the SMX scheduler to update the ExeBL ﬁeld in the KDE and AGE which track the number of TBs in execution. ## 4.3 Benefits of DTBL * Dynamic TB launches have less overhead compared to device-side kernel launching. * Instead of processing the device-side launching kernel command through a long path from the SMX to KMU and then to the Kernel Distributor, TBs are directly grouped with active kernels in the Kernel Distributor by the SMX scheduler. * Dynamically generated TBs are very likely to be coalesced to the same kernel which enables more TBs to be executed concurrently hence increase the SMX occupancy * due to the similarity of the dynamic workload inirregular applications * concurrent execution of ﬁne-grained device kernels are limited by the size of Kernel Distributor # 5. Evaluation ![](https://i.imgur.com/otPyL0d.png) * CDP and DTBLwithout modeling launching latency are denoted as CDP-Ideal(CDPI) and DTBL-Ideal (DTBLI) respectively * As CDPI and DTBLI decrease control ﬂow divergenceand increases memory efﬁciency, they achieve average 1.43xand 1.63x ideal speedup respectively * However the non-trivial overhead of kernel launching negates the CDP performance gain, which results in an average of 1.16x slow down from the ﬂat implementations * DTBL, on the other hand, shows an average of 1.21x speedup over the ﬂat implementation and 1.40x over the CDP, which demonstrates that DTBL preserves the capability of CDP in increasing control ﬂow and memory regularity for irregular applications while using a more efﬁcient scheduling strategy with lower launching overhead to increase the overall performance ![](https://i.imgur.com/JOwvded.png) * decreasing AGTsize from 1024 to 512 causes 1.31x slow down * increasingto 2048 causes 1.20x speedup