TaskStream: Accelerating Task-Parallel Workloads by Recovering Program Structure

# TaskStream: Accelerating Task-Parallel Workloads by Recovering Program Structure ###### paper: [link](https://polyarch.cs.ucla.edu/papers/asplos2022-taskstream.pdf) ###### tags: `CGRA` ###### conference : ASPLOS, 2022 ## 1. Introduction #### Motivation * CGRAs and dataflow architectures, have come to prominence for addressing data-processing problems. * However, they are largely limited to workloads with regular parallelism, precluding their applicability to prevalent task-parallel workloads * While promising for generality, most reconfigurable architectures remain fairly limited to regular computations. * This paper focus on irregular parallelism, also known as task parallelism: this occurs when a program’s work is created and scheduled to execution resources dynamically, based on runtime computations * There are at least three clear benefits to supporting task parallelism in reconfigurable accelerators : 1. many workloads have inherent data-dependences in forming parallel work, so this enables broader applicability 2. sometimes the amount of work per task can only be determined at runtime, so dynamic assignment can balance load 3. many irregular workloads have multiple task types, where each type stresses the system differently in terms of compute, memory, network, or other resources (e.g. memory-bound graph aggregation and compute-bound multiplication in Graph Convolution Networks - GCNs); running different task types in parallel can balance shared resource usage. #### Contribution * Novel model for task parallelism for reconfigurable accelerators, which enables a new class of structure-recovery optimizations. * Hardware/software co-design to support three different classes of communication patterns efficiently: * 1. lightweight task creation, * 2. inter-task streaming dependences with co-scheduling of dependent tasks, * 3. inter-task reuse by exploiting spatial or temporal structure with batching. * Evaluation against static and naïve task-parallel models, demonstrating the value of structure recovery in irregular workloads ## 2. TaskStream Execution Model ### 2.1 Opportunities for Structure Recovery ![](https://i.imgur.com/7nRksWL.png) #### a. Variable-sized Tasks * Figure 1a shows an example where inner loop tasks have a data-dependent length, based on B[i]. * A naive task parallel model would assign the inner looptasks irrespective of the work involved in a task. * The opportunity here is to distribute tasks with the knowledge of the work involved * core 1 gets the smallest and second-largest task (i.e. with total work = 3+7 = 10), so that all cores get similar total work #### b. Coarse-grain Pipeline Reuse * Figure 1b demonstrates a global reduction example where each core gets a tile of data * In the NSAïve task parallel implementation, all cores need to perform updates on the reduction variable through memory * The opportunity here is to identify the ordered reuse, and pipeline or stream the data from a producer to one or more consumer tasks. * This transforms the memory traffic into direct network traffic, reduces shared-memory overhead from coherence, and also allows overlapped execution of tasks for more concurrency #### c. Coarse-grain Read Reuse * Figure 1c show the duplicates in B are expected to create multiple tasks with shared read data, providing an opportunity for reuse. * A naïve task parallel model schedules tasks without respecting locality, so tasks that access the same data may not be scheduled on the same core or at the same time. * Such coarse-grain reuse can be exploited by identifying tasks that access the same data, and reordering them to execute at the same time on different cores; * the responses can then be multicast to significantly reduce network traffic and memory bandwidth usage ### 2.2 TaskStream Model TaskStream is a task-parallel execution model that adds sufficient information to identify and exploit structure-recovery opportunities. #### TaskStream Basics Nodes : * a program is represented as a set of nodes, one for each task type * task can be in one of three state: 1. Created : the arguments for a task instance are constructed on the originating core 2. Scheduled : the task is bound to execution resources 3. Executing : task computation is in progress Edges: * represent inter-task dependences * edges are typed, each type indicated the potential for structure-recorvery : * creation (standard) * streaming (for pipeline reuse) * batching (read reuse) Processing step: * Tasks are created when they receive a set of values for all incoming creation (standard) edges. * Next, a task is scheduled to storage/execution resources (e.g. buffer/core), after which it is assigned a TaskID that represents this location * the TaskID may be returned to the parent if a streaming communication will be established Core configuration: * Tasks may only be scheduled to a core which is configured for its task type * To convey the configuration information, each task node is annotated with a coreMask: a bitmap that describes the legal mapping locations * if there are sufficient resource, some task types can be co-located on the same core. Pending Task: * Tasks that are not yet ready to execute may be waiting on streaming or batched data, and we call these tasks pending ![](https://i.imgur.com/jAwjVya.png) #### a. Task Creation & Work-aware Load Balance * The outer-loop task gets core 1 (coreMask:001) while accumulation gets the remaining 2 cores (coreMask:110) because it has a ratio of B[i] times more work compared to the outer-loop * . In the example, B[i] is the number of iterations of the inner loop, and is therefore used for that task’s sizehint. * The scheduler could then assign tasks of size 3, 8 to core 1 and tasks of size 2, 9 to core 2, resulting in a balanced load. #### b. Task Streaming * To facilitate dynamic pipelining between tasks, edges may be of task streaming type * For a streaming edge, the programmer can specify a dependence distance (depDistance), which allows developing a streaming relationship between task-instances separated by a fixed number of tasks. * In the example, the depDistance is 1. * start streaming : To set up the communication, start-of-stream handshaking messages are exchanged to ensure that the children are ready, and producers have their scheduling information * end streaming : To close the communication, an end-of-stream message is sent when the required number of bytes have been streamed in, this parameter must be specified by the producer. #### c. Task Batching * To enable multicasting of shared reads, we implement a task batching edge * This edge requires three parameters when it is activated: DataID indicates whether the reads are to the same data, TaskID indicates the dependent dynamic task, and bytes indicates the length of these reads * The task scheduler can use this information to record which tasks are dependent on the same reads, and reorder them to schedule them together * The outputs from B[i] are batched, resulting in only 3 unique requests instead of 6. ## 3. TaskStream for Reconfigureable Accelerators ### 3.1 Hierarchical TaskStream Dataflow #### Decoupled Dataflow Background * In this model, computation and memory instructions are represented as nodes, edges represent ordered dependences between instructions * Computation is decoupled from memory to allow for efficient prefetching * To execute a computation, the memory and compute nodes are first configured ![](https://i.imgur.com/22wO30x.png) #### Hierarchical Integration * When a task is scheduled, its inputs are delivered to corresponding ports, which triggers instruction-level execution * Tasks are then executed in pipelined fashion in the order they are scheduled * producer ports at the parent deliver data to the consumer ports at the child task. A consumer port must be “acquired” by a parent task before communication can begin, as explained later #### Task Protocol Scheduling : * TaskStream checks whether any task argument is annotated with sizehint, and the task is sent to the core with the least cumulative work until now, and this value is incremented. * Task instances (identified by their arguments) may either be held in a ready state if all arguments are available, or in a pending state otherwise. * in Figure 3, the T2 task is ready after receiving B[i], however the T3 task will be in the pending state, as it is still waiting on M[j] Streaming : * Whenever there is data at the producer port of a task streaming edge, an ack is sent to the child tasks along with the producer port information (Figure 3, step 3) * This ack should check as whether the child task can be concurrently scheduled * this requires that the current task has finished and the consumer port is free. * When both conditions are met: the child task is set ready and scheduled (Figure 3, step 4), the consumer port is set busy (i.e. it is acquired), and the ack response is sent back to start streaming (Figure 3, step 4, 5) * After the last data is sent, the producer sends another ack to close the communication and free the remote port (Figure 3, step 8). #### Deadlock Prevention Self-loop in the TaskStream Graph : * Consider the scenario when the parent and child tasks, setup for streaming communication, are scheduled for execution on the same core * The child task may never be able to lock the consumer port if the parent is already using it, and the parent cannot release the producer port until the streaming data is sent to the child, as it is waiting for the child to get lock of the consumer port * step 7 : child can never get consumer port when parent is using in step 5 * Our solution is to allocate a mutually exclusive set of resources/cores to the parent and child tasks. ### 3.2 Programming ![](https://i.imgur.com/vdelrUJ.png) #### 1. Defining Task Types * For each task type, the programmer first defines the instructionlevel dataflow graphs. * For the example in Figure 4a, Cholesky has three task types, one for each loop nesting degree: 1. Point: performs only one inverse and square root for every outer loop iteration. 2. Vector: performs O(n) multiplication operations. 3. Matrix: performs O(n×n) multiplication and subtraction operations. * Since the work required for Point and Vector is much smaller than Matrix, the coreMask is set to assign one core to both Point and Vector, while all other cores are assigned to the matrix task. * Figure 4d shows the mapping of Cholesky to our accelerator #### 2. Defining the TaskStreawm Graph * Next, the programmer uses algorithmic knowledge to identify edges among task type nodes. * in Figure 4c, there is a creation edge from Point to Vector and Matrix-tile * There exists data dependencies among multiple matrix tasks, hence there is a streaming edge from the Matrix-tile task to itself * The dependence distance is the number of matrix-tiles in the k th iteration of the outermost loop #### 3. Managing Program Phase * A program phase starts when the programmer pushes an explicit task of any type. * The phase is complete when all tasks have finished execution. * Cholesky is initiated by creating a task for the outer-loop point task. ### 3.3 Workload Mapping ![](https://i.imgur.com/zuFIJ2w.png) #### kNN serach * For every query, a binary kd-tree is searched. When the leaf node is reached, data associated with the leaf is accessed to perform linear search * We define two task types: 1. Tree node: A tree traversal. Since each traversal will incur long latencies to access pointers, we split it into small tasks that compares with the current node and outputs the next tree node 2. Leaf search: Here the query is searched linearly in a long vector associated with the leaf. Many queries may search in the same leaf, generating coarse-grained reuse. #### Graph Convolution Network (GCN) * Every vertex accumulates its feature vectors into its outgoing neighbors, and when all the incoming feature vectors are received, the accumulated vector is multiplied with a weight matrix * we define three task types : * graph access, * feature vector updates (together performing aggregation) * matrix-vector multiplication #### Database + Machine Learning * We define three task types: 1. Sort: Sort requires sufficient work to utilize all resources. Moreover, it reuses its outputs as inputs to the next iteration of sort, and therefore, its output cannot form pipelined communication with other dependent task types. Hence, we treat the subsequent computation as another phase, executed after a task barrier. 2. Join: requires O(n) comparison operations. 3. kmeans groupby: requires O(#matched-rows*d) operations to find the minimum distance. #### Sparse matrix multiply * We define a task type as the product of a column of matrix 1 and a tile of matrix 2’s corresponding row * Since different tiles of matrix 2’s row access a common column, there is an opportunity of batching reuse ## 4. DELTA: A TASKSTREAM ACCELERATOR Delta is our proposed multicore accelerator, which implements the TaskStream execution model. ![](https://i.imgur.com/bX2kbBR.png) * The The computation unit is a coarse-grained reconfigurable array, connected via hardware ports (vector ports) * The stream controller generates memory requests from stream access patterns, and the responses are sent to the port interface for further communication * The novel aspect of the hardware is for task management : * task-creation unit (used for storing arguments for ready and pending tasks) * task-batching unit (used for detecting and scheduling tasks that read the same data) #### Memory Hierarchy * Each core has a small private scratchpad, and all cores share access to a shared, distributed on-chip cache. * three predominant forms of reuse for memory access: 1. Small, read-only data shared among tasks: This data should be cached on-chip during the entire algorithm. Therefore, each core has a small private scratchpad 2. Shared data across a subset of tasks: Here the use of scratchpad would require software coherence, and would be difficult to manage. Therefore, Delta has a shared cache, and the reuse across tasks is exploited using task batching 3. Streaming data with no reuse: This data can bypass the cache hierarchy and will be directly streamed from memory #### Memory Stream Controller * The memory stream controller is designed to generate memory addresses using the access patterns and size determined by task type and arguments * Along with an address generator, the memory stream controller also has a stream table that holds their running state #### Task Creation Unit * This unit includes a free list, task argument buffer, dedicated acknowledgment buffer, and a FIFO scheduler * The task argument buffer is a simple SRAM memory that stores task arguments sequentially * the free list queue maintains free entries in this buffer usage : * When data at producer ports is available, if the free list has space, the arguments are pushed in the task creation buffer, and a unique TaskID is assigned * When a task is ready (i.e. does not require an explicit acknowledgment or already has one), the address location of the current task (TaskID) is pushed to the FIFO scheduler * When all consumer ports have sufficient space, the FIFO scheduler will release the task arguments to the input ports in the given order. At that time, the resulting entry will be pushed into the free list. #### Task Batch Buffer * this unit also has a free list, batching buffer, a small CAM and a FIFO scheduler * The batching buffer is a banked SRAM memory * the difference here is that the free list maintains TaskIDs of a “batch” of task arguments instead of just one, as in the task creation unit. * When the data at the producer ports is ready, the CAM is searched for the current DataID entry * If the entry is found, the new task arguments are stored at this DataID’s next empty entry. Otherwise, a new entry is popped from the free list for this purpose #### The Stream Table * This table maintains the streams associated with TaskStream edges * These include the streams that transfer data: 1. from producer ports to the corresponding task creation/batching buffer 2. from task creation/batching buffer to corresponding consumer ports 3. Any streams for streaming data to remote cores reordering circular queue: * we need to reorder messages so that the correct parents are matched to the correct children ## 5. Methodology #### Delta Power/Area * DSAGEN, Chisel-based framework * Components were synthesized using 28 nm UMC library * Cacti 7.0 for estimating the overhead of the SRAM buffer #### Baseline Architectures * They compare against a 24-core SKL CPU running optimized libraries * MKL for Cholesky * SpMSpM, and MADLib for DB-ML. * kNN, we use the popular FLANN library * For GCN, we use the PyG library * They developed a simulator for Delta and integrated it with gem5 , using a RISCV ISA for the control core * For accelerator comparison points, we evaluated three designs: * Static Parallel * Delta, OnlyTasks * Delta, TaskStream ![](https://i.imgur.com/ZzlHvQc.png) ## 6. Evaluation #### Overall Performance ![](https://i.imgur.com/Iy6AnX4.png) ![](https://i.imgur.com/Aa1fzJf.png) * Figure 7 shows that Delta-OnlyTasks achieves 1.3× speedup against the static reconfigurable accelerator that represents the state-of-the art * Cholesky does not benefit due to low parallelism in both static-parallel and OnlyTasks * Figure 7 also shows that with stream-recovery optimizations, the speedup increases to 2.2× over the static-parallel accelerator. #### Memory Traffic Reduction ![](https://i.imgur.com/WhRAdtC.png) * To explain the source of performance improvement using stream recovery, Figure 8 demonstrates what percentage of memory traffic is converted into network traffic in Delta-TaskStream. #### Fine-grained Core-wise Throughput Comparison * TaskStream optimizations improve the core utilization by alleviating communication bottlenecks, while providing load balance * core 0 is used for lower-rate (or less compute intensive) tasks, therefore is generally under-utilized. ![](https://i.imgur.com/HuxoOV1.png)