PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators

# PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators ###### tags: `Accelerators` #### [paper](http://web.cs.ucla.edu/~tjn/papers/isca2021-polygraph.pdf) #### [slide](https://drive.google.com/file/d/1P6spVs5Uszh1zwtIGwR2Wt5YNLXJNh3z/view) #### ISCA, 2021 ## abstraction * Motivation * importance of graph workloads * limitations of CPUs/GPUs * Prior acceralator * single graph algorithm variant * This work * identify a taxonomy of key algorithm variants * develop a template architecture(PolyGraph) that is flexible across these variants while being able to modularly integrate specialization features for each * find that flexibility in graph acceleration is critical 1. INTRODUCTION * Motivation * Challenging for CPUs/GPUs due to data-dependent memory access, reuse, and parallelism * Opportunities: * commutative updates * repetitive structure in memory access and computation * Prior work's assumption * input graph type (eg. high vs. low diameter) * workload property (eg. order resilience, frontier density) * graph algorithm variants * Update Visibility(granularity when graph updates become visible) * Vertex Scheduling(the fine-grain scheduling policy for vertex updates) * Slice Scheduling(whether and how the graph working set is controlled) * Update Direction(pull/push) * ![](https://i.imgur.com/YGGaeJH.png) * performence * ![](https://i.imgur.com/y1Sgtpz.png) * limitation * lack of support for fine-grain data-dependent parallelism * Insight * having the flexibility to use the right algorithm variant for the right graph and workload * Challenge * design an architecture with sufficient algorithm/architecture flexibility, and little performance, area, and power overhead * granularity tasks (synchronous vs asynchronous updates) * fine-grain task scheduling * flexibly controlling the working set * having flexibility for different data structures * Approach * efficient decoupled spatial accelerators * support general data-structures * suit both memory-intensive and compute-intensive workloads * Evaluation and Results * ![](https://i.imgur.com/rk26ij2.png) * 16.79× (up to 275× for high diameter graphs) faster than a Titan V GPU * By statically choosing the best algorithm variant, we gain 2.71× speedup. Dynamic flexibility provides 1.09× further speedup 2. GRAPH ACCELERATION BACKGROUND A. Vertex-centric, Sliced Graph Execution Model * vertex-centric graph execution * a user-defined function is executed over vertices * This function accesses properties from adjacent vertices and/or edges * execution continues until these properties have converged * Preprocessing the graph * better spatial and/or temporal locality * temporal slices * fit into on-chip memory * spatial slices * divide the graph among cores for load-balance or locality * ![](https://i.imgur.com/9XrwyAz.png) * Graph Data-structures * ![](https://i.imgur.com/hZxUHo2.png) B. Key Workload/Graph Properties * Graph Property: * Diameter * largest distance between two vertices * Uniform-degree graphs * high diameter * similar/low number of edges-per-vertex * power-law graphs * low diameter * some vertices are highly connected * Workload Property: * Order Sensitivity * sensitive * SSSP * less sensitive * BFS * insensitive * GCN * Frontier Density * sparse frontier * SSSP, BFS * dense frontier * PR, CF * sparse frontier workloads require fewer passes through the graph until convergence 3. GRAPH ALGORITHM TAXONOMY * Update Visibility * ![](https://i.imgur.com/7RcfVg0.png) * Vertex Scheduling * for asynchronous variants * ![](https://i.imgur.com/n2UdZOG.png) * Temporal Slicing * Slices are determined during offline partitioning and are generally sized to fit on-chip memory * Updates to data outside the current slice are deferred * an explicit phase is required to switch slices * ![](https://i.imgur.com/IwGbREb.png) * Update Direction * updates its own property (pull/remote read) * updates its neighbor’s properties (push/remote atomic update) * Notation * push is default * ![](https://i.imgur.com/8S0QoF4.png) * Summary * ![](https://i.imgur.com/QO9UqWK.png) 4. UNIFIED GRAPH PROCESSING REPRESENTATION * data-plane (pipelined task execution) * control plane (slice scheduling) A. Data plane Representation: Taskflow * major requirements * Need for fully pipelined execution of per-vertex computation * Need to support data-dependent creation of new tasks, including programmatically specifying and updating the priority ordering * Need for streaming/memory reuse * task invoked by <t,args>: type t and input arguments * Each task type is defined by a graph of nodes * Compute nodes * are passive, and may maintain a single state item. * Memory nodes * represent decoupled patterns of memory access, called streams * ![](https://i.imgur.com/Mh8N7t1.png) * Atomics * correct handling of memory conflicts on vertex updates * ![](https://i.imgur.com/GDFr2I3.png) * Task nodes * represent arguments, and are ingress and egress points of the graph * Priority Scheduling and Coalescing * task arguments * task’s priority * ID * is unique for all active tasks * vertex id * task coalescing and sliced execution * Taskflow Examples * ![](https://i.imgur.com/mH7H4zt.png) * ![](https://i.imgur.com/tfY4sDC.png) * Taskflow Flexibility Summary * Synchronous variants use coarse grain tasks that pass through the (per graph/per slice) active list * Asynchrony is supported with explicit fine-grain tasks, optionally with priority hint argument B. Slice Scheduling Interface and Operation * Slice scheduler * configure on-chip memory, decide which slice to execute next, and manage data/task orchestration * on a simple control core with limited extensions for data pinning operations. * creating initial tasks * Data Pinning * provide the slice scheduler an interface to pin a range of data to the on-chip memory at a particular offset, essentially reserving a portion of the cache * Non-pinned data is treated like normal cache access * Slice Switching for Asynchronous Variants * tradeoff between work-efficiency (switch sooner) and reuse (switch later) * Slice Preprocessing * Slices are preprocessed to keep all edges(and hence updates) within each slice * Slice Transition * ![](https://i.imgur.com/YWn0wou.png) C. Scheduling of Algorithm Variants * Quantitative Motivation * ![](https://i.imgur.com/HUnTjlc.png) * Notice that the highest performance variant changes during the execution * Heuristics for Algorithm Variant Scheduling * ![](https://i.imgur.com/4fzYget.png) * Variant Transition * Given the algorithm variant, the control core will * Initialize data-structures and configure taskflow graph * Perform pinning operations * If a dynamic switch is invoked, on-chip memories are flushed, and taskflow may require reconfiguration. 5. POLYGRAPH HARDWARE IMPLEMENTATION * A [Softbrain](https://hackmd.io/EJXvQdj2RgKrQ6P7Ny9UTQ)-like CGRA executes compute nodes in pipelined fashion * Multicore decoupled-spatial accelerator connected by 2D triple mesh networks * ![](https://i.imgur.com/xWLNrWv.png) * The data plane is comprised of all modules besides the control core, and is responsible for executing taskflow graphs * Memory nodes are maintained on stream address generators, and accesses are decoupled to hide memory latency * Compute nodes executes in pipelined fashion * Between the stream controller and CGRA are several “ports” or FIFOs, providing latency insensitive communication * Task management * A priority-ordered task queue holds waiting tasks * Task nodes define how incoming task arguments from the queue are consumed by the stream controller to perform memory requests for new tasks * Stream controller * If the stream controller can accept a new task, the task queue will issue the highest priority task. * issue memory requests from memory nodes of any active task * CGRA * pipeline the computation of any compute nodes * create new tasks by forwarding data to output ports designated for task creation, and these are consumed by the task management hardware. * Tasks may be triggered remotely to perform processing near data. * Control core * Initial tasks may be created by the control core, by explicitly pushing task arguments to the task creation port * Task management unit * enables high-throughput priority-ordered task dispatch * coalesces superfluous tasks at high throughput * Tasks can overflow the task queue * Slice scheduling * is implemented on core 0’s control core A. Task Hardware Details * Task Queue and Priority Scheduling * A task argument buffer maintains the arguments of each task instance before their execution * The task argument pointers to ready tasks are stored in the task scheduler * use the priority task scheduler only for graph access tasks and FIFO scheduling for others (eg. vertex update) * Overflow and Reserved Entries * If the task queue is full, new tasks will overflow into a buffer in main memory * Re-calculation is required as the priority might have been updated due to coalescing * Task Coalescing * To reduce active tasks, we allow coalescing of tasks with the same ID B. Memory Architecture * Shared Memory * Our on-chip memory is a shared address-partitioned cache, with multiple banks per PolyGraph core * Atomic Updates 6. SPATIAL PARTITIONING * While offline partitioning is common for creating temporal slices, we find that spatial partitioning makes the mesh-based * tradeoff between locality and load balance * Multi-level Scheme * the graph is split into many small clusters of fixed size to preserve locality, then these clusters are distributed equally among cores for balanced load * ![](https://i.imgur.com/628yXjs.png) * high diameter graphs * load balanced because active vertices is usually low across iterations * low diameter graphs * larger clusters are helpful for locality 7. METHODOLOGY * PolyGraph Power/Area * ex-tending DSAGEN * task scheduling hardware * stream-dataflow ISA * synthesized PolyGraph cores and NoC at 1GHz, with a 28nm UMC library. * used Cacti 7.0 for modeling eDRAM * For performance modeling across variants, we developed a custom cycle-level modular simulator * Main memory is modeled using DRAMSim2 * We assume preprocessing is done offline and reused across queries 8. EVALUATION A. Algorithm Variants Performance Comparison * ![](https://i.imgur.com/AJFBCAI.png) * ![](https://i.imgur.com/ANAKDYr.png) B. Comparison to Prior Accelerators * ![](https://i.imgur.com/eURJhWy.png) * ![](https://i.imgur.com/2RvDfdd.png) C. Algorithm Sensitivity * ![](https://i.imgur.com/fjTaU1t.png)