# PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators ###### tags: `Accelerators` #### [paper](http://web.cs.ucla.edu/~tjn/papers/isca2021-polygraph.pdf) #### [slide](https://drive.google.com/file/d/1P6spVs5Uszh1zwtIGwR2Wt5YNLXJNh3z/view) #### ISCA, 2021 ## abstraction * Motivation * importance of graph workloads * limitations of CPUs/GPUs * Prior acceralator * single graph algorithm variant * This work * identify a taxonomy of key algorithm variants * develop a template architecture(PolyGraph) that is flexible across these variants while being able to modularly integrate specialization features for each * find that flexibility in graph acceleration is critical 1. INTRODUCTION * Motivation * Challenging for CPUs/GPUs due to data-dependent memory access, reuse, and parallelism * Opportunities: * commutative updates * repetitive structure in memory access and computation * Prior work's assumption * input graph type (eg. high vs. low diameter) * workload property (eg. order resilience, frontier density) * graph algorithm variants * Update Visibility(granularity when graph updates become visible) * Vertex Scheduling(the fine-grain scheduling policy for vertex updates) * Slice Scheduling(whether and how the graph working set is controlled) * Update Direction(pull/push) * ![](https://i.imgur.com/YGGaeJH.png) * performence * ![](https://i.imgur.com/y1Sgtpz.png) * limitation * lack of support for fine-grain data-dependent parallelism * Insight * having the <span style="color:red">flexibility</span> to use the right algorithm variant for the right graph and workload * Challenge * design an architecture with sufficient algorithm/architecture flexibility, and little performance, area, and power overhead * granularity tasks (synchronous vs asynchronous updates) * fine-grain task scheduling * flexibly controlling the working set * having flexibility for different data structures * Approach * efficient decoupled spatial accelerators * support general data-structures * suit both memory-intensive and compute-intensive workloads * Evaluation and Results * ![](https://i.imgur.com/rk26ij2.png) * 16.79× (up to 275× for high diameter graphs) faster than a Titan V GPU * By statically choosing the best algorithm variant, we gain 2.71× speedup. Dynamic flexibility provides 1.09× further speedup 2. GRAPH ACCELERATION BACKGROUND A. Vertex-centric, Sliced Graph Execution Model * vertex-centric graph execution * a user-defined function is executed over vertices * This function accesses properties from adjacent vertices and/or edges * execution continues until these properties have converged * Preprocessing the graph * better spatial and/or temporal locality * temporal slices * fit into on-chip memory * spatial slices * divide the graph among cores for load-balance or locality * ![](https://i.imgur.com/9XrwyAz.png) * Graph Data-structures * ![](https://i.imgur.com/hZxUHo2.png) B. Key Workload/Graph Properties * Graph Property: * Diameter * largest distance between two vertices * Uniform-degree graphs * high diameter * similar/low number of edges-per-vertex * power-law graphs * low diameter * some vertices are highly connected * Workload Property: * Order Sensitivity * sensitive * SSSP * less sensitive * BFS * insensitive * GCN * Frontier Density * sparse frontier * SSSP, BFS * dense frontier * PR, CF * sparse frontier workloads require fewer passes through the graph until convergence 3. GRAPH ALGORITHM TAXONOMY * Update Visibility * ![](https://i.imgur.com/7RcfVg0.png) * Vertex Scheduling * for asynchronous variants * ![](https://i.imgur.com/n2UdZOG.png) * Temporal Slicing * Slices are determined during offline partitioning and are generally sized to fit on-chip memory * Updates to data outside the current slice are deferred * an explicit phase is required to switch slices * ![](https://i.imgur.com/IwGbREb.png) * Update Direction * updates its own property (pull/remote read) * updates its neighbor’s properties (push/remote atomic update) * Notation * push is default * ![](https://i.imgur.com/8S0QoF4.png) * Summary * ![](https://i.imgur.com/QO9UqWK.png) 4. UNIFIED GRAPH PROCESSING REPRESENTATION * data-plane (pipelined task execution) * control plane (slice scheduling) A. Data plane Representation: Taskflow * major requirements * Need for <span style="color:red">fully pipelined</span> execution of per-vertex computation * Need to support <span style="color:red">data-dependent creation of new tasks</span>, including programmatically specifying and updating the <span style="color:red">priority ordering</span> * Need for streaming/memory reuse * task invoked by <t,args>: <span style="color:red">type t</span> and <span style="color:red">input arguments</span> * Each <span style="color:red">task type</span> is defined by a graph of nodes * Compute nodes * are passive, and may maintain a <span style="color:red">single state item</span>. * Memory nodes * represent <span style="color:red">decoupled patterns of memory access</span>, called <span style="color:red">streams</span> * ![](https://i.imgur.com/Mh8N7t1.png) * Atomics * correct handling of <span style="color:red">memory conflicts</span> on vertex updates * ![](https://i.imgur.com/GDFr2I3.png) * Task nodes * <span style="color:red">represent arguments</span>, and are ingress and egress points of the graph * Priority Scheduling and Coalescing * task arguments * task’s priority * ID * is unique for all active tasks * vertex id * task coalescing and sliced execution * Taskflow Examples * ![](https://i.imgur.com/mH7H4zt.png) * ![](https://i.imgur.com/tfY4sDC.png) * Taskflow Flexibility Summary * Synchronous variants use <span style="color:red">coarse grain tasks</span> that pass through the (per graph/per slice) active list * Asynchrony is supported with <span style="color:red">explicit fine-grain tasks</span>, optionally with priority hint argument B. Slice Scheduling Interface and Operation * Slice scheduler * configure on-chip memory, decide which slice to execute next, and manage data/task orchestration * on a simple <span style="color:red">control core</span> with limited extensions for <span style="color:red">data pinning operations</span>. * <span style="color:red">creating initial tasks</span> * Data Pinning * provide the slice scheduler an interface to <span style="color:red">pin</span> a range of data to the on-chip memory at a particular offset, essentially reserving a portion of the cache * Non-pinned data is treated like normal cache access * Slice Switching for Asynchronous Variants * tradeoff between <span style="color:red">work-efficiency</span> (switch sooner) and <span style="color:red">reuse</span> (switch later) * Slice Preprocessing * Slices are preprocessed to keep all edges(and hence updates) within each slice * Slice Transition * ![](https://i.imgur.com/YWn0wou.png) C. Scheduling of Algorithm Variants * Quantitative Motivation * ![](https://i.imgur.com/HUnTjlc.png) * Notice that the highest performance variant changes during the execution * Heuristics for Algorithm Variant Scheduling * ![](https://i.imgur.com/4fzYget.png) * Variant Transition * Given the algorithm variant, the <span style="color:red">control core</span> will * Initialize data-structures and configure taskflow graph * Perform pinning operations * If a <span style="color:red">dynamic switch</span> is invoked, on-chip memories are flushed, and taskflow may require <span style="color:red">reconfiguration</span>. 5. POLYGRAPH HARDWARE IMPLEMENTATION * A [Softbrain](https://hackmd.io/EJXvQdj2RgKrQ6P7Ny9UTQ)-like CGRA executes compute nodes in pipelined fashion * Multicore decoupled-spatial accelerator connected by 2D triple mesh networks * ![](https://i.imgur.com/xWLNrWv.png) * The <span style="color:red">data plane</span> is comprised of all modules besides the <span style="color:red">control core</span>, and is responsible for executing <span style="color:red">taskflow graphs</span> * <span style="color:red">Memory nodes</span> are maintained on <span style="color:red">stream address generators</span>, and accesses are decoupled to hide memory latency * <span style="color:red">Compute nodes</span> executes in pipelined fashion * Between the <span style="color:red">stream controller</span> and <span style="color:red">CGRA</span> are several “ports” or FIFOs, providing latency insensitive communication * Task management * A <span style="color:red">priority-ordered task queue</span> holds <span style="color:red">waiting tasks</span> * <span style="color:red">Task nodes</span> define how incoming task arguments from the queue are consumed by the <span style="color:red">stream controller</span> to perform memory requests for new tasks * Stream controller * If the stream controller can accept a new task, the task queue will issue the highest priority task. * issue memory requests from <span style="color:red">memory nodes</span> of any active task * CGRA * pipeline the computation of any <span style="color:red">compute nodes</span> * <span style="color:red">create new tasks</span> by forwarding data to output ports designated for task creation, and these are consumed by the <span style="color:red">task management hardware</span>. * Tasks may be triggered remotely to perform processing near data. * Control core * <span style="color:red">Initial tasks</span> may be created by the <span style="color:red">control core</span>, by explicitly pushing task arguments to the task creation port * Task management unit * enables high-throughput priority-ordered task dispatch * <span style="color:red">coalesces superfluous tasks</span> at high throughput * Tasks can <span style="color:red">overflow</span> the task queue * Slice scheduling * is implemented on core 0’s <span style="color:red">control core</span> A. Task Hardware Details * Task Queue and Priority Scheduling * A <span style="color:red">task argument buffer</span> maintains the arguments of each task instance before their execution * The <span style="color:red">task argument pointers</span> to ready tasks are stored in the <span style="color:red">task scheduler</span> * use the <span style="color:red">priority task scheduler</span> only for graph access tasks and <span style="color:red">FIFO scheduling</span> for others (eg. vertex update) * Overflow and Reserved Entries * If the task queue is full, new tasks will <span style="color:red">overflow</span> into a buffer in <span style="color:red">main memory</span> * <span style="color:red">Re-calculation</span> is required as the priority might have been updated due to <span style="color:red">coalescing</span> * Task Coalescing * To reduce active tasks, we allow <span style="color:red">coalescing</span> of tasks with the <span style="color:red">same ID</span> B. Memory Architecture * Shared Memory * Our on-chip memory is a <span style="color:red">shared address-partitioned cache</span>, with multiple banks per <span style="color:red">PolyGraph core</span> * Atomic Updates 6. SPATIAL PARTITIONING * While <span style="color:red">offline partitioning</span> is common for creating temporal slices, we find that <span style="color:red">spatial partitioning</span> makes the mesh-based * tradeoff between <span style="color:red">locality</span> and <span style="color:red">load balance</span> * Multi-level Scheme * the graph is split into many small clusters of <span style="color:red">fixed size</span> to preserve <span style="color:red">locality</span>, then these clusters are distributed equally among cores for <span style="color:red">balanced load</span> * ![](https://i.imgur.com/628yXjs.png) * high diameter graphs * load balanced because active vertices is usually low across iterations * low diameter graphs * larger clusters are helpful for locality 7. METHODOLOGY * PolyGraph Power/Area * ex-tending DSAGEN * task scheduling hardware * stream-dataflow ISA * synthesized PolyGraph cores and NoC at 1GHz, with a 28nm UMC library. * used Cacti 7.0 for modeling eDRAM * For performance modeling across variants, we developed a custom <span style="color:red">cycle-level modular simulator</span> * Main memory is modeled using <span style="color:red">DRAMSim2</span> * We assume <span style="color:red">preprocessing</span> is done <span style="color:red">offline</span> and reused across queries 8. EVALUATION A. Algorithm Variants Performance Comparison * ![](https://i.imgur.com/AJFBCAI.png) * ![](https://i.imgur.com/ANAKDYr.png) B. Comparison to Prior Accelerators * ![](https://i.imgur.com/eURJhWy.png) * ![](https://i.imgur.com/2RvDfdd.png) C. Algorithm Sensitivity * ![](https://i.imgur.com/fjTaU1t.png)