# PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators
###### tags: `Accelerators`
#### [paper](http://web.cs.ucla.edu/~tjn/papers/isca2021-polygraph.pdf)
#### [slide](https://drive.google.com/file/d/1P6spVs5Uszh1zwtIGwR2Wt5YNLXJNh3z/view)
#### ISCA, 2021
## abstraction
* Motivation
* importance of graph workloads
* limitations of CPUs/GPUs
* Prior acceralator
* single graph algorithm variant
* This work
* identify a taxonomy of key algorithm variants
* develop a template architecture(PolyGraph) that is flexible across these variants while being able to modularly integrate specialization features for each
* find that flexibility in graph acceleration is critical
1. INTRODUCTION
* Motivation
* Challenging for CPUs/GPUs due to data-dependent memory access, reuse, and parallelism
* Opportunities:
* commutative updates
* repetitive structure in memory access and computation
* Prior work's assumption
* input graph type (eg. high vs. low diameter)
* workload property (eg. order resilience, frontier density)
* graph algorithm variants
* Update Visibility(granularity when graph updates become visible)
* Vertex Scheduling(the fine-grain scheduling policy for vertex updates)
* Slice Scheduling(whether and how the graph working set is controlled)
* Update Direction(pull/push)
* 
* performence
* 
* limitation
* lack of support for fine-grain data-dependent parallelism
* Insight
* having the <span style="color:red">flexibility</span> to use the right algorithm variant for the right graph and workload
* Challenge
* design an architecture with sufficient algorithm/architecture flexibility, and little performance, area, and power overhead
* granularity tasks (synchronous vs asynchronous updates)
* fine-grain task scheduling
* flexibly controlling the working set
* having flexibility for different data structures
* Approach
* efficient decoupled spatial accelerators
* support general data-structures
* suit both memory-intensive and compute-intensive workloads
* Evaluation and Results
* 
* 16.79× (up to 275× for high diameter graphs) faster than a Titan V GPU
* By statically choosing the best algorithm variant, we gain 2.71× speedup. Dynamic flexibility provides 1.09× further speedup
2. GRAPH ACCELERATION BACKGROUND
A. Vertex-centric, Sliced Graph Execution Model
* vertex-centric graph execution
* a user-defined function is executed over vertices
* This function accesses properties from adjacent vertices and/or edges
* execution continues until these properties have converged
* Preprocessing the graph
* better spatial and/or temporal locality
* temporal slices
* fit into on-chip memory
* spatial slices
* divide the graph among cores for load-balance or locality
* 
* Graph Data-structures
* 
B. Key Workload/Graph Properties
* Graph Property:
* Diameter
* largest distance between two vertices
* Uniform-degree graphs
* high diameter
* similar/low number of edges-per-vertex
* power-law graphs
* low diameter
* some vertices are highly connected
* Workload Property:
* Order Sensitivity
* sensitive
* SSSP
* less sensitive
* BFS
* insensitive
* GCN
* Frontier Density
* sparse frontier
* SSSP, BFS
* dense frontier
* PR, CF
* sparse frontier workloads require fewer passes through the graph until convergence
3. GRAPH ALGORITHM TAXONOMY
* Update Visibility
* 
* Vertex Scheduling
* for asynchronous variants
* 
* Temporal Slicing
* Slices are determined during offline partitioning and are generally sized to fit on-chip memory
* Updates to data outside the current slice are deferred
* an explicit phase is required to switch slices
* 
* Update Direction
* updates its own property (pull/remote read)
* updates its neighbor’s properties (push/remote atomic update)
* Notation
* push is default
* 
* Summary
* 
4. UNIFIED GRAPH PROCESSING REPRESENTATION
* data-plane (pipelined task execution)
* control plane (slice scheduling)
A. Data plane Representation: Taskflow
* major requirements
* Need for <span style="color:red">fully pipelined</span> execution of per-vertex computation
* Need to support <span style="color:red">data-dependent creation of new tasks</span>, including programmatically specifying and updating the <span style="color:red">priority ordering</span>
* Need for streaming/memory reuse
* task invoked by <t,args>: <span style="color:red">type t</span> and <span style="color:red">input arguments</span>
* Each <span style="color:red">task type</span> is defined by a graph of nodes
* Compute nodes
* are passive, and may maintain a <span style="color:red">single state item</span>.
* Memory nodes
* represent <span style="color:red">decoupled patterns of memory access</span>, called <span style="color:red">streams</span>
* 
* Atomics
* correct handling of <span style="color:red">memory conflicts</span> on vertex updates
* 
* Task nodes
* <span style="color:red">represent arguments</span>, and are ingress and egress points of the graph
* Priority Scheduling and Coalescing
* task arguments
* task’s priority
* ID
* is unique for all active tasks
* vertex id
* task coalescing and sliced execution
* Taskflow Examples
* 
* 
* Taskflow Flexibility Summary
* Synchronous variants use <span style="color:red">coarse grain tasks</span> that pass through the (per graph/per slice) active list
* Asynchrony is supported with <span style="color:red">explicit fine-grain tasks</span>, optionally with priority hint argument
B. Slice Scheduling Interface and Operation
* Slice scheduler
* configure on-chip memory, decide which slice to execute next, and manage data/task orchestration
* on a simple <span style="color:red">control core</span> with limited extensions for <span style="color:red">data pinning operations</span>.
* <span style="color:red">creating initial tasks</span>
* Data Pinning
* provide the slice scheduler an interface to <span style="color:red">pin</span> a range of data to the on-chip memory at a particular offset, essentially reserving a portion of the cache
* Non-pinned data is treated like normal cache access
* Slice Switching for Asynchronous Variants
* tradeoff between <span style="color:red">work-efficiency</span> (switch sooner) and <span style="color:red">reuse</span> (switch later)
* Slice Preprocessing
* Slices are preprocessed to keep all edges(and hence updates) within each slice
* Slice Transition
* 
C. Scheduling of Algorithm Variants
* Quantitative Motivation
* 
* Notice that the highest performance variant changes during the execution
* Heuristics for Algorithm Variant Scheduling
* 
* Variant Transition
* Given the algorithm variant, the <span style="color:red">control core</span> will
* Initialize data-structures and configure taskflow graph
* Perform pinning operations
* If a <span style="color:red">dynamic switch</span> is invoked, on-chip memories are flushed, and taskflow may require <span style="color:red">reconfiguration</span>.
5. POLYGRAPH HARDWARE IMPLEMENTATION
* A [Softbrain](https://hackmd.io/EJXvQdj2RgKrQ6P7Ny9UTQ)-like CGRA executes compute nodes in pipelined fashion
* Multicore decoupled-spatial accelerator connected by 2D triple mesh networks
* 
* The <span style="color:red">data plane</span> is comprised of all modules besides the <span style="color:red">control core</span>, and is responsible for executing <span style="color:red">taskflow graphs</span>
* <span style="color:red">Memory nodes</span> are maintained on <span style="color:red">stream address generators</span>, and accesses are decoupled to hide memory latency
* <span style="color:red">Compute nodes</span> executes in pipelined fashion
* Between the <span style="color:red">stream controller</span> and <span style="color:red">CGRA</span> are several “ports” or FIFOs, providing latency insensitive communication
* Task management
* A <span style="color:red">priority-ordered task queue</span> holds <span style="color:red">waiting tasks</span>
* <span style="color:red">Task nodes</span> define how incoming task arguments from the queue are consumed by the <span style="color:red">stream controller</span> to perform memory requests for new tasks
* Stream controller
* If the stream controller can accept a new task, the task queue will issue the highest priority task.
* issue memory requests from <span style="color:red">memory nodes</span> of any active task
* CGRA
* pipeline the computation of any <span style="color:red">compute nodes</span>
* <span style="color:red">create new tasks</span> by forwarding data to output ports designated for task creation, and these are consumed by the <span style="color:red">task management hardware</span>.
* Tasks may be triggered remotely to perform processing near data.
* Control core
* <span style="color:red">Initial tasks</span> may be created by the <span style="color:red">control core</span>, by explicitly pushing task arguments to the task creation port
* Task management unit
* enables high-throughput priority-ordered task dispatch
* <span style="color:red">coalesces superfluous tasks</span> at high throughput
* Tasks can <span style="color:red">overflow</span> the task queue
* Slice scheduling
* is implemented on core 0’s <span style="color:red">control core</span>
A. Task Hardware Details
* Task Queue and Priority Scheduling
* A <span style="color:red">task argument buffer</span> maintains the arguments of each task instance before their execution
* The <span style="color:red">task argument pointers</span> to ready tasks are stored in the <span style="color:red">task scheduler</span>
* use the <span style="color:red">priority task scheduler</span> only for graph access tasks and <span style="color:red">FIFO scheduling</span> for others (eg. vertex update)
* Overflow and Reserved Entries
* If the task queue is full, new tasks will <span style="color:red">overflow</span> into a buffer in <span style="color:red">main memory</span>
* <span style="color:red">Re-calculation</span> is required as the priority might have been updated due to <span style="color:red">coalescing</span>
* Task Coalescing
* To reduce active tasks, we allow <span style="color:red">coalescing</span> of tasks with the <span style="color:red">same ID</span>
B. Memory Architecture
* Shared Memory
* Our on-chip memory is a <span style="color:red">shared address-partitioned cache</span>, with multiple banks per <span style="color:red">PolyGraph core</span>
* Atomic Updates
6. SPATIAL PARTITIONING
* While <span style="color:red">offline partitioning</span> is common for creating temporal slices, we find that <span style="color:red">spatial partitioning</span> makes the mesh-based
* tradeoff between <span style="color:red">locality</span> and <span style="color:red">load balance</span>
* Multi-level Scheme
* the graph is split into many small clusters of <span style="color:red">fixed size</span> to preserve <span style="color:red">locality</span>, then these clusters are distributed equally among cores for <span style="color:red">balanced load</span>
* 
* high diameter graphs
* load balanced because active vertices is usually low across iterations
* low diameter graphs
* larger clusters are helpful for locality
7. METHODOLOGY
* PolyGraph Power/Area
* ex-tending DSAGEN
* task scheduling hardware
* stream-dataflow ISA
* synthesized PolyGraph cores and NoC at 1GHz, with a 28nm UMC library.
* used Cacti 7.0 for modeling eDRAM
* For performance modeling across variants, we developed a custom <span style="color:red">cycle-level modular simulator</span>
* Main memory is modeled using <span style="color:red">DRAMSim2</span>
* We assume <span style="color:red">preprocessing</span> is done <span style="color:red">offline</span> and reused across queries
8. EVALUATION
A. Algorithm Variants Performance Comparison
* 
* 
B. Comparison to Prior Accelerators
* 
* 
C. Algorithm Sensitivity
* 