# Scalable Interconnects for Reconfigurable Spatial Architectures ###### tags: `Accelerators` ###### paper origin: ISCA 2019 ###### papers: [link](https://ieeexplore.ieee.org/abstract/document/8980344) # 1. INTRODUCTION * FPGAs achieve energy efficiency by providing **statically reconfigurable compute elements** and **on-chip memories** in a bit-level programmable interconnect * Long compile times and relatively low compute density have hampered widespread adoption for several years * CGRAs use increasingly coarse-grained building blocks, such as **ALUs, register files, and memory controllers**, distributed in a **programmable, word-level** static interconnect * On CGRAs, applications are distributed by **parallelizing** and **pipelining**; pipelining introduces **frequent** and **throughput-sensitive** communication * Interconnects can be classified into two broad categories: **static and dynamic** * Static interconnects use switches programmed at compile time to reserve high-bandwidth links between communicating units for the lifetime of the application * Static networks are fast but they require **over-provisioning bandwidth** and can be **underutilized** * Dynamic interconnects, or NoCs, contain routers that allow links to be shared between more than one pair of communicating units * The key contributions of this paper are: * An analysis of key communication patterns exhibited by spatial architectures. * A network-aware compiler flow that efficiently targets static, dynamic, and hybrid networks with varying granularities * A quantitative analysis of the performance, area, and energy trade-offs involved in choosing a CGRA network, using benchmarks drawn from various application domains. # 2. BACKGROUND * The CGRA contains **Physical Blocks (PBs)** corresponding to distributed hardware resources, including compute units, scratchpads, and DRAM controllers * In this study, the authors focus on two categories of CGRA architectures ![](https://i.imgur.com/fAjh3RT.jpg) * The first architecture uses **pipelining in compute PBs** such as Plasticine, as shown in Figure 1 * To provide high throughput, each stage of the pipeline exploits **SIMD parallelism** * The second architecture uses **time-scheduled execution**, where each PB executes a small loop of instructions (e.g., 6) repeatedly * The scheduling window is small enough that instructions are stored as part of the configuration fabric, without dynamic instruction fetch overhead * Many proposed CGRAs and domain-specific architectures use time-scheduled form of computation ## 2.1 Application Characteristics ### 2.1.1 Vectorized communication * Coarser-grained computation typically increases the size of communication because of **vectorized compute units**, but loops with carried dependencies contribute to **scalar communications** * Therefore, the authors separate networks for different communication granularities ### 2.1.2 Broadcast and incast communication * When a consumer stage is parallelized, the producer sends a **one-to-many broadcast** to all of its consumers * When a producer stage is parallelized, all partial results are sent to the consumer, forming a **many-to-one incast** link ### 2.1.3 Compute to memory communication * The NoCs use in multi-processors, where each core has a **local cache to buffer intermediate results** * For spatial accelerators, compute performance is limited by network throughput and latency is comparatively less important ### 2.1.4 Communication-aware compilation * Communication on spatial architectures is created statically by compiling and mapping the compute graph onto the distributed PB resources * As the compiler performs optimization passes, such as unrolling and banking, it can determine which network flows correspond to throughput-critical inner-loop traffic and which correspond to low-bandwidth outer-loop traffic ## 2.2 Design Space for Network Architectures ### 2.2.1 Static networks * The impact of flow-control schemes in static switches * In **credit-based flow control**, the source and destination PBs coordinate to ensure that the destination buffer does not overflow * **Skid-buffered queue** with two entries at each switch; using two entries enables per-hop backpressure and accounts for a one-cycle delay in stalling the upstream switch * Bandwidth * The authors vary the number of connections between switches in each direction, which trades off area and energy for bandwidth ### 2.2.2 Dynamic networks * The design is a dynamic NoC using per-hop virtual channel flow control * The compiler performs static routing and VC allocation and results are loaded as a part of the routers’ configurations at runtime * VC allocation is performed to ensure that all logical paths traversing the same physical link are placed into separate buffers in order to guarantee no conflict between two logical paths ### 2.2.3 Hybrid networks * The highest-bandwidth logical links from the program graph are mapped onto the static network; once the static network is full, further links are mapped to the dynamic network ## 2.3 High-Level Abstraction ![](https://i.imgur.com/puCaowW.jpg) * Spatial is an open source domain specific language for reconfigurable accelerators in order to target spatial architectures * For spatial architectures, Design Space Exploration (DSE) of parameters (e.g., op1, op2, ip, tsA, tsB) is critical to achieve good resource utilization and performance # 3. METHODOLOGY ## 3.1 Compilation ![](https://i.imgur.com/4GZ4t7F.jpg) * It starts by taking the Intermediate Representation (IR) output by the Spatial compiler and lowering it into a target-specific IR, which is a distributed data flow graph consisting of Virtual Blocks (VBs) * The VB graph is mapped onto a PB graph and the mapped VB graph is placed and routed onto the physical array and associated network ### 3.1.1 Virtual block allocation ![](https://i.imgur.com/BBGW564.jpg) * Figure 3(a) shows the input to our compiler: the hierarchical controller-based representation of the Spatial program shown in Figure 2 * Figure 3(b) shows the way to convert from the hierarchical model to a streaming dataflow graph * The VB dataflow graph is shown in Figure 3\(c\), where edges indicate single-bit, scalar, or vector data dependencies between VBs ### 3.1.2 Resource pruning and VB partitioning * Software can contain arbitrarily large basic blocks that do not fit in fixed-size hardware * Compiler may partition a VB whenever its compute resources, on-chip memory capacity, or network bandwidth exceed hardware constraints * When a VB in a pipelined CGRA consumes too many inputs or produces too many distinct outputs, we partition it to meet the input/output bandwidth constraints of a purely static network ### 3.1.3 Link analysis and heuristic generation * For loop controllers, the number of iterations per parent iteration is $\lceil(max − min)/(step · par)\rceil$ * Using the computed activation counts, the placer prioritizes highly used links for allocation to the static network and leaves infrequently used links on the dynamic network ## 3.2 Placement and Routing ### 3.2.1 Congestion-aware routing * Routing starts with the highest-priority routes and uses **Dijkstra’s algorithm** and a hop weighting function * A search algorithm based on **Prim’s algorithm** for minimum spanning trees is run to build a tree for the broadcast, starting with only the source being reached ### 3.2.2 VC allocation for deadlock avoidance ![](https://i.imgur.com/vGOrlf5.jpg) * In figure 5, if C fills the buffer shared with B, VB-3 will never receive any packets from B and will not make forward progress * Deadlock avoidance using cycle-free routing does not work in our scenario because of loop-carried dependency * Allocating VCs to prevent multiple logical links from sharing the same physical buffer is consequently the most practical option for deadlock avoidance ## 3.3 Simulation * Use a **cycle-accurate simulator** to model the pipeline and scheduling delay for the two types of architectures, integrated with DRAMSim to model DRAM access latency * To efficiently evaluate large networks, the authors start by characterizing the **area and power consumption** of individual routers and switches used in various network configurations * Power consumption can be broken into two types * **Inactive power** consumed when switches and routers are at zero-load * The **active power** is proportional to the amount of data transmitted * The marginal energy to transmit a single flit of data **(flit energy, E~flit~)** * $E_{flit} = (P - P_{inactive}) \times T_{testbench} \div \#flit$ * The total network energy * $E_{net} = \displaystyle\sum_{allocated}P_{inactive}T_{sim} + E_{flit}\#flit$ # 4. EVALUATION ![](https://i.imgur.com/8WLjXa3.jpg) * Table 2 summarizes the notation used to describe network configurations in this section ## 4.1 Application Characterization ![](https://i.imgur.com/T95aR9X.jpg) * Table 1 lists the applications and their data size ![](https://i.imgur.com/kAC8OGK.jpg) * Figure 6 shows, for each design, which resource limits performance: **compute, on-chip memory, or DRAM bandwidth** ![](https://i.imgur.com/OmeQmWq.jpg) * Compute and on-chip memory-bound applications show a significant amount of highbandwidth communication ![](https://i.imgur.com/sHiVEGw.jpg) * The output degree does not change with partitioning because most outputs with a large degree are from broadcast links ## 4.2 Area and Energy Characterization ![](https://i.imgur.com/belbOtp.jpg) * Figure 9 shows that switch and router power scale linearly with the rate of data transmission but that there is non-zero power at zero-load ![](https://i.imgur.com/wOSurNU.jpg) * A router consumes more energy a switch to transmit a single bit of data, even though the overall router consumes less power ## 4.3 Network Architecture Exploration ### 4.3.1 Bandwidth scaling with network size ![](https://i.imgur.com/rBf1oMI.jpg) * The performance of compute-bound applications (GEMM and SGD) improves with increased resources, but plateaus at a level that is determined by on-chip network bandwidth. ### 4.3.2 Bandwidth and flow control in switches ![](https://i.imgur.com/HtRUuZy.jpg) * The left side of Figure 14 shows that increased static bandwidth results in a linear performance increase and a superlinear increase in area and power * The right side of Figure 14 shows that, although credit-based flow control reduces the amount of buffering in switches and decreases network area and energy, application performance is significantly impacted ### 4.3.3 VC count and reduced flit width in routers ![](https://i.imgur.com/l3v56kU.jpg) * Figure 13 shows that the best hybrid network configurations with 2x and 3x static bandwidth require at most 2 VCs, whereas the pure dynamic network requires 4 VCs to map all applications ![](https://i.imgur.com/LNF3dWG.jpg) * The left side of Figure 15 shows minimal performance improvement from using more VCs * The right side of Figure 15 shows that, for a hybrid network, reducing flit width improves area efficiency with minimal performance loss ### 4.3.4 Static vs. hybrid vs. dynamic networks ![](https://i.imgur.com/hFzBJ0r.jpg) * For DRAM-bound applications, the performance variation between different networks is trivial because only a small fraction of the network is being used ![](https://i.imgur.com/s8bu3Rb.jpg) * The highest bandwidth static network uses the most PBs, as shown in Figures 18(b,e), because it permits more parallelization * It also has more data movement, as shown in (c,f), because PBs can be distributed farther apart * With the same static bandwidth, most hybrid networks have better energy efficiency than the corresponding pure static networks. ![](https://i.imgur.com/r4OVxdr.jpg) * Figure 16 summarizes the best perf/watt and perf/area for pipelined and scheduled CGRA architectures * Pure dynamic networks are not shown because they perform poorly due to insufficient bandwidth * The hybrid networks deliver better energy efficiency with shorter routing distances by allowing an escape path on the dynamic network # 5. CONCLUSION * In this work, the authors describe the mapping process from a high-level program to a distributed spatial architecture * **Hybrid networks** tend to provide the best energy efficiency by reducing data movement using static place and route, with a 2.3x improvement over the worst configuration * Both **hybrid networks and static networks** have a performance per area improvement of around 7x for pipelined CGRAs and 2x for time-scheduled CGRAs * **Pure dynamic networks** are unsuitable for CGRA architectures due to insufficient bandwidth