# Scalable Interconnects for Reconfigurable Spatial Architectures
###### tags: `Accelerators`
###### paper origin: ISCA 2019
###### papers: [link](https://ieeexplore.ieee.org/abstract/document/8980344)
# 1. INTRODUCTION
* FPGAs achieve energy efficiency by providing **statically reconfigurable compute elements** and **on-chip memories** in a bit-level programmable interconnect
* Long compile times and relatively low compute density have hampered widespread adoption for several years
* CGRAs use increasingly coarse-grained building blocks, such as **ALUs, register files, and memory controllers**, distributed in a **programmable, word-level** static interconnect
* On CGRAs, applications are distributed by **parallelizing** and **pipelining**; pipelining introduces **frequent** and **throughput-sensitive** communication
* Interconnects can be classified into two broad categories: **static and dynamic**
* Static interconnects use switches programmed at compile time to reserve high-bandwidth links between communicating units for the lifetime of the application
* Static networks are fast but they require **over-provisioning bandwidth** and can be **underutilized**
* Dynamic interconnects, or NoCs, contain routers that allow links to be shared between more than one pair of communicating units
* The key contributions of this paper are:
* An analysis of key communication patterns exhibited by spatial architectures.
* A network-aware compiler flow that efficiently targets static, dynamic, and hybrid networks with varying granularities
* A quantitative analysis of the performance, area, and energy trade-offs involved in choosing a CGRA network, using benchmarks drawn from various application domains.
# 2. BACKGROUND
* The CGRA contains **Physical Blocks (PBs)** corresponding to distributed hardware resources, including compute units, scratchpads, and DRAM controllers
* In this study, the authors focus on two categories of CGRA architectures

* The first architecture uses **pipelining in compute PBs** such as Plasticine, as shown in Figure 1
* To provide high throughput, each stage of the pipeline exploits **SIMD parallelism**
* The second architecture uses **time-scheduled execution**, where each PB executes a small loop of instructions (e.g., 6) repeatedly
* The scheduling window is small enough that instructions are stored as part of the configuration fabric, without dynamic instruction fetch overhead
* Many proposed CGRAs and domain-specific architectures use time-scheduled form of computation
## 2.1 Application Characteristics
### 2.1.1 Vectorized communication
* Coarser-grained computation typically increases the size of communication because of **vectorized compute units**, but loops with carried dependencies contribute to **scalar communications**
* Therefore, the authors separate networks for different communication granularities
### 2.1.2 Broadcast and incast communication
* When a consumer stage is parallelized, the producer sends a **one-to-many broadcast** to all of its consumers
* When a producer stage is parallelized, all partial results are sent to the consumer, forming a **many-to-one incast** link
### 2.1.3 Compute to memory communication
* The NoCs use in multi-processors, where each core has a **local cache to buffer intermediate results**
* For spatial accelerators, compute performance is limited by network throughput and latency is comparatively less important
### 2.1.4 Communication-aware compilation
* Communication on spatial architectures is created statically by compiling and mapping the compute graph onto the distributed PB resources
* As the compiler performs optimization passes, such as unrolling and banking, it can determine which network flows correspond to throughput-critical inner-loop traffic and which correspond to low-bandwidth outer-loop traffic
## 2.2 Design Space for Network Architectures
### 2.2.1 Static networks
* The impact of flow-control schemes in static switches
* In **credit-based flow control**, the source and destination PBs coordinate to ensure that the destination buffer does not overflow
* **Skid-buffered queue** with two entries at each switch; using two entries enables per-hop backpressure and accounts for a one-cycle delay in stalling the upstream switch
* Bandwidth
* The authors vary the number of connections between switches in each direction, which trades off area and energy for bandwidth
### 2.2.2 Dynamic networks
* The design is a dynamic NoC using per-hop virtual channel flow control
* The compiler performs static routing and VC allocation and results are loaded as a part of the routers’ configurations at runtime
* VC allocation is performed to ensure that all logical paths traversing the same physical link are placed into separate buffers in order to guarantee no conflict between two logical paths
### 2.2.3 Hybrid networks
* The highest-bandwidth logical links from the program graph are mapped onto the static network; once the static network is full, further links are mapped to the dynamic network
## 2.3 High-Level Abstraction

* Spatial is an open source domain specific language for reconfigurable accelerators in order to target spatial architectures
* For spatial architectures, Design Space Exploration (DSE) of parameters (e.g., op1, op2, ip, tsA, tsB) is critical to achieve good resource utilization and performance
# 3. METHODOLOGY
## 3.1 Compilation

* It starts by taking the Intermediate Representation (IR) output by the Spatial compiler and lowering it into a target-specific IR, which is a distributed data flow graph consisting of Virtual Blocks (VBs)
* The VB graph is mapped onto a PB graph and the mapped VB graph is placed and routed onto the physical array and associated network
### 3.1.1 Virtual block allocation

* Figure 3(a) shows the input to our compiler: the hierarchical controller-based representation of the Spatial program shown in Figure 2
* Figure 3(b) shows the way to convert from the hierarchical model to a streaming dataflow graph
* The VB dataflow graph is shown in Figure 3\(c\), where edges indicate single-bit, scalar, or vector data dependencies between VBs
### 3.1.2 Resource pruning and VB partitioning
* Software can contain arbitrarily large basic blocks that do not fit in fixed-size hardware
* Compiler may partition a VB whenever its compute resources, on-chip memory capacity, or network bandwidth exceed hardware constraints
* When a VB in a pipelined CGRA consumes too many inputs or produces too many distinct outputs, we partition it to meet the input/output bandwidth constraints of a purely static network
### 3.1.3 Link analysis and heuristic generation
* For loop controllers, the number of iterations per parent iteration is $\lceil(max − min)/(step · par)\rceil$
* Using the computed activation counts, the placer prioritizes highly used links for allocation to the static network and leaves infrequently used links on the dynamic network
## 3.2 Placement and Routing
### 3.2.1 Congestion-aware routing
* Routing starts with the highest-priority routes and uses **Dijkstra’s algorithm** and a hop weighting function
* A search algorithm based on **Prim’s algorithm** for minimum spanning trees is run to build a tree for the broadcast, starting with only the source being reached
### 3.2.2 VC allocation for deadlock avoidance

* In figure 5, if C fills the buffer shared with B, VB-3 will never receive any packets from B and will not make forward progress
* Deadlock avoidance using cycle-free routing does not work in our scenario because of loop-carried dependency
* Allocating VCs to prevent multiple logical links from sharing the same physical buffer is consequently the most practical option for deadlock avoidance
## 3.3 Simulation
* Use a **cycle-accurate simulator** to model the pipeline and scheduling delay for the two types of architectures, integrated with DRAMSim to model DRAM access latency
* To efficiently evaluate large networks, the authors start by characterizing the **area and power consumption** of individual routers and switches used in various network configurations
* Power consumption can be broken into two types
* **Inactive power** consumed when switches and routers are at zero-load
* The **active power** is proportional to the amount of data transmitted
* The marginal energy to transmit a single flit of data **(flit energy, E~flit~)**
* $E_{flit} = (P - P_{inactive}) \times T_{testbench} \div \#flit$
* The total network energy
* $E_{net} = \displaystyle\sum_{allocated}P_{inactive}T_{sim} + E_{flit}\#flit$
# 4. EVALUATION

* Table 2 summarizes the notation used to describe network configurations in this section
## 4.1 Application Characterization

* Table 1 lists the applications and their data size

* Figure 6 shows, for each design, which resource limits performance: **compute, on-chip memory, or DRAM bandwidth**

* Compute and on-chip memory-bound applications show a significant amount of highbandwidth communication

* The output degree does not change with partitioning because most outputs with a large degree are from broadcast links
## 4.2 Area and Energy Characterization

* Figure 9 shows that switch and router power scale linearly with the rate of data transmission but that there is non-zero power at zero-load

* A router consumes more energy a switch to transmit a single bit of data, even though the overall router consumes less power
## 4.3 Network Architecture Exploration
### 4.3.1 Bandwidth scaling with network size

* The performance of compute-bound applications (GEMM and SGD) improves with increased resources, but plateaus at a level that is determined by on-chip network bandwidth.
### 4.3.2 Bandwidth and flow control in switches

* The left side of Figure 14 shows that increased static bandwidth results in a linear performance increase and a superlinear increase in area and power
* The right side of Figure 14 shows that, although credit-based flow control reduces the amount of buffering in switches and decreases network area and energy, application performance is significantly impacted
### 4.3.3 VC count and reduced flit width in routers

* Figure 13 shows that the best hybrid network configurations with 2x and 3x static bandwidth require at most 2 VCs, whereas the pure dynamic network requires 4 VCs to map all applications

* The left side of Figure 15 shows minimal performance improvement from using more VCs
* The right side of Figure 15 shows that, for a hybrid network, reducing flit width improves area efficiency with minimal performance loss
### 4.3.4 Static vs. hybrid vs. dynamic networks

* For DRAM-bound applications, the performance variation between different networks is trivial because only a small fraction of the network is being used

* The highest bandwidth static network uses the most PBs, as shown in Figures 18(b,e), because it permits more parallelization
* It also has more data movement, as shown in (c,f), because PBs can be distributed farther apart
* With the same static bandwidth, most hybrid networks have better energy efficiency than the corresponding pure static networks.

* Figure 16 summarizes the best perf/watt and perf/area for pipelined and scheduled CGRA architectures
* Pure dynamic networks are not shown because they perform poorly due to insufficient bandwidth
* The hybrid networks deliver better energy efficiency with shorter routing distances by allowing an escape path on the dynamic network
# 5. CONCLUSION
* In this work, the authors describe the mapping process from a high-level program to a distributed spatial architecture
* **Hybrid networks** tend to provide the best energy efficiency by reducing data movement using static place and route, with a 2.3x improvement over the worst configuration
* Both **hybrid networks and static networks** have a performance per area improvement of around 7x for pipelined CGRAs and 2x for time-scheduled CGRAs
* **Pure dynamic networks** are unsuitable for CGRA architectures due to insufficient bandwidth