# Ultra-Elastic CGRAs for Irregular Loop Specialization
###### tags: `Accelerators`
###### paper origin: HPCA 2021
###### papers: [link](https://ieeexplore.ieee.org/document/9407079)
# 1. Introduction
* Architects face three challenges for efficient acceleration of irregular loops defined by:
* irregular memory accesses with variable latencies and non-uniform access patterns
* irregular control flow
* performance bottlenecks due to inter-iteration loop dependencies
* 
* Figure 1(a-d) shows a code example with a multi-cycle inter-iteration loop dependency.
* Figure 1(e) shows the system-level view for the UE-CGRA platform
* All PEs are connected with elastic buffers
* Unutilized PEs are power-gated and remaining PEs are configured for different voltages and frequencies to execute the dataflow graph more efficiently
* The authors make the following contributions:
* The ultra-elastic CGRA compiler, architecture, and VLSI techniques that enable a CGRA to efficiently accelerate irregular loops with inter-iteration loop dependencies
* an analytical model for rapid UE-CGRA design space exploration
* the UE-CGRA compiler with a heuristic three phase power-mapping pass
* the first CGRA with per-PE fine-grain DVFS
# 2. UE-CGRA ANALYTICAL MODELING
## Discrete-Event Performance Model
* This is a simple discrete-event simulator that models the performance of a dataflow graph (DFG) executing on both an elastic CGRA and an ultra-elastic CGRA.
* This simulator assumes that all nodes map to a unique PE and any PE can communicate with any other PE
* 
* Figure 2(a) illustrates a toy dataflow graph with six nodes, with three connected in a cycle
* Figure 2(b) shows nodes A1 and A2 resting to a lower voltage corresponding to one-third the nominal frequency without impacting the kernel throughput
* Figure 2\(c\) shows how the power slack can be used to sprint the critical DFG cycle to boost throughput to one every two cycles, reflecting a 1.5x factor in performance
## First-Order Energy Model
* Use discrete-event performance simulator to estimate both throughput and latency, and then calculate dynamic energy and static energy
* 
* Figure 3 sweeps different voltage and frequency settings for each node across all nodes in the DFG.
* The circled point combines sprinting and resting for 1.4x speedup and 1.2x energy efficiency
* Sprinting the six-node cycle increases energy, but resting non-critical nodes reduces energy
# 3. UE-CGRA COMPILER

* Figure 4 overviews the compiler toolchain
* Build compiler on LLVM to transform simple C programs into logical dataflow graphs
* Conduct a simple mapping of nodes to physical PEs
* Place and route the design onto the UE-CGRA
* Conduct a heuristic power mapping pass to configure the voltages and frequencies to optimize performance and energy efficiency

* Complexity-Reduction Phase
* Naively, selecting from M possible power modes (e.g., rest, nominal, sprint) for each of N logical DFG nodes has a worst-case time complexity of **O(M^N^)**, which grows quickly for larger DFGs
* The throughput of an entire chain is determined by the slowest PE so we can group all such nodes (i.e., GroupNodes()) to effectively reduce N)
* Energy-Delay-Optimization Phase
* The compiler initializes each logical node n to sprint voltage such that power mode M(n) = V(sprint)
* And then rest each group of nodes for an improved energy-delay product.
* The algorithm greedily tries to rest first for the greatest potential energy-efficiency benefit before trying nominal, and then rolling back to sprint
* Constraint Phase
* Logical nodes in the DFG may fold onto the same physical PE (i.e., since size(N) ≥ size(T)) and therefore be limited to run at the same voltage and frequency
* The ConstrainPEModes() function implements a small energy-delay optimization search across the options, estimating each option internally with the MeasureEnergyDelay() function
# 4. UE-CGRA ARCHITECTURE
## Architectural Template

* The complete UE-CGRA is composed of a control unit, a DMA unit with read and write queues, and an array of UE-CGRA PEs interconnected with queues.
* Input Queues and Registers
* The input queues from each cardinal direction are elastic and have two entries to avoid creating unnecessary pipeline bubbles when all PEs are communicating on the same synchronous clock.
* Muxing and Operators
* Four five-input muxes select between the four input queues and the multi-purpose register.
* Bypassing enables any input queue to forward data to any output, allowing messages to route through PEs executing other operations.
* The compute operator supports the operations in a 32-bit datapath
## Discussion

* Impact of inter-PE latency on performance?
* Figure 7(a) quantifies this impact and shows how a two-cycle synchronization latency degrades throughput by a factor of three for DFGs with inter-iteration dependencies
* Performant dataflow architectures must reduce inter-PE synchronization latency as much as possible
* Impact of queue depth on performance?
* For irregular workloads, Figure 7(b) studies queue depth in a UE-CGRA running a synthetic irregular microbenchmark
* No amount of deeper queuing affects the throughput of irregular kernels with inter-iteration dependencies.
* Two-element queues are reasonable to support both regular and irregular kernels
* Which frequencies should be supported?
* For the sprint level, Figure 7\(c\) sweeps sprint frequencies for a UE-CGRA running synthetic irregular microbenchmarks
* Speedup is proportional to frequency until performance hits a throughput ceiling
* A reasonable set of voltages in TSMC 28 might therefore be **0.60 V, 0.90 V, 1.30 V**, before further VLSI considerations.
# 5. UE-CGRA VLSI
## Overview

* Ratiochronous crossings have phasic relationships that may require suppression of “unsafe” edges as shown in Figure 8(a).
* Several edges are not aligned and are “unsafe” for transmitting data.
* Figure 8\(c\) implements a traditional unsafe-edge detector for each crossing between three clock domains.
* Figure 8(d) shows how the empty signal from the input queues is used to implement two edge detectors that allow handshakes on unsafe edges as long as data has been **enqueued for longer than one local clock cycle**, which can prevent unnecessary stalling
## Hierarchical Clock-Network Gating
* Specifically we cluster PEs (e.g., 4×4 clusters) and gate each cluster’s portion of the global clock network with a 1-bit configuration register.
* Because the compiler is statically aware of all PE clocks, it can gate with perfect knowledge
* It can gate the entire sprinting clock if no PEs are sprinting
# 6. RESULTS
* In this section, we evaluate UE-CGRAs against inelastic CGRAs (IE-CGRA) and elastic CGRAs (E-CGRA) executing a set of irregular kernels that fit in an 8 × 8 CGRA for detailed simulation
## Benchmarks and Compiler

* Figure 9 shows the inner loop code for the five benchmarks used in our evaluation and their inter-iteration dependencies.
## PE Area and Energy

* In general, the E-CGRA PE and UE-CGRA PE have similar area with **14% and 17%** overhead over the IE-CGRA PE at a 750 MHz target (1.33 ns).

* Figure 11 breaks down area for each PE and also energy for a single E-CGRA PE and UE-CGRA PE across all operations.
* On average, the UE-CGRA PE consumes 21% more energy across all operations compared to an E-CGRA PE.
## CGRA Area, Cycle Time, and Power Breakdown

* Both UE-CGRAs have global clock power (i.e., not including the PE portion) about 4x that of the E-CGRA, and accounts for 19% of the total power of UE-CGRA POpt.
* UE-CGRA POpt has **higher** power than the E-CGRA, but UE-CGRA EOpt has **lower** power, showing how the compiler can target different power budgets.
## CGRA Performance and Energy
* Energy-Optimized Mapping (EOpt)
* The energyoptimized UE-CGRA prioritizes resting PEs while avoiding performance loss.
* 
* Table II shows that an 8 × 8 UE-CGRA can improve energy efficiency by up to 2.32x over an 8 × 8 E-CGRA for our kernels.
* Performance-Optimized Mapping (POpt)
* The performance-optimized UE-CGRA sprints PEs in the recurrence loop while also resting less-critical PEs for energy efficiency.
* Energy Efficiency and Performance
* 
* Statically resting and sprinting the entire E-CGRA can only achieve either high energy efficiency or high performance.
* On the other hand, the UE-CGRA can achieve **both** with carefully chosen power mappings.
* 
* llist (linked-list search) in Figure 14\(c\) achieves 1.49× performance over the E-CGRA by accelerating critical PEs with the same energy cost.
## First-Order System-Level Results

* Table III projects the performance and energy efficiency of the processor-CGRA system relative to a single core.
* The UE-CGRA results show that true-dependency bottlenecks can be overcome
* The llist kernel improves 1.35x in performance despite its true dependency, and dither improves 1.80x.
* For all other kernels, there is at least one power-mapping approach (either EOpt or POpt) that has good energy efficiency or performs well
* fft exploits both ILP and fine-grain DVFS for 3.38× speedup or 1.53× energy
# 7. DISCUSSION
* Performance compared to an out-of-order core?
* Out-of-order cores and CGRAs (including UE-CGRAs) are each designed to extract ILP for performance.
* The mechanisms in the out-of-order core do not typically help to accelerate irregular loops with true dependencies.
* Area efficiency and unrolling for more parallelism?
* In practice, the CGRA fabric size is typically tuned such that the largest target workload has a high utilization, but this means that smaller kernels will underutilize CGRA resources (e.g., see Figure 14, our average is 65% utilized).
* The utilization challenge can be mitigated by **unrolling the outer loop**.
* Launch multiple instances from different kernels onto the fabric in parallel