# Ultra-Elastic CGRAs for Irregular Loop Specialization ###### tags: `Accelerators` ###### paper origin: HPCA 2021 ###### papers: [link](https://ieeexplore.ieee.org/document/9407079) # 1. Introduction * Architects face three challenges for efficient acceleration of irregular loops defined by: * irregular memory accesses with variable latencies and non-uniform access patterns * irregular control flow * performance bottlenecks due to inter-iteration loop dependencies * ![](https://i.imgur.com/ou9RSIn.jpg) * Figure 1(a-d) shows a code example with a multi-cycle inter-iteration loop dependency. * Figure 1(e) shows the system-level view for the UE-CGRA platform * All PEs are connected with elastic buffers * Unutilized PEs are power-gated and remaining PEs are configured for different voltages and frequencies to execute the dataflow graph more efficiently * The authors make the following contributions: * The ultra-elastic CGRA compiler, architecture, and VLSI techniques that enable a CGRA to efficiently accelerate irregular loops with inter-iteration loop dependencies * an analytical model for rapid UE-CGRA design space exploration * the UE-CGRA compiler with a heuristic three phase power-mapping pass * the first CGRA with per-PE fine-grain DVFS # 2. UE-CGRA ANALYTICAL MODELING ## Discrete-Event Performance Model * This is a simple discrete-event simulator that models the performance of a dataflow graph (DFG) executing on both an elastic CGRA and an ultra-elastic CGRA. * This simulator assumes that all nodes map to a unique PE and any PE can communicate with any other PE * ![](https://i.imgur.com/SZ4ITne.jpg) * Figure 2(a) illustrates a toy dataflow graph with six nodes, with three connected in a cycle * Figure 2(b) shows nodes A1 and A2 resting to a lower voltage corresponding to one-third the nominal frequency without impacting the kernel throughput * Figure 2\(c\) shows how the power slack can be used to sprint the critical DFG cycle to boost throughput to one every two cycles, reflecting a 1.5x factor in performance ## First-Order Energy Model * Use discrete-event performance simulator to estimate both throughput and latency, and then calculate dynamic energy and static energy * ![](https://i.imgur.com/STqGk50.jpg) * Figure 3 sweeps different voltage and frequency settings for each node across all nodes in the DFG. * The circled point combines sprinting and resting for 1.4x speedup and 1.2x energy efficiency * Sprinting the six-node cycle increases energy, but resting non-critical nodes reduces energy # 3. UE-CGRA COMPILER ![](https://i.imgur.com/41NLsgx.jpg) * Figure 4 overviews the compiler toolchain * Build compiler on LLVM to transform simple C programs into logical dataflow graphs * Conduct a simple mapping of nodes to physical PEs * Place and route the design onto the UE-CGRA * Conduct a heuristic power mapping pass to configure the voltages and frequencies to optimize performance and energy efficiency ![](https://i.imgur.com/tJVuhD0.jpg) * Complexity-Reduction Phase * Naively, selecting from M possible power modes (e.g., rest, nominal, sprint) for each of N logical DFG nodes has a worst-case time complexity of **O(M^N^)**, which grows quickly for larger DFGs * The throughput of an entire chain is determined by the slowest PE so we can group all such nodes (i.e., GroupNodes()) to effectively reduce N) * Energy-Delay-Optimization Phase * The compiler initializes each logical node n to sprint voltage such that power mode M(n) = V(sprint) * And then rest each group of nodes for an improved energy-delay product. * The algorithm greedily tries to rest first for the greatest potential energy-efficiency benefit before trying nominal, and then rolling back to sprint * Constraint Phase * Logical nodes in the DFG may fold onto the same physical PE (i.e., since size(N) ≥ size(T)) and therefore be limited to run at the same voltage and frequency * The ConstrainPEModes() function implements a small energy-delay optimization search across the options, estimating each option internally with the MeasureEnergyDelay() function # 4. UE-CGRA ARCHITECTURE ## Architectural Template ![](https://i.imgur.com/qNd87XS.jpg) * The complete UE-CGRA is composed of a control unit, a DMA unit with read and write queues, and an array of UE-CGRA PEs interconnected with queues. * Input Queues and Registers * The input queues from each cardinal direction are elastic and have two entries to avoid creating unnecessary pipeline bubbles when all PEs are communicating on the same synchronous clock. * Muxing and Operators * Four five-input muxes select between the four input queues and the multi-purpose register. * Bypassing enables any input queue to forward data to any output, allowing messages to route through PEs executing other operations. * The compute operator supports the operations in a 32-bit datapath ## Discussion ![](https://i.imgur.com/ZHwYrlC.jpg) * Impact of inter-PE latency on performance? * Figure 7(a) quantifies this impact and shows how a two-cycle synchronization latency degrades throughput by a factor of three for DFGs with inter-iteration dependencies * Performant dataflow architectures must reduce inter-PE synchronization latency as much as possible * Impact of queue depth on performance? * For irregular workloads, Figure 7(b) studies queue depth in a UE-CGRA running a synthetic irregular microbenchmark * No amount of deeper queuing affects the throughput of irregular kernels with inter-iteration dependencies. * Two-element queues are reasonable to support both regular and irregular kernels * Which frequencies should be supported? * For the sprint level, Figure 7\(c\) sweeps sprint frequencies for a UE-CGRA running synthetic irregular microbenchmarks * Speedup is proportional to frequency until performance hits a throughput ceiling * A reasonable set of voltages in TSMC 28 might therefore be **0.60 V, 0.90 V, 1.30 V**, before further VLSI considerations. # 5. UE-CGRA VLSI ## Overview ![](https://i.imgur.com/gcEtJhE.jpg) * Ratiochronous crossings have phasic relationships that may require suppression of “unsafe” edges as shown in Figure 8(a). * Several edges are not aligned and are “unsafe” for transmitting data. * Figure 8\(c\) implements a traditional unsafe-edge detector for each crossing between three clock domains. * Figure 8(d) shows how the empty signal from the input queues is used to implement two edge detectors that allow handshakes on unsafe edges as long as data has been **enqueued for longer than one local clock cycle**, which can prevent unnecessary stalling ## Hierarchical Clock-Network Gating * Specifically we cluster PEs (e.g., 4×4 clusters) and gate each cluster’s portion of the global clock network with a 1-bit configuration register. * Because the compiler is statically aware of all PE clocks, it can gate with perfect knowledge * It can gate the entire sprinting clock if no PEs are sprinting # 6. RESULTS * In this section, we evaluate UE-CGRAs against inelastic CGRAs (IE-CGRA) and elastic CGRAs (E-CGRA) executing a set of irregular kernels that fit in an 8 × 8 CGRA for detailed simulation ## Benchmarks and Compiler ![](https://i.imgur.com/ViVmVQq.jpg) * Figure 9 shows the inner loop code for the five benchmarks used in our evaluation and their inter-iteration dependencies. ## PE Area and Energy ![](https://i.imgur.com/z0xsAJE.jpg) * In general, the E-CGRA PE and UE-CGRA PE have similar area with **14% and 17%** overhead over the IE-CGRA PE at a 750 MHz target (1.33 ns). ![](https://i.imgur.com/3fexa3o.jpg) * Figure 11 breaks down area for each PE and also energy for a single E-CGRA PE and UE-CGRA PE across all operations. * On average, the UE-CGRA PE consumes 21% more energy across all operations compared to an E-CGRA PE. ## CGRA Area, Cycle Time, and Power Breakdown ![](https://i.imgur.com/eGBz1RY.jpg) * Both UE-CGRAs have global clock power (i.e., not including the PE portion) about 4x that of the E-CGRA, and accounts for 19% of the total power of UE-CGRA POpt. * UE-CGRA POpt has **higher** power than the E-CGRA, but UE-CGRA EOpt has **lower** power, showing how the compiler can target different power budgets. ## CGRA Performance and Energy * Energy-Optimized Mapping (EOpt) * The energyoptimized UE-CGRA prioritizes resting PEs while avoiding performance loss. * ![](https://i.imgur.com/iKQRXTc.jpg) * Table II shows that an 8 × 8 UE-CGRA can improve energy efficiency by up to 2.32x over an 8 × 8 E-CGRA for our kernels. * Performance-Optimized Mapping (POpt) * The performance-optimized UE-CGRA sprints PEs in the recurrence loop while also resting less-critical PEs for energy efficiency. * Energy Efficiency and Performance * ![](https://i.imgur.com/l7OBG7r.jpg) * Statically resting and sprinting the entire E-CGRA can only achieve either high energy efficiency or high performance. * On the other hand, the UE-CGRA can achieve **both** with carefully chosen power mappings. * ![](https://i.imgur.com/BTaUuwR.jpg) * llist (linked-list search) in Figure 14\(c\) achieves 1.49× performance over the E-CGRA by accelerating critical PEs with the same energy cost. ## First-Order System-Level Results ![](https://i.imgur.com/KAhjqWd.jpg) * Table III projects the performance and energy efficiency of the processor-CGRA system relative to a single core. * The UE-CGRA results show that true-dependency bottlenecks can be overcome * The llist kernel improves 1.35x in performance despite its true dependency, and dither improves 1.80x. * For all other kernels, there is at least one power-mapping approach (either EOpt or POpt) that has good energy efficiency or performs well * fft exploits both ILP and fine-grain DVFS for 3.38× speedup or 1.53× energy # 7. DISCUSSION * Performance compared to an out-of-order core? * Out-of-order cores and CGRAs (including UE-CGRAs) are each designed to extract ILP for performance. * The mechanisms in the out-of-order core do not typically help to accelerate irregular loops with true dependencies. * Area efficiency and unrolling for more parallelism? * In practice, the CGRA fabric size is typically tuned such that the largest target workload has a high utilization, but this means that smaller kernels will underutilize CGRA resources (e.g., see Figure 14, our average is 65% utilized). * The utilization challenge can be mitigated by **unrolling the outer loop**. * Launch multiple instances from different kernels onto the fabric in parallel