# SNAFU: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture
###### tags: `Accelerators`
###### paper origin: ISCA-2021
###### paper: [link](http://www.cs.cmu.edu/~beckmann/publications/papers/2021.isca.snafu.pdf)
## Introduction
### Motivation
* Ultra-low-Power(ULP) devices are becoming pervasive, enabling sensing applications.
* Energy-efficiency is paramount in these applications, as efficiency determines device lifetime in battery-powered deployments and performance in energy-harvesting deployments.
### Problem
* Existing programmable ULP devices are too inefficient
* Commercial-off-the-shelf(COTS) ULP devices are general-purpose and highly programmable, but they pay a high energy tax for this flexibility.
* ASICs can minimize energy, but they are too inflexible
* ASICs' efficiency comes at high upfront cost and with serverely limited application scope.
### Solution
* Ultra-low-Power CGRAs are the answer
* SNAFU(++S++imple ++N++etwork of ++A++rbitrary ++F++unctional ++U++nit), a framework to generate ULP, energy-minimal coarse-grain reconfigurable arrays(CGRAs)
---
## Background
### CGRA architecture

* What is a CGRA?
* A CGRA comprises a set of PE connect to each other via an on-chip network.
* These architectures are coarse in that PEs support higher-level operations, like multiplication, on **multi-bit data words**, as opposed to bit-level configurability in FPGAs.
* PEs can often be configured to perform different operations and the NoC can be configured to route values directly between PEs.
* Contrasting SNAFU with prior CGRAs
* 
* 
* SNAFU is a CGRA-generator, so fabric size is parameterizable.
* SNAFU minimize PE energy by statically assigning operations to specific PEs and minimizes switching by not sharing PEs between operations.
* Dynamic dataflow firing is essential to SNAFU's flexibility and ability to support arbitrary, heterogeneous PEs in a single fabric.
* **SNAFU is consistently biased towards minimizing energy, even at the expense of area and performance, while still achieving extremely low operating power and high energy-efficiency**
## Overview

* SNAFU is a framework for generating energy-minimal, ULP CGRAs and compiling applications to run efficiently on them.
### SNAFU is a flexible ULP CGRA generator
* SNAFU convert a high-level description of a CGRA to valid RTL and ultimately to ULP hardware.
* SNAFU takes two inputs: a library of PEs and a high-level description of the CGRA topology.
* SNAFU lets designers customize the ULP CGRA via a "*bring your own functional unit(BYOFU)*" approach, defining a generic PE interface that makes it easy to add custom logic to a generated CGRA.
### Example of SNAFU in action

* This kernel multiplies values at address &a by 5 for the elements where the mask m is set, sums the result, and stores it to address &c.
* ①In the first timestep, the two memory PEs are enabled and issue loads. The rest of the fabric is idle because it has no valid input values.
* ②The load for a[0] completes, but m[0] cannot due to a bank conflict. This causes a stall, which is handled transparently by SNAFU's scheduling logic and bufferless NoC. Meanwhile, the load of a[1] begins.
* ③As soon as the load for m[0] completes, the multiply operation can fire because both of its inputs have arrived. But m[0]==0, meaning the multiply is disabled, so a[0] passes through transparently. The load of a[1] completes, and loads for a[2] and m[1] begin.
* ④When the predicated multiply completes, its result is consumed by the fourth PE, which keeps a partial sum of the products. The preceding PEs continue executing in pipelined fashion, multiplying a[1]x5 and loading a[3] and m[2].
* ⑤Finally, a value arrives at fifth PE, and is stored back to memory in c[0]. Execution continues in this fashion until all elements of a and m have been processed and a final results has been stored back to memory.
---
## Designing SNAFU to Maximize FLEXIBILITY

### Bring your own functional unit(BYOFU) interface
* If a custom FU implements SNAFU's interface, then SNAFU generates hardware to automatically handle configuring the FU, tracking FU and overall CGRA progress, and moderating its communication with other PEs.
* The μcore handles progress tracking, predicated execution, and communication. The standard FU interface connects the μcore to the custom FU logic.
* The μcfg handles configuration of both the μcore and FUs.
#### Communication
* The μcore handles communication between the PE and the NoC, decoupling the NoC from the FU.
* The input router handles incoming connections, notifying the internal μcore logic of the availability of valid data and predicates. The intermediate buffers hold output data produced by the FU.
* The NoC, which forwards data to dependent PEs, is entirely bufferless.
#### The FU interface
* The interface has four control signals(*op, ready, valid, done*), and several data signals.
* *op*: tells the FU that input operands are ready to be consumed.
* *ready*: indicates that the FU can consume new operands.
* *valid*: says that the FU has data ready to send over the network.
* *done*: says the FU has completed execution.
#### Progress tracking and fabric control
* The fabric has a top-level controller that interfaces with each μcore via three 1-bit signals.
* 1. enables the μcore to begin execution
* 2. resets the μcore
* 3. tells the controller when the PE has finished processing all input.
* The μcore keeps track of the progress of the FU by monitoring the *done* signal, conunting how many elements the FU has processed. When the number of completed elements matches the length of the computation, the μcore signals the controller that it is done.
#### Configuration services
* The μcfg handles PE configuration, setting up a PE's dataflow routes and providing custom FU configuration state.
* The μcfg module contains a configuration cache that can hold up to six different configurations.
* Large DFG applications split their DFG into multiple sub-graphs. The CGRA executes them one at a time, efficiently switching between them via the configuration cache.
### SNAFU's PE standard library
* The library includes four types of PEs:
* Arithmetic PEs:
* basic ALU: bitwise operations, comparisons, additions, subtractions, fixed-point clip operations
* multiplier: 32-bit signed multiplication
* Memory PEs: generate addresses and issue load and store to global memory
* Scratchpad PEs: holds intermediate values produced by the CGRA.
### Compilation
* The compiler extracts the dataflow graph from the vectorized C code. SNAFU asks the system designer to provide a mapping from RISC-V vector ISA instruction to a PE type, including the mapping of an operations inputs and outputs onto an FU's inputs and outputs.
* The compiler uses an integer linear program (ILP) formulation to schedule operations onto the PEs of a CGRA.
---
## Designing SNAFU to Minimize ENERGY
### Spatial vector-dataflow execution
* SNAFU's CGRA amortizes a single fabric configuration across many computations(vector), and routes intermediate values directly between operations(dataflow).
### Asynchronous dataflow firing without tag-token matching
* Static assigning operations to PE and scheduling is most energy-efficient but is feasible when all operation latencies are known; Dynamic strategies require expensive tag-matching hardware to associate operands with their operation.
* SNAFU uses hybrid CGRA with static PE assignment and dynamic scheduling.
* Each PE uses local, asynchronous dataflow firing to tolerate variable latency. SNAFU avoids tag-token matching by enforcing that values arrive in-order.
### Statically routed, bufferless on-chip network
* SNAFU includes a statically-configured, bufferless, multi-hop on-chip network.
* Static circuit-switching eliminates expensive lookup tables and flow-control mechanisms.
* The network is bufferless, eliminating the NoC's primary energy sink.
### Minimizing buffers in the fabric
* SNAFU includes minimal in-fabric buffering at the producer PE, with none in NoC.
* Buffering at the producer PE means each value is buffered exactly once. and overwritten only when all dependent PEs are finished using it.
* SNAFU minimizes the number of buffers at each PE, using just four buffers per PE by default.
---
## SNAFU-ARCH: A COMPLETE ULP SYSTEM w/ CGRA

* The RISC-V scalar core implements the E, M, I, and C extensions and issues control signals to the SNAFU fabric.
### Example of SNAFU-ARCH in action

* SNAFU-ARCH adds three instructions to the scalar core to interface with the CGRA fabric.
* The SNAFU fabric operates in three states: **idle**, **configuration**, and **execution**.
* During the **idle** phase the scalar core is running and the fabric is not.
* When the scalar core reaches a *vcfg* instruction, the fabric transitions to the **configuration** state.
* The configurator stalls until the scalar core either reaches a *vtfr* instruction or a *vfence* instruction.
* *vtfr* lets the scalar core pass a register value to the fabric configurator, which then passes that value to a specific PE.
* *vfence* insdicates that configuration is done, so the scalar core stalls and the fabric transitions to execution. Execution proceeds until all PEs signal that they have completed their work.
* Finally, the scalar core resumes execution from the *vfence*, and the fabric transitions back into the **idle** state.
---
## Evaluation
1. SNAFU-ARCH achieves high-performance by exploiting instruction-level parallelism in each kernel, which is naturally achieved by SNAFU's asynchronous dataflow-firing at each PE.

2. The reason for improvement in large input size is that with larger input sizes, SNAFU-ARCH can more effectively amortize the overhead of (re)configuration.

3. These results make it clear that SNAFU-ARCH can effectively exploit instruction-level parallelism and that there is an opportunity for compiler to further improve efficiency.

4. Without scratchpad units, intermediate values were being communicated through memory, which is quite expensive.
