# SNAFU: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture ###### tags: `Accelerators` ###### paper origin: ISCA-2021 ###### paper: [link](http://www.cs.cmu.edu/~beckmann/publications/papers/2021.isca.snafu.pdf) ## Introduction ### Motivation * Ultra-low-Power(ULP) devices are becoming pervasive, enabling sensing applications. * Energy-efficiency is paramount in these applications, as efficiency determines device lifetime in battery-powered deployments and performance in energy-harvesting deployments. ### Problem * Existing programmable ULP devices are too inefficient * Commercial-off-the-shelf(COTS) ULP devices are general-purpose and highly programmable, but they pay a high energy tax for this flexibility. * ASICs can minimize energy, but they are too inflexible * ASICs' efficiency comes at high upfront cost and with serverely limited application scope. ### Solution * Ultra-low-Power CGRAs are the answer * SNAFU(++S++imple ++N++etwork of ++A++rbitrary ++F++unctional ++U++nit), a framework to generate ULP, energy-minimal coarse-grain reconfigurable arrays(CGRAs) --- ## Background ### CGRA architecture ![](https://i.imgur.com/LaA87G6.png) * What is a CGRA? * A CGRA comprises a set of PE connect to each other via an on-chip network. * These architectures are coarse in that PEs support higher-level operations, like multiplication, on **multi-bit data words**, as opposed to bit-level configurability in FPGAs. * PEs can often be configured to perform different operations and the NoC can be configured to route values directly between PEs. * Contrasting SNAFU with prior CGRAs * ![](https://i.imgur.com/9VCTGEk.png) * ![](https://i.imgur.com/mAW4xEY.png) * SNAFU is a CGRA-generator, so fabric size is parameterizable. * SNAFU minimize PE energy by statically assigning operations to specific PEs and minimizes switching by not sharing PEs between operations. * Dynamic dataflow firing is essential to SNAFU's flexibility and ability to support arbitrary, heterogeneous PEs in a single fabric. * **SNAFU is consistently biased towards minimizing energy, even at the expense of area and performance, while still achieving extremely low operating power and high energy-efficiency** ## Overview ![](https://i.imgur.com/2m5ElEa.png) * SNAFU is a framework for generating energy-minimal, ULP CGRAs and compiling applications to run efficiently on them. ### SNAFU is a flexible ULP CGRA generator * SNAFU convert a high-level description of a CGRA to valid RTL and ultimately to ULP hardware. * SNAFU takes two inputs: a library of PEs and a high-level description of the CGRA topology. * SNAFU lets designers customize the ULP CGRA via a "*bring your own functional unit(BYOFU)*" approach, defining a generic PE interface that makes it easy to add custom logic to a generated CGRA. ### Example of SNAFU in action ![](https://i.imgur.com/6TtCPuH.png) * This kernel multiplies values at address &a by 5 for the elements where the mask m is set, sums the result, and stores it to address &c. * ①In the first timestep, the two memory PEs are enabled and issue loads. The rest of the fabric is idle because it has no valid input values. * ②The load for a[0] completes, but m[0] cannot due to a bank conflict. This causes a stall, which is handled transparently by SNAFU's scheduling logic and bufferless NoC. Meanwhile, the load of a[1] begins. * ③As soon as the load for m[0] completes, the multiply operation can fire because both of its inputs have arrived. But m[0]==0, meaning the multiply is disabled, so a[0] passes through transparently. The load of a[1] completes, and loads for a[2] and m[1] begin. * ④When the predicated multiply completes, its result is consumed by the fourth PE, which keeps a partial sum of the products. The preceding PEs continue executing in pipelined fashion, multiplying a[1]x5 and loading a[3] and m[2]. * ⑤Finally, a value arrives at fifth PE, and is stored back to memory in c[0]. Execution continues in this fashion until all elements of a and m have been processed and a final results has been stored back to memory. --- ## Designing SNAFU to Maximize FLEXIBILITY ![](https://i.imgur.com/PaiFPnz.png) ### Bring your own functional unit(BYOFU) interface * If a custom FU implements SNAFU's interface, then SNAFU generates hardware to automatically handle configuring the FU, tracking FU and overall CGRA progress, and moderating its communication with other PEs. * The μcore handles progress tracking, predicated execution, and communication. The standard FU interface connects the μcore to the custom FU logic. * The μcfg handles configuration of both the μcore and FUs. #### Communication * The μcore handles communication between the PE and the NoC, decoupling the NoC from the FU. * The input router handles incoming connections, notifying the internal μcore logic of the availability of valid data and predicates. The intermediate buffers hold output data produced by the FU. * The NoC, which forwards data to dependent PEs, is entirely bufferless. #### The FU interface * The interface has four control signals(*op, ready, valid, done*), and several data signals. * *op*: tells the FU that input operands are ready to be consumed. * *ready*: indicates that the FU can consume new operands. * *valid*: says that the FU has data ready to send over the network. * *done*: says the FU has completed execution. #### Progress tracking and fabric control * The fabric has a top-level controller that interfaces with each μcore via three 1-bit signals. * 1. enables the μcore to begin execution * 2. resets the μcore * 3. tells the controller when the PE has finished processing all input. * The μcore keeps track of the progress of the FU by monitoring the *done* signal, conunting how many elements the FU has processed. When the number of completed elements matches the length of the computation, the μcore signals the controller that it is done. #### Configuration services * The μcfg handles PE configuration, setting up a PE's dataflow routes and providing custom FU configuration state. * The μcfg module contains a configuration cache that can hold up to six different configurations. * Large DFG applications split their DFG into multiple sub-graphs. The CGRA executes them one at a time, efficiently switching between them via the configuration cache. ### SNAFU's PE standard library * The library includes four types of PEs: * Arithmetic PEs: * basic ALU: bitwise operations, comparisons, additions, subtractions, fixed-point clip operations * multiplier: 32-bit signed multiplication * Memory PEs: generate addresses and issue load and store to global memory * Scratchpad PEs: holds intermediate values produced by the CGRA. ### Compilation * The compiler extracts the dataflow graph from the vectorized C code. SNAFU asks the system designer to provide a mapping from RISC-V vector ISA instruction to a PE type, including the mapping of an operations inputs and outputs onto an FU's inputs and outputs. * The compiler uses an integer linear program (ILP) formulation to schedule operations onto the PEs of a CGRA. --- ## Designing SNAFU to Minimize ENERGY ### Spatial vector-dataflow execution * SNAFU's CGRA amortizes a single fabric configuration across many computations(vector), and routes intermediate values directly between operations(dataflow). ### Asynchronous dataflow firing without tag-token matching * Static assigning operations to PE and scheduling is most energy-efficient but is feasible when all operation latencies are known; Dynamic strategies require expensive tag-matching hardware to associate operands with their operation. * SNAFU uses hybrid CGRA with static PE assignment and dynamic scheduling. * Each PE uses local, asynchronous dataflow firing to tolerate variable latency. SNAFU avoids tag-token matching by enforcing that values arrive in-order. ### Statically routed, bufferless on-chip network * SNAFU includes a statically-configured, bufferless, multi-hop on-chip network. * Static circuit-switching eliminates expensive lookup tables and flow-control mechanisms. * The network is bufferless, eliminating the NoC's primary energy sink. ### Minimizing buffers in the fabric * SNAFU includes minimal in-fabric buffering at the producer PE, with none in NoC. * Buffering at the producer PE means each value is buffered exactly once. and overwritten only when all dependent PEs are finished using it. * SNAFU minimizes the number of buffers at each PE, using just four buffers per PE by default. --- ## SNAFU-ARCH: A COMPLETE ULP SYSTEM w/ CGRA ![](https://i.imgur.com/H5qMdZU.png) * The RISC-V scalar core implements the E, M, I, and C extensions and issues control signals to the SNAFU fabric. ### Example of SNAFU-ARCH in action ![](https://i.imgur.com/xZsz1J8.png) * SNAFU-ARCH adds three instructions to the scalar core to interface with the CGRA fabric. * The SNAFU fabric operates in three states: **idle**, **configuration**, and **execution**. * During the **idle** phase the scalar core is running and the fabric is not. * When the scalar core reaches a *vcfg* instruction, the fabric transitions to the **configuration** state. * The configurator stalls until the scalar core either reaches a *vtfr* instruction or a *vfence* instruction. * *vtfr* lets the scalar core pass a register value to the fabric configurator, which then passes that value to a specific PE. * *vfence* insdicates that configuration is done, so the scalar core stalls and the fabric transitions to execution. Execution proceeds until all PEs signal that they have completed their work. * Finally, the scalar core resumes execution from the *vfence*, and the fabric transitions back into the **idle** state. --- ## Evaluation 1. SNAFU-ARCH achieves high-performance by exploiting instruction-level parallelism in each kernel, which is naturally achieved by SNAFU's asynchronous dataflow-firing at each PE. ![](https://i.imgur.com/UFJWHkO.png) 2. The reason for improvement in large input size is that with larger input sizes, SNAFU-ARCH can more effectively amortize the overhead of (re)configuration. ![](https://i.imgur.com/wXbX8qM.png) 3. These results make it clear that SNAFU-ARCH can effectively exploit instruction-level parallelism and that there is an opportunity for compiler to further improve efficiency. ![](https://i.imgur.com/vLg5WI9.png) 4. Without scratchpad units, intermediate values were being communicated through memory, which is quite expensive. ![](https://i.imgur.com/r7ndJYA.png)