# MANIC: A Vector-Dataflow Architecture for Ultra-Low-Power Embedded Systems
###### tags: `Accelerators`
###### paper origin: MICRO 52
###### papers: [link](https://dl.acm.org/doi/pdf/10.1145/3352460.3358277)
###### slides and video: `none`
# 1. INTRODUCTION
## Problem
* How can we perform sophisticated computations on simple, ultra-low-power systems? One design is to offload work by wirelessly transmitting data to a more powerful nearby computer for processing. Unfortunately, transmitting data takes much more energy per byte than sensing, storing, or computing on those data .
* The device must **process data locally at a low operating power** and with extremely **high energy-efficiency**.
* The device must be **programmable and general** to support a wide variety of applications.
* COTS **MCUs pay a high power, energy, and performance cost** for their generality and programmability.

* We find that **instruction and data supply consume 54.4%** of the average execution energy in our workloads.

## Solution
* In this work we present MANIC: **an efficient vector-dataflow architecture for ultra-low-power embedded systems.**
* MANIC is closest to the Ideal design, **achieving high energy-efficiency while remaining general-purpose and simple to program**.
* MANIC is simple to program because it exposes a standard vector ISA interface based on the **RISC-V vector extension** .
* MANIC achieves high energy-efficiency by **eliminating the two main costs of programmability through its vector-dataflow design**.
* First, **vector execution amortizes instruction supply energy** over a large number of operations.
* Second, **MANIC addresses the high cost of VRF accesses through its dataflow component by forwarding operands directly between vector operations**.
## Contributions
* They implement MANIC fully in RTL and use industry-grade CAD tools to evaluate its energy efficiency across a collection of programs appropriate to the deeply embedded domain.
* Using post-synthesis energy estimates, they show that MANIC is within 26.4% of the energy of an idealized design while remaining fully general and making few, unobtrusive changes to the ISA and software development stack.
# 2. Implementation
* There are two main goals of vector-dataflow execution:
* The first goal is to **provide general-purpose programmability**.
* The second goal is to do this while operating efficiently by **minimizing instruction and data supply overheads**.
* Vector-dataflow achieves this through three features:
* **vector execution**
* **dataflow instruction fusion**
* **register kill points**
## Vector Execution
* **Save instruction energy. (SIMD)**
## Vector Dataflow
* **Save vector register access energy.**
* Orange arrows represent control flow, blue arrows represent dataflow. MANIC relies on vector-dataflow execution, **avoiding register accesses by forwarding and renaming**.

## Vector Register Kill Points
* **Save data supply energy**.
* **A vector register is dead at a particular instruction if no subsequent instruction uses the value in that register, a dead value need not be written to the vector register file.**
* Histograms of kill distances for three different applications. Distances skew left, suggesting values are consumed for the last time shortly after being produced.

## Vfence
* **A new vfence instruction is added that handles both synchronization and memory consistency.**
* **vfence stalls the scalar core** until the vector unit completes execution with its current window of vector-dataflow operations.
* In practice, this often means inserting a vfence at the end of the kernel, the **programmer** is responsible for their correct use. (**Fix data race and alias problem**)
## MANIC: ULTRA-LOW-POWER VECTOR-DATAFLOW PROCESSING
* They emphasize that these compiler-based features do not require programming changes, do not expose microarchitectural details, and are optional to the effective use of MANIC. MANIC implements the RISC-V V vector extension.
* The vector unit has a few simple additions to support vector-dataflow execution
* **instruction windowing hardware**
* **renaming mechanism**
* They implement **16 vector registers**, requiring four bits to name, and **leaving a single bit in the register name unused, and use the extra bit to convey kill annotations**.
* A block diagram of MANIC’s microarchitectural components (non-gray region). Control flow is denoted with orange, while blue denotes dataflow. Stateful components (e.g. register files) have dotted outlines.

* MANIC adds four components to this base vector core to support vector-dataflow execution:
* **issue logic and a register renaming table**
* Issue logic is primarily responsible for creating a window of instructions to execute according to vector-dataflow.
* Identifying, preparing, and issuing for execution a window of dependent instructions over an entire vector of inputs.
* The issue logic identifies dataflow between instructions by comparing the names of their input and output operands.
* **If two instructions are dependent—the output of one of the instructions is the input of another — MANIC should forward the output value directly from its producer to the input of the consumer, avoiding the register file.**
* **an instruction window buffer**
* **A key feature of the instruction window's control logic is its ability to select an operand's source or destination.**
* For input operands, the instruction window controls whether to fetch an operand from the vector register file or from MANIC's forwarding buffer.
* For output operands, the instruction window controls whether to write an output operand to the vector register file, to the forwarding buffer, or to both.
* **an xdata buffer**
* **Some instructions like vector loads and stores require extra information available from the scalar register file** when the instruction is decoded.
* Since not all vector instructions require values from the scalar register file, MANIC includes a separate buffer, called the xdata buffer, to hold this extra information.
* **a forwarding buffer**
* The forwarding buffer is a small, directlyindexed buffer that **stores intermediate values** as MANIC's execution unit forwards them to dependent instructions in the instruction window.
* **By accessing the forwarding buffer instead of accessing the vector register file**, an instruction with one or more forwarded operands consumes less energy than one that executes without MANIC.
## Memory System
* **MANIC includes an instruction cache and a data cache**.
* This departs from the designs of many commercial microcontrollers in the ultra-low-power computing domain, which do not have dcaches and have extremely small icaches on the order of 64 bytes.
## Example
* MANIC’s issue logic constructs windows of instructions with dataflow. The rename table keeps track of registers and names, updating the instruction buffer when new opportunities for forwarding are identified.

* MANIC’s microarchitecture components execute a window of instructions using forwarding according to dataflow across an entire vector of input.

## Microarchitecture-Agnostic Dataflow Scheduling
* Code scheduling is microarchitecturally agnostic – minimizing the sum of kill distances is good proxy for minimizing register writes for specific window size.

* The two curves generally agree, suggesting that **minimizing sum kill distance eliminates register writes with similar efficacy as when window size is exposed explicitly to the compiler**.
* For the **FFT kernel, the instruction window is broken by stores and permutations, causing additional vector register file writes.** This is a limitation of optimizing only for sum kill distance that we plan to address in future work.
* There are three structural hazards that cause MANIC to stop buffering additional instructions, stall the scalar core, and start vector execution
* The first hazard occurs when the **instruction buffer is full** and another vector instruction is waiting to be buffered.
* The second hazard occurs when **all slots in the forwarding buffer are allocated** and an incoming instruction requires a slot.
* Finally, the third hazard occurs when the **xdata buffer is full** and a decoded vector instruction requires a slot.
## METHODOLOGY

* They build a simulation infrastructure based on Verilator that allows us to run full applications on top of MANIC. This infrastructure includes a custom version of Spike, modifications to the assembler, a custom LibC, a custom bootloader, a cache and memory simulator, and RTL for both MANIC and its **five-stage pipelined scalar core**.
* They develop **a custom version of Spike** with support for vector instructions to verify RTL simulation.
* They **extend the GNU assembler** to generate the correct bit encodings for the RISC-V vector extension instruction set.
* They build **a custom LibC and bootloader** to minimize the overhead of startup and to support our architecture.
* They use a **cache and memory simulator** to model timing for loads and stores.
* Finally, they use a test harness built with **Verilator to run full-application, cycle-accurate RTL simulations**.
# 3. Result

* Full system energy of MANIC against various baselines across seven different applications. Bars (from left-to-right): scalar baseline, vector baseline, MANIC, and an idealized vector design with no instruction or data supply energy. **MANIC is within 26.4% of the ideal design and is overall 2.8× more energy efficient than the scalar baseline and 38.1% more energy efficient than the vector baseline.**

* Impact of MANIC’s optimizations on full system energy, comparing (from left-to-right): vector baseline, MANIC with forwarding disabled, MANIC without dataflow code scheduling, and full MANIC.Without forwarding, MANIC’s added components slightly increase energy by <5%. **Forwarding saves 15.5% system energy vs. the baseline, and kill annotations and dataflow code scheduling saves a further 26.7%.**

* In the intermittent computing domain, **MANIC with hardware JIT-checkpointing is 9.6× more energy efficient than SONIC**, which maintains correctness in software alone.

* Instruction and cycle counts for seven benchmarks running on the scalar baseline, vector baseline, and MANIC. The vector baseline and MANIC effectively do not differ. **Vector execution means that both run 10.6× less instructions and 2.5× less cycles than the scalar baseline**.
* 
* Power of the scalar baseline, vector baseline, and MANIC across seven benchmarks. MANIC uses 10.0% less power than the scalar baseline and, despite using less energy than the scalar baseline, the vector baseline actually uses 29.5% more power.

* MANIC’s sensitivity to its microarchitectural parameters: **16 is the best window size, larger vector lengths are generally better, and moderately sized caches are generally more energy efficient**.
* 
# 4. Conclusion
* This paper described MANIC, an ultra-low-power embedded processor architecture that achieves high energy efficiency without sacrificing programmability or generality.
* The key to MANIC’s efficient operation is its vector-dataflow execution model, in which dependent instructions in a short window forward operands to one another according to dataflow.
* Vector operation amortizes control overhead. Dataflow execution avoids costly reads from the vector register file. Simple compiler and software support helps avoid further vector register file writes in a microarchitecture-agnostic way.
* MANIC’s microarchitecture implementation directly implements vector-dataflow with simple hardware additions, while still exposing a standard RISC-V ISA interface. **MANIC’s highly efficient implementation is on average 2.8× more energy efficient than an scalar core and is within 26.4% on average of an ideal design that eliminates all costs of programmability.** Our results show that MANIC’s vector-dataflow model is realizable and approaches the limit of energy efficiency for an ultra-low-power embedded processor.
# 5. Discussion
* Very good.