MANIC: A Vector-Dataflow Architecture for Ultra-Low-Power Embedded Systems

# MANIC: A Vector-Dataflow Architecture for Ultra-Low-Power Embedded Systems ###### tags: `Accelerators` ###### paper origin: MICRO 52 ###### papers: [link](https://dl.acm.org/doi/pdf/10.1145/3352460.3358277) ###### slides and video: `none` # 1. INTRODUCTION ## Problem * How can we perform sophisticated computations on simple, ultra-low-power systems? One design is to offload work by wirelessly transmitting data to a more powerful nearby computer for processing. Unfortunately, transmitting data takes much more energy per byte than sensing, storing, or computing on those data . * The device must **process data locally at a low operating power** and with extremely **high energy-efficiency**. * The device must be **programmable and general** to support a wide variety of applications. * COTS **MCUs pay a high power, energy, and performance cost** for their generality and programmability. ![](https://i.imgur.com/u5dQz2Z.png) * We find that **instruction and data supply consume 54.4%** of the average execution energy in our workloads. ![](https://i.imgur.com/zEwcG9G.png) ## Solution * In this work we present MANIC: **an efficient vector-dataflow architecture for ultra-low-power embedded systems.** * MANIC is closest to the Ideal design, **achieving high energy-efficiency while remaining general-purpose and simple to program**. * MANIC is simple to program because it exposes a standard vector ISA interface based on the **RISC-V vector extension** . * MANIC achieves high energy-efficiency by **eliminating the two main costs of programmability through its vector-dataflow design**. * First, **vector execution amortizes instruction supply energy** over a large number of operations. * Second, **MANIC addresses the high cost of VRF accesses through its dataflow component by forwarding operands directly between vector operations**. ## Contributions * They implement MANIC fully in RTL and use industry-grade CAD tools to evaluate its energy efficiency across a collection of programs appropriate to the deeply embedded domain. * Using post-synthesis energy estimates, they show that MANIC is within 26.4% of the energy of an idealized design while remaining fully general and making few, unobtrusive changes to the ISA and software development stack. # 2. Implementation * There are two main goals of vector-dataflow execution: * The first goal is to **provide general-purpose programmability**. * The second goal is to do this while operating efficiently by **minimizing instruction and data supply overheads**. * Vector-dataflow achieves this through three features: * **vector execution** * **dataflow instruction fusion** * **register kill points** ## Vector Execution * **Save instruction energy. (SIMD)** ## Vector Dataflow * **Save vector register access energy.** * Orange arrows represent control flow, blue arrows represent dataflow. MANIC relies on vector-dataflow execution, **avoiding register accesses by forwarding and renaming**. ![](https://i.imgur.com/9ANLO8u.png) ## Vector Register Kill Points * **Save data supply energy**. * **A vector register is dead at a particular instruction if no subsequent instruction uses the value in that register, a dead value need not be written to the vector register file.** * Histograms of kill distances for three different applications. Distances skew left, suggesting values are consumed for the last time shortly after being produced. ![](https://i.imgur.com/QtdmTmp.png) ## Vfence * **A new vfence instruction is added that handles both synchronization and memory consistency.** * **vfence stalls the scalar core** until the vector unit completes execution with its current window of vector-dataflow operations. * In practice, this often means inserting a vfence at the end of the kernel, the **programmer** is responsible for their correct use. (**Fix data race and alias problem**) ## MANIC: ULTRA-LOW-POWER VECTOR-DATAFLOW PROCESSING * They emphasize that these compiler-based features do not require programming changes, do not expose microarchitectural details, and are optional to the effective use of MANIC. MANIC implements the RISC-V V vector extension. * The vector unit has a few simple additions to support vector-dataflow execution * **instruction windowing hardware** * **renaming mechanism** * They implement **16 vector registers**, requiring four bits to name, and **leaving a single bit in the register name unused, and use the extra bit to convey kill annotations**. * A block diagram of MANIC’s microarchitectural components (non-gray region). Control flow is denoted with orange, while blue denotes dataflow. Stateful components (e.g. register files) have dotted outlines. ![](https://i.imgur.com/VYu3Rcz.png) * MANIC adds four components to this base vector core to support vector-dataflow execution: * **issue logic and a register renaming table** * Issue logic is primarily responsible for creating a window of instructions to execute according to vector-dataflow. * Identifying, preparing, and issuing for execution a window of dependent instructions over an entire vector of inputs. * The issue logic identifies dataflow between instructions by comparing the names of their input and output operands. * **If two instructions are dependent—the output of one of the instructions is the input of another — MANIC should forward the output value directly from its producer to the input of the consumer, avoiding the register file.** * **an instruction window buffer** * **A key feature of the instruction window's control logic is its ability to select an operand's source or destination.** * For input operands, the instruction window controls whether to fetch an operand from the vector register file or from MANIC's forwarding buffer. * For output operands, the instruction window controls whether to write an output operand to the vector register file, to the forwarding buffer, or to both. * **an xdata buffer** * **Some instructions like vector loads and stores require extra information available from the scalar register file** when the instruction is decoded. * Since not all vector instructions require values from the scalar register file, MANIC includes a separate buffer, called the xdata buffer, to hold this extra information. * **a forwarding buffer** * The forwarding buffer is a small, directlyindexed buffer that **stores intermediate values** as MANIC's execution unit forwards them to dependent instructions in the instruction window. * **By accessing the forwarding buffer instead of accessing the vector register file**, an instruction with one or more forwarded operands consumes less energy than one that executes without MANIC. ## Memory System * **MANIC includes an instruction cache and a data cache**. * This departs from the designs of many commercial microcontrollers in the ultra-low-power computing domain, which do not have dcaches and have extremely small icaches on the order of 64 bytes. ## Example * MANIC’s issue logic constructs windows of instructions with dataflow. The rename table keeps track of registers and names, updating the instruction buffer when new opportunities for forwarding are identified. ![](https://i.imgur.com/i54oO4A.png) * MANIC’s microarchitecture components execute a window of instructions using forwarding according to dataflow across an entire vector of input. ![](https://i.imgur.com/NdxwRRG.png) ## Microarchitecture-Agnostic Dataflow Scheduling * Code scheduling is microarchitecturally agnostic – minimizing the sum of kill distances is good proxy for minimizing register writes for specific window size. ![](https://i.imgur.com/OcJcc3e.png) * The two curves generally agree, suggesting that **minimizing sum kill distance eliminates register writes with similar efficacy as when window size is exposed explicitly to the compiler**. * For the **FFT kernel, the instruction window is broken by stores and permutations, causing additional vector register file writes.** This is a limitation of optimizing only for sum kill distance that we plan to address in future work. * There are three structural hazards that cause MANIC to stop buffering additional instructions, stall the scalar core, and start vector execution * The first hazard occurs when the **instruction buffer is full** and another vector instruction is waiting to be buffered. * The second hazard occurs when **all slots in the forwarding buffer are allocated** and an incoming instruction requires a slot. * Finally, the third hazard occurs when the **xdata buffer is full** and a decoded vector instruction requires a slot. ## METHODOLOGY ![](https://i.imgur.com/aTDbF1b.png) * They build a simulation infrastructure based on Verilator that allows us to run full applications on top of MANIC. This infrastructure includes a custom version of Spike, modifications to the assembler, a custom LibC, a custom bootloader, a cache and memory simulator, and RTL for both MANIC and its **five-stage pipelined scalar core**. * They develop **a custom version of Spike** with support for vector instructions to verify RTL simulation. * They **extend the GNU assembler** to generate the correct bit encodings for the RISC-V vector extension instruction set. * They build **a custom LibC and bootloader** to minimize the overhead of startup and to support our architecture. * They use a **cache and memory simulator** to model timing for loads and stores. * Finally, they use a test harness built with **Verilator to run full-application, cycle-accurate RTL simulations**. # 3. Result ![](https://i.imgur.com/Q7YF3FS.png) * Full system energy of MANIC against various baselines across seven different applications. Bars (from left-to-right): scalar baseline, vector baseline, MANIC, and an idealized vector design with no instruction or data supply energy. **MANIC is within 26.4% of the ideal design and is overall 2.8× more energy efficient than the scalar baseline and 38.1% more energy efficient than the vector baseline.** ![](https://i.imgur.com/aPxtLHz.png) * Impact of MANIC’s optimizations on full system energy, comparing (from left-to-right): vector baseline, MANIC with forwarding disabled, MANIC without dataflow code scheduling, and full MANIC.Without forwarding, MANIC’s added components slightly increase energy by <5%. **Forwarding saves 15.5% system energy vs. the baseline, and kill annotations and dataflow code scheduling saves a further 26.7%.** ![](https://i.imgur.com/lVC9Si0.png) * In the intermittent computing domain, **MANIC with hardware JIT-checkpointing is 9.6× more energy efficient than SONIC**, which maintains correctness in software alone. ![](https://i.imgur.com/NA9lPhq.png) * Instruction and cycle counts for seven benchmarks running on the scalar baseline, vector baseline, and MANIC. The vector baseline and MANIC effectively do not differ. **Vector execution means that both run 10.6× less instructions and 2.5× less cycles than the scalar baseline**. * ![](https://i.imgur.com/IwvCb8N.png) * Power of the scalar baseline, vector baseline, and MANIC across seven benchmarks. MANIC uses 10.0% less power than the scalar baseline and, despite using less energy than the scalar baseline, the vector baseline actually uses 29.5% more power. ![](https://i.imgur.com/IQtIAhE.png) * MANIC’s sensitivity to its microarchitectural parameters: **16 is the best window size, larger vector lengths are generally better, and moderately sized caches are generally more energy efficient**. * ![](https://i.imgur.com/x20z6nd.png) # 4. Conclusion * This paper described MANIC, an ultra-low-power embedded processor architecture that achieves high energy efficiency without sacrificing programmability or generality. * The key to MANIC’s efficient operation is its vector-dataflow execution model, in which dependent instructions in a short window forward operands to one another according to dataflow. * Vector operation amortizes control overhead. Dataflow execution avoids costly reads from the vector register file. Simple compiler and software support helps avoid further vector register file writes in a microarchitecture-agnostic way. * MANIC’s microarchitecture implementation directly implements vector-dataflow with simple hardware additions, while still exposing a standard RISC-V ISA interface. **MANIC’s highly efficient implementation is on average 2.8× more energy efficient than an scalar core and is within 26.4% on average of an ideal design that eliminates all costs of programmability.** Our results show that MANIC’s vector-dataflow model is realizable and approaches the limit of energy efficiency for an ultra-low-power embedded processor. # 5. Discussion * Very good.