Stream-Dataflow Acceleration

# Stream-Dataflow Acceleration ###### tags: `Accelerators` ###### paper: [link](https://research.cs.wisc.edu/vertical/papers/2017/isca17-stream-dataflow.pdf) ###### no slides nor videos found ##### paper origin: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) # 1. Intrduction * Motivation * We require a hardware that is capable of executing **data-intensive algorithms** at high performance with much **lower power** than existing programmable architectures, while remaining broadly **applicable** and **adaptable**. * Common characteristics 1. High computational intensity with long phases 2. Small instruction footprints with simple control flow 3. Straightforward memory access and re-use patterns * It is called "stream-dataflow", because of its components, and exposes these basic abstractions: * A dataflow graph for repeated, pipelined computations. * Stream-based commands for facilitating efficient data-movement across components and to memory. * A private (scratchpad) address space for efficient data reuse. ![](https://i.imgur.com/NbudAlT.png) * Performance: * Compared to the machine learning accelerator, we average only 2× power and area overhead. * On the broader set of MachSuite workloads, compared to custom ASICs, the average overhead was 2× power and 8× area. # 2. MOTIVATION AND OVERVIEW ## 2.1 Specialization in Existing Approaches * Discuss the specialization capabilities in three broad categories 1. Reducing the per-instruction power and resource access costs, 2. Reducing the cost of memory addressing and communication, and 3. Reducing the cost of attaining high execution resource utilizatio ![](https://i.imgur.com/Ztp4BYq.png) ![](https://i.imgur.com/ejO2Y6h.png) * Summary and Observations 1. Being able to specify "vectorized memory access" is extremely important, not just for parallelism and reducing memory accesses, but also for reducing address generation overhead 2. Though "vectorized instructions" do reduce instruction dispatch overhead, the separation of the work into fixed-length instructions requires inefficient operand communication through register files and requires high-power mechanisms to attain high utilization 3. Exposing a "spatial dataflow" substrate to software solves the above, but complicates and disrupts the ability to specify and take advantage of vectorized memory access ## 2.2 Opportunities for Stream-Dataflow * **Vector architectures** expose a far more efficient **parallel memory interface**, while **spatial architectures** expose a far more efficient **parallel computation interface** ![](https://i.imgur.com/xL4NfJJ.png) # 3. STREAM-DATAFLOW ARCHITECTURE ## 3.1 Abstractions * Dataflow Graph (DFG) * The DFG is an acyclic graph containing instructions and dependences * DFG inputs and outputs are named ports with explicit vector widths * Dataflow graphs can be switched through a configuration command * Streams * Streams are defined by a source architectural location, a destination and an access pattern * Streams from DFG outputs to inputs support recurrence * Streams generally execute concurrently * Barriers and Concurrency * Barrier instructions serialize the execution of certain types of commands * The programmer or compiler is responsible for enforcing memory dependences ## 3.2 Programming and Execution Model * A stream-dataflow program consists of a set of configure, stream, and barrier commands, that interact with and are ordered with respect to the instructions of a general program. ![](https://i.imgur.com/GzX6CRU.png) * Performance 1. The DFG size should be as large as possible to maximize instruction parallelism. 2. Streams should be as “long” as possible to avoid instruction overheads on the control core. 3. Reused data should be pushed to the scratchpad to reduce bandwidth to memory ## 3.3 Stream-Dataflow ISA ![](https://i.imgur.com/GfeBQGk.png) # 4. A STREAM-DATAFLOW MICROARCHITECTURE * Two primary design principles: * Avoid introducing large or power hungry structures, especially multi-ported memories * We take full advantage of the concurrency provided by the ISA ## 4.1 Overview * At a high level, we combine a low power **control core** to generate **stream commands**, a set of **stream-engines** to efficiently interface with memories and move data, and a deeply-pipelined reconfigurable **dataflow substrate** for efficient parallel computation. * **component** * Control Core * Generates **stream-dataflow commands** to the stream dispatcher * Stream Dispatcher * Manages the concurrent execution of the stream engines by **tracking stream resource dependences** and **issuing commands** to stream engines * Stream Engines * Data access and movement is carried out * Three components: 1. Memory 2. Scratchpad 3. DFG recurrences * Vector Ports * **Interface** between the computations performed by the CGRA and the streams of incoming/outgoing data * a set of vector ports not connected to the CGRA are used for storing the streaming addresses of **indirect loads/stores** * CGRA * Coarse grained reconfigurable architecture enables pipelined computation of dataflow graphs ![](https://i.imgur.com/epzMtJc.png) * **Stream Command Lifetime** 1. The **control core** generates the command and sends it to the **stream dispatcher**. 2. The **stream dispatcher** issues the command to the appropriate **stream engines** once any associated resources are free. 3. The data transfer for each stream is carried out by the stream engine, which keeps track of the **running state** of the stream over its lifetime. 4. When the stream completes, the stream engine notifies the dispatcher that the corresponding resources are free, enabling the next stream command to be issued. ## 4.2 Stream Dispatch and Control Core * Enforce **resource dependences** on streams, and coordinate the execution of the stream-engines by sending them commands * Stream requests from the control core are **queued** until they can be processed by the command decoder. This unit consults with **resource status checking logic** to determine if a command can be issued, if so it will be **dequeued**. * **Barrier commands** block the core from issuing further stream commands until the barrier condition is resolved. ![](https://i.imgur.com/HgMldTW.png) ## 4.3 Stream Engines * Stream engines manage **concurrent access** to various resources by many **active streams**. They are critical to achieving high parallelism with low power overhead, by fully utilizing the associated resources through arbitrating stream access * Stream engines are initiated by receiving commands from the **stream dispatcher**. They then coordinate the **address generation** and **data transfer** over the lifetime of the stream, and finally **notify** the dispatcher when the corresponding vector ports are freed. * The stream engines each have their own 512-bit wide bus to the input and output vector ports. The stream dispatcher ensures that concurrent streams have dedicated access to their vector ports. * Indirect access is facilitated by vector ports which are not connected to the CGRA, which buffer addresses in flight. ![](https://i.imgur.com/7U14N6b.png) * address generation unit(AGU) # 5. IMPLEMENTATION ![](https://i.imgur.com/XedsSul.png) * Hardware * Implement in Chisel * Software Stack * Create a simple wrapper API that is mapped down into the RISCV-encoding of the stream-dataflow ISA * Modify a GCC cross compiler for RISCV with **stream-dataflow ISA extensions**, and **DFG compiler** * Simulator * cycle-level RISC-V based simulator # 7. EVALUATION * What are the sources of its power and area overhead? • CGRA network and control core. * Can it match the speedup of a domain specialized accel.? • Yes * Is the stream-dataflow paradigm general? • Yes, All DNN and most MachSuite are implementable using the stream-dataflow abstractions. * What are its limitations in terms of generality? • Code properties that are not suitable are arbitrary memory-indirection and aliasing, control-dependent loads, and bit-level manipulations. * How does stream-dataflow compare to application-specific? • Only 2× power and 8× area overhead. ## 7.1 Domain-Specifi c Accelerator Comparison ![](https://i.imgur.com/4QfwLgP.png) ## 7.2 Stream-Dataflow Generality ## 7.3 Application-Specifi c Comparison ![](https://i.imgur.com/pPkdzvJ.png) * Figure 13 shows the power savings (efficiency) over a Sandybridge OOO4 core5 as the baseline * Figure 14 shows the energy efficiency comparison of Softbrain and the ASICs ![](https://i.imgur.com/HfItGvn.png) ## 9. DISCUSSION AND CONCLUSIONS * Provides abstractions that balance the tradeoffs of **vector** and **spatial architectures** * Attain the **specialization capabilities** of both on an important class of **data-processing workloads** * Sufficiently **general** to express the execution of a variety of deep learning workloads and many workloads from MachSuite * Developed an efficient microarchitecture, and our evaluation suggests its **power** and **area** is only small factors more than domain-specific and ASIC designs