Plasticine: A Reconfigurable Architecture For Parallel Patterns

# Plasticine: A Reconfigurable Architecture For Parallel Patterns ###### tags: `Accelerators` ###### paper origin: ISCA 2017 ###### papers: [link](https://ieeexplore.ieee.org/document/8192487) # 1. Introduction * Motivation * Accelerators implement customized data and control paths to suit a domain of applications * To avoid overheads of flexibility in general-purpose processors * However, ASICs is expensive due to the high NRE costs for design * Reconfigurable architectures like FPGAs offset the high NRE costs by providing flexible logic blocks * Support for arbitrary precision computation * Recent work has shown that parallel patterns can be used to generate optimized accelerators for FPGAs from high-level languages * Developing a reconfigurable architecture that support parallel patterns which is both highly efficient in terms of area, power, and performance * Plasticine * Each **Pattern Compute Unit(PCU)** consists of a reconfigurable pipeline with multiple stages of SIMD functional units * **Pattern Memory Units(PMUs)** are composed of a banked scratchpad memory * The compiler can map inner loop computation to one PCU such that most operands are transferred directly between functional units without scratchpad accesses or inter-PCU communication # 2. PARALLEL PATTERNS ## Programming with Parallel Patterns * The programming model of this paper is based on the parallel patterns Map, FlatMap, Fold, and HashReduce. ![](https://i.imgur.com/Wi20xMJ.jpg) * Map * Map creates a single output element per index using the function f, where each execution of f is guaranteed to be independent. * The number of output elements from Map is the same as the size of the input iteration domain. * FlatMap * It produces an arbitrary number of elements per index using function g, where function execution is also independent. * The produced elements are concatenated into a flat output. * Conditional data selection (e.g. WHERE in SQL) is a special case of FlatMap where g produces zero or one elements ![](https://i.imgur.com/4VLmXlk.jpg) * Fold * It first acts as a Map. * And then reduces these elements using an associative combine function r * HashReduce * It generates a hash key and a value for every index using functions k and v, respectively. * Values with the same key are reduced by a single accumulator using an associative combine function r. * Histogram creation is a simple example of HashReduce * the key function gives the histogram bin * the value function is defined to always be 1 * the combine function r is integer addition * Example ![](https://i.imgur.com/FuI9ixO.jpg) * the Map creates an output matrix of size M × P * Fold’s map function (f2) accesses an element of matrix a and matrix b and multiplies them. * Fold’s combine function (r) in this case is summation. ![](https://i.imgur.com/lgDGd2R.jpg) * Use the filter(Flatmap) to keep the item which is less than cutoff * hashReduce's key function and value function can generate a hash key and a value for every item. * hashReduce's combine function r use summation. ## Hardware Implementation Requirements ![](https://i.imgur.com/L0t3LwK.jpg) * We can categorize access patterns as either statically predictable **linear** functions of the pattern indices or unpredictable, **random** accesses. * We label accesses as either **streaming**, where no data reuse occurs across a statically determinable number of function executions, or **tiled**, where reuse may occur. * Traditional FPGAs can also be configured to implement these patterns, but with much poorer energy efficiency. # 3. THE PLASTICINE ARCHITECTURE * Plasticine is consisting of Pattern Compute Units (PCUs) and Pattern Memory Units (PMUs), which we refer to collectively simply as **units**. * Each Plasticine component is used to map specific parts of applications * local address calculation is done in PMUs * DRAM address computation happens in the DRAM address management units * the remaining data computation happens in PCUs ## Pattern Compute Unit ![](https://i.imgur.com/fbmtXVq.jpg) * The PCU datapath is organized as a multi-stage, reconfigurable SIMD pipeline. * Each stage of each SIMD lane is composed of a functional unit(FU) and associated pipeline registers (PR). * PCUs use three kinds of inputs and outputs(IO) * **Scalar** is used to communicate single words of data, such as the results of Folds. * Each **vector** allows communicating one word per lane in the PCU. * **Control** is used to communicate control signals such as the start or end of execution of a PCU. ## Pattern Memory Unit ![](https://i.imgur.com/UZsBkeO.jpg) * Each PMU contains a scratchpad memory coupled with a scalar datapath intended for address calculation. ![](https://i.imgur.com/FizobFz.jpg) * Address calculation is performed on the PMU datapath, while the core computation is performed within the PCU. * Address calculation involves simple scalar math, which requires simpler ALUs than the FUs in PCUs. * Performing address calculation within the PCU requires routing the addresses from the PCU to the PMU, which occupies PCU stages and output links, and can lead to PCU under-utilization. ## Interconnect * Plasticine supports communication between PMUs, PCUs, and the remaining elements using three kinds of interconnect - scalar, vector, and control. * scalar：word-level * vector：multiple word-level * control：bit-level ## Off-chip Memory Access * Off-chip DRAM is accessed from Plasticine using 4 memory channels. * Each DRAM channel is accessed using several address generators (AG) on two sides of the chip. * Multiple AGs connect to an address coalescing unit, which can process memory requests. ## Control Flow ![](https://i.imgur.com/qUykP1J.jpg) * Sequential execution * In a sequential parent controller, only one data dependent child is active at any time. * This is commonly used when a program has loop-carried dependencies. * Coarse-grained pipelines * The parent controller sends N tokens to the heads, where N is the number of data dependent children in the critical path. * To allow producers and consumers to work on the same data across different iterations, each intermediate memory is M-buffered, where M is the distance between the corresponding producer and consumer on their data dependency path. * Credit * The number of credits is initialized to M. * Each producer decrements its credit count after producing all the data for the “current" iteration of the parent. * The consumer sends a credit back after consuming all the data for the **current** iteration of the parent. * Streaming * A controller is enabled when all FIFOs it reads from are not empty and all FIFOs it writes to are not full. * FIFOs are local to the consumer controller, so **enqueue** and **not full** signals are routed from consumer to producer. ## Architecture Sizing ![](https://i.imgur.com/R0nWQel.jpg) * Table 3 summarizes the architecture parameters under consideration, the possible values for each, and the final value which is selected. ![](https://i.imgur.com/nJIXy9q.jpg) * Benchmark-normalized area overhead * Area~PCU~：For each proposed value, the authors sweep the remaining space to find the minimum possible PCU Area. * Min~PCU~：the possible minimum area for every benchmark * **Overhead = Area~PCU~/Min~PCU~-1** * How to choose parameters * The number of stages per physical PCU * All other parameters are left unrestricted. * The ideal number of stages per PCU is 5 or 6 for most benchmarks. * Finally, select 6 stages per PCU. * The number of registers per FU * Fixing the number of stages at 6 but leaving all other parameters unrestricted. * The ideal number of registers across most applications is between 4 and 6. * Finally, select 6 registers per FU. * The number of scalar inputs and outputs * Select 6 scalar inputs and 5 scalar outputs, as this minimizes area overhead across all benchmarks. * The vector inputs and outputs per PCU * Vector inputs are associated with input FIFOs, which represent a sizeable fraction of PCU area. * Minimize vector inputs as much as possible and choose 3 vector inputs and 3 vector outputs per PCU. * PCU and PMU * Choose 16 × 8 units and 1:1 ratio of PMUs to PCUs * Larger ratios improved unit utilization but were less energy efficient # 4. EVALUATION ## Benchmarks ![](https://i.imgur.com/27R1oRG.jpg) * Table 4 provides a summary of the benchmarks which is used to evaluate FPGA and Plasticine, respectively ## Plasticine Design ![](https://i.imgur.com/3yk9QeW.jpg) * Table 5 provides the component-wise area breakdown for Plasticine at an area footprint of 112.77mm^2^ * Total runtime for Plasticine begins after data is copied into the accelerator’s main memory, and ends when execution on the accelerator is complete (i.e. before copying data back to the host) ## FPGA Design * Run each synthesized benchmark on an Altera 28nm Stratix V FPGA * The FPGA has a 150 MHz fabric clock, a 400 MHz memory controller clock, and 48 GB of dedicated off-chip DDR3-800 DRAM with 6 channels and a peak bandwidth of 37.5 GB/s * Runtime starts after copy of data from the host to the FPGA’s dedicated DRAM completes and finishes when FPGA execution completes ## Plasticine versus FPGA ![](https://i.imgur.com/HepcQmg.jpg) * The table shows that Plasticine achieves higher energy efficiency over an FPGA # 5. CONCLUSION * Plasticine is a novel reconfigurable architecture for efficiently executing both sparse and dense applications composed of parallel patterns. * For an area budget of 113 mm^2^, Plasticine provides up to 95× improvement in performance and up to 77× improvement in performance-per-Watt compared to an FPGA in a similar process technology.