DSAGen & CGRA - HackMD

# DSAGen & CGRA ## **DSAGen** > DSAGen is a research infrastructure for studying programmable accelerators from the perspective of programming, ISAs, microarchitecture, and scaling. DSAGen = Domain-Specific Accelerator Generator / Decoupled Spatial Architecture Generator. * "Spatial" refers to designs with hardware/software interfaces(介面) that expose underlying hardware features. *(can exploit high compute parallelism using direct communication between an array of relatively simple PEs)* * "Decoupled" refers to the separation of memory access and computation pipelines. *(decoupled architecture allows each computing component to exist and perform tasks independently, while also allowing the components to remain completely unaware and autonomous until instructed.)* --- ### **Overall framework** ![image](https://hackmd.io/_uploads/HJsVGRIF6.png) *Fig.1 shows the overall framework of DSAGen* 1. Spatial accelerators is represented by an architecture description graph(ADG). ( 利用ADG來表示整個spatial accelerator的架構) 2. The compiler first transforms each input kernel into a decoupled dataflow representation; several different versions of each kernel are created with different sets of transformations, each targeted to particular architecture features. 3. The hardware mapper will distribute the program to hardware resources, and evaluate an efficiency metric (eg.performance×power×area) using a model. During design space exploration (DSE), the history of hardware mappings is used to modify the ADG to improve efficiency. 4. The best legal version of the compiled program and modified ADG is chosen. > 假設fig(a)是原始的target kernels(C + Pragmas)，fig(b)則是經compiler傳換過後的decoupled dataflow representation，fig(c)則是最後將dataflow map到硬體上的成果圖。 ![image](https://hackmd.io/_uploads/BkshM0dFp.png)![image](https://hackmd.io/_uploads/rk3aGAOKT.png)![image](https://hackmd.io/_uploads/rJmCfAuY6.png) --- ### **ADG** ![image](https://hackmd.io/_uploads/S1sfIGKFp.png) An architecture description graph(ADG) consists of simple primitives like network switches, processing elements, memory, and synchronoization. One benefit of the ADG abstraction is the ability to customize the datapath and memory features (and parameters) for a set of related programs. ![image](https://hackmd.io/_uploads/ByCXYJYFT.png) * **Processing Elements (PEs)**：perform computations * Dynamic vs Static Scheduling: * Dynamic Scheduling： 1. the operation is chosen dynamically based on data arrival. 2. more power/area. 3. needs flow-control on the network. * Static Scheduling： 1. the order of all operations and data arrivals is determined by the compiler. 2. less power/area. 3. less flexibility. * Dedicated vs Shared PE * Dedicated PE： 1. one instr. per PE 2. higher throughput 3. lower power/area overhead * Shared PE： 1. multiple instr. per PE 2. higher power/area overhead * **Switches**：connect inputs and outputs with differing bitwidths.Switch complexity is a key determiner in the tradeoff between complexity and efficiency. * **Connections**： Direct communication between hardware elemetnts, forming the edges of the ADG. * **Memories**：provide abstractions for shared memory and determine access from concurrent coarse grained memory patterns. * **Delay Elements**：FIFOs used for pipeline-balancing, parameterized by their depth. * Static-scheduled delay elements：offer a fixed delay. * Dynamic-scheduled delay elements：act as a buffer drained opportunistically. * **Synchronization Elements**：The interface between dynamically scheduled elements (e.g. memory, dynamic PEs) and static elements (static PEs), which are implemented as FIFO buffers. * **Controllers**： distributes work to other components, and thus synchronizes all other units for each phase of the algorithm. [ADG file example](https://dsa-framework.readthedocs.io/en/latest/ADG/example.html) --- ### **Compiler overview** #### **Programming interface** Employing C language and several pragmas(編譯指示) to provided additional information like regions to be offloaded and memory alias freedom. (e.g. #pragma dsa offload, #pragma dsa config) #### **compilation flow** **1. Decoupling the Memory and Compute：** the compiler inspects code blocks marked with the offload pragma, and slices the memory operations. **2. Data Dependence Transformation：** transforming control dependences to data dependence ![image](https://hackmd.io/_uploads/ryfssrYYa.png) **3. Modular Compilation：** The compiler optimizes for the given ADG’s hardware features. The compiler first inspect if the underlying hardware has the corresponding feature to support it before any hardware dependent transformations. **4. Spatial Scheduling：** * responsibilities of spatial scheduling： 1. map instructions and memory streams onto hardware units 2. route dependences onto the network 3. match the timing of operand arrival (for static components) * Spatial Scheduling Algorithm：a stochastic search based algorithm. ![image](https://hackmd.io/_uploads/r1ZvdLtY6.png) **5. Code Generation：** The compiler goes through each candidate of each code transformation, and chooses one with the highest estimated performance. #### **Generic Optimizations** * **Producer-Consumer：** avoids the synchronization overhead introduced by waiting for the producer phase to be done, but also enables pipelining the producer and consumer regions. ![image](https://hackmd.io/_uploads/HyszWPYYa.png) * **Repetitive In-Place Update：** the compiler will rewrite the update loop level by tiling it so that data size updated each time can fit in the capability of the synchronization buffers. ![image](https://hackmd.io/_uploads/rka3ZPFFa.png) --- ### **Design space exploration** Perform an automated codesign between the input programs and hardware. Select the best set of transformations to each program and a fine-grain selection of hardware features through iterative graph search. #### **DSE algorithm** figure below shows the iterative approach to codesign, and fig(9) shows the scheduling example in step 2(b). ![image](https://hackmd.io/_uploads/BJyhiQcKT.png) ![image](https://hackmd.io/_uploads/SkMAsm5tp.png) #### **Performance evaluation** * **Performance Modeling Approach：** Estimate the performance of a transformed code by the IPC, the memory bandwidth required to achieve fully pipelined execution, and the impact of the dependences on activity ratio. > *IPC = #Insts × Activity Ratio* * **Power/Area Modeling Approach** Utilize an analytical regression model for power/area evaluation. --- ### Hardware generation The hardware generator produces RTL and formalizes the software/hardware interfaces. #### Bitstream encoding The spatial architecture is configured by loading the bitstream into the registers. (e.g. A synchronization element’s bitstream encodes the cycles of delay, A switch’s bitstream encodes the routing information) #### Configuration generation To construct a set of configuration paths which minimizes configuration time => finding the minimum spanning tree. --- ### **Experiment result** * The compiler achieves **89% of the performance** of manually tuned versions.(or 80%, mentioned in different section) ![image](https://hackmd.io/_uploads/ByPm18cFT.png) * The design space explorer can save **42% power and area** over the initial hardware. * The automated DSE generates hardware with **mean 1.3×perf2/mm2** comparing with prior programmable accelerators across multiple sets of workloads. #### Modularity The baseline architecture is a 4x4 mesh of dedicated static PEs. Replacing the original one with three attribute below： * replace four dedicated PEs with shared PEs. * scheduling confers the ability to handle control-dependent data-reuse. * designs support vectorized indirect load/update. ![image](https://hackmd.io/_uploads/rkYjlL9Yp.png) => Across all workloads, the best design includes all features. #### DSE result Figure 14 shows how the area (left bar), power (right bar), and overall objective (color intensity) evolve during DSE. In these three DSE runs, the estimated performance is enhanced, and the explorer trims redundant resources. ![image](https://hackmd.io/_uploads/HJO_VI5F6.png) => Overall, our design space explorer saves mean 42% of the area and achieve mean 12× objective improvement over the initial hardware. #### Quality of the generated hardware ![image](https://hackmd.io/_uploads/r1qzXLqFT.png) #### Schedule repair compare traditional scheduling (map entire dataflow every iteration) and our schedule repair approach. ![image](https://hackmd.io/_uploads/S1YwXIcFT.png) => Overall, schedule repair leads to a 1.3× better objective for DSE. #### Configuration path The path generator only introduces mean 1.4× overhead versus the ideal. ![image](https://hackmd.io/_uploads/B1lr3QUcY6.png) --- ### Conclusion By DSAGen, an accelerator can be developed by composing simple spatial architecture primitives(ADG), and also be generated through automated codesign. This work also suggests that the ISA does not need to be the hardware/software abstraction which designers rely on, but a modular accelerator description can serve that purpose. --- ## **DSAGen build example (on linux)** follow the steps below. ### Docker install 1. Uninstall old versions ``` $ for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done ``` 2. Set up Docker's apt repository ``` # Add Docker's official GPG key: $ sudo apt-get update $ sudo apt-get install ca-certificates curl $ sudo install -m 0755 -d /etc/apt/keyrings $ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc $ sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources: $ echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null $ sudo apt-get update ``` 3. Install the latest Docker packages. ``` $ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin ``` 4. Verify that the Docker Engine installation is successful by running the hello-world image. ``` $ sudo docker run hello-world ``` ### zsh install ``` $ apt install zsh ``` ### DSAGen install ``` $ git clone https://github.com/PolyArch/dsa-framework.git $ cd dsa-framework $ sudo docker build . -t polyarch/dsa-framework:latest $ sudo docker run -tid -v /home/<your username>/dsa-share:/root/dsa-share --name=overgen polyarch/dsa-framework:latest /bin/bash ``` ### DSAGen build (everytime you wanna build DSAGen) ``` # skip this command if its the first build $ sudo docker start overgen # Attach to docker container $ sudo docker attach overgen # Switch from `bash` to `zsh`, DO NOT use zsh when start docker container $ zsh # Inside the docker, enter dsa-framework root folder $ cd /root/dsa-framework # Initialize all submodules, SKIP this step if you are using docker $ ./scripts/init-submodules.sh # Setup dsa-framework environment variables $ source ./setup.sh # setup environement variables # Compile the entire dsa-framework $ make all -j # Please source chipyard/env.sh manually if this is a first time build $ source chipyard/env.sh ``` NOTE: If you just want temporarily leave the container (detach, not close), you should just **<Ctrl-p><Ctrl-q>** to detach, **instead of typing exit**. ### Example To verify if the repo is successfully built ``` $ cd dsa-apps/sdk/compiled $ ./run.sh ss-vecadd.out ``` After running commands above. ![image](https://hackmd.io/_uploads/BJg3aawcp.png) --- ## Copy file from docker to host ``` # in local command window $ sudo docker cp f58de14ec715:/root/dsa-framework/chipyard/sims/verilator/output/chipyard.TestHarness.MeshDSARocketConfig/ss-mv.log /home/annyen1113 # from server to local(in local command line) scp -r annyen1113@140.113.193.228:/home/annyen1113/simulator-chipyard-MeshDSARocketConfig \Users\User\Desktop\nycu\dsagen\dsagen-test ``` --- ## CGRA Coarse-Grained Reconfigurable Array (CGRA) is a class of spatial accelerator, which are more flexible than ASICs and more efficient than FPGAs. ![image](https://hackmd.io/_uploads/Sk8x-ynYT.png) *(spatial domain is restricted by the number of processing elements, and temporal domain is restricted by the number of configurations that can be stored on chip)* The CGRA compiler achieves the mapping by embracing the dataflow computing model. ![image](https://hackmd.io/_uploads/rkuZ2-2tT.png) **Challenge：** * Map the computations within the loop kernel onto the processing elements * route the data dependencies between processing The micro-architecture of domain-specific spatial accelerators shares many similarities with CGRAs. However, the processing elements have limited and specific computation capability. The interconnection network is designed to support specific data flow and not fully reconfigurable. --- ### **CGRA Architecture** #### **Basic CGRA architecture** A CGRA consists of a 2D array of word-level processing elements and an on-chip network. A PE comprises of a Functional Unit (FU), Register File (RF), crossbar switches, and configuration memory. *( Each FU can have one or more ALU (Arithmetic-Logic Unit) or other computation units. The on-chip data memory, usually Scratchpad Memory (SPM), feeds data to the whole PE array. The data transfer between the SPM and the off-chip memory takes place through Direct-Memory Access (DMA). )* #### **Homogeneous and Heterogeneous CGRA** * **Homogeneous CGRA：** all the PE have the same functionality * **Heterogeneous CGRA：** each PE can have different functionality => Most CGRAs provide heterogeneity in terms of memory access functionality, since the latency for data memory access is generally much longer than computation. ( e.g. SPM restricts the number of parallel accesses by limited number of ports ) #### **On-chip network** The on-chip network connects the PEs to route data, and there are routing paths from the input ports to the output ports for each PE. **Network types** * Neighbor-to-Neighbor (N2N)： * neighbors can be reached in one cycle * Routing to distant PEs requires multiple cycles * limited interconnection, limited speedup, inefficient * cannot scaled well * HyCUBE： * send data to distant PEs in one cycle * Increasing the number of hops per cycle reduce the maximum possible clock frequency * more effiecient * cannot scaled well * Plasticine： * tiled into small sub-CGRA blocks * scaled well #### **Memory hierarchy** A CGRA consists of two types of memory * data memory： * hold input, output and intermediate data * mostly use multi-bank scratchpad memory(SPM) * SPMs are fully software-controlled, more power-efficient than hardware controlled caches * configuration memory： * also known as context/instruction memory * hold the configuration directives for the FU, RF, and the switches * can be either centralized (global) or decentralized (local) #### **Interface between CPU and CGRA** CPU is responsible for running the on-loop code, configuring the CGRA, and initiating the DMA data transfers from the main memory to the CGRA local memory. * Tightly coupled： * CGRA is a part of the main CPU * cannot execute code in parallel on the CPU and the CGRA * less overheads in data transfer * Loosely coupled： * the CGRA is connected to an independent accelerator * more flexible in the design phase * CPU and the CGRA can execute code in parallel * more overheads in data transfer --- ### Compilation for CGRAs the goal of compilation is to map the loop onto the CGRA to maximize the throughput. ### Modulo Scheduling and Modulo Routing Resource Graph (MRRG) * Modulo Scheduling * a software pipelining technique to exploit the instruction-level parallelism (ISA) among the loop iterations * each node represents an operation, and the edges represent the dependency between the nodes ![image](https://hackmd.io/_uploads/S1hWZgaYa.png) * Mapping of the DFG onto the CGRA includes placement and routing： * placement： decide which PE will execute each operation * routing： make sure that the data can be routed to the dependent operations in a timely manner * mapping has three parts： * prologue：executed only once at the start of the loop execution * steady state kernel：repeated and include all the operations from one or more iterations * epilogue：executed only once at the end of the loop execution * The schedule length of the kernel is called the Initial Interval (II), indicating the number of cycles between the initiation of consecutive loop iterations * Minimum Initial Interval (MII) is the maximum of the resource-minimal and the recurrence-minimal II * Modulo Routing Resource Graph (MRRG) * Represents the resources and the routing for a time-extended CGRA. * Nodes represent the ports of the register file, on-chip network, ALU inside PE etc, and the edges are the connections among the nodes. * ![image](https://hackmd.io/_uploads/S1jkKgTYa.png) ### CGRA Mapping Approaches #### Heuristic Approaches * Simulated Annealing * One of the most popular meta-heuristic method that treat the architectural elements as black boxes. * iteratively reduces resource overuse and tries to come up with a legal scheduling * designed for CGRAs with rotating register files * has long convergence time, especially for large dataflow graphs * Edge-Centric Modulo Scheduling (EMS) The slots with dashed circles are failed attempts ![image](https://hackmd.io/_uploads/SJ1BqV6Y6.png) * Node-Centric modulo scheduling： * the mapper places the consumer C by visiting all the empty slots * find a solution slower * Edge-Centric modulo scheduling： * the scheduler tries to route edge from P1 to C first * When failing to route the edge from P2 to C, routing resumes from slot (PE2,1), and not from P1 * find a solution faster and achieves better quality mapping * can fall into local minima * Schedule, Place, and Route (SPR) * combines the VLIW-style scheduler and FPGA placement and routing algorithms * Uses Iterative Modulo Scheduling (IMS), Simulated Annealing placement with a cooling schedule * List Scheduling * Operation with the highest priority is mapped on the next available PE in the PE list * cannot exploit inter-iteration parallelism * Evolutionary Algorithm * limited to small loop kernels with few DFG and limited reconfigurability * uses an initial solution from list scheduling as a starting point * iteratively improves the solution until it finds a solution with optimal latency * Machine Learning * RLMap proposed a reinforcement learning-based mapping approach * DFGNet proposed a CNN-based mapping approach, and the mapping problem is translated into an image-based classification problem * LISA uses Graph Neural Network (GNN) to analyze DFG to derive labels that describe how the DFG should be mapped #### Mathematical Optimization Techniques * Integer Linear Programming (ILP) * consists of all the requirements and the constraints that must be satisfied by a valid schedule * highly portable * only for very simple loop kernels * Boolean Satisfiability (SAT) solvers * automatically compile dataflow programs onto a CGRA-based dataflow processor (DPU) containing 16,000 processing elements * requires custom algorithms #### Graph Theory Inspired Techniques * Subgraph Homeomorphism Based Techniques ![image](https://hackmd.io/_uploads/SJkTjP6K6.png) ![image](https://hackmd.io/_uploads/BJTZldTtp.png) *( Figure 1.11b. The DFG is homeomorphic to the subgraph of the MRRG shown in Figure 1.11c, and thus the subgraph represents a valid mapping )* * employ a greedy algorithm for the transformation from the dependency graph to the hardware resource graph * possible wastage of precious routing resources * Graph Epimorphism Based Technique * based on graph homomorphism： ![image](https://hackmd.io/_uploads/S1SnWdTYT.png) * EPIMap can generate better scheduling results compared to EMS with a similar compilation time * Graph Minor Based Technique * inspired by the tree search method * the number of routing nodes is reduced compared to Subgraph homeomorphism approach * Compatibilty Graph Based Technique * partitions the problem into a scheduling problem and an integrated placement & register allocation problem * Graph Drawing Based Technique * starts from an initial drawing where all DFG nodes reside in the same group * The split process continues until each group contains only one node ### Other Compilation-related Issues #### Challenges Related to Data Access Even when CGRA mapping achieves higher compute utilization under the assumption of ideal memory, the memory limitations could cause overall performance degradation in the actual setting. * Memory Aware Compilation * Effective memory-aware strategy * Place the data without under-utilizing the memory banks * Consider the limited connections between the PE array and the memory banks * Prevent memory access conflicts * Maximize data reuse by avoiding data duplication * Memory access conflict happened situation * naive data placement in multi-bank memory when supporting kernels with multiple accesses for the same data array * the number of data accesses per bank per cycle is higher than the number of memory ports in one memory bank * Memory access conflict solution * the compiler partition the data into memory banks * DAMQ (Dynamically Allocated, Multi-Queue buffer) uses a request queue and arbiter * Memory Address Generation * substantial amount of instructions * solution： offload the address generation to specialized address generation units #### Nested Loop Mapping previous subsections consider a single innermost loop * Mapping Approaches * Polyhedral Model Based * Being used transform the two-dimensional nested loops to a new iteration domain with a parallelogram shape * Loop Flattening Based * convert imperfect loop nests into a single nested loop * can be executed in a single invocation * Systolic Mapping Based * intermediate abstraction layer to guide the hierarchical mapping #### Application-level Mapping * Synchronous DataFlow (SDF) * The SDF has several actors, and data is encapsulated in an object called a token * Each actor either consumes data tokens or produces tokens or both in each invocation ![image](https://hackmd.io/_uploads/SyovZtaY6.png) *( Each invocation of B consumer 10 tokens from A and produces 20 tokens. )* #### Handling Loops with Control Flow * Predication effectively translates the control flow instructions with dataflow instructions * Resource underutilization due to static allocation of duplicate resources which are unused at runtime #### Scalable CGRA Mapping * Most CGRA mapping approaches are not scalable * Operation placement and routing become increasingly difficult in larger CGRAs * limited routing resources * complicated data dependencies in bigger kernels ---