gem5-SALAM: A System Architecture for LLVM-based Accelerator Modeling

# gem5-SALAM: A System Architecture for LLVM-based Accelerator Modeling ###### tags: `Accelerators` ##### Link: [Paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9251937) ##### Writter: `kksweet8845` ## 1. Abstract With the prevalence of hardware accelerators as an integral part of the modern systems on chip (SoCs), the ability to quickly and accurately model accelerators within the system it operates is critical. This paper presents gem5-SALAM as a novel system architecture for LLVM-based modeling and simulation of custom hardware accelerators integrated into thegem5 framework. - Because of the **Dennard Scaling** and the **slowdown in Moore's Law**, heterogeneous SoCs with many application/domain-specific accelerators have emerged as the major chip design paradigm to deliver never-ending demand for high-performance power-efficient computing. ![](https://i.imgur.com/pA50EWf.png) ## 2. Background and Motivation ### 1. Idealized env One of the biggest challenges for designing and integrating new accelerators into modern SoCs is simulating and exploring design parameters. - **Over-tuning** of design parameters based on **idealized** assumptions about data availability and other system overheads. ### 2. gem5 is open source gem5 is open source comparsing to the EDA tool ### 3. gem5-Aladdin is not accurate and not fully supported gem5-Aladdin has the limitation to accurately simulate the - Different dataset with different function unit allocation ![](https://i.imgur.com/enYWDIO.png) - Different cache line sizes with different function unit ![](https://i.imgur.com/Xudc8e5.png) ## 3. Contribution 1. **Accurate** modeling of datapath structure, area, and staticleakage power based on analysis of algorithm-intrinsic characteristics exposed by LLVM. 2. **Cycle-accurate modeling** of dynamic power and timingthrough a dynamic LLVM-based runtime executionengine, through gem5-SALAM’s dynamic execute-in-execute LLVM-based runtime engine. 3. **Separation of datapath and memory infrastructure** to enable independent tuning and design space exploration. 4. **Flexible system integration** that directly exposes accelerator models to other system elements, within gem5,to enable complex inter-accelerator communication andsynchronization, using pre-existing gem5 simulationconstructs. 5. **General purpose C++/Python API** for accelerator model-ing that decouples computation from system communi-cation, and enables customization and specialization tomatch user modeling needs. ## 4. GEM5-SALAM ### Static setup and initialization ![](https://i.imgur.com/DupL7te.png) 1. IR Generation - Write the application code - LLVM performs opt (Loop unroll, and Vectorization) - Aids Static elaboration of the control and dataflow graph (CDFG). 2. Static Elaboration - Use the IR and device-specific ocnfiguration files are used to extract the static CDFG, and the IR instructions are linked to virutal hardware functional units and registers. ### Dynamic LLVM Runtime Engine ![](https://i.imgur.com/JXp1tSA.png) - Reservation Queue - Compute Queue - Memory Queue 1. Reservation Queue > dynamic instructionmap is generated at the granularity of basic blocks from thestatically elaborated CDFG, and the contents of the first basicblock of the application are imported. As each instruction(operation) is added to the queue, dynamic dependencies aregenerated by searching upward in the reservation queue as wellas the in-flight compute and memory queues. Additionally, theexecution of previous instances of the same instruction and allinstructions that read from its destination register are checkedto be in-flight or completed. > [name=Author] 2. Compute Queue >Compute instructions are all instructionsthat can be resolved by the simulator using only valuesstored in local registers. These instructions are transferredto the “Compute Queue” where their functional units areinvoked. For simulation purposes, the computation is doneimmediately, but the commit of the result can be delayed by anumber of operational cycles that is uniquely configurable foreach function unit type. The dynamic energy of each activeinstruction is also calculated at this point to estimate the powerof the compute data-path. Once an instruction (operation) isready for commit, the instruction is removed from the queue,the hardware unit is released, and the Reservation Queue issignaled to resolve dependencies on the committed instruction. > [name=Author] 3. Memory Queue >Memory instructions will be transferredto the Read/Write queues shown in the bottom-right corner. These queues forward the memory requests to theconnected communications interface, described in Sec.III-D,which is responsible for interfacing with gem5’s other systemelements. The memory queues operate asynchronously fromother elements of the runtime engine in order to handle memoryrequests that complete in between the compute cycles of theruntime engine. When a memory request is ready to commit,the request is removed from the queue and the “ReservationQueue” is signaled to resolve dependencies on the committedrequest. >[name=Author] ### Metrics Estimation 1. Power And Area - The power estimation model utilizes parameters defined within the **hardware profile** and the **device config** ![](https://i.imgur.com/iaRrX4e.png) 2. Performance and Occupancy - **Analyze the instruction in IR** to estimation the cycles of each op. - During the dynamic runtime simulation gem5-SALAM logs which instructions are **scheduled or in-flight for each cycle**.This additional data point combined with configurable hardwareresources allows for a fine grained analysis and explorationtool for exploring occupancy levels within the system. ### gem5 Integration and Scalable Full System Simulation 1. Copute Unit and Communications Interface - Compute unit: - Communication Interface - Memory Mapped Registers (MMRs) - **CommInterface** model: all Customized interface inherent from this class. - **User-configurable memory controller** - Enable all parallelism memory access ![](https://i.imgur.com/U4Q6INZ.png) ![](https://i.imgur.com/BUluGFS.png) 2. Multi-ACC simulation - Can have a pool of clusters - shared DMA and scratchpad - local cross bar and global cross bar 3. Control and Synchronization - Provide the low level control (MMRs) - Controlled by host CPU via MMRs ### Simulation Setup and Configuration 1. Single Accelerator Configuration - Each should be config individually ![](https://i.imgur.com/12bTi5o.png) 2. Accelerators Cluster Configuration ![](https://i.imgur.com/zvWZJgF.png) ## 5. Simulation Results and Validation ### Metrics Validation Use the below flow for validation ![](https://i.imgur.com/g9k52qU.png) ![](https://i.imgur.com/tQKWoNU.png) ![](https://i.imgur.com/Jw1ZKuI.png) ## 6. Simulation Time ![](https://i.imgur.com/sEe9vhB.png) ### Left ![](https://i.imgur.com/mQd8bYG.png) ### Middle ![](https://i.imgur.com/lXn6LDg.png) ### Right ![](https://i.imgur.com/iJZqX55.png) ## 7. Design Space Exploration ### Case Study: Matrix Multiply - Higher consumption on over-duplicate the function unit ![](https://i.imgur.com/VNgkPnl.png) ![](https://i.imgur.com/TD2K2ae.png) ### Co-Design with gem5-SALAM ![](https://i.imgur.com/58iGiWK.png) ![](https://i.imgur.com/W90gcjD.png) ![](https://i.imgur.com/OgupzeX.png) ![](https://i.imgur.com/925QjRN.png) ### Multi Accelerator Design Space Exploration ![](https://i.imgur.com/MrfnMgw.png) ## 8. Conclusions Good ## 9. Reference 1. https://breagen.github.io/MachSuite/