XCache: A Modular Architecture for Domain-Specific Caches

# XCache: A Modular Architecture for Domain-Specific Caches ###### tags: `Accelerators` ## Abstract With Dennard scaling ending, architects are turning to domain-specific accelerators(DSAs). They propose a DSA cache, which needs to support custom tags, data-structure walks, multiple refills, and preloading. Three Key ideas on the proposed X-Cache 1. DSA-specific Tags(Meta-tags) - Eliminating the overhead of walking and translating the metadata to global address. 2. DSA-programmable walkers(X-Actions) - Programmable microcode engine that can efficiently realize the data orchestration 3. DSA-portable controller (X-Routers) - A lightweight, and minimizing controller occupancy --- - Performance Gain It outperforms address-based caches by 1.7X and dont downgrade any DSA performance. - Energy Saving Demonstrate that meta-tags save 26%-79% Energy. ## DSA applications 1. Non-affine data structures: - Sparse data structures - Indirect-indexes - Hash Tables 2. Dynamic Accesses: - Indirectly addressed - Irregular non-linear access 3. Walkers: - Since it is non-linear data structure, one single miss will trigger multiple nested preload. 4. Explicit orchestration: - DSA needs explicit cache replacement and refill with the computational datapaths. ## Tradtional Cache on DSA application 1. Metadata translation to Address (Waste of energe) According to tradtional address-based Cache, it needs to implement explicit logic transformation from Row/Col indices to address. 2. Address tags(High load-to-use latency) Conventional caches are tagged by blocks' address, it needs to obtain the address corresponding to the row/col indices, it may cause high load-to-use latency 3. Walker Logic(High NRE(None Recuring Enginerrning)) ## X-Cache 1. DSA-specific tags: X-Cache permits any combination of metadata fields to serve as cache tags 2. DSA-programmable walking(X-Routine): On a miss, X-Cache implicitly walks and finds the relevant data. 4. DSA-agnostic controller architecture: X-Cache’s controller implements the walkers and data orchestration as coroutines. ## Motivation ### Cache Walker Cache walker is a cahce specific for the DSA. That is, the walker will know how to recall the data from memory. Take inner-product as an example. ![](https://i.imgur.com/bdQ4D8h.png) ### Why not Scratchpad? >We first con- sider Scratchpad+DMA. There has been a long history of work on decoupled engines for shuttling dense tensors from DRAM onto on- chip SRAMs. This includes scatter/gather engines [10, 13, 17, 31], memory controllers [4], and tiled DMAs [1, 6, 28]. All designs in- troduce additional address spaces, either local [17, 28], carved from the physical [31] or virtual [4] addresses. Thus, DSAs that include unordered and dynamic accesses would need to implement address- translation from meta-tags to these local addresses [35, 37]. Further, they can only support access patterns driven by affine loops and trides. We focus on algorithms with indirect indexes, sparse tensors, and conditional accesses. Scratchpad-AE (programmable access engine) There have also been works on augmenting scratchpads in GPUs [21] and FPGAs [5, 6]. Scratchpad-AE targets DSAs, which can split into coarse-grain access-execute regions. The access pat- terns have to be regular, i.e., the order of accesses has to be known statically upfront and cannot be conditional. They are also heavy- weight since the access engine is mapped to an FPGA kernel or GPU threads. They target tile-granularity reuse (e.g., dense GEMM) and cannot support fine-grain misses and refills. Since the access engine iterates over elements in a specific order, they cannot accommodate dynamic input-dependent reuse behaviour (e.g., indirect indexes). >[name=Author] #### Summary - Scratchpad + DMA - Affine loop, fixed stride - Scratchpad AE (programmable access engine) - The access pattern has to be regular - Cannot be conditional - Pattern should be known - Author want indirect accesses, sparse tensors, and conditional accesses #### X-Cache V.S. Others ![](https://i.imgur.com/Scklye7.png) ![](https://i.imgur.com/Yg0CP8N.png) 1. Implicit vs. explicit The walking and orchestration tasks are broken down into sub-tasks (e.g., miss refill, eviction). Each of these sub-tasks can be **implicitly triggered by memory accesses or may be explicitly invoked by the datapath.** 2. Coupled vs. decoupled: Coupled lacks domain-awareness and refills only a **single data element**. Decoupled models have **domain-awareness and preload multiple data elements**. 3. Dynamic vs. Static. The data access order is fixed in DSAs with **static access patterns**, and the addresses can be calculated using affine functions. Emerging DSAs exhibit **dynamic data-dependent access pattern**; the access order of global addresses are determined by the input pattern. ### Deconstructing Caches How to choose which cache we should use depending on the following factors. #### 1. Address-Tags v.s. Meta-Tags - **Address-Tags** need like translate the virtual address to physical address and need several page table walk to get the physical address. - **Meta-Tags** like TLB, eliminate the path from virtual address to physicall address. It can access the address directly. ![](https://i.imgur.com/NJwu3aE.png) #### 2. Fixed or Programmable Walker Fixed walk like address-based cache need to perform a series of address matching procedure. However, Programmable walker can fetch the data with different procedure. Like previous discussed Inner-product Walker, and Outer-product Walker. The programmable walker can handle different situation on these different types of DSA. #### 3. Threads vs. Coroutines ![](https://i.imgur.com/UpOzewM.png) ![](https://i.imgur.com/YNQqwIg.png) ## X-Cache Architecture ![](https://i.imgur.com/1od0yw9.png) 1. Meta-tags: The architect defines the metadata fields that the controller should write explicitly to the tag entry when a refill is completed. These fields will be used for tag matches and determining hits. 2. Meta-event & Meta-state: In X-Cache the states represent the status of blocks in the walker. And, pairing it with the incoming event leads to sequencing through the walking logic. So, the current state of an entry is maintained along with the meta-tags 3. Routing-Table: Similar to cache coherence protocols [32], we have developed a table-based specification. The table encodes the walker logic’s states, events, and transitions. Each cell is a pointer to a routine in the microcode RAM. 4. Routines and mu-coded RAM: X-Cache compiles the actual procedures implementing the walking and orchestration down to a microcode binary and stores it in the routine μ-code RAM. 5. Actions: Executing the Action 6. Data RAMs: The data RAM is physically banked based on the number of words supplied to the compute datapaths. Logically, the data RAM is organized as fixed-granularity sectors. ### Execution Model Coroutines + Decoupled #### Programmer's View ![](https://i.imgur.com/Z1Nemea.png) #### Execution Model ![](https://i.imgur.com/29aUUvp.png) ### DSA Walker Example This section presents the details for the walkers of two families: i) Widx [18]: a DSA for relation-database hash indices. And ii) SpArch [37]: a DSA for outer-product sparse GEMM ![](https://i.imgur.com/9PMR8sK.png) ## X-Cache Hierarchies 1. Multilevel X-Cache (MX) 2. X-Cache with Address cache (MXA) 3. X-Cache with streaming (MXS). ![](https://i.imgur.com/wEVOXwl.png) ## X-Cache Setup ![](https://i.imgur.com/W8tyLF6.png) ## Evaluation ![](https://i.imgur.com/Q9Sz5NF.png) ![](https://i.imgur.com/UCrLKjl.png) ![](https://i.imgur.com/HyERCG9.png) ![](https://i.imgur.com/hVxqXwF.png) ![](https://i.imgur.com/pdbSNOJ.png) ![](https://i.imgur.com/fsYFhYV.png) ![](https://i.imgur.com/f5ZPUfR.png)