RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference

# RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference ###### tags: `Accelerator` root: [DLRM](https://hackmd.io/@accelerator/BJX_-wqk5) paper: [link](https://arxiv.org/pdf/2102.00075.pdf) key words: NDP, FTL ### Summary RecSSD is the first NDP-based SSD system. RecSSD is designed to accelerate embedding table operations. RecSSD is implemented on a real, open-source **Cosmos+ OpenSSD system** within the **Micron UNVMe driver library**, requiring no hardware changes. To evaluate RecSSD, we use a diverse set of eight industry-representative recommendation models, across various use cases (e.g., social media, e-commerce, entertainment), provided in **DeepRecInfra**. ### Motivation * large and fast DRAM-based memories levy high infrastructure costs //成本高 * conventional SSD-based storage solutions offer larger capacity, but have worse read latency and bandwidth, degrading inference performance //效能低 ### Method #### Brief Summary * use near data processing (NDP) by leveraging the existing compute capability of SSD * offloads the entire embedding table operation to the SSDs * implements embedding operations by moving computation into the SSD FTL * exploits the locality patterns of recommendation inference queries * modify the UNVMe driver stack to include two additional commands [Section 4] #### Architecture * ![](https://i.imgur.com/wD39Ks2.png =400x) * the FTL schedules and controls Flash controllers, which are organized per channel and provide specific commands to all the Flash DIMs (chips) on a channel and DMA capabilities across the multiple channels * ![](https://i.imgur.com/3VRgEik.png) #### Mapping Embedding Operations to the FTL * Data-structures * to support NDP SLS, RecSSD adds two major system components: a **pending-SLS-request buffer** and a **specialized embedding cache** * each entry, in pending-SLS-request buffer, contains five major elements * Input Config: buffer space to store SLS configuration data passed from the host * Status: various data structures storing reformatted input configuration and counters to track completion status * Pending Flash Page Requests: a queue of pending Flash page read requests to be submitted to the low-level page request queues * Pending Host Page Requests: a queue of pending result logical block requests to be serviced to the host upon completion * Result Scratchpad: buffer space for those result pages * Initiating embedding request 1. (step 1b) When the FTL receives a SLS read-like NVMe command (asynchronous with steps 2-5), it searches for the associated SLS request entry and populates the *Pending Host Page Request Queue*. 2. When FTL receives an *SLS Request*(write-like NVMe command), the FTL allocates an entry in the *Pending SLS Buffer*. 3. **(step 1a)** The FTL then triggers the DMA of the configuration data from the host by using the NVMe host controller. 4. FTL then process the data, initialize completion status counters, and populate custom data structures. `//Status` * This processing step (1)computes which *Flash Pages* must be accessed and (2)separates input embeddings by Flash Page, such that the per-page processing computation can easily access its own input embeddings. `//Status` * **(step 2a)** During this scan of the input data, a fast path may also check for availability of input embeddings in an *Embedding Cache*, avoiding flash page read requests. * **(step 2b)** Otherwise, placing those requests in the *Pending Flash Page Reqests Queue*. * Upon completion of the configuration processing, the request entry is marked as configured. * **(step 3b)** If the page exists within the page cache already, the page will be directly processed. * **(step 3a)** Otherwise, *Pending Flash Page Requests* will be pulled and fed into the *Low-Level Page Request Queues* * Issuing individual Flash requests * Requests are handled fairly in a **round robin** fashion by incrementing the SLS request buffer pointer. * Returning individual Flash requests * **(step 4)** Upon completion of a Flash Page Read Request, the extraction and reduction computation will be triggered for that page. * After (step 5), the reformatted input configuration allows the page processing function to quickly index which embeddings need to be processed and appropriately update the completion counters. * Returning embedding requests * **(step 6)** If completed pages are ready, and the NVMe host controller is available, the scheduler will trigger the controller to DMA the result pages back to the host. #### Exploiting temporal locality in embedding operations * ![](https://i.imgur.com/q35rKA9.png =300x) * based on [Bandana](https://arxiv.org/pdf/1811.05922.pdf), which has thoroughly investigated advanced caching techniques, RecSSD propose orthogonal solutions * Multi-threading and Pipelining * to overlap NDP SLS I/O operations with the rest of the neural network computation * use a threadpool of SLS workers to fetch embeddings and feed post-SLS embeddings to neural network workers * match SLS worker count to the number of independent available I/O queues in the SSD driver stack * then match neural network workers to the available CPU resources * Host-side DRAM Caching * reference: [Bandana](https://arxiv.org/pdf/1811.05922.pdf) * motivation: there exist relatively few highly accessed embeddings ![](https://i.imgur.com/qinBQCr.png =300x) * method: implement a static partitioning technique, utilizing input data profiling, which can partition embedding tables; such that frequently accessed embeddings are stored in host DRAM, while infrequently used embeddings are stored on the SSD * SSD-side DRAM Caching * implement a **direct-mapped cache** * reason: * the FTL runs on a relatively simple CPU, with limited DRAM space * the SSD FTL is designed without dynamic memory allocation and garbage collection * NDP SLS Interface ### Implementation * Cosmos+ OpenSSD system: [GitHub1](https://github.com/Cosmos-OpenSSD/Cosmos-plus-OpenSSD), [Platform](http://www.openssd.io.), [GitHub2](https://github.com/wilkening-mark/RecSSD-OpenSSDFirmware), [Tutorial](https://dl.acm.org/doi/fullHtml/10.1145/3385073) * has a 2TB capacity, fully functional NVMe Flash SSD, and a customizable Flash controller and FTL firmware * Micron UNVMe driver library: [GitHub1](https://github.com/zenglg/unvme), [GitHub2](https://github.com/wilkening-mark/RecSSD-UNVMeDriver) ### Evaluation * compared to baseline SSD, RecSSD provides a 4× improvement in embedding operation latency, 2× improvement in end-to-end model latency on a real OpenSSD system, [GitHub](https://github.com/Cosmos-OpenSSD/Cosmos-plus-OpenSSD) ![](https://i.imgur.com/BhjIdNy.png =300x)