FlashEmbedding

# FlashEmbedding ###### tags: `Accelerator` ### ABSTRACT * FlashEmbedding is a hardware/software co-design solution for storing embedding tables on SSDs for large- scale recommendation inference under memory capacity- limited systems. * FlashEmbedding achieves up to 17.44× lower latency in embedding lookups and 2.89× lower end-to-end latency than baseline solution in a memory capacity–limted system. ### 1. INTRODUCTION * After careful analysis, we find that the performance degradation of SSD-based recommendation inference is mainly attributed to **two factors**: the long I/O stack and the tremendous read amplifications of the block-interface SSD. * To overcome these issues, * First, we design **embedding vector SSD (EV-SSD)** to support embedding vector lookup requests by taking advantage of two-stage fine-grained reading and MMIO interface. * Second, we design **embedding vector cache (EV-Cache)** to reduce the access to EV-SSD by applying a bitmap filter and robin hood hashing–based LRU caches. * Third, we use the **pipeline** to hide the I/O latency further by overlapping CPU computation with EV-SSD handling. ### 2. BACKGROUND AND MOTIVATION #### 2.1 Embedding * Sparse features are transformed to a dense representation by looking up embedding tables in the embedding layer. * An embedding table is a sort of lookup table in which each row is a vector of floating-point value that corresponds to a category for a sparse feature. * The embedding vectors returned are aggregated into a single vector through simple pooling operations such as sum. #### 2.2 SSD-based Recommendation Inference * As the amount of memory decreases, the inference latency of baseline SSD increases significantly and shows 5× slower than the ideal performance. * ![](https://i.imgur.com/5WEpdwM.png =300x) * ![](https://i.imgur.com/rr7mgWz.png =300x) * I/O stack costs are the primary limiter on SSD-based recommender system performance. * ![](https://i.imgur.com/GqegnRq.png =300x) * ![](https://i.imgur.com/VvcLi0j.png =300x) * In summary, we root-cause **two factors** as crucial limiters of the prior approach relying on SSD for storing embeddings: high I/O stack overhead and large SSD read amplification. ### 3. FLASHEMBEDDING DESIGN #### 3.1 System Overview * ![](https://i.imgur.com/B1U00xU.png =300x) * On the hardware side, we propose an embedding semantic-aware SSD, referred to as **EV-SSD**, to fast access the embedding vectors in SSD by reducing read amplification and I/O stacks. * On the software side, we propose an embedding vector cache, referred to as **EV-Cache**, dedicated to efficiently cache the embedding vector using a small amount of memory to exploit the modest level of temporal locality within embedding tables and to reduce long SSD read latency. #### 3.2 Embedding Vector SSD * EV-SSD extends conventional NVMe SSD with the following modules: * MMIO Manager, * as EV lookup interface * supports both host-side memory interface and DMA mode data transmission * former: write the EV Register directly to exchange small control parameters with low latency, such as embedding table ID * later: to transfer data blocks in bulk, including the lookup EV indices and qualified EVs * Embedding Vector Translator (EV Translator), * to parse the EV index to the LBA(Logical Block Address) * the output LBA is fed to FTL; after FTL translation, the physical block read requests are assigned to the related EV-FC * Embedding Vector Flash Controller (EV-FC), * Embedding Vector Merger (EV Merger). * To address the read amplification, we adopt a **two-stage** fine-grained reading strategy. * At the device stage, EV Translator is used to translate EV lookup indices, and EV Merger takes charge of merging the scattered qualified EV lookup results returning by EV-FC. * At the flash channel stage, EV-FC only fetches the desired EV data from flash channels. * ![](https://i.imgur.com/b4WgKiw.png =300x) * ![](https://i.imgur.com/crJASAi.png =300x) * When a batch of EV lookups comes, EV Translator translates the EV request according to the following steps: 1. fetches the metadata of the requested table to the SRAM for faster access 2. fetches one index from the Index Buffer each time 3. the extent ID of the current index is calculated by repeatedly comparing with the index ranges 4. according to the extent ID, the start LBA of the EV is determined 5. based on the index offset in this extent and EV-Dimension, we figure out the final LBA of the EV #### 3.3 Embedding Vector Cache * EV-Cache is an in-memory embedding vector cache for each embedding table that resides in SSD to exploit the embedding vectors reuse. * EV-Cache employs dedicated bitmap as filters to prune paths early. * EV-Cache uses LRU (least recently used) based vector cache as its building block, which uses a hashmap to store key-value pairs, and a circular double-linked list to record the access order of each element. * hashmap is based on open addressing with robin hood to meet the requirements * Robin Hood hashing reduces the maximum probe sequence length and the long tail in probe sequence length distribution, achieving high performance. * ![](https://i.imgur.com/nRAPjLe.png =300x) #### 3.4 Putting It All Together * ![](https://i.imgur.com/dQuAETu.png =300x) ### 4 IMPLEMENTATION * We implement our prototype EV-SSD based on FPGA * enriching the FPGA chip to realize the functionality of EV-SSD controllers * employing four DDR4 banks to emulate four flash channels * the IP cores of DMA controller and DDR4 controller are provided by Xilinx * the linear mapping function is applied in the FTL design * The memory required for FlashEmbedding is mainly from EV-Cache. * setting EV-Cache size to 1% of its embedding table size is sufficient for most cases * Bitmap filter only takes two bits of memory space for each index; the total size is neglectable for modern memory systems. ### 5 EVALUATION #### 5.1 Experimental Setup * We evaluate FlashEmbedding on AWS EC2 F1 instance with a Xilinx Virtex 57 Plus UltraScale XCVU9P card. * We store the five largest embedding tables (around 4GB total) on SSD and the remaining 21 small tables in DRAM (around 70MB total). * We set the default batch size to 128 and sweep batch size from 64 to 512 for sensitivity studies. * The default embedding dimension is 32. * For EV-Cache, we use a threshold of 2 for bitmap, 1% of embedding table size for cache size (total cache size is around 40MB) as our default configuration. #### 5.2 Effectiveness of Proposed Techniques * Compared to Baseline SSD, EV-SSD significantly reduces the normalized latency from 17.44× to 3.08×. * The reason is two-fold. * First, EV-SSD bypasses most parts of the I/O stack, allowing direct access to SSD. * Second, EV-SSD supports fine-grained access, and read amplification is reduced remarkably. This is because embedding cache can reduce the quantity of SSD requests by exploiting the temporal reuse opportunity of embeddings in the dataset. * ![](https://i.imgur.com/ZoKx1Yd.png =300x) * FlashEmbedding and its variants incur much less read I/O traffic than Baseline SSD. * EV-SSD’s lighter I/O stack and finer access granularity efficiently reduce the read amplification. * EV-Cache efficiently caches the embeddings and serves embedding lookup requests directly whenever possible, reducing traffic to SSD. * FlashEmbedding further reduces I/O with pipeline because it can promptly update bitmap status once embedding vectors were accessed in the previous sub-indices, leading to better utilization of EV-Cache. * ![](https://i.imgur.com/DisHdqV.png =300x) #### 5.3 End-to-End Performance * FlashEmbedding shows comparable performance to Ideal DRAM (less than 10% latency overhead) but with much lower memory consumption (approximate 110MB memory versus 30GB memory), opening up new possibilities to store and lookup embedding tables on SSDs for large-scale recommendation under memory capacity limited systems. * ![](https://i.imgur.com/kXv5k8s.png =300x) * The end-to-end latency to infer a single query is 6 ms, much lower than common latency requirements, tens of milliseconds. #### 5.4 Sensitivity Study * As the batch size increases, the fraction of time spent on the embedding table operations decreases with larger batch size, from 65.4% to 58.5% for batch size 64 and 512, respectively. * ![](https://i.imgur.com/K7XTVDZ.png =300x) * The performance of embedding lookup increases slightly as we increase the cache size from 0.1% to 5%. * Because bitmap filters out less frequently accessed indices and only pass frequently accessed indices to the cache. The cache hit ratio is almost 100% for all cache sizes. * Further increasing the cache size will increase the cache management overhead, leading to worse performance. * ![](https://i.imgur.com/uge3zDt.png =300x) * Figure 14 shows how our proposed techniques scale with different SSD access latency. * ![](https://i.imgur.com/qGgeE4n.png =300x) * For EV-SSD, embedding lookup latency increases to 2.08× and 5.66× when SSD latency increases from 4 us to 10 us and 30 us, respectively. * EV-Cache and FlashEmbedding mitigate performance degradation, indicating better scalability. This is mainly because EV-Cache reduces the number of SSD accesses by exploiting the data locality.