RM-SSD - HackMD

# RM-SSD It is the first **complete** solution that can improve the performance of both embedding-dominated and MLP-dominated models while storing all embedding tables in SSDs. store table: lookup: *lseek* operation: OpenMP ### Motivation * Performance profiling * Opportunities for offloading embedding lookup * In-storage computing has been shown as an effective approach for I/O traffic reduction * Significant read amplification * ![](https://i.imgur.com/nDcwrUQ.png =400x) * Irregular embedding access pattern * ![](https://i.imgur.com/uJyLDp9.png =400x) * Opportunities for offloading MLP operation * FPGA is an order faster than the CPU when running the same MLP layer * The architecture of the recommendation system is suitable to be remapped to pipeline stages for throughput improvement * Reason to offload the entire recommendation system into SSD * While the host and SSD co-design pipeline mechanism is also a possibility, pipelining all stages in storage is more efficient. 1. the whole system synchronization usually accompanies high pipe overhead, which is much more severe for server environments 2. the additional data copying and reformatting would bring additional latency overhead 3. optimization on embedding (in SSD) and MLP (host-side) separately would be hindered by the long inner concatenation between embedding and MLP layers * Challenge of in-storage computing 1. in-storage computing is more sensitive to resource consumption and energy efficiency than near-memory acceleration 2. the additional computing resource deployed in SSD cannot be as powerful as the near-memory accelerators due to the monetary cost 3. high power consumption often leads to high temperature, which could be detrimental to SSD lifetime ### Method * Embedding Lookup Engine: focuses on the access latency reduction for embedding vectors **(orange blocks)** * MLP Acceleration Engine: aims at optimizing the MLP-dominated models **(green blocks)** ![](https://i.imgur.com/4ubfmoR.png =400x) #### System Overview 1. When the Top MLP finishes, the status register in RM Registers is converted to ready. 2. At the host side, before reading the results from RM-SSD, the CPU first checks the result status register in SSD using MMIO. 3. Once it turns ready, the final results of inference are transmitted to the host. ![](https://i.imgur.com/YDvvrFJ.png =400x) * ***Col*** is recognized as the read offset in a page #### Embedding Lookup Engine * similar with FlashEmbedding #### MLP Acceleration Engine 1. the long inner concatenation between embedding and MLP layers can be eliminated by ***Intra-Layer Decomposition*** 2. it is possible to fit all the layers on FPGA simultaneously so that the latency and throughput can be further improved by ***Inter-Layer Composition*** 3. the unbalanced structure provides opportunities for balancing the time cost of each FC layer, which ensures no waste of resources for unnecessary computing 4. the non-MLP embedding can also be balanced with MLP layers with batch processing to maximize throughput with ***Kernel Search Algorithm*** ##### Basic FC layer design * implementation for Matrix Multiplication (MM): * classic: the systolic array * time cost is R · C · II, where R is input and C is output * RM-SSD: the adder tree for sum operation * time cost is reduced to ![](https://i.imgur.com/IM0G9vB.png =200x), where kernel block sizes are kr and kc along the row and column dimension ##### Intra-layer decomposition * conventional: after both of the bottom MLP and embedding layers finish and then feeds the united vector into the top MLP layer * suitable for the python framework, but not take full advantage of the FPGA hardware * RM-SSD: the bottom MLP and embedding layer can continue handling in parallel * ![](https://i.imgur.com/lcVyfM3.png =300x) ##### Inter-layer composition * conventional: left, not efficient for pipelining * RM-SSD: right. * the time consumption of MLP can be reduced by half * compose the adjacent layers into a pair by exchanging the scanning direction ![](https://i.imgur.com/i2usoe6.png =400x) * the time cost ![](https://i.imgur.com/2QjBZea.png =300x) ##### Kernel search algorithm * RM-SSD propose a kernel search algorithm that can efficiently decide the kernel size for each FC layer, which ensures the lowest resource utilization and the optimal throughput for both embedding-dominated and MLP-dominated models. * the optimization objective is to meet the goal in equation (2) * ![](https://i.imgur.com/KsDkg9b.png =300x) * The following rules are applied to achieve this goal. 1. BRAM resource assessment * Fitting all weights on the BRAM is the preferred option. * However, if ![](https://i.imgur.com/mHLZc4p.png =200x), the off-chip DRAM will be used. 2. Kernel size for DRAM employed layers * to fully utilize the DRAM bandwidth: ![](https://i.imgur.com/EELVu3Y.png =80x) must be satisfied * to minimize the resource consumption: ![](https://i.imgur.com/HvX4e5n.png =100x) * the time cost for this layer: ![](https://i.imgur.com/fwYvfjB.png =300x) 4. Batch size decision 5. ![](https://i.imgur.com/S59QRuV.png =70x) decision * to avoid layer pipeline bubbles, the following constraints must be met * ![](https://i.imgur.com/ljFNiJs.png =250x) #### Software Integration * The RM-SSD driver and user library are designed * RM-SSD provides a C++ runtime library, which can be integrated with Python-based DL frameworks, e.g., PyTorch, Caffe2 1. ***RMcreatetable(TableSize)*** adopts the block I/O driver and goes through the file system as normal files 2. ***RMopentable(TableID, TablePath)*** is the open operation residing on MMIO path 3. ***RMsendinputs(fd, IndicesPerLookup, SparseIn, DenseIn)*** 4. ***RMreadoutputs()*** reads the final results in the form of the batch from the RM-SSD in DMA mode ### Implementation * emulation platform: * Amazon EC2 F1 instance * Intel Xeon E5-2686 v4 @ 2.30GHz CPU (8 vCPU) * Xilinx Virtex 57 Plus UltraScale XCVU9P card * an FPGA chip * 64GB (16GB × 4) off-chip DDR4 with 64-byte data width * 122 GB DRAM (8 channels) * PCIe gen3 × 16 * ![](https://i.imgur.com/mZxAKZr.png =300x) * target model * Facebook’s DLRM models * the customized C++ *SparseLengthSum* operator * ![](https://i.imgur.com/4Pvm0dM.png =400x) ### Result * ![](https://i.imgur.com/D7LJGAG.png) * ![](https://i.imgur.com/S6j5J38.png =350x) * RM-SSD adopts the vector-grained access of embeddings, while RecSSD still adopts the page access * ![](https://i.imgur.com/dTNuEG0.png =350x)