The Architectural Implications of Facebook’s DNN-based Personalized Recommendation

# The Architectural Implications of Facebook’s DNN-based Personalized Recommendation #### paper : [link](https://https://arxiv.org/pdf/1906.03109.pdf) ###### tags: `DLRM` ## Introduction * Personalized recommendation for content ranking is now largely accomplished using deep neural networks. However, little research attention has been devoted to them. * This paper presents a set of real-world, production-scale DNNs for personalized recommendation. * focus on three models(RMC1, RMC2, RMC3) because of their low consume AI inference cycles ![Fig.1](https://i.imgur.com/Ml0yfwH.png) * benchmarks of production-scale models : * application-level constraint (use case) : high throughput understrict latency constraints * embedding tables (memory intensive) : larger storage requirements and more irregular memory accesses * fully-connected layers(compute intensive) : have unique compute (FLOPs) and memory (bytesread) requirements ![Fig.2](https://i.imgur.com/oOKRG3C.png) * Future system design : * using only latency for benchmarking inference performance is insufficient. Metric of latency-bounded throughput is more representative * Inference latency varies across several generations of Intelservers (Haswell, Broadwell, Skylake) that co-exist in data centers. * Co-locating multiple recommendation models on a single machine ## Background * Task * dense feature : age * sparse feature : user's preferences for a genre of content (multi-hot vector) * Models * FC layers, embedding table, Concat, non-linearities, such as ReLU ![Fig.3](https://i.imgur.com/9MPxf9E.png) * Embedding table * Large storage capacity * Low compute intensity * Irregular memory accesses ![Fig.5](https://i.imgur.com/IhkYe6Z.png) ## At-scale presentation * Production Recommendation Pipeline * lightweight filtering * heavyweight ranking ![Fig.6](https://i.imgur.com/vB5IEOG.png) * Focus on RMC1, RMC2, and RMC3. ![Table 1](https://i.imgur.com/PnI7hAn.png) * Server Architectures ![Table 2](https://i.imgur.com/fGvWt3y.png) ## Performance 1. Inference latency of the three models, with unit batch-size, on an Intel Broadwell server. (Fig.7 Left) 2. Acceleration of matrix multiplication operations alone (e.g., BatchMatMul and FC) will provide limited benefits on end-to-end performance ![Fig.7](https://i.imgur.com/XtEFnUB.png) 3. For a small batch size of 16, inference latency is optimized when the recommendation models are run on the Broadwell architecture. ![Fig.8](https://i.imgur.com/gZKud1U.png) 4. While Broadwell is optimal at low batch-sizes, Skylake has higher performance with larger batch-sizes. This is a result of Skylake’s wider-SIMD (AVX-512) support. 5. We must balance low-latency and high-throughout. Thus, even for inference, hardware solutions must consider batching. Next, optimizing end-to-end model performance. 6. Co-location improves overall throughput of high-end server architecture, it can impact performance bottlenecks when running production-scale recommendation model leading to lower resource utilization. ![Fig.9](https://i.imgur.com/3rIx3W7.png) 7. Skylake’s higher performance with high co-location is a result of implementing an exclusive L2/L3 cache hierarchy. ![Fig.10](https://i.imgur.com/Dzjg4KW.png) 8. Performance variability in recommendation systems in production environments ![Fig.11](https://i.imgur.com/H9LmkUu.png) ## Open-source benchmarks 1. The number of embedding tables 2. Input and output dimensions of embedding tables 3. Number of sparse lookups per embedding table 4. Depth/widthof MLP layers for dense features (Bottom-MLP) 5. Depth/width of MLP layers after combining dense and sparse features (Top-MLP) ![](https://i.imgur.com/3mOPage.png) * Table III summarizes the key microarchitectural performance bottlenecks for the different classes of recommendation models studied in this paper. ![](https://i.imgur.com/7WmrsAP.png) ## Conclusion * This paper provides a detailed performance analysis of recommendation models on server-scale systems present in the data center. * The detailed performance analysis of production-scale recommendation models lay the foundation for future full-stack hardware solutions targeting personalized recommendation.