# The Architectural Implications of Facebook’s DNN-based Personalized Recommendation
#### paper : [link](https://https://arxiv.org/pdf/1906.03109.pdf)
###### tags: `DLRM`
## Introduction
* Personalized recommendation for content ranking is now largely accomplished using deep neural networks. However, little research attention has been devoted to them.
* This paper presents a set of real-world, production-scale DNNs for personalized recommendation.
* focus on three models(RMC1, RMC2, RMC3) because of their low consume AI inference cycles

* benchmarks of production-scale models :
* application-level constraint (use case) : high throughput understrict latency constraints
* embedding tables (memory intensive) : larger storage requirements and more irregular memory accesses
* fully-connected layers(compute intensive) : have unique compute (FLOPs) and memory (bytesread) requirements

* Future system design :
* using only latency for benchmarking inference performance is insufficient. Metric of latency-bounded throughput is more representative
* Inference latency varies across several generations of Intelservers (Haswell, Broadwell, Skylake) that co-exist in data centers.
* Co-locating multiple recommendation models on a single machine
## Background
* Task
* dense feature : age
* sparse feature : user's preferences for a genre of content (multi-hot vector)
* Models
* FC layers, embedding table, Concat, non-linearities, such as ReLU

* Embedding table
* Large storage capacity
* Low compute intensity
* Irregular memory accesses

## At-scale presentation
* Production Recommendation Pipeline
* lightweight filtering
* heavyweight ranking

* Focus on RMC1, RMC2, and RMC3.

* Server Architectures

## Performance
1. Inference latency of the three models, with unit batch-size, on an Intel Broadwell server. (Fig.7 Left)
2. Acceleration of matrix multiplication operations alone (e.g., BatchMatMul and FC) will provide limited benefits on end-to-end performance

3. For a small batch size of 16, inference latency is optimized when the recommendation models are run on the Broadwell architecture.

4. While Broadwell is optimal at low batch-sizes, Skylake has higher performance with larger batch-sizes. This is a result of Skylake’s wider-SIMD (AVX-512) support.
5. We must balance low-latency and high-throughout. Thus, even for inference, hardware solutions must consider batching. Next, optimizing end-to-end model performance.
6. Co-location improves overall throughput of high-end server architecture, it can impact performance bottlenecks when running production-scale recommendation model leading to lower resource utilization.

7. Skylake’s higher performance with high co-location is a result of implementing an exclusive L2/L3 cache hierarchy.

8. Performance variability in recommendation systems in production environments

## Open-source benchmarks
1. The number of embedding tables
2. Input and output dimensions of embedding tables
3. Number of sparse lookups per embedding table
4. Depth/widthof MLP layers for dense features (Bottom-MLP)
5. Depth/width of MLP layers after combining dense and sparse features (Top-MLP)

* Table III summarizes the key microarchitectural performance bottlenecks for the different classes of recommendation models studied in this paper.

## Conclusion
* This paper provides a detailed performance analysis of recommendation models on server-scale systems present in the data center.
* The detailed performance analysis of production-scale recommendation models lay the foundation for future full-stack hardware solutions targeting personalized recommendation.