# DLRM
[github DLRM](https://github.com/facebookresearch/dlrm)
## 下載
### 台灣杉1號下載(conda)
```shell=
## module miniconda
module load miniconda3
## 包含cuda/conda
## numpy(https://numpy.org/install/)
conda install -y numpy
## scikit-learn(https://pypi.org/project/scikit-learn/)
conda install -c conda-forge -y scikit-learn
## pytorch(https://pytorch.org/)
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-nightly
## tensorboard(https://anaconda.org/conda-forge/tensorboard)
conda install -c conda-forge -y tensorboard
## future(https://anaconda.org/anaconda/future)
conda install -c anaconda -y future
## pydot(https://anaconda.org/rmg/pydot)
conda install -c rmg -y pydot
## tqdm(https://github.com/tqdm/tqdm)
conda install -c conda-forge -y tqdm
## onnx(https://github.com/onnx/onnx)
conda install -c conda-forge -y onnx
## torchviz(https://github.com/szagoruyko/pytorchviz)
pip install torchviz
## MLPerf Logging(https://github.com/mlcommons/logging)
git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging
conda install -y cmake
## DLRM 本體
git clone https://github.com/facebookresearch/dlrm.git
## 下載基本完成
```
## 架構組成(early model)
![](https://i.imgur.com/2XrJpXU.png)
![](https://i.imgur.com/DJSNBeC.png)
![](https://i.imgur.com/ry2KRer.png)
![](https://i.imgur.com/56DtUc2.png)
![](https://i.imgur.com/xEFyGN0.png)
![](https://i.imgur.com/gK3SdT4.png)
### embedding
### matrix factorization
![](https://i.imgur.com/4fo4tnP.png)
### factorization machine
![](https://i.imgur.com/0EPLXAo.png)
![](https://i.imgur.com/c67bXVt.png)
### multilayer percptrons(MLP)
[機器學習- 神經網路(多層感知機 Multilayer perceptron, MLP) 含倒傳遞( Backward propagation)詳細推導](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-%E5%A4%9A%E5%B1%A4%E6%84%9F%E7%9F%A5%E6%A9%9F-multilayer-perceptron-mlp-%E5%90%AB%E8%A9%B3%E7%B4%B0%E6%8E%A8%E5%B0%8E-ee4f3d5d1b41)
## 主體
* dlrm_s_pytorch.py - dlrm在pytorch架構下
```
optional arguments:
-h, --help show this help message and exit
--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE
--arch-embedding-size ARCH_EMBEDDING_SIZE
--arch-mlp-bot ARCH_MLP_BOT
--arch-mlp-top ARCH_MLP_TOP
--arch-interaction-op {dot,cat}
--arch-interaction-itself
--weighted-pooling WEIGHTED_POOLING
--md-flag
--md-threshold MD_THRESHOLD
--md-temperature MD_TEMPERATURE
--md-round-dims
--qr-flag
--qr-threshold QR_THRESHOLD
--qr-operation QR_OPERATION
--qr-collisions QR_COLLISIONS
--activation-function ACTIVATION_FUNCTION
--loss-function LOSS_FUNCTION
--loss-weights LOSS_WEIGHTS
--loss-threshold LOSS_THRESHOLD
--round-targets ROUND_TARGETS
--data-size DATA_SIZE
--num-batches NUM_BATCHES
--data-generation DATA_GENERATION
--rand-data-dist RAND_DATA_DIST
--rand-data-min RAND_DATA_MIN
--rand-data-max RAND_DATA_MAX
--rand-data-mu RAND_DATA_MU
--rand-data-sigma RAND_DATA_SIGMA
--data-trace-file DATA_TRACE_FILE
--data-set DATA_SET
--raw-data-file RAW_DATA_FILE
--processed-data-file PROCESSED_DATA_FILE
--data-randomize DATA_RANDOMIZE
--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING
--max-ind-range MAX_IND_RANGE
--data-sub-sample-rate DATA_SUB_SAMPLE_RATE
--num-indices-per-lookup NUM_INDICES_PER_LOOKUP
--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED
--num-workers NUM_WORKERS
--memory-map
--mini-batch-size MINI_BATCH_SIZE
--nepochs NEPOCHS
--learning-rate LEARNING_RATE
--print-precision PRINT_PRECISION
--numpy-rand-seed NUMPY_RAND_SEED
--sync-dense-params SYNC_DENSE_PARAMS
--optimizer OPTIMIZER
--dataset-multiprocessing
The Kaggle dataset can be multiprocessed in an environment with more than 7 CPU cores and more than 20 GB of memory. The Terabyte
dataset can be multiprocessed in an environment with more than 24 CPU cores and at least 1 TB of memory.
--inference-only
--quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT
--quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT
--save-onnx
--use-gpu
--local_rank LOCAL_RANK
--dist-backend DIST_BACKEND
--print-freq PRINT_FREQ
--test-freq TEST_FREQ
--test-mini-batch-size TEST_MINI_BATCH_SIZE
--test-num-workers TEST_NUM_WORKERS
--print-time
--print-wall-time
--debug-mode
--enable-profiling
--plot-compute-graph
--tensor-board-filename TENSOR_BOARD_FILENAME
--save-model SAVE_MODEL
--load-model LOAD_MODEL
--mlperf-logging
--mlperf-acc-threshold MLPERF_ACC_THRESHOLD
--mlperf-auc-threshold MLPERF_AUC_THRESHOLD
--mlperf-bin-loader
--mlperf-bin-shuffle
--mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER
--lr-num-warmup-steps LR_NUM_WARMUP_STEPS
--lr-decay-start-step LR_DECAY_START_STEP
--lr-num-decay-steps LR_NUM_DECAY_STEPS
```
* dlrm_s_caffe2.py - dlrm在caffe2架構下
```
optional arguments:
--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE
--arch-embedding-size ARCH_EMBEDDING_SIZE
--arch-mlp-bot ARCH_MLP_BOT
--arch-mlp-top ARCH_MLP_TOP
--arch-interaction-op ARCH_INTERACTION_OP
--arch-interaction-itself
--activation-function ACTIVATION_FUNCTION
--loss-function LOSS_FUNCTION
--loss-threshold LOSS_THRESHOLD
--round-targets ROUND_TARGETS
--weighted-pooling WEIGHTED_POOLING
--data-size DATA_SIZE
--num-batches NUM_BATCHES
--data-generation DATA_GENERATION
--rand-data-dist RAND_DATA_DIST
--rand-data-min RAND_DATA_MIN
--rand-data-max RAND_DATA_MAX
--rand-data-mu RAND_DATA_MU
--rand-data-sigma RAND_DATA_SIGMA
--data-trace-file DATA_TRACE_FILE
--data-set DATA_SET
--raw-data-file RAW_DATA_FILE
--processed-data-file PROCESSED_DATA_FILE
--data-randomize DATA_RANDOMIZE
--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING
--max-ind-range MAX_IND_RANGE
--data-sub-sample-rate DATA_SUB_SAMPLE_RATE
--num-indices-per-lookup NUM_INDICES_PER_LOOKUP
--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED
--num-workers NUM_WORKERS
--memory-map
--mini-batch-size MINI_BATCH_SIZE
--nepochs NEPOCHS
--learning-rate LEARNING_RATE
--print-precision PRINT_PRECISION
--numpy-rand-seed NUMPY_RAND_SEED
--sync-dense-params SYNC_DENSE_PARAMS
--caffe2-net-type CAFFE2_NET_TYPE
--optimizer OPTIMIZER
This is the optimizer for embedding tables.
--inference-only
--save-onnx
--save-proto-types-shapes
--use-gpu
--print-freq PRINT_FREQ
--test-freq TEST_FREQ
--test-mini-batch-size TEST_MINI_BATCH_SIZE
--test-num-workers TEST_NUM_WORKERS
--print-time
--debug-mode
--enable-profiling
--plot-compute-graph
--mlperf-logging
--mlperf-acc-threshold MLPERF_ACC_THRESHOLD
--mlperf-auc-threshold MLPERF_AUC_THRESHOLD
```
## dataset
### 如何拿到測試dataset
[kaggle](https://www.kaggle.com/)
github上的criteo kaggle advertising challenge datatset連結已不可用
流程參考[Kaggle下载criteo数据集](https://blog.csdn.net/songbinxu/article/details/116588580)
### 測試的dataset
* 之前的Criteo_dataset(kaggle)
[Criteo_dataset](https://www.kaggle.com/mrkmakr/criteo-dataset/tasks)
* 因為是這個dataset是原始dlrm有支援的dataset 因此可以直接使用而不需改寫py檔
* bash command(我給的)
```
python dlrm_s_pytorch.py \
--arch-sparse-feature-size=16 \
--arch-mlp-bot="13-512-256-64-16" \
--arch-mlp-top="512-256-1" \
--data-generation=dataset \
--data-set=kaggle \
--raw-data-file=/home/nckuhpclab04/dlrm/criteo/dac/train.txt \
--processed-data-file=/home/nckuhpclab04/dlrm/criteo/dac/kaggleAdDisplayChallenge_processed.npz \
--loss-function=bce \
--learning-rate=0.1 \
--mini-batch-size=128 \
--use-gpu
```
* bash command(benchmark給的)
```
python dlrm_s_pytorch.py \
--arch-sparse-feature-size=16 \
--arch-mlp-bot="13-512-256-64-16" \
--arch-mlp-top="512-256-1" \
--data-generation=dataset \
--data-set=kaggle \
--raw-data-file=./input/train.txt \
--processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz \
--loss-function=bce \
--round-targets=True \
--learning-rate=0.1 \
--mini-batch-size=128 \
--print-freq=1024 \
--print-time \
--test-mini-batch-size=16384 \
--test-num-workers=16
```
### 主程式(dlrm_s_pytorch.py)
架構(函數)
1. dataset and dataloader
```
class CriteoDataset(Dataset)
def __init__(
self,
dataset,
max_ind_range,
sub_sample_rate,
randomize,
split="train",
raw_path="",
pro_data="",
memory_map=False,
dataset_multiprocessing=False,
)
def __getitem__(self, index)
def _default_preprocess(self, X_int, X_cat, y)
def __len__(self)
```
2. dateset transform
### DLRM Operators by Framework
![](https://i.imgur.com/f8DXWsq.png)
### pytorch for dataset and dataloader(batch-training)
for big data-> better way to load the data is to divided the sample to so-called smaller batches, can do optimization base only on the batches
#### 重要名詞
**epoch** = 1 forward and backward pass of ALL training smaples
**batch_size** = number of training sample in one forward & bakcward pass
**number of iterations** = number of passes, each pass using "batch_size" number of sample
e.g. 100 samples, batch_size=20 --> 100/20 = 5 iterations for 1 epoch
## terabyte dataset(待測試)
benchmark + mlperf standard + GPU + extra option(input地方要改路徑)
```
python dlrm_s_pytorch.py --arch-sparse-feature-size=128 --arch-mlp-bot="13-512-256-128" --arch-mlp-top="1024-1024-512-256-1" --max-ind-range=40000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./input/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2048 --print-time --test-freq=102400 --test-mini-batch-size=16384 --test-num-workers=16 --memory-map --mlperf-logging --use-gpu --mlperf-auc-threshold=0.8025 --mlperf-bin-loader --mlperf-bin-shuffle 2>&1 | tee run_terabyte_mlperf_pt.log
```
extra options
```
["--test-freq=10240 --memory-map --data-sub-sample-rate=0.875"]
```
單節點 8GPU backend as NCCL + bench + mlperf + gpu + extra option(input地方要改路徑)
```
python -m torch.distributed.launch --nproc_per_node=8 dlrm_s_pytorch.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./input/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --dist-backend=nccl --test-freq=10240 --memory-map --data-sub-sample-rate=0.875 --mlperf-logging --mlperf-auc-threshold=0.8025 --mlperf-bin-loader --mlperf-bin-shuffle 2>&1 | tee run_terabyte_pt_mlperf.log
```
:::warning
有提供兩種方式跑distributed training: python 跟 MPI
backend 可用 NCCL/Gloo/mpi
:::
multi nodes --- 不確定
```
# for multiple nodes, user can add the related argument according to the launcher manual like:
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234
```
```
python dlrm_s_pytorch.py \
--arch-sparse-feature-size=128 \
--arch-mlp-bot="13-512-256-128" \
--arch-mlp-top="1024-1024-512-256-1" \
--max-ind-range=40000000 \
--data-generation=dataset \
--data-set=terabyte \
--raw-data-file=./input/day \
--processed-data-file=./input/terabyte_processed.npz \
--loss-function=bce \
--round-targets=True \
--learning-rate=1.0 \
--mini-batch-size=2048 \
--print-freq=2048 \
--print-time \
--test-freq=102400 \
--test-mini-batch-size=16384 \
--test-num-workers=16 \
--memory-map \
--mlperf-logging \
--mlperf-auc-threshold=0.8025 \
--mlperf-bin-loader \
--mlperf-bin-shuffle $dlrm_extra_option
```
================================================================
有關DLRM distributed training的一些資料及說明
[Pytorch 分布式训练](https://zhuanlan.zhihu.com/p/76638962)
###### tags: `DLRM`