DLRM - HackMD

# DLRM [github DLRM](https://github.com/facebookresearch/dlrm) ## 下載 ### 台灣杉1號下載(conda) ```shell= ## module miniconda module load miniconda3 ## 包含cuda/conda ## numpy(https://numpy.org/install/) conda install -y numpy ## scikit-learn(https://pypi.org/project/scikit-learn/) conda install -c conda-forge -y scikit-learn ## pytorch(https://pytorch.org/) conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-nightly ## tensorboard(https://anaconda.org/conda-forge/tensorboard) conda install -c conda-forge -y tensorboard ## future(https://anaconda.org/anaconda/future) conda install -c anaconda -y future ## pydot(https://anaconda.org/rmg/pydot) conda install -c rmg -y pydot ## tqdm(https://github.com/tqdm/tqdm) conda install -c conda-forge -y tqdm ## onnx(https://github.com/onnx/onnx) conda install -c conda-forge -y onnx ## torchviz(https://github.com/szagoruyko/pytorchviz) pip install torchviz ## MLPerf Logging(https://github.com/mlcommons/logging) git clone https://github.com/mlperf/logging.git mlperf-logging pip install -e mlperf-logging conda install -y cmake ## DLRM 本體 git clone https://github.com/facebookresearch/dlrm.git ## 下載基本完成 ``` ## 架構組成(early model) ![](https://i.imgur.com/2XrJpXU.png) ![](https://i.imgur.com/DJSNBeC.png) ![](https://i.imgur.com/ry2KRer.png) ![](https://i.imgur.com/56DtUc2.png) ![](https://i.imgur.com/xEFyGN0.png) ![](https://i.imgur.com/gK3SdT4.png) ### embedding ### matrix factorization ![](https://i.imgur.com/4fo4tnP.png) ### factorization machine ![](https://i.imgur.com/0EPLXAo.png) ![](https://i.imgur.com/c67bXVt.png) ### multilayer percptrons(MLP) [機器學習- 神經網路(多層感知機 Multilayer perceptron, MLP) 含倒傳遞( Backward propagation)詳細推導](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%A5%9E%E7%B6%93%E7%B6%B2%E8%B7%AF-%E5%A4%9A%E5%B1%A4%E6%84%9F%E7%9F%A5%E6%A9%9F-multilayer-perceptron-mlp-%E5%90%AB%E8%A9%B3%E7%B4%B0%E6%8E%A8%E5%B0%8E-ee4f3d5d1b41) ## 主體 * dlrm_s_pytorch.py - dlrm在pytorch架構下 ``` optional arguments: -h, --help show this help message and exit --arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE --arch-embedding-size ARCH_EMBEDDING_SIZE --arch-mlp-bot ARCH_MLP_BOT --arch-mlp-top ARCH_MLP_TOP --arch-interaction-op {dot,cat} --arch-interaction-itself --weighted-pooling WEIGHTED_POOLING --md-flag --md-threshold MD_THRESHOLD --md-temperature MD_TEMPERATURE --md-round-dims --qr-flag --qr-threshold QR_THRESHOLD --qr-operation QR_OPERATION --qr-collisions QR_COLLISIONS --activation-function ACTIVATION_FUNCTION --loss-function LOSS_FUNCTION --loss-weights LOSS_WEIGHTS --loss-threshold LOSS_THRESHOLD --round-targets ROUND_TARGETS --data-size DATA_SIZE --num-batches NUM_BATCHES --data-generation DATA_GENERATION --rand-data-dist RAND_DATA_DIST --rand-data-min RAND_DATA_MIN --rand-data-max RAND_DATA_MAX --rand-data-mu RAND_DATA_MU --rand-data-sigma RAND_DATA_SIGMA --data-trace-file DATA_TRACE_FILE --data-set DATA_SET --raw-data-file RAW_DATA_FILE --processed-data-file PROCESSED_DATA_FILE --data-randomize DATA_RANDOMIZE --data-trace-enable-padding DATA_TRACE_ENABLE_PADDING --max-ind-range MAX_IND_RANGE --data-sub-sample-rate DATA_SUB_SAMPLE_RATE --num-indices-per-lookup NUM_INDICES_PER_LOOKUP --num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED --num-workers NUM_WORKERS --memory-map --mini-batch-size MINI_BATCH_SIZE --nepochs NEPOCHS --learning-rate LEARNING_RATE --print-precision PRINT_PRECISION --numpy-rand-seed NUMPY_RAND_SEED --sync-dense-params SYNC_DENSE_PARAMS --optimizer OPTIMIZER --dataset-multiprocessing The Kaggle dataset can be multiprocessed in an environment with more than 7 CPU cores and more than 20 GB of memory. The Terabyte dataset can be multiprocessed in an environment with more than 24 CPU cores and at least 1 TB of memory. --inference-only --quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT --quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT --save-onnx --use-gpu --local_rank LOCAL_RANK --dist-backend DIST_BACKEND --print-freq PRINT_FREQ --test-freq TEST_FREQ --test-mini-batch-size TEST_MINI_BATCH_SIZE --test-num-workers TEST_NUM_WORKERS --print-time --print-wall-time --debug-mode --enable-profiling --plot-compute-graph --tensor-board-filename TENSOR_BOARD_FILENAME --save-model SAVE_MODEL --load-model LOAD_MODEL --mlperf-logging --mlperf-acc-threshold MLPERF_ACC_THRESHOLD --mlperf-auc-threshold MLPERF_AUC_THRESHOLD --mlperf-bin-loader --mlperf-bin-shuffle --mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER --lr-num-warmup-steps LR_NUM_WARMUP_STEPS --lr-decay-start-step LR_DECAY_START_STEP --lr-num-decay-steps LR_NUM_DECAY_STEPS ``` * dlrm_s_caffe2.py - dlrm在caffe2架構下 ``` optional arguments: --arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE --arch-embedding-size ARCH_EMBEDDING_SIZE --arch-mlp-bot ARCH_MLP_BOT --arch-mlp-top ARCH_MLP_TOP --arch-interaction-op ARCH_INTERACTION_OP --arch-interaction-itself --activation-function ACTIVATION_FUNCTION --loss-function LOSS_FUNCTION --loss-threshold LOSS_THRESHOLD --round-targets ROUND_TARGETS --weighted-pooling WEIGHTED_POOLING --data-size DATA_SIZE --num-batches NUM_BATCHES --data-generation DATA_GENERATION --rand-data-dist RAND_DATA_DIST --rand-data-min RAND_DATA_MIN --rand-data-max RAND_DATA_MAX --rand-data-mu RAND_DATA_MU --rand-data-sigma RAND_DATA_SIGMA --data-trace-file DATA_TRACE_FILE --data-set DATA_SET --raw-data-file RAW_DATA_FILE --processed-data-file PROCESSED_DATA_FILE --data-randomize DATA_RANDOMIZE --data-trace-enable-padding DATA_TRACE_ENABLE_PADDING --max-ind-range MAX_IND_RANGE --data-sub-sample-rate DATA_SUB_SAMPLE_RATE --num-indices-per-lookup NUM_INDICES_PER_LOOKUP --num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED --num-workers NUM_WORKERS --memory-map --mini-batch-size MINI_BATCH_SIZE --nepochs NEPOCHS --learning-rate LEARNING_RATE --print-precision PRINT_PRECISION --numpy-rand-seed NUMPY_RAND_SEED --sync-dense-params SYNC_DENSE_PARAMS --caffe2-net-type CAFFE2_NET_TYPE --optimizer OPTIMIZER This is the optimizer for embedding tables. --inference-only --save-onnx --save-proto-types-shapes --use-gpu --print-freq PRINT_FREQ --test-freq TEST_FREQ --test-mini-batch-size TEST_MINI_BATCH_SIZE --test-num-workers TEST_NUM_WORKERS --print-time --debug-mode --enable-profiling --plot-compute-graph --mlperf-logging --mlperf-acc-threshold MLPERF_ACC_THRESHOLD --mlperf-auc-threshold MLPERF_AUC_THRESHOLD ``` ## dataset ### 如何拿到測試dataset [kaggle](https://www.kaggle.com/) github上的criteo kaggle advertising challenge datatset連結已不可用流程參考[Kaggle下载criteo数据集](https://blog.csdn.net/songbinxu/article/details/116588580) ### 測試的dataset * 之前的Criteo_dataset(kaggle) [Criteo_dataset](https://www.kaggle.com/mrkmakr/criteo-dataset/tasks) * 因為是這個dataset是原始dlrm有支援的dataset 因此可以直接使用而不需改寫py檔 * bash command(我給的) ``` python dlrm_s_pytorch.py \ --arch-sparse-feature-size=16 \ --arch-mlp-bot="13-512-256-64-16" \ --arch-mlp-top="512-256-1" \ --data-generation=dataset \ --data-set=kaggle \ --raw-data-file=/home/nckuhpclab04/dlrm/criteo/dac/train.txt \ --processed-data-file=/home/nckuhpclab04/dlrm/criteo/dac/kaggleAdDisplayChallenge_processed.npz \ --loss-function=bce \ --learning-rate=0.1 \ --mini-batch-size=128 \ --use-gpu ``` * bash command(benchmark給的) ``` python dlrm_s_pytorch.py \ --arch-sparse-feature-size=16 \ --arch-mlp-bot="13-512-256-64-16" \ --arch-mlp-top="512-256-1" \ --data-generation=dataset \ --data-set=kaggle \ --raw-data-file=./input/train.txt \ --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz \ --loss-function=bce \ --round-targets=True \ --learning-rate=0.1 \ --mini-batch-size=128 \ --print-freq=1024 \ --print-time \ --test-mini-batch-size=16384 \ --test-num-workers=16 ``` ### 主程式(dlrm_s_pytorch.py) 架構(函數) 1. dataset and dataloader ``` class CriteoDataset(Dataset) def __init__( self, dataset, max_ind_range, sub_sample_rate, randomize, split="train", raw_path="", pro_data="", memory_map=False, dataset_multiprocessing=False, ) def __getitem__(self, index) def _default_preprocess(self, X_int, X_cat, y) def __len__(self) ``` 2. dateset transform ### DLRM Operators by Framework ![](https://i.imgur.com/f8DXWsq.png) ### pytorch for dataset and dataloader(batch-training) for big data-> better way to load the data is to divided the sample to so-called smaller batches, can do optimization base only on the batches #### 重要名詞 **epoch** = 1 forward and backward pass of ALL training smaples **batch_size** = number of training sample in one forward & bakcward pass **number of iterations** = number of passes, each pass using "batch_size" number of sample e.g. 100 samples, batch_size=20 --> 100/20 = 5 iterations for 1 epoch ## terabyte dataset(待測試) benchmark + mlperf standard + GPU + extra option(input地方要改路徑) ``` python dlrm_s_pytorch.py --arch-sparse-feature-size=128 --arch-mlp-bot="13-512-256-128" --arch-mlp-top="1024-1024-512-256-1" --max-ind-range=40000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./input/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2048 --print-time --test-freq=102400 --test-mini-batch-size=16384 --test-num-workers=16 --memory-map --mlperf-logging --use-gpu --mlperf-auc-threshold=0.8025 --mlperf-bin-loader --mlperf-bin-shuffle 2>&1 | tee run_terabyte_mlperf_pt.log ``` extra options ``` ["--test-freq=10240 --memory-map --data-sub-sample-rate=0.875"] ``` 單節點 8GPU backend as NCCL + bench + mlperf + gpu + extra option(input地方要改路徑) ``` python -m torch.distributed.launch --nproc_per_node=8 dlrm_s_pytorch.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./input/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --dist-backend=nccl --test-freq=10240 --memory-map --data-sub-sample-rate=0.875 --mlperf-logging --mlperf-auc-threshold=0.8025 --mlperf-bin-loader --mlperf-bin-shuffle 2>&1 | tee run_terabyte_pt_mlperf.log ``` :::warning 有提供兩種方式跑distributed training: python 跟 MPI backend 可用 NCCL/Gloo/mpi ::: multi nodes --- 不確定 ``` # for multiple nodes, user can add the related argument according to the launcher manual like: --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234 ``` ``` python dlrm_s_pytorch.py \ --arch-sparse-feature-size=128 \ --arch-mlp-bot="13-512-256-128" \ --arch-mlp-top="1024-1024-512-256-1" \ --max-ind-range=40000000 \ --data-generation=dataset \ --data-set=terabyte \ --raw-data-file=./input/day \ --processed-data-file=./input/terabyte_processed.npz \ --loss-function=bce \ --round-targets=True \ --learning-rate=1.0 \ --mini-batch-size=2048 \ --print-freq=2048 \ --print-time \ --test-freq=102400 \ --test-mini-batch-size=16384 \ --test-num-workers=16 \ --memory-map \ --mlperf-logging \ --mlperf-auc-threshold=0.8025 \ --mlperf-bin-loader \ --mlperf-bin-shuffle $dlrm_extra_option ``` ================================================================ 有關DLRM distributed training的一些資料及說明 [Pytorch 分布式训练](https://zhuanlan.zhihu.com/p/76638962) ###### tags: `DLRM`