NVIDIA DALI出奇蹟加速資料載入及訓練

NVIDIA DALI出奇蹟加速資料載入及訓練 === ![](https://i.imgur.com/GTClbii.gif) [TOC] # NVIDIA DALI是甚麼? &ensp;&ensp; 深度學習通常需要複雜且多個階段的資料前處理流程，這類的資料處理(尤其是影像)主要都是在CPU上進行計算。舉例來說: **從硬碟讀取訓練資料**、**對圖片解碼**、**裁切**、**隨機調整圖片大小**、**進行資料增強(Data Augmentation)**還有一些**格式轉換(NCHW or NHWC)**，主要都是在CPU上來執行這些動作，這也限制了訓練和推理的性能。 &ensp;&ensp; **Nvidia DALI，NVIDIA’s Data Loading Library, is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications**。這是NVIDIA官方給出的定義，簡單扼要地講就是一個可以在深度學習中加速資料前處理(尤其是影像)和載入的函示庫，可以用來加速深度學習數據前處理的載入速度，主軸就是要充分的使用GPU來進行加速而不讓它有浪費的空間。 &ensp;&ensp; 若要進行深度學習訓練，大致上可以分做3個步驟: 1. **將訓練資料(圖片、影片、文字、音頻或其他格式)存在Server的硬碟上。** 2. **透過CPU將資料載入到記憶體，進行解碼和資料增強等前處理的操作，在這個步驟是非常依賴CPU的計算能力的。** 3. **將處理後的資料載入到GPU的記憶體中，進行之後的訓練流程，這個部分就是要依賴GPU的計算能力。** 在這整個流程中，CPU和GPU的計算延遲應該不能差太多，CPU處理好的資料才能夠及時地提供給GPU進行訓練或推理。如果CPU的處理速度遠不及於GPU，這樣就會導致GPU常常需要等待CPU而導致浪費，反之亦然。 &ensp;&ensp; DALI就做了一件事情，那就是將一部分的資料前處理操作從CPU挪到GPU上去進行運算，可以有效的提高效能，並且更充分利用GPU的運算能力。 ![](https://i.imgur.com/1TNY46V.png) --- # DALI 主要功能 ## Pipeline流程 DALI提供了一個簡單的Python interface，可以通過以下幾個步驟實現資料前處理的pipeline： 1. **Select Operators from this extensive [list of supported operators](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html)** 2. **Define the operation flow as a symbolic graph in an imperative way (as in most of the current deep learning frameworks)** 3. **Build an operation pipeline** 4. **Run graph on demand** 5. **Integrate with your target deep learning framework by dedicated plugin** ## DALI內部是如何工作的? DALI將資料前處理的pipeline定義為Dataflow graph，DALI具有以下3種類型的運算子: * **CPU: accepts and produces data on CPU** * **Mixed: accepts data from CPU and produces the output at the GPU side** * **GPU: accepts and produces data on the GPU** 因為效能的原因，DALI傳輸資料只支援 **CPU -> Mixed -> GPU**(如下圖): ![](https://i.imgur.com/mPL5kiO.png) --- # 如何安裝DALI ## 系統要求 1. 作業系統: ==Linux x64(不支援windows、macOS)== 2. CUDA版本為9.0之後 (在Terminal上打```nvcc --version```可以查看CUDA版本) 3. 有以下一個或多個深度學習框架 &ensp; - MXNet 1.3或更高版本 &ensp; - PyTorch 0.4或更高版本 &ensp; - TensorFlow 1.7或或更高版本 ## 安裝指令 * **For CUDA 9.0** ``` pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali ``` 若框架為TensorFlow，需要另外安裝插件。 ``` pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali-tf-plugin ``` * **For CUDA 10.0** ``` pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100 ``` 若框架為TensorFlow，需要另外安裝插件。 ``` pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100 ``` * **For CUDA 11.0** ``` pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110 ``` 若框架為TensorFlow，需要另外安裝插件。 ``` pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-tf-plugin-cuda110 ``` - **NVIDIA Installation Guide: [LINK](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html)** --- # 如何使用DALI ## Defining the Pipeline DALI Python API最核心的主要功能就是```Pipeline class```，透過建立Pipeline的subclass來建立專屬自己的DALI pipeline。 **Simple Pipeline** ```python= import nvidia.dali.ops as ops from nvidia.dali.pipeline import Pipeline class SimplePipeline(Pipeline): def __init__(self, batch_size, num_threads, device_id, seed = 21): super(SimplePipeline, self).__init__(batch_size, num_threads, device_id) self.input = ops.FileReader(file_root = image_dir) self.decode = ops.HostDecoder(output_type = types.RGB) def define_graph(self): jpegs, labels = self.input() images = self.decode(jpegs) return (images, labels) ``` 只需要寫兩個methods: * ```__init__```: 選擇需要的operators(可以在[```nvidia.dali.ops```](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html)裡面找到)，在這個簡單範例中只用到2個operators，```FileReader```就是從硬碟中讀取資料、```HostDecoder```就是將圖片解碼成RGB的格式。還需要將以下參數傳給super: ```batch_size```(由Pipeline為你handle batch data)、```num_threads```(希望使用多少的threads)、```device_id```(指定特定的顯卡去進行操作)、```seed```(亂數生成)。 * ```define_graph```: 將這些operator連接再一起去定義計算的步驟。從FileReader去讀取jpeg的圖片以及對應的標籤，並且傳給decoder進行解碼，最後return解碼後的圖片和標籤作為Pipeline的輸出 ## Building the Pipeline 為了使用上述的```SimplePipeline```，需要去build它。 ```python= pipe = SimplePipeline(batch_size, 1, 0) pipe.build() ``` **num_threads建議配置: CPU核心數 / GPU數量** ## Running the pipeline ```python= pipe_out = pipe.run() images, labels = pipe_out ``` **pipeline的輸出會是一個list包含2個元素（就是我們在SimplePipeline class中的define_graph函數中指定的2個輸出）。** ## Adding augmentation **Random Shuffle** ```python= import nvidia.dali.ops as ops from nvidia.dali.pipeline import Pipeline class ShuffledSimplePipeline(Pipeline): def __init__(self, batch_size, num_threads, device_id): super(ShuffledSimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12) self.input = ops.FileReader(file_root = image_dir, random_shuffle = True, initial_fill = 23000) self.decode = ops.ImageDecoder(device = 'cpu', output_type = types.RGB) def define_graph(self): jpegs, labels = self.input() images = self.decode(jpegs) return (images, labels) ``` 這邊對```FileReader```多增加了兩個參數: * ```random_shuffle```: enables shuffling of images in the reader. **Shuffling is performed using a buffer of images read from disk.** When reader is asked to provide a next image, it randomly selects an image from the buffer, outputs it and immediately replaces that spot in a buffer with a freshly read image. * ```initial_fill```: sets the **capacity of the buffer**. The default value of this parameter (1000), well suited for datasets containing thousands of examples. **Random Rotate** ```python= class RandomRotatedSimplePipeline(Pipeline): def __init__(self, batch_size, num_threads, device_id): super(RandomRotatedSimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12) self.input = ops.FileReader(file_root = image_dir, random_shuffle = True, initial_fill = 21) self.decode = ops.ImageDecoder(device = 'cpu', output_type = types.RGB) self.rotate = ops.Rotate() self.rng = ops.random.Uniform(range = (-10.0, 10.0)) def define_graph(self): jpegs, labels = self.input() images = self.decode(jpegs) angle = self.rng() rotated_images = self.rotate(images, angle = angle) return (rotated_images, labels) ``` * ```dali.ops.random.Uniform```: random number generator * ```dali.ops.random.Rotate```: enables us to feed the operator with different rotation angles for every image ## Hybrid decoding **Specifying “mixed” device parameter in ImageDecoder enables nvJPEG support. Other file formats are still decoded on the CPU.** ```python= import nvidia.dali.types as types import nvidia.dali.ops as ops from nvidia.dali.pipeline import Pipeline class HybridTrainPipe(Pipeline): def __init__(self, batch_size, num_threads, device_id, data_dir, crop = 224, dali_cpu = False ): super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id) self.input = ops.FileReader(file_root = data_dir, random_shuffle = True) # let user decide which pipeline works if dali_cpu: dali_device = 'cpu' self.decode = ops.HostDecoder(device = dali_device, output_type = types.RGB) else: dali_device = 'gpu' self.decode = ops.ImageDecoder(device = 'mixed', output_type = types.RGB) self.rrc = ops.RandomResizedCrop(device = dali_device, size = (crop, crop)) def define_graph(self): self.jpegs, self.labels = self.input(name = "Reader") images = self.decode(self.jpegs) # images are on the GPU outputs = self.rrc(images) self.labels = self.labels.gpu() return [outputs, self.labels] ``` # PyTorch Plugin API reference ## DALIClassificationIterator DALI iterator for classification tasks for PyTorch. It returns 2 outputs (data and label) in the form of PyTorch’s Tensor. ```python= nvidia.dali.plugin.pytorch.DALIClassificationIterator(pipelines) ``` ## DALIGenericIterator General DALI iterator for PyTorch. It can return any number of outputs from the DALI pipeline in the form of PyTorch’s Tensors. ```python= nvidia.dali.plugin.pytorch.DALIGenericIterator(pipelines) ``` --- # DALI 範例 ## Example ``` python= class HybridTrainPipe(Pipeline): def __init__(self, batch_size, num_threads, device_id, data_dir, crop = 224, dali_cpu = False ): super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id) self.input = ops.FileReader(file_root = data_dir, random_shuffle = True, initial_fill = 23000) # let user decide which pipeline works him if dali_cpu: dali_device = 'cpu' self.decode = ops.HostDecoder(device = dali_device, output_type = types.RGB) else: dali_device = 'gpu' self.decode = ops.ImageDecoder(device = "mixed", output_type = types.RGB) self.rrc = ops.RandomResizedCrop(device = dali_device, size = (crop, crop)) def define_graph(self): self.jpegs, self.labels = self.input(name = "Reader") images = self.decode(self.jpegs) images = self.rrc(images) self.labels = self.labels.gpu() return [images, self.labels] def train(model, device, train_loader, criterion, optimizer): model.train() for batch_idx, data in tqdm(enumerate(train_loader)): target = data[0]['label'].squeeze().cuda(6, non_blocking = True).long() data = data[0]["data"].cuda(6, non_blocking=True).type(torch.cuda.FloatTensor) data = data.permute(0, 3, 1, 2) data_var = Variable(data) target_var = Variable(target) optimizer.zero_grad() output = model(data_var) loss = criterion(output, target_var) # Compute gradient and do optimizer step loss.backward() optimizer.step() torch.cuda.synchronize() def main(): net = models.resnet50(num_classes = 2).to(device) # Define loss function and optimizer criterion = nn.functional.cross_entropy optimizer = optim.SGD(net.parameters(), lr = args.lr, momentum = 0.9) pipe = HybridTrainPipe(batch_size = args.batch_size, num_threads = 4, device_id = 6, data_dir = DOG_CAT_PATH ) pipe.build() # DALI train_loader_dali = DALIClassificationIterator(pipe, size = 23000) for epoch in range(1, args.epochs + 1): train(net, device, train_loader_dali, criterion, optimizer) if __name__ == "__main__": main() ``` ## Other USE CASE **Object Detection:** ![](https://i.imgur.com/gJxcvHe.png) **Video Pipeline:** ![](https://i.imgur.com/dpCinyd.png) **Optical Flow:** ![](https://i.imgur.com/PvNXbsO.png) --- # Compare ## Preprocessing time(Test on dog/cat dataset) > **With one Intel(R) Xeon(R) Silver 4110 CPU 、 one RTX-2080ti GPU and all dataset place in memory disk, extremely accelerate image preprocessing with DALI.** | Iter Training Data Cost(bs=64)| DOG/CAT dataset | |--------------|:-----:| |DALI | 3.49108s | |torchvision | 77.81846s | ## NVIDIA official data([source](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9925-fast-ai-data-pre-processing-with-nvidia-dali.pdf)) **Training on PyTorch** ![](https://i.imgur.com/GXbGvra.png) **Training on TensorFlow** ![](https://i.imgur.com/cEUMwfG.png) --- # Additional Resource 1. **[Fast data pipeline for deep learning training](http://on-demand.gputechconf.com/gtc/2018/presentation/s8906-fast-data-pipelines-for-deep-learning-training.pdf)** 2. **[Fast AI Data Preprocessing with NVIDIA DALI](https://developer.nvidia.com/blog/fast-ai-data-preprocessing-with-nvidia-dali/)** 3. **[Integration of DALI with TensorRT on Xavier](https://developer.nvidia.com/gtc/2019/video/S9818/video)** 4. **[NVIDIA DALI documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html#)** 5. **[DALI Release Notes](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html)** 6. **[NVIDIA Examples(Github)](https://github.com/NVIDIA/DALI/tree/master/docs/examples)** 7. **[Webinar: Efficient Data Loading using DALI (Youtube)](https://www.youtube.com/watch?v=cu-M8I3YUxM)**