NVIDIA DALI出奇蹟 加速資料載入及訓練
===

[TOC]
# NVIDIA DALI是甚麼?
   深度學習通常需要複雜且多個階段的資料前處理流程,這類的資料處理(尤其是影像)主要都是在CPU上進行計算。舉例來說: **從硬碟讀取訓練資料**、**對圖片解碼**、**裁切**、**隨機調整圖片大小**、**進行資料增強(Data Augmentation)**還有一些**格式轉換(NCHW or NHWC)**,主要都是在CPU上來執行這些動作,這也限制了訓練和推理的性能。
   **Nvidia DALI,NVIDIA’s Data Loading Library, is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications**。這是NVIDIA官方給出的定義,簡單扼要地講就是一個可以在深度學習中加速資料前處理(尤其是影像)和載入的函示庫,可以用來加速深度學習數據前處理的載入速度,主軸就是要充分的使用GPU來進行加速而不讓它有浪費的空間。
   若要進行深度學習訓練,大致上可以分做3個步驟:
1. **將訓練資料(圖片、影片、文字、音頻或其他格式)存在Server的硬碟上。**
2. **透過CPU將資料載入到記憶體,進行解碼和資料增強等前處理的操作,在這個步驟是非常依賴CPU的計算能力的。**
3. **將處理後的資料載入到GPU的記憶體中,進行之後的訓練流程,這個部分就是要依賴GPU的計算能力。**
在這整個流程中,CPU和GPU的計算延遲應該不能差太多,CPU處理好的資料才能夠及時地提供給GPU進行訓練或推理。如果CPU的處理速度遠不及於GPU,這樣就會導致GPU常常需要等待CPU而導致浪費,反之亦然。
   DALI就做了一件事情,那就是將一部分的資料前處理操作從CPU挪到GPU上去進行運算,可以有效的提高效能,並且更充分利用GPU的運算能力。

---
# DALI 主要功能
## Pipeline流程
DALI提供了一個簡單的Python interface,可以通過以下幾個步驟實現資料前處理的pipeline:
1. **Select Operators from this extensive [list of supported operators](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html)**
2. **Define the operation flow as a symbolic graph in an imperative way (as in most of the current deep learning frameworks)**
3. **Build an operation pipeline**
4. **Run graph on demand**
5. **Integrate with your target deep learning framework by dedicated plugin**
## DALI內部是如何工作的?
DALI將資料前處理的pipeline定義為Dataflow graph,DALI具有以下3種類型的運算子:
* **CPU: accepts and produces data on CPU**
* **Mixed: accepts data from CPU and produces the output at the GPU side**
* **GPU: accepts and produces data on the GPU**
<br>
因為效能的原因,DALI傳輸資料只支援 **CPU -> Mixed -> GPU**(如下圖):

---
# 如何安裝DALI
## 系統要求
1. 作業系統: ==Linux x64(不支援windows、macOS)==
2. CUDA版本為9.0之後 (在Terminal上打```nvcc --version```可以查看CUDA版本)
3. 有以下一個或多個深度學習框架
  - MXNet 1.3或更高版本
  - PyTorch 0.4或更高版本
  - TensorFlow 1.7或或更高版本
## 安裝指令
* **For CUDA 9.0**
```
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali
```
若框架為TensorFlow,需要另外安裝插件。
```
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/9.0 nvidia-dali-tf-plugin
```
* **For CUDA 10.0**
```
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100
```
若框架為TensorFlow,需要另外安裝插件。
```
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100
```
* **For CUDA 11.0**
```
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110
```
若框架為TensorFlow,需要另外安裝插件。
```
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-tf-plugin-cuda110
```
- **NVIDIA Installation Guide: [LINK](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html)**
---
<br>
# 如何使用DALI
## Defining the Pipeline
DALI Python API最核心的主要功能就是```Pipeline class```,透過建立Pipeline的subclass來建立專屬自己的DALI pipeline。
<br>
**Simple Pipeline**
```python=
import nvidia.dali.ops as ops
from nvidia.dali.pipeline import Pipeline
class SimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id, seed = 21):
super(SimplePipeline, self).__init__(batch_size, num_threads, device_id)
self.input = ops.FileReader(file_root = image_dir)
self.decode = ops.HostDecoder(output_type = types.RGB)
def define_graph(self):
jpegs, labels = self.input()
images = self.decode(jpegs)
return (images, labels)
```
只需要寫兩個methods:
* ```__init__```: 選擇需要的operators(可以在[```nvidia.dali.ops```](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html)裡面找到),在這個簡單範例中只用到2個operators,```FileReader```就是從硬碟中讀取資料、```HostDecoder```就是將圖片解碼成RGB的格式。還需要將以下參數傳給super: ```batch_size```(由Pipeline為你handle batch data)、```num_threads```(希望使用多少的threads)、```device_id```(指定特定的顯卡去進行操作)、```seed```(亂數生成)。
* ```define_graph```: 將這些operator連接再一起去定義計算的步驟。從FileReader去讀取jpeg的圖片以及對應的標籤,並且傳給decoder進行解碼,最後return解碼後的圖片和標籤作為Pipeline的輸出
<br>
## Building the Pipeline
為了使用上述的```SimplePipeline```,需要去build它。
```python=
pipe = SimplePipeline(batch_size, 1, 0)
pipe.build()
```
**num_threads建議配置: CPU核心數 / GPU數量**
<br>
## Running the pipeline
```python=
pipe_out = pipe.run()
images, labels = pipe_out
```
**pipeline的輸出會是一個list包含2個元素(就是我們在SimplePipeline class中的define_graph函數中指定的2個輸出)。**
<br>
## Adding augmentation
**Random Shuffle**
```python=
import nvidia.dali.ops as ops
from nvidia.dali.pipeline import Pipeline
class ShuffledSimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id):
super(ShuffledSimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12)
self.input = ops.FileReader(file_root = image_dir, random_shuffle = True, initial_fill = 23000)
self.decode = ops.ImageDecoder(device = 'cpu', output_type = types.RGB)
def define_graph(self):
jpegs, labels = self.input()
images = self.decode(jpegs)
return (images, labels)
```
這邊對```FileReader```多增加了兩個參數:
* ```random_shuffle```: enables shuffling of images in the reader. **Shuffling is performed using a buffer of images read from disk.** When reader is asked to provide a next image, it randomly selects an image from the buffer, outputs it and immediately replaces that spot in a buffer with a freshly read image.
* ```initial_fill```: sets the **capacity of the buffer**. The default value of this parameter (1000), well suited for datasets containing thousands of examples.
<br>
**Random Rotate**
```python=
class RandomRotatedSimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id):
super(RandomRotatedSimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12)
self.input = ops.FileReader(file_root = image_dir, random_shuffle = True, initial_fill = 21)
self.decode = ops.ImageDecoder(device = 'cpu', output_type = types.RGB)
self.rotate = ops.Rotate()
self.rng = ops.random.Uniform(range = (-10.0, 10.0))
def define_graph(self):
jpegs, labels = self.input()
images = self.decode(jpegs)
angle = self.rng()
rotated_images = self.rotate(images, angle = angle)
return (rotated_images, labels)
```
* ```dali.ops.random.Uniform```: random number generator
* ```dali.ops.random.Rotate```: enables us to feed the operator with different rotation angles for every image
<br>
## Hybrid decoding
**Specifying “mixed” device parameter in ImageDecoder enables nvJPEG support. Other file formats are still decoded on the CPU.**
```python=
import nvidia.dali.types as types
import nvidia.dali.ops as ops
from nvidia.dali.pipeline import Pipeline
class HybridTrainPipe(Pipeline):
def __init__(self,
batch_size,
num_threads,
device_id,
data_dir,
crop = 224,
dali_cpu = False
):
super(HybridTrainPipe, self).__init__(batch_size, num_threads,
device_id, seed = 12 + device_id)
self.input = ops.FileReader(file_root = data_dir, random_shuffle = True)
# let user decide which pipeline works
if dali_cpu:
dali_device = 'cpu'
self.decode = ops.HostDecoder(device = dali_device, output_type = types.RGB)
else:
dali_device = 'gpu'
self.decode = ops.ImageDecoder(device = 'mixed', output_type = types.RGB)
self.rrc = ops.RandomResizedCrop(device = dali_device, size = (crop, crop))
def define_graph(self):
self.jpegs, self.labels = self.input(name = "Reader")
images = self.decode(self.jpegs) # images are on the GPU
outputs = self.rrc(images)
self.labels = self.labels.gpu()
return [outputs, self.labels]
```
# PyTorch Plugin API reference
## DALIClassificationIterator
DALI iterator for classification tasks for PyTorch. It returns 2 outputs (data and label) in the form of PyTorch’s Tensor.
```python=
nvidia.dali.plugin.pytorch.DALIClassificationIterator(pipelines)
```
## DALIGenericIterator
General DALI iterator for PyTorch. It can return any number of outputs from the DALI pipeline in the form of PyTorch’s Tensors.
```python=
nvidia.dali.plugin.pytorch.DALIGenericIterator(pipelines)
```
<br>
---
# DALI 範例
## Example
``` python=
class HybridTrainPipe(Pipeline):
def __init__(self,
batch_size,
num_threads,
device_id,
data_dir,
crop = 224,
dali_cpu = False
):
super(HybridTrainPipe, self).__init__(batch_size, num_threads,
device_id, seed = 12 + device_id)
self.input = ops.FileReader(file_root = data_dir, random_shuffle = True, initial_fill = 23000)
# let user decide which pipeline works him
if dali_cpu:
dali_device = 'cpu'
self.decode = ops.HostDecoder(device = dali_device, output_type = types.RGB)
else:
dali_device = 'gpu'
self.decode = ops.ImageDecoder(device = "mixed", output_type = types.RGB)
self.rrc = ops.RandomResizedCrop(device = dali_device, size = (crop, crop))
def define_graph(self):
self.jpegs, self.labels = self.input(name = "Reader")
images = self.decode(self.jpegs)
images = self.rrc(images)
self.labels = self.labels.gpu()
return [images, self.labels]
def train(model, device, train_loader, criterion, optimizer):
model.train()
for batch_idx, data in tqdm(enumerate(train_loader)):
target = data[0]['label'].squeeze().cuda(6, non_blocking = True).long()
data = data[0]["data"].cuda(6, non_blocking=True).type(torch.cuda.FloatTensor)
data = data.permute(0, 3, 1, 2)
data_var = Variable(data)
target_var = Variable(target)
optimizer.zero_grad()
output = model(data_var)
loss = criterion(output, target_var)
# Compute gradient and do optimizer step
loss.backward()
optimizer.step()
torch.cuda.synchronize()
def main():
net = models.resnet50(num_classes = 2).to(device)
# Define loss function and optimizer
criterion = nn.functional.cross_entropy
optimizer = optim.SGD(net.parameters(), lr = args.lr, momentum = 0.9)
pipe = HybridTrainPipe(batch_size = args.batch_size,
num_threads = 4,
device_id = 6,
data_dir = DOG_CAT_PATH
)
pipe.build()
# DALI
train_loader_dali = DALIClassificationIterator(pipe, size = 23000)
for epoch in range(1, args.epochs + 1):
train(net, device, train_loader_dali, criterion, optimizer)
if __name__ == "__main__":
main()
```
<br>
## Other USE CASE
**Object Detection:**

**Video Pipeline:**

**Optical Flow:**

---
# Compare
## Preprocessing time(Test on dog/cat dataset)
> **With one Intel(R) Xeon(R) Silver 4110 CPU 、 one RTX-2080ti GPU and all dataset place in memory disk, extremely accelerate image preprocessing with DALI.**
| Iter Training Data Cost(bs=64)| DOG/CAT dataset |
|--------------|:-----:|
|DALI | 3.49108s |
|torchvision | 77.81846s |
<br>
## NVIDIA official data([source](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9925-fast-ai-data-pre-processing-with-nvidia-dali.pdf))
**Training on PyTorch**

**Training on TensorFlow**

---
# Additional Resource
1. **[Fast data pipeline for deep learning training](http://on-demand.gputechconf.com/gtc/2018/presentation/s8906-fast-data-pipelines-for-deep-learning-training.pdf)**
2. **[Fast AI Data Preprocessing with NVIDIA DALI](https://developer.nvidia.com/blog/fast-ai-data-preprocessing-with-nvidia-dali/)**
3. **[Integration of DALI with TensorRT on Xavier](https://developer.nvidia.com/gtc/2019/video/S9818/video)**
4. **[NVIDIA DALI documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html#)**
5. **[DALI Release Notes](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html)**
6. **[NVIDIA Examples(Github)](https://github.com/NVIDIA/DALI/tree/master/docs/examples)**
7. **[Webinar: Efficient Data Loading using DALI (Youtube)](https://www.youtube.com/watch?v=cu-M8I3YUxM)**