NVIDIA DALI出奇蹟 加速資料載入及訓練
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
NVIDIA DALI是甚麼?
深度學習通常需要複雜且多個階段的資料前處理流程,這類的資料處理(尤其是影像)主要都是在CPU上進行計算。舉例來說: 從硬碟讀取訓練資料、對圖片解碼、裁切、隨機調整圖片大小、進行資料增強(Data Augmentation)還有一些格式轉換(NCHW or NHWC),主要都是在CPU上來執行這些動作,這也限制了訓練和推理的性能。
Nvidia DALI,NVIDIA’s Data Loading Library, is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications。這是NVIDIA官方給出的定義,簡單扼要地講就是一個可以在深度學習中加速資料前處理(尤其是影像)和載入的函示庫,可以用來加速深度學習數據前處理的載入速度,主軸就是要充分的使用GPU來進行加速而不讓它有浪費的空間。
若要進行深度學習訓練,大致上可以分做3個步驟:
- 將訓練資料(圖片、影片、文字、音頻或其他格式)存在Server的硬碟上。
- 透過CPU將資料載入到記憶體,進行解碼和資料增強等前處理的操作,在這個步驟是非常依賴CPU的計算能力的。
- 將處理後的資料載入到GPU的記憶體中,進行之後的訓練流程,這個部分就是要依賴GPU的計算能力。
在這整個流程中,CPU和GPU的計算延遲應該不能差太多,CPU處理好的資料才能夠及時地提供給GPU進行訓練或推理。如果CPU的處理速度遠不及於GPU,這樣就會導致GPU常常需要等待CPU而導致浪費,反之亦然。
DALI就做了一件事情,那就是將一部分的資料前處理操作從CPU挪到GPU上去進行運算,可以有效的提高效能,並且更充分利用GPU的運算能力。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
DALI 主要功能
Pipeline流程
DALI提供了一個簡單的Python interface,可以通過以下幾個步驟實現資料前處理的pipeline:
- Select Operators from this extensive list of supported operators
- Define the operation flow as a symbolic graph in an imperative way (as in most of the current deep learning frameworks)
- Build an operation pipeline
- Run graph on demand
- Integrate with your target deep learning framework by dedicated plugin
DALI內部是如何工作的?
DALI將資料前處理的pipeline定義為Dataflow graph,DALI具有以下3種類型的運算子:
- CPU: accepts and produces data on CPU
- Mixed: accepts data from CPU and produces the output at the GPU side
- GPU: accepts and produces data on the GPU
因為效能的原因,DALI傳輸資料只支援 CPU -> Mixed -> GPU(如下圖):
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
如何安裝DALI
系統要求
- 作業系統: Linux x64(不支援windows、macOS)
- CUDA版本為9.0之後 (在Terminal上打
nvcc --version
可以查看CUDA版本)
- 有以下一個或多個深度學習框架
- MXNet 1.3或更高版本
- PyTorch 0.4或更高版本
- TensorFlow 1.7或或更高版本
安裝指令
若框架為TensorFlow,需要另外安裝插件。
若框架為TensorFlow,需要另外安裝插件。
若框架為TensorFlow,需要另外安裝插件。
- NVIDIA Installation Guide: LINK
如何使用DALI
Defining the Pipeline
DALI Python API最核心的主要功能就是Pipeline class
,透過建立Pipeline的subclass來建立專屬自己的DALI pipeline。
Simple Pipeline
只需要寫兩個methods:
-
__init__
: 選擇需要的operators(可以在nvidia.dali.ops
裡面找到),在這個簡單範例中只用到2個operators,FileReader
就是從硬碟中讀取資料、HostDecoder
就是將圖片解碼成RGB的格式。還需要將以下參數傳給super: batch_size
(由Pipeline為你handle batch data)、num_threads
(希望使用多少的threads)、device_id
(指定特定的顯卡去進行操作)、seed
(亂數生成)。
-
define_graph
: 將這些operator連接再一起去定義計算的步驟。從FileReader去讀取jpeg的圖片以及對應的標籤,並且傳給decoder進行解碼,最後return解碼後的圖片和標籤作為Pipeline的輸出
Building the Pipeline
為了使用上述的SimplePipeline
,需要去build它。
num_threads建議配置: CPU核心數 / GPU數量
Running the pipeline
pipeline的輸出會是一個list包含2個元素(就是我們在SimplePipeline class中的define_graph函數中指定的2個輸出)。
Adding augmentation
Random Shuffle
這邊對FileReader
多增加了兩個參數:
-
random_shuffle
: enables shuffling of images in the reader. Shuffling is performed using a buffer of images read from disk. When reader is asked to provide a next image, it randomly selects an image from the buffer, outputs it and immediately replaces that spot in a buffer with a freshly read image.
-
initial_fill
: sets the capacity of the buffer. The default value of this parameter (1000), well suited for datasets containing thousands of examples.
Random Rotate
Hybrid decoding
Specifying “mixed” device parameter in ImageDecoder enables nvJPEG support. Other file formats are still decoded on the CPU.
PyTorch Plugin API reference
DALIClassificationIterator
DALI iterator for classification tasks for PyTorch. It returns 2 outputs (data and label) in the form of PyTorch’s Tensor.
DALIGenericIterator
General DALI iterator for PyTorch. It can return any number of outputs from the DALI pipeline in the form of PyTorch’s Tensors.
DALI 範例
Example
class HybridTrainPipe(Pipeline):
def __init__(self,
batch_size,
num_threads,
device_id,
data_dir,
crop = 224,
dali_cpu = False
):
super(HybridTrainPipe, self).__init__(batch_size, num_threads,
device_id, seed = 12 + device_id)
self.input = ops.FileReader(file_root = data_dir, random_shuffle = True, initial_fill = 23000)
if dali_cpu:
dali_device = 'cpu'
self.decode = ops.HostDecoder(device = dali_device, output_type = types.RGB)
else:
dali_device = 'gpu'
self.decode = ops.ImageDecoder(device = "mixed", output_type = types.RGB)
self.rrc = ops.RandomResizedCrop(device = dali_device, size = (crop, crop))
def define_graph(self):
self.jpegs, self.labels = self.input(name = "Reader")
images = self.decode(self.jpegs)
images = self.rrc(images)
self.labels = self.labels.gpu()
return [images, self.labels]
def train(model, device, train_loader, criterion, optimizer):
model.train()
for batch_idx, data in tqdm(enumerate(train_loader)):
target = data[0]['label'].squeeze().cuda(6, non_blocking = True).long()
data = data[0]["data"].cuda(6, non_blocking=True).type(torch.cuda.FloatTensor)
data = data.permute(0, 3, 1, 2)
data_var = Variable(data)
target_var = Variable(target)
optimizer.zero_grad()
output = model(data_var)
loss = criterion(output, target_var)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
def main():
net = models.resnet50(num_classes = 2).to(device)
criterion = nn.functional.cross_entropy
optimizer = optim.SGD(net.parameters(), lr = args.lr, momentum = 0.9)
pipe = HybridTrainPipe(batch_size = args.batch_size,
num_threads = 4,
device_id = 6,
data_dir = DOG_CAT_PATH
)
pipe.build()
train_loader_dali = DALIClassificationIterator(pipe, size = 23000)
for epoch in range(1, args.epochs + 1):
train(net, device, train_loader_dali, criterion, optimizer)
if __name__ == "__main__":
main()
Other USE CASE
Object Detection:

Video Pipeline:

Optical Flow:

Compare
Preprocessing time(Test on dog/cat dataset)
With one Intel® Xeon® Silver 4110 CPU 、 one RTX-2080ti GPU and all dataset place in memory disk, extremely accelerate image preprocessing with DALI.
Iter Training Data Cost(bs=64) |
DOG/CAT dataset |
DALI |
3.49108s |
torchvision |
77.81846s |
NVIDIA official data(source)
Training on PyTorch

Training on TensorFlow

Additional Resource
- Fast data pipeline for deep learning training
- Fast AI Data Preprocessing with NVIDIA DALI
- Integration of DALI with TensorRT on Xavier
- NVIDIA DALI documentation
- DALI Release Notes
- NVIDIA Examples(Github)
- Webinar: Efficient Data Loading using DALI (Youtube)