# pytorch中DataLoader和Dataset的基本用法 ## DataLoader支持的兩種 1. map格式:即key, value形式, 例如{0:"張三", 1:"李四"} 2. iterator格式:例如數組,迭代器 ### DataLoader - python中,只要可以for迴圈的數據,都是iterator的數據 ```python= data = [0, 1, 2, 3, 4] for item in data: print(item, end=' ') ``` - 上述list就是一個迭代器 ```python= data_iter = iter(data) item = next(data_iter, None) while item is not None: print(item, end=' ') item = next(data_iter, None) ``` ## pytorch使用dataloader ```python= from torch.utils.data import DataLoader data = [i for i in range(100)] # 定義dataloader,其接受3個參數 # dataset: 資料集 # batch_size: 要將數據切分為多少份 # shuffle: 是否要將數據打亂 dataloader = DataLoader(dataset=data, batch_size=6, shuffle=False) for i, item in enumerate(dataloader): print(i, item) 0 tensor([0, 1, 2, 3, 4, 5]) 1 tensor([ 6, 7, 8, 9, 10, 11]) 2 tensor([12, 13, 14, 15, 16, 17]) 3 tensor([18, 19, 20, 21, 22, 23]) 4 tensor([24, 25, 26, 27, 28, 29]) 5 tensor([30, 31, 32, 33, 34, 35]) 6 tensor([36, 37, 38, 39, 40, 41]) 7 tensor([42, 43, 44, 45, 46, 47]) 8 tensor([48, 49, 50, 51, 52, 53]) 9 tensor([54, 55, 56, 57, 58, 59]) 10 tensor([60, 61, 62, 63, 64, 65]) 11 tensor([66, 67, 68, 69, 70, 71]) 12 tensor([72, 73, 74, 75, 76, 77]) 13 tensor([78, 79, 80, 81, 82, 83]) 14 tensor([84, 85, 86, 87, 88, 89]) 15 tensor([90, 91, 92, 93, 94, 95]) 16 tensor([96, 97, 98, 99]) ``` ### 使用自定義dataset ```python= # 使用自定義的dataset from torch.utils.data import IterableDataset class MyDataset(IterableDataset): def __init__(self): print('init...') def __iter__(self): print("iter...") self.n = 1 return self def __next__(self): print("next...") x = self.n self.n += 1 if x >= 100: raise StopIteration return x dataloader = DataLoader(MyDataset(), batch_size=5) for i, item in enumerate(dataloader): print(i, item) init... iter... next... next... next... next... next... 0 tensor([1, 2, 3, 4, 5]) next... next... next... next... next... 1 tensor([ 6, 7, 8, 9, 10]) next... next... ``` - 可發現可迭代對象在初始化中會調用一次__init__方法,在獲取迭代器的時候會調用一次__iter__方法,之後在獲取元素時,每獲取一個元素會調用一次__next__方法 ## map格式的Daataloader ```python= # map格式的dataset dataset = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'} dataloader = DataLoader(dataset, batch_size=2, shuffle=False) for i, value in enumerate(dataloader): print(i, value) 0 ['a', 'b'] 1 ['c', 'd'] 2 ['e'] ``` ```python= from torch.utils.data import Dataset class CustomerDataset(Dataset): def __init__(self): super(CustomerDataset, self).__init__() self.data = ['張三', '李四', '王五', '趙六', '陳七'] def __getitem__(self, index): return self.data[index] def __len__(self): return len(self.data) dataloader = DataLoader(CustomerDataset(), batch_size=2, shuffle=True) for i, value in enumerate(dataloader): print(i, value) 0 ['張三', '李四'] 1 ['陳七', '趙六'] 2 ['王五'] ``` ### map的dataloader執行過程 1. 調用len(dataset)方法,獲取dataset的長度,這裡為4 2. 然後生成index list 3. 按順序調用__getitem__方法,即為getitem(0),getitem(1)... 4. 根據batch_size進行返回 ## 實戰 - 資料來源:https://www.kaggle.com/datasets/shilou/crypko-data?select=faces - 目標:設計一個Dataset自動讀取資料,並做一定處理,然後通過dataloader加載 ### 先看一下要如何處理資料 ```python= import os import torchvision import torchvision.transforms as transforms import os file_path = "./faces" fnames = [file_path + '/' + fname for fname in os.listdir(file_path)] fnames ['./faces/0.jpg', './faces/1.jpg', './faces/10.jpg', './faces/100.jpg', './faces/1000.jpg', './faces/10000.jpg', './faces/10001.jpg', './faces/10002.jpg', ``` ### 實現自定義map-style的dataset ```python= class CustomerDataset(Dataset): def __init__(self, fnames): super(CustomerDataset, self).__init__() fnames = [file_path + '/' + fname for fname in os.listdir(file_path)] self.compose = compose = [ transforms.ToPILImage(), transforms.Resize((64, 64)), transforms.ToTensor(), transforms.Normalize(mean=[.5, .5, .5], std=[.5, .5, .5]) ] def __getitem__(self, index): img = torchvision.io.read_image(fnames[index]) transform = transforms.Compose(self.compose) return transform(img) def __len__(self): return len(fnames) ``` ```python= dataset = CustomerDataset("./faces") print(next(iter(dataset)).shape) dataloader = DataLoader(dataset, batch_size=16) for i, value in enumerate(dataloader): print(i, value.shape) 0 torch.Size([16, 3, 64, 64]) 1 torch.Size([16, 3, 64, 64]) 2 torch.Size([16, 3, 64, 64]) 3 torch.Size([16, 3, 64, 64]) 4 torch.Size([16, 3, 64, 64]) 5 torch.Size([16, 3, 64, 64]) ``` - 成功創建dataloader ### 稍微使用dayaloadaer ```python= next(iter(dataloader)).size() torch.Size([16, 3, 64, 64]) ``` ```python= # 使用dataloader import matplotlib.pyplot as plt grid_img = torchvision.utils.make_grid(next(iter(dataloader)), nrow=4) plt.figure(figsize=(10, 10)) plt.imshow(grid_img.permute(1, 2, 0)) plt.show() ``` ![](https://hackmd.io/_uploads/ByzJSwky6.png) #### torchvision.utils.make_grid - tensor:输入的张量,一般为大小为 (B, C, H, W) 的四维张量,其中 B 是批次大小,C 是通道数,H 和 W 分别是每张图像的高度和宽度。 - nrow:每行显示的图像数量,默认为 8 - normalize:是否进行归一化,默认为 False。如果设置为 True,则将图像像素值归一化到 [0, 1] 范围。