# pytorch中DataLoader和Dataset的基本用法
## DataLoader支持的兩種
1. map格式:即key, value形式, 例如{0:"張三", 1:"李四"}
2. iterator格式:例如數組,迭代器
### DataLoader
- python中,只要可以for迴圈的數據,都是iterator的數據
```python=
data = [0, 1, 2, 3, 4]
for item in data:
print(item, end=' ')
```
- 上述list就是一個迭代器
```python=
data_iter = iter(data)
item = next(data_iter, None)
while item is not None:
print(item, end=' ')
item = next(data_iter, None)
```
## pytorch使用dataloader
```python=
from torch.utils.data import DataLoader
data = [i for i in range(100)]
# 定義dataloader,其接受3個參數
# dataset: 資料集
# batch_size: 要將數據切分為多少份
# shuffle: 是否要將數據打亂
dataloader = DataLoader(dataset=data, batch_size=6, shuffle=False)
for i, item in enumerate(dataloader):
print(i, item)
0 tensor([0, 1, 2, 3, 4, 5])
1 tensor([ 6, 7, 8, 9, 10, 11])
2 tensor([12, 13, 14, 15, 16, 17])
3 tensor([18, 19, 20, 21, 22, 23])
4 tensor([24, 25, 26, 27, 28, 29])
5 tensor([30, 31, 32, 33, 34, 35])
6 tensor([36, 37, 38, 39, 40, 41])
7 tensor([42, 43, 44, 45, 46, 47])
8 tensor([48, 49, 50, 51, 52, 53])
9 tensor([54, 55, 56, 57, 58, 59])
10 tensor([60, 61, 62, 63, 64, 65])
11 tensor([66, 67, 68, 69, 70, 71])
12 tensor([72, 73, 74, 75, 76, 77])
13 tensor([78, 79, 80, 81, 82, 83])
14 tensor([84, 85, 86, 87, 88, 89])
15 tensor([90, 91, 92, 93, 94, 95])
16 tensor([96, 97, 98, 99])
```
### 使用自定義dataset
```python=
# 使用自定義的dataset
from torch.utils.data import IterableDataset
class MyDataset(IterableDataset):
def __init__(self):
print('init...')
def __iter__(self):
print("iter...")
self.n = 1
return self
def __next__(self):
print("next...")
x = self.n
self.n += 1
if x >= 100:
raise StopIteration
return x
dataloader = DataLoader(MyDataset(), batch_size=5)
for i, item in enumerate(dataloader):
print(i, item)
init...
iter...
next...
next...
next...
next...
next...
0 tensor([1, 2, 3, 4, 5])
next...
next...
next...
next...
next...
1 tensor([ 6, 7, 8, 9, 10])
next...
next...
```
- 可發現可迭代對象在初始化中會調用一次__init__方法,在獲取迭代器的時候會調用一次__iter__方法,之後在獲取元素時,每獲取一個元素會調用一次__next__方法
## map格式的Daataloader
```python=
# map格式的dataset
dataset = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'}
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)
for i, value in enumerate(dataloader):
print(i, value)
0 ['a', 'b']
1 ['c', 'd']
2 ['e']
```
```python=
from torch.utils.data import Dataset
class CustomerDataset(Dataset):
def __init__(self):
super(CustomerDataset, self).__init__()
self.data = ['張三', '李四', '王五', '趙六', '陳七']
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return len(self.data)
dataloader = DataLoader(CustomerDataset(), batch_size=2, shuffle=True)
for i, value in enumerate(dataloader):
print(i, value)
0 ['張三', '李四']
1 ['陳七', '趙六']
2 ['王五']
```
### map的dataloader執行過程
1. 調用len(dataset)方法,獲取dataset的長度,這裡為4
2. 然後生成index list
3. 按順序調用__getitem__方法,即為getitem(0),getitem(1)...
4. 根據batch_size進行返回
## 實戰
- 資料來源:https://www.kaggle.com/datasets/shilou/crypko-data?select=faces
- 目標:設計一個Dataset自動讀取資料,並做一定處理,然後通過dataloader加載
### 先看一下要如何處理資料
```python=
import os
import torchvision
import torchvision.transforms as transforms
import os
file_path = "./faces"
fnames = [file_path + '/' + fname for fname in os.listdir(file_path)]
fnames
['./faces/0.jpg',
'./faces/1.jpg',
'./faces/10.jpg',
'./faces/100.jpg',
'./faces/1000.jpg',
'./faces/10000.jpg',
'./faces/10001.jpg',
'./faces/10002.jpg',
```
### 實現自定義map-style的dataset
```python=
class CustomerDataset(Dataset):
def __init__(self, fnames):
super(CustomerDataset, self).__init__()
fnames = [file_path + '/' + fname for fname in os.listdir(file_path)]
self.compose = compose = [
transforms.ToPILImage(),
transforms.Resize((64, 64)),
transforms.ToTensor(),
transforms.Normalize(mean=[.5, .5, .5], std=[.5, .5, .5])
]
def __getitem__(self, index):
img = torchvision.io.read_image(fnames[index])
transform = transforms.Compose(self.compose)
return transform(img)
def __len__(self):
return len(fnames)
```
```python=
dataset = CustomerDataset("./faces")
print(next(iter(dataset)).shape)
dataloader = DataLoader(dataset, batch_size=16)
for i, value in enumerate(dataloader):
print(i, value.shape)
0 torch.Size([16, 3, 64, 64])
1 torch.Size([16, 3, 64, 64])
2 torch.Size([16, 3, 64, 64])
3 torch.Size([16, 3, 64, 64])
4 torch.Size([16, 3, 64, 64])
5 torch.Size([16, 3, 64, 64])
```
- 成功創建dataloader
### 稍微使用dayaloadaer
```python=
next(iter(dataloader)).size()
torch.Size([16, 3, 64, 64])
```
```python=
# 使用dataloader
import matplotlib.pyplot as plt
grid_img = torchvision.utils.make_grid(next(iter(dataloader)), nrow=4)
plt.figure(figsize=(10, 10))
plt.imshow(grid_img.permute(1, 2, 0))
plt.show()
```

#### torchvision.utils.make_grid
- tensor:输入的张量,一般为大小为 (B, C, H, W) 的四维张量,其中 B 是批次大小,C 是通道数,H 和 W 分别是每张图像的高度和宽度。
- nrow:每行显示的图像数量,默认为 8
- normalize:是否进行归一化,默认为 False。如果设置为 True,则将图像像素值归一化到 [0, 1] 范围。