【Pytorch 深度學習筆記】用 Tensors 表示現實世界的資訊

# 【Pytorch 深度學習筆記】用 Tensors 表示現實世界的資訊 [TOC] 哈囉大家好我是 LukeTseng，感謝您點進本篇筆記，該篇筆記主要配合讀本《Deep Learning with pytorch》進行學習，另外透過網路資料作為輔助。本系列筆記是我本人奠基深度學習基礎知識的開始，若文章有誤煩請各位指正，謝謝！本篇為《Deep Learning with pytorch》這本書第四章 Real-world data representation using tensors 的相關筆記。 ## 處理影像一張彩色照片是如何用數字表示的？用張量表示。假設有張 800×600 像素的照片： * 每個像素有 3 個顏色通道（Channel）：RGB。 * 整張照片可以表示為形狀為 `(3, 800, 600)` 的張量。 * 第一個維度：3 個顏色 Channel。 * 第二、三個維度：圖片的寬和高。由於書中範例是使用 imageio Module，但我比較習慣用 PIL，所以以下是 PIL 的範例：有一張向日葵的圖片我拿來當範例（文章的圖片是壓縮過的，下載原檔執行程式才會是正確結果）： ![Sunflower_from_Silesia2~1](https://hackmd.io/_uploads/H1lIJvsl-g.jpg) Image Source：https://commons.wikimedia.org/wiki/File:Sunflower_from_Silesia2.jpg 程式碼（在 Jupyter Notebook 上執行）： ```python= import torch from PIL import Image from torchvision import transforms img = Image.open('Sunflower_from_Silesia2.jpg') img_tensor = transforms.PILToTensor()(img) print(img_tensor.shape) ``` Output： ``` torch.Size([3, 1697, 2434]) ``` 在以往的 numpy 跟 PIL 中，輸出的格式通常都是 HWC，H 就是 Height，W 就是 Width，C 就是 Channel。但到了 tensor 中，就變成了 CHW 的順序，因為在卷積神經網路運算中，CHW 格式比較有效率啦。 `transforms.PILToTensor()` 會自動重新排列維度，使用 `permute(2, 0, 1)` 將 HWC 轉換成 CHW。也可直接輸出他的 tensor： ```python= import torch from PIL import Image from torchvision import transforms img = Image.open('Sunflower_from_Silesia2.jpg') img_tensor = transforms.PILToTensor()(img) print(img_tensor) ``` Output： ``` tensor([[[ 52, 56, 53, ..., 40, 36, 37], [ 50, 53, 56, ..., 41, 37, 38], [ 51, 51, 52, ..., 40, 38, 38], ..., [ 49, 50, 48, ..., 35, 35, 35], [ 48, 47, 50, ..., 35, 37, 36], [ 51, 48, 49, ..., 37, 40, 39]], [[104, 108, 105, ..., 87, 91, 89], [105, 105, 108, ..., 89, 89, 91], [106, 103, 103, ..., 88, 89, 89], ..., [ 94, 94, 94, ..., 81, 80, 80], [ 94, 93, 96, ..., 81, 81, 80], [ 97, 94, 95, ..., 81, 81, 80]], [[188, 194, 191, ..., 177, 173, 172], [188, 191, 194, ..., 174, 172, 171], [189, 189, 192, ..., 173, 170, 170], ..., [179, 181, 180, ..., 169, 171, 171], [182, 181, 182, ..., 167, 170, 169], [183, 180, 181, ..., 168, 171, 170]]], dtype=torch.uint8) ``` tensor 中的值代表每個像素點的顏色強度，數值範圍是 0 到 255（dtype=torch.uint8）。看到佔有兩個中括號的 `[[` 有三個，就分別代表三個 Channel：RGB。 `torch.uint8` 為 8 bit 的 unsigned int，剛好對應 0 ~ 255。 ### 正規化（normalization）之後若要訓練深度學習模型，通常會將剛才例子中的這些值除以 255，轉換成 0.0 到 1.0 的浮點數範圍，這個過程叫做正規化（normalization）。正規化有兩種寫法，首先第一種就是直接除以 255，但記得要先把原本 tensor 裡面的 data type 轉成 float。 ```python= import torch from PIL import Image from torchvision import transforms img = Image.open('Sunflower_from_Silesia2.jpg') img_tensor = transforms.PILToTensor()(img) img_normalized = img_tensor.float() / 255.0 print(img_normalized) print(img_normalized.min(), img_normalized.max()) ``` Output： ``` tensor([[[0.2039, 0.2196, 0.2078, ..., 0.1569, 0.1412, 0.1451], [0.1961, 0.2078, 0.2196, ..., 0.1608, 0.1451, 0.1490], [0.2000, 0.2000, 0.2039, ..., 0.1569, 0.1490, 0.1490], ..., [0.1922, 0.1961, 0.1882, ..., 0.1373, 0.1373, 0.1373], [0.1882, 0.1843, 0.1961, ..., 0.1373, 0.1451, 0.1412], [0.2000, 0.1882, 0.1922, ..., 0.1451, 0.1569, 0.1529]], [[0.4078, 0.4235, 0.4118, ..., 0.3412, 0.3569, 0.3490], [0.4118, 0.4118, 0.4235, ..., 0.3490, 0.3490, 0.3569], [0.4157, 0.4039, 0.4039, ..., 0.3451, 0.3490, 0.3490], ..., [0.3686, 0.3686, 0.3686, ..., 0.3176, 0.3137, 0.3137], [0.3686, 0.3647, 0.3765, ..., 0.3176, 0.3176, 0.3137], [0.3804, 0.3686, 0.3725, ..., 0.3176, 0.3176, 0.3137]], [[0.7373, 0.7608, 0.7490, ..., 0.6941, 0.6784, 0.6745], [0.7373, 0.7490, 0.7608, ..., 0.6824, 0.6745, 0.6706], [0.7412, 0.7412, 0.7529, ..., 0.6784, 0.6667, 0.6667], ..., [0.7020, 0.7098, 0.7059, ..., 0.6627, 0.6706, 0.6706], [0.7137, 0.7098, 0.7137, ..., 0.6549, 0.6667, 0.6627], [0.7176, 0.7059, 0.7098, ..., 0.6588, 0.6706, 0.6667]]]) tensor(0.) tensor(1.) ``` 第二種方法則是使用 `ToTensor()`，取代掉原本的函式 `PILToTensor()`，那麼他就會自動正規化了： ```python= import torch from PIL import Image from torchvision import transforms img = Image.open('Sunflower_from_Silesia2.jpg') img_tensor = transforms.ToTensor()(img) print(img_tensor) print(img_tensor.dtype) ``` Output： ``` tensor([[[0.2039, 0.2196, 0.2078, ..., 0.1569, 0.1412, 0.1451], [0.1961, 0.2078, 0.2196, ..., 0.1608, 0.1451, 0.1490], [0.2000, 0.2000, 0.2039, ..., 0.1569, 0.1490, 0.1490], ..., [0.1922, 0.1961, 0.1882, ..., 0.1373, 0.1373, 0.1373], [0.1882, 0.1843, 0.1961, ..., 0.1373, 0.1451, 0.1412], [0.2000, 0.1882, 0.1922, ..., 0.1451, 0.1569, 0.1529]], [[0.4078, 0.4235, 0.4118, ..., 0.3412, 0.3569, 0.3490], [0.4118, 0.4118, 0.4235, ..., 0.3490, 0.3490, 0.3569], [0.4157, 0.4039, 0.4039, ..., 0.3451, 0.3490, 0.3490], ..., [0.3686, 0.3686, 0.3686, ..., 0.3176, 0.3137, 0.3137], [0.3686, 0.3647, 0.3765, ..., 0.3176, 0.3176, 0.3137], [0.3804, 0.3686, 0.3725, ..., 0.3176, 0.3176, 0.3137]], [[0.7373, 0.7608, 0.7490, ..., 0.6941, 0.6784, 0.6745], [0.7373, 0.7490, 0.7608, ..., 0.6824, 0.6745, 0.6706], [0.7412, 0.7412, 0.7529, ..., 0.6784, 0.6667, 0.6667], ..., [0.7020, 0.7098, 0.7059, ..., 0.6627, 0.6706, 0.6706], [0.7137, 0.7098, 0.7137, ..., 0.6549, 0.6667, 0.6627], [0.7176, 0.7059, 0.7098, ..., 0.6588, 0.6706, 0.6667]]]) torch.float32 ``` 其實還有一種寫法，是 DL 中常用的技巧，就是做比較進階的正規化，正規化到 `[-1, 1]`： ```python= import torch from PIL import Image from torchvision import transforms img = Image.open('Sunflower_from_Silesia2.jpg') transform = transforms.Compose([ transforms.ToTensor(), # 轉成 [0, 1] transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) # 轉成 [-1, 1] ]) img_tensor = transform(img) print(img_tensor) ``` Output： ``` tensor([[[-0.5922, -0.5608, -0.5843, ..., -0.6863, -0.7176, -0.7098], [-0.6078, -0.5843, -0.5608, ..., -0.6784, -0.7098, -0.7020], [-0.6000, -0.6000, -0.5922, ..., -0.6863, -0.7020, -0.7020], ..., [-0.6157, -0.6078, -0.6235, ..., -0.7255, -0.7255, -0.7255], [-0.6235, -0.6314, -0.6078, ..., -0.7255, -0.7098, -0.7176], [-0.6000, -0.6235, -0.6157, ..., -0.7098, -0.6863, -0.6941]], [[-0.1843, -0.1529, -0.1765, ..., -0.3176, -0.2863, -0.3020], [-0.1765, -0.1765, -0.1529, ..., -0.3020, -0.3020, -0.2863], [-0.1686, -0.1922, -0.1922, ..., -0.3098, -0.3020, -0.3020], ..., [-0.2627, -0.2627, -0.2627, ..., -0.3647, -0.3725, -0.3725], [-0.2627, -0.2706, -0.2471, ..., -0.3647, -0.3647, -0.3725], [-0.2392, -0.2627, -0.2549, ..., -0.3647, -0.3647, -0.3725]], [[ 0.4745, 0.5216, 0.4980, ..., 0.3882, 0.3569, 0.3490], [ 0.4745, 0.4980, 0.5216, ..., 0.3647, 0.3490, 0.3412], [ 0.4824, 0.4824, 0.5059, ..., 0.3569, 0.3333, 0.3333], ..., [ 0.4039, 0.4196, 0.4118, ..., 0.3255, 0.3412, 0.3412], [ 0.4275, 0.4196, 0.4275, ..., 0.3098, 0.3333, 0.3255], [ 0.4353, 0.4118, 0.4196, ..., 0.3176, 0.3412, 0.3333]]]) ``` ## 3D 影像體積資料（3D images: Volumetric data）醫療掃描（如 CT 掃描、MRI）生成的資料即屬於 3D 影像。例如一張 CT 掃描可能有 100 到 500 層，每層皆為一張 512x512 的影像。那 tensor shape 有可能就是 `(100, 512, 512)`。以下是作者範例（From [dlwpt-code/p1ch4/2_volumetric_ct.ipynb at master · deep-learning-with-pytorch/dlwpt-code](https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/p1ch4/2_volumetric_ct.ipynb)）： ```python= import numpy as np import torch import imageio torch.set_printoptions(edgeitems=2, threshold=50) dir_path = "./data/p1ch4/volumetric-dicom/2-LUNG 3.0 B70f-04083" vol_arr = imageio.volread(dir_path, 'DICOM') print(vol_arr.shape) vol = torch.from_numpy(vol_arr).float() vol = torch.unsqueeze(vol, 0) print(vol.shape) %matplotlib inline import matplotlib.pyplot as plt plt.imshow(vol_arr[50]) # 顯示第 50 個 slice plt.show() ``` Output： ![image](https://hackmd.io/_uploads/ByRpvV3x-g.png) imageio 可讀取 `DICOM` 檔案，而 PIL 不行，PIL 需要搭配 pydicom 才能讀取。而 DICOM 為醫療數位影像傳輸協定，全名是 Digital Imaging and Communications in Medicine。 ```python torch.set_printoptions(edgeitems=2, threshold=50) ``` 設定當印出 tensor 時，只顯示開頭和結尾各 2 個元素，超過 50 個元素就省略中間部分。 ```python= dir_path = "./data/p1ch4/volumetric-dicom/2-LUNG 3.0 B70f-04083" vol_arr = imageio.volread(dir_path, 'DICOM') print(vol_arr.shape) ``` 為讀取 DICOM 檔案的程式。 `imageio.volread(dir_path, 'DICOM')` 會讀取整個資料夾中的 DICOM 檔案。 `vol_arr.shape` 顯示 `(99, 512, 512)`，表示有 99 張 CT slice，512x512 像素。這部分跟前面 Image 的 tensors 表示是一樣的。 `vol = torch.unsqueeze(vol, 0)` 在第 0 個位置增加一個維度。原本形狀 `(99, 512, 512)` 變成 `(1, 99, 512, 512)`。這多出的一個維度是 batch（批次）維度。 ## 表格資料最常見的就是像 .csv、.xlsx 這種二維的表格資料了，那該怎麼用 tensor 表示呢？如下（範例使用 pandas 分析讀取資料）： ```python= import pandas as pd import torch data = pd.read_csv('data.csv') # 分離特徵和標籤 input_data = data.iloc[:, :-1] # 前幾欄為特徵 output_data = data.iloc[:, -1] # 最後一欄為標籤 # 轉換為Tensor input_tensor = torch.Tensor(input_data.to_numpy()) output_tensor = torch.tensor(output_data.to_numpy()) print('輸入格式:', input_tensor.shape, input_tensor.dtype) print('輸出格式:', output_tensor.shape, output_tensor.dtype) print(f'input_tensor = {input_tensor}') print(f'output_tensor = {output_tensor}') ``` 使用的 `data.csv` 檔案內容如下： ```csv 面積,房間數,屋齡,距離車站,房價 85.5,3,10,500,15000000 120.0,4,5,300,25000000 65.2,2,15,800,12000000 95.8,3,8,450,18000000 110.5,4,3,200,28000000 ``` Output： ``` 輸入格式: torch.Size([5, 4]) torch.float32 輸出格式: torch.Size([5]) torch.int64 input_tensor = tensor([[ 85.5000, 3.0000, 10.0000, 500.0000], [120.0000, 4.0000, 5.0000, 300.0000], [ 65.2000, 2.0000, 15.0000, 800.0000], [ 95.8000, 3.0000, 8.0000, 450.0000], [110.5000, 4.0000, 3.0000, 200.0000]]) output_tensor = tensor([15000000, 25000000, 12000000, 18000000, 28000000]) ``` 將資料分離成這樣的目的是為了可以預測房價輸出，而前面四項因素是影響房價的特徵，故而當作輸入特徵。可以將這些資料建立 TensorDataSet，以利後續訓練模型（將資料打包成 TensorDataSet 的形式）：在此之前要先引入 `from torch.utils.data import TensorDataset, DataLoader`。 ```python= from torch.utils.data import TensorDataset, DataLoader # 建立TensorDataset dataset = TensorDataset(input_tensor, output_tensor) # 建立DataLoader進行批次讀取 dataloader = DataLoader(dataset, batch_size=32, shuffle=True) for batch_idx, (features, labels) in enumerate(dataloader): print(f'\n批次 {batch_idx + 1}:') print(f'特徵形狀: {features.shape}') print(f'標籤形狀: {labels.shape}') print(f'特徵範例:\n{features[:2]}') # 顯示前2筆 print(f'標籤範例: {labels[:2]}') ``` Output： ``` 批次 1: 特徵形狀: torch.Size([5, 4]) 標籤形狀: torch.Size([5]) 特徵範例: tensor([[ 95.8000, 3.0000, 8.0000, 450.0000], [120.0000, 4.0000, 5.0000, 300.0000]]) 標籤範例: tensor([18000000, 25000000]) ``` 批次只有 1 是因為 .csv 的資料筆數太少了，而 batch_size 又設定成 32，所以他會把那裡面所有筆的資料塞在同一個批次裡面，因而得到輸出只有批次 1 的內容。 ## 處理類別資料（非數字資料）：One-hot encoding 假設有一列記錄「天氣情況」： * 1=晴天 * 2=陰天 * 3=下雨 * 4=下雪若直接用 1 2 3 4 會有個問題，就是模型可能會認為 4 > 3 > 2 > 1，但天氣之間沒有順序關係。所以解決方法就是用 One-Hot 編碼來處理這些資料。假設天氣值：`[1, 2, 3, 4, 1]` One-hot 編碼過後： ``` 1 → [1, 0, 0, 0] 2 → [0, 1, 0, 0] 3 → [0, 0, 1, 0] 4 → [0, 0, 0, 1] 1 → [1, 0, 0, 0] ``` 在 PyTorch 中可以使用 `torch.nn.functional.one_hot()` 函數來實現 One-hot 編碼。要使用這個之前要引入：`import torch.nn.functional as F` 範例： ```python= import torch import torch.nn.functional as F # 原始天氣資料：1=晴天，2=陰天，3=下雨，4=下雪 weather_data = torch.tensor([1, 2, 3, 4, 1]) # 因為 based-index 從 0 開始所以要 -1 # 或使用 0~3 表示也行 weather_indices = weather_data - 1 # 轉換為 [0, 1, 2, 3, 0] # 進行 One-Hot 編碼，num_classes=4 表示有 4 種類別 weather_onehot = F.one_hot(weather_indices, num_classes=4) print("原始天氣資料:") print(weather_data) print("\n轉換後的索引:") print(weather_indices) print("\nOne-Hot 編碼結果:") print(weather_onehot) print("\n形狀:", weather_onehot.shape) # torch.Size([5, 4]) ``` Output： ``` 原始天氣資料: tensor([1, 2, 3, 4, 1]) 轉換後的索引: tensor([0, 1, 2, 3, 0]) One-Hot 編碼結果: tensor([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0]]) 形狀: torch.Size([5, 4]) ``` 為什麼需要在 1, 2, 3, 4 後面又加一個 ,1 ？主要是避免模型他學到每種天氣的樣子，而無法學到某種天氣比較常見的統計特性。而以下是 One-hot 編碼的結果： ``` One-Hot 編碼結果: tensor([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0]]) ``` 每一個陣列裡面的四個值代表晴天、陰天、下雨、下雪，如果當前索引值是晴天的話，那第一個值會是 1，其他都是 0，以此類推。 ### One-hot encoding One-hot encoding（獨熱編碼）是 ML 中將類別資料轉換成數值形式的常用方法。原理大致上是為每個類別建立一個二元向量，在該向量中只有一個位置的值為 1，其他位置都是 0。如同剛才範例所見的 output。優點： * 避免類別間被誤解為有數值大小順序。 * 讓機器學習模型能公平學習每個類別。 * 保留各類別的獨立性與非序關係。缺點： * 當類別數量很多時，會導致特徵維度劇增（維度災難）。 * 產生稀疏矩陣，計算上可能較耗費資源。 ## 處理時間序列資料 time series 時間序列的意思是資料的順序是重要的，因為在時間上有因果關係。相反處理表格資料時，資料順序反而不是那麼重要，每一行都是獨立的。實際的例子像是預測股票價格、天氣預測等等，前者的解釋可能是今天價格會受到昨天的影響，後者的解釋為今天下雨，而明天可能會繼續下雨之類的。 ~~由於我找不太到有什麼比較好的 dataset 來取代這本書的範例，所以就用它的吧XD。~~ 程式範例至：https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/p1ch4/4_time_series_bikes.ipynb 這邊使用的資料集是華盛頓特區共享單車系統（Capital Bikeshare）2011-2012 年的每小時租車數量，加上天氣和季節資訊。 ```python= import numpy as np import torch torch.set_printoptions(edgeitems=2, threshold=50, linewidth=75) bikes_numpy = np.loadtxt( "./data/p1ch4/bike-sharing-dataset/hour-fixed.csv", dtype=np.float32, delimiter=",", skiprows=1, converters={1: lambda x: float(x[8:10])}) bikes = torch.from_numpy(bikes_numpy) bikes ``` 以下的程式碼，`skiprows` 為跳過標題行，`converters={1: lambda x: float(x[8:10])})` 為將日期字串轉換成日期數字。 ```python= bikes_numpy = np.loadtxt( "./data/p1ch4/bike-sharing-dataset/hour-fixed.csv", dtype=np.float32, delimiter=",", skiprows=1, converters={1: lambda x: float(x[8:10])}) ``` 為啥要把日期字串轉成日期數字，而且這個轉換結果會是什麼？請看以下範例： | 原始日期字串 | `x[8:10]` | `float(x[8:10])` | | ------------ | ------- | -------------- | | "2011-01-01" | "01" | 1.0 | | "2011-01-15" | "15" | 15.0 | | "2011-01-31" | "31" | 31.0 | | "2012-12-05" | "05" | 5.0 | 轉換後的「日期數字」就是一個月中的第幾天（1-31）。這樣子設計的原因是為了簡化資料，一個 NN 不需要去知道完整的日期，只要知道這在一個月的哪幾天即可。另外就是可以減少維度，用一個數字（1-31）比用完整日期字串來得簡單。最後就是可以保留週期性的資訊，月份中的日期有週期性（每月重複），這在學習模式上有蠻大的幫助的。輸出結果會是長下面這樣： ``` tensor([[1.0000e+00, 1.0000e+00, ..., 1.3000e+01, 1.6000e+01], [2.0000e+00, 1.0000e+00, ..., 3.2000e+01, 4.0000e+01], ..., [1.7378e+04, 3.1000e+01, ..., 4.8000e+01, 6.1000e+01], [1.7379e+04, 3.1000e+01, ..., 3.7000e+01, 4.9000e+01]]) ``` 用 shape 看的話會是這樣： ``` torch.Size([17520, 17]) ``` 其中表示有： * 17,520 行：代表 17,520 個小時（730 天 × 24 小時） * 17 列：代表 17 個特性（變數）那程式碼當中又接著一個 `bikes.stride()`，`stride()` 是步長的意思。而這邊輸出是 `(17, 1)`，就是表示： * 第一個維度（行）前進 1 步 → 在 storage 中前進 17 個位置（每行有 17 個數字） * 第二個維度（列）前進 1 步 → 在 storage 中前進 1 個位置。什麼是 stride？在多維 tensor 中，要從一個元素跳到「同一維度的下一個元素」時，在底層的一維記憶體中需要跳過多少個數字。 ### 新增時間維度接下來再新增時間維度。資料是一個長序列（17,520 小時），可把它按天去分組，這樣 NN 可以學習一天之內的模式（如早上 8 點通勤、晚上 6 點下班）。要做的事情就是把形狀從 `(17520, 17)` 變成 `(730, 24, 17)`，也就是 730 天、24 小時、17 個變數特性。在程式上用 `.view()` 去重塑 tensor。 ```python= daily_bikes = bikes.view(-1, 24, bikes.shape[1]) daily_bikes.shape, daily_bikes.stride() ``` 最後就得到了 `(torch.Size([730, 24, 17]), (408, 17, 1))`。 `.view()` 的 `-1` 參數表示自動計算這個維度應該是多少。PyTorch 會算出： $17,520 ÷ 24 = 730$ 這個數字。而第二個參數 24 表示第二個維度固定 24，第三個參數為維度固定 17。 ### 調整維度順序 NN 通常所要的格式是 `(N, C, L)`： * N：樣本數量（Number of samples）= 730 天。 * C：通道數（Channels）= 17 個特性。 * L：序列長度（Length）= 24 小時。但現在的順序是 `(N, L, C)` ，所以需要轉置（transpose）第 2 和第 3 維：因此可寫下程式碼： ```python= daily_bikes = daily_bikes.transpose(1, 2) daily_bikes.shape, daily_bikes.stride() ``` Output： ``` (torch.Size([730, 17, 24]), (408, 1, 17)) ``` ### 準備訓練資料這部分就是處理類別資料的問題了，在原本的 dataset 中有個 weathersit，表示天氣情況，他是一個序數變數（ordinal），有 4 個等級： * 1 = 好天氣 * 2 = 霧 * 3 = 小雨/小雪 * 4 = 大雨/大雪可把它當作分類變數，用 One-Hot 編碼，也可以當作「連續變數」直接使用。而作者在這邊用 One-Hot encoding： ```python= # 假設只看第一天的資料 first_day = bikes[:24].long() # 取前 24 小時 weather_onehot = torch.zeros(first_day.shape[0], 4) # 24 小時 × 4 種天氣 # 把天氣狀況（第 9 列）轉成索引 first_day[:, 9] ``` Output： ``` tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2]) ``` 接下來做 One-Hot encoding： ```python= weather_onehot.scatter_( dim=1, # 在「列」方向散佈 index=first_day[:,9].unsqueeze(1).long() - 1, # 天氣索引 value=1.0) # 填入 1 ``` Output： ``` tensor([[1., 0., 0., 0.], [1., 0., 0., 0.], ..., [0., 1., 0., 0.], [0., 1., 0., 0.]]) ``` 再接下來呢，就是做拼接的動作，把他接回去原始資料，好讓 NN 能去處理的 tensor： ```python torch.cat((bikes[:24], weather_onehot), 1)[:1] ``` Output： ``` tensor([[ 1.0000, 1.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000, 6.0000, 0.0000, 1.0000, 0.2400, 0.2879, 0.8100, 0.0000, 3.0000, 13.0000, 16.0000, 1.0000, 0.0000, 0.0000, 0.0000]]) ``` 在原本範例的後面，在做的事情基本上跟這邊一樣，只是在最後的最後多了去做正規化的動作。 ## 用 tensor 表示文字 Deep Learning 在近年來於 NLP（Natural Language Processing）自然語言處理領域上有革命性的發展，當中有個 NN 叫做 RNNs（循環神經網路），應用於文本分類、分析、生成、自動翻譯等。有個問題就是，NN 只能處理一堆數字，也就是 tensor，那要如何將文字轉成數字呢？要處理文字共分兩個層級，一個是字元級別的，一個是單字級別的。 | 層級 | 處理單位 | 優點 | 缺點 | | -------- | -------- | -------- | -------- | | 字元級別（Character-level） | 每次處理一個字元 | 字元種類少（只有 26 個英文字母 + 標點符號） | 每個字元資訊量少 | | 單字級別（Word-level） | 每次處理一個單字 | 單字資訊量大 | 單詞數量過大（需要處理未見過的單字） | ### Character-level 的 One-Hot encoding 這本書作者所用的是 Jane Austen 的 "Pride and Prejudice" 當作範例。該節的範例程式碼在 https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/p1ch4/5_text_jane_austen.ipynb ```python= lines = text.split('\n') line = lines[200] line ``` 選一行文字做後續編碼，也就是那個熟悉的 One-Hot encoding。 ```python= letter_t = torch.zeros(len(line), 128) letter_t.shape ``` 把每個字元轉換成一個長度為 128 的向量（ASCII 有 128 個字元）。輸出得到：`torch.Size([70, 128])`。就是表示說這行文字 line 有 70 個字元，然後每個字元都用 128 維的向量表示。 ```python= for i, letter in enumerate(line.lower().strip()): letter_index = ord(letter) if ord(letter) < 128 else 0 # 取得 ASCII Code letter_t[i][letter_index] = 1 # 在對應位置填 1 ``` 在這邊就是在做字元級別的 One-Hot encoding 了。 ### Word-level 的 One-Hot encoding ```python= def clean_words(input_str): punctuation = '.,;:"!?”“_-' word_list = input_str.lower().replace('\n',' ').split() word_list = [word.strip(punctuation) for word in word_list] return word_list words_in_line = clean_words(line) line, words_in_line ``` 這邊在做的事是做資料預處理，先把那些特殊字元拿掉，只要看單字本身即可。 Output： ``` ('“Impossible, Mr. Bennet, impossible, when I am not acquainted with him', ['impossible', 'mr', 'bennet', 'impossible', 'when', 'i', 'am', 'not', 'acquainted', 'with', 'him']) ``` 第二步就是建立字典，然後單字對應索引： ```python= word_list = sorted(set(clean_words(text))) word2index_dict = {word: i for (i, word) in enumerate(word_list)} len(word2index_dict), word2index_dict['impossible'] ``` 首先第一行 `word_list = sorted(set(clean_words(text)))` 用到 set 資料結構，就是要用於去除重複這件事情，把重複的單字都給去除。第二行就是建立一個字典。最後的 Output 會 show 出：`(7261, 3394)`，表示這本書有 7261 個不重複的單字，"impossible" 這個單字的索引在 3394。最後一步就是做 Word-level One-Hot encoding 啦： ```python= word_t = torch.zeros(len(words_in_line), len(word2index_dict)) for i, word in enumerate(words_in_line): word_index = word2index_dict[word] word_t[i][word_index] = 1 print('{:2} {:4} {}'.format(i, word_index, word)) print(word_t.shape) ``` ## Text embeddings One-Hot 在前面有介紹過他的缺點，就是當資料量一旦大起來的時候，維度就會爆一個大的，然後讓你訓練的時候不好過。但其實還有一個問題，就是不能表示單字之間的相似性，例如 "apple" 跟 "orange" 都是水果，但在 One-Hot encoding 中他們之間的距離都很遠，因此無法捕捉到語意關係。解決上述問題的技術因而誕生，就是詞嵌入（Word Embeddings），或稱詞向量（Word vector）。這主要是用一個低維度的浮點數向量（如 100 維）表示每個單字，並且語義相似的單字在這個空間中距離很近。舉例： ``` One-Hot 編碼: apple → [0, 0, ..., 1, 0, ..., 0] （7261 維） orange → [0, 0, ..., 0, 1, ..., 0] （7261 維） Word Embedding: apple → [0.8, 0.2, -0.5, ..., 0.3] （100 維） orange → [0.75, 0.25, -0.4, ..., 0.35] （100 維） dog → [-0.1, 0.9, 0.6, ..., -0.2] （100 維） ``` One-Hot 不是 0 就是 1，之間的數字不連續，因此沒有辦法判斷的比較精準，相反 Word Embedding 就可以。可以發現 apple 跟 orange 的數值很相近，但兩者與 dog 的數值完全找不到任何關係。 ![image](https://hackmd.io/_uploads/HybHvSyWbx.png) Image Source：《Deep Learning with PyTorch》Page 100. 書中用了二維的空間去表示單字之間的遠近關係。而向量之間的運算可以做到接近某個單字的向量，也就是說可以做一個類比： ```python= apple_vector = [0.1, 0.1] red_adjustment = [0, -0.1] yellow_adjustment = [0, 0.5] result = apple_vector + red_adjustment + yellow_adjustment # result ≈ [0.1, 0.5]，接近 lemon 的 [0.2, 0.5] ``` `apple_vector + red_adjustment + yellow_adjustment` 運算完的結果能類比成 lemon。 ## 總結 ### 彩圖的 tensor 表示一張 RGB 彩色照片可用三維 tensors 表示： * 3 個通道（RGB） * 寬 × 高像素矩陣若影像為 800×600，則 tensor 形狀為 `(3, 800, 600)`（CHW 格式）。tensor 之所以使用 CHW，是因為卷積運算在此格式下較高效。使用 `transforms.PILToTensor()` 可將 PIL Image 轉為此格式，並自動從 HWC → CHW。 tensor 內數值為 0~255 的 uint8，代表像素的顏色強度。 ### 正規化（Normalization）訓練模型時需將像素值調整到更適合神經網路學習的範圍： 1. 簡單正規化到 `[0, 1]` - 將 tensor 轉為 float，再除以 255。 2. 使用 `transforms.ToTensor()` - 自動完成：HWC 轉 CHW，並除以 255。 3. 進階正規化到 `[-1, 1]` - 使用 `transforms.Normalize(mean=[0.5,...], std=[0.5,...])`。 ### 表格資料轉換為張量以 pandas 讀取 CSV 後： * 前 n 欄 → 特徵（input） * 最後一欄 → 標籤（output）如資料共有 5 列 4 特徵： * 特徵張量形狀 → (5, 4) * 標籤張量 → (5, ) 這些 Tensor 可使用 TensorDataset 與 DataLoader 打包，以便批次訓練模型。 ## 參考資料《Deep Learning with PyTorch》第四章 Real-world data representation using tensors