Course
湯濬澤
NTUST_M11015117
HackMD Link:https://hackmd.io/@RTon/H1YkYXjUj
NeRF 透過神經網路,來學習將視角與顏色、密度的對應,轉成 Radiance Field,藉此來表示一個 3D 場景。理論上僅需要一些物體的多視角照片,就能夠合成出中間的其他所有視角圖,甚至轉換成 point cloud。
Position encoding 是一個讓 NeRF 表現好的關鍵技術。透過 Position encoding,讓網路快速地學到高頻訊息,它的做法是用傅立葉 (Fourier Convolution),讓網路快速地學到多頻率的特徵,讓 NeRF 在高頻的訊息如:文字、高頻形狀上表現得更好。
與傳統的方法相比,NeRF 在場景的真實度與光照上比傳統的方法要好的許多,由於訓練過程中,會學習光場與視角之間的關係,可以達到 ray tracing 的效果,真實度更高。不過 NeRF 的缺點在於它不好訓練、訓練時間要很久,而且需要先用 Structure from motion 來獲取相機視角資訊。
DVGO 大幅改善了 NeRF 訓練時間過久的問題,它採用了 dense voxel grid 來取代 NeRF 的 volume density,藉此加速運算。且相比其他種同樣採用 voxel grid 的方式,DVGO 進一步使用了post-activation,來減少 voxel grid 的數量,卻仍保有 sharp surface。至於顏色的部分,則是沿用了 NeRF 的方式,採用 MLP 來計算。
實作部分,在 training 部分,先透過相機資訊計算 Bounding box。
xyz_min_coarse, xyz_max_coarse = compute_bbox_by_cam_frustrm(args=args, cfg=cfg, **data_dict)
然後創建一個 DVGO model
model = dvgo.DirectVoxGO(
xyz_min=xyz_min, xyz_max=xyz_max,
num_voxels=num_voxels,
mask_cache_path=coarse_ckpt_path,
**model_kwargs
)
隨機 sample ray
sel_b = torch.randint(rgb_tr.shape[0], [cfg_train.N_rand])
sel_r = torch.randint(rgb_tr.shape[1], [cfg_train.N_rand])
sel_c = torch.randint(rgb_tr.shape[2], [cfg_train.N_rand])
target = rgb_tr[sel_b, sel_r, sel_c]
rays_o = rays_o_tr[sel_b, sel_r, sel_c]
rays_d = rays_d_tr[sel_b, sel_r, sel_c]
viewdirs = viewdirs_tr[sel_b, sel_r, sel_c]
丟進 model 做 trainig
# volume rendering
render_result = model(
rays_o, rays_d, viewdirs,
global_step=global_step, is_train=True,
**render_kwargs
)
# gradient descent step
optimizer.zero_grad(set_to_none=True)
loss = cfg_train.weight_main * F.mse_loss(render_result['rgb_marched'], target)
...
loss.backward()
optimizer.step()
至於 model 的 forward,首先在 ray 上 sample points,再計算 alpha w 與 post-activation。完成後計算 accumulated transmittance。
ray_pts, ray_id, step_id = self.sample_ray(
rays_o=rays_o, rays_d=rays_d, **render_kwargs)
...
density = self.density(ray_pts)
alpha = self.activate_density(density, interval)
...
weights, alphainv_last = Alphas2Weights.apply(alpha, ray_id, N)
計算顏色
if self.rgbnet is None:
# no view-depend effect
rgb = torch.sigmoid(k0)
else:
# view-dependent color emission
if self.rgbnet_direct:
k0_view = k0
else:
k0_view = k0[:, 3:]
k0_diffuse = k0[:, :3]
viewdirs_emb = (viewdirs.unsqueeze(-1) * self.viewfreq).flatten(-2)
viewdirs_emb = torch.cat([viewdirs, viewdirs_emb.sin(), viewdirs_emb.cos()], -1)
viewdirs_emb = viewdirs_emb.flatten(0,-2)[ray_id]
rgb_feat = torch.cat([k0_view, viewdirs_emb], -1)
rgb_logit = self.rgbnet(rgb_feat)
if self.rgbnet_direct:
rgb = torch.sigmoid(rgb_logit)
else:
rgb = torch.sigmoid(rgb_logit + k0_diffuse)
最後 ray matching
rgb_marched = segment_coo(
src=(weights.unsqueeze(-1) * rgb),
index=ray_id,
out=torch.zeros([N, 3]),
reduce='sum')
rgb_marched += (alphainv_last.unsqueeze(-1) * render_kwargs['bg'])
ret_dict.update({
'alphainv_last': alphainv_last,
'weights': weights,
'rgb_marched': rgb_marched,
'raw_alpha': alpha,
'raw_rgb': rgb,
'ray_id': ray_id,
})
Setting | PSNR | SSIM | LPIPS |
---|---|---|---|
Default | 35.1887 | 0.9742 | 0.0227 |
Setting 2 | 35.1811 | 0.9742 | 0.0223 |
N_iters=10000, # number of optimization steps
N_rand=12000, # batch size (number of random rays per optimization step)
lrate_rgbnet=1e-3, # lr of the mlp to preduct view-dependent color
lrate_decay=100, # lr decay by 0.1 after every lrate_decay*1000 steps
試著改過了一些 sample 方式後與 iteration numbers 等參數後,從結果可見,差異並不大。
Self-Supervised learning 主要使用助教提供的 repo (BYOL) 做 training。
byol = models.BYOL(
resnet,
image_size = 128,
hidden_layer = 'avgpool'
)
Augmentation 使用預設帶有的幾項,如隨機更動對比、亮度、灰階、翻轉、高斯模糊等。
torch.nn.Sequential(
RandomApply(
T.ColorJitter(0.8, 0.8, 0.8, 0.2),
p = 0.3
),
T.RandomGrayscale(p=0.2),
T.RandomHorizontalFlip(),
RandomApply(
T.GaussianBlur((3, 3), (1.0, 2.0)),
p = 0.2
),
T.RandomResizedCrop((image_size, image_size)),
T.Normalize(
mean=torch.tensor([0.485, 0.456, 0.406]),
std=torch.tensor([0.229, 0.224, 0.225])),
)
使用 CrossEntropyLoss 與 Adam 優化器跑 160 個 epoch。
# Initial learning rate = 0.001
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(byol.parameters(), lr=learning_rate)
scheduler = MultiStepLR(optimizer, milestones=[100], gamma=0.1)
而 resnet 部分,先把最後一層 fc 換掉,然後依據需求決定要不要鎖 feature layers。
utils.load_checkpoint(args.backbone, resnet)
resnet.fc = nn.Linear(resnet.fc.in_features, num_classes)
if freeze_feature:
for layer_idx, layer in enumerate(resnet.children()):
if layer_idx < 8:
for param in layer.parameters():
param.requires_grad = False
else:
pass
model = resnet.to(device)
一樣跑 160 個 Epoch,
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
if not freeze_feature:
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
else:
optimizer = torch.optim.Adam(\
[p for p in model.parameters() if p.requires_grad], lr=learning_rate)
# optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=3.0517578125e-05, momentum=0.875)
scheduler = MultiStepLR(optimizer, milestones=[35, 100], gamma=0.1)
至於 Label 與 class name 的 mapping 部分,我將它以字典的方式存在 label_dict.json
內,以確保每次的對應都會一樣。
{
'Couch': 0, 'Helmet': 1, 'Refrigerator': 2, ...
}
Setting | Pre-training (Mini-ImageNet) | Fine-tuning (Office-Home dataset) | Validation accuracy (Office-Home dataset) |
---|---|---|---|
A | - | Train full model (backbone + classifier) | 0.2783 |
B | w/ label (TAs have provided this backbone) | Train full model (backbone + classifier) | 0.38 |
C | w/o label (Your SSL pre-trained backbone) | Train full model (backbone + classifier) | 0.4778 |
D | w/ label (TAs have provided this backbone) | Fix the backbone. Train classifier only | 0.3251 |
E | w/o label (Your SSL pre-trained backbone) | Fix the backbone. Train classifier only | 0.3695 |
從結果看來,效果最好的自然是先用 SSL train feature extractor,並最後用另一個 dataset fine tune 所有 layer。而最糟的則是不跑 SSL,直接 train 整個 ResNet50 的情況。