DLCV HW4

tags: Course

湯濬澤
NTUST_M11015117
HackMD Link:https://hackmd.io/@RTon/H1YkYXjUj

Problem 1 - 3D Novel View Synthesis

1. Explain

a. the NeRF idea in your own words

NeRF 透過神經網路,來學習將視角與顏色、密度的對應,轉成 Radiance Field,藉此來表示一個 3D 場景。理論上僅需要一些物體的多視角照片,就能夠合成出中間的其他所有視角圖,甚至轉換成 point cloud。

b .which part of NeRF do you think is the most important

Position encoding 是一個讓 NeRF 表現好的關鍵技術。透過 Position encoding,讓網路快速地學到高頻訊息,它的做法是用傅立葉 (Fourier Convolution),讓網路快速地學到多頻率的特徵,讓 NeRF 在高頻的訊息如:文字、高頻形狀上表現得更好。

c. compare NeRF’s pros/cons w.r.t. other novel view synthesis work

與傳統的方法相比,NeRF 在場景的真實度與光照上比傳統的方法要好的許多,由於訓練過程中,會學習光場與視角之間的關係,可以達到 ray tracing 的效果,真實度更高。不過 NeRF 的缺點在於它不好訓練、訓練時間要很久,而且需要先用 Structure from motion 來獲取相機視角資訊。

2. Describe the implementation details of Direct Voxel Grid Optimization(DVGO) & explain DVGO’s method in your own ways.

DVGO 大幅改善了 NeRF 訓練時間過久的問題,它採用了 dense voxel grid 來取代 NeRF 的 volume density,藉此加速運算。且相比其他種同樣採用 voxel grid 的方式,DVGO 進一步使用了post-activation,來減少 voxel grid 的數量,卻仍保有 sharp surface。至於顏色的部分,則是沿用了 NeRF 的方式,採用 MLP 來計算。

實作部分,在 training 部分,先透過相機資訊計算 Bounding box。

xyz_min_coarse, xyz_max_coarse = compute_bbox_by_cam_frustrm(args=args, cfg=cfg, **data_dict)

然後創建一個 DVGO model

model = dvgo.DirectVoxGO(
	xyz_min=xyz_min, xyz_max=xyz_max,
	num_voxels=num_voxels,
	mask_cache_path=coarse_ckpt_path,
	**model_kwargs
)

隨機 sample ray

sel_b = torch.randint(rgb_tr.shape[0], [cfg_train.N_rand])
sel_r = torch.randint(rgb_tr.shape[1], [cfg_train.N_rand])
sel_c = torch.randint(rgb_tr.shape[2], [cfg_train.N_rand])
target = rgb_tr[sel_b, sel_r, sel_c]
rays_o = rays_o_tr[sel_b, sel_r, sel_c]
rays_d = rays_d_tr[sel_b, sel_r, sel_c]
viewdirs = viewdirs_tr[sel_b, sel_r, sel_c]

丟進 model 做 trainig

# volume rendering
render_result = model(
	rays_o, rays_d, viewdirs,
	global_step=global_step, is_train=True,
	**render_kwargs
)

# gradient descent step
optimizer.zero_grad(set_to_none=True)
loss = cfg_train.weight_main * F.mse_loss(render_result['rgb_marched'], target)
...
loss.backward()
optimizer.step()

至於 model 的 forward,首先在 ray 上 sample points,再計算 alpha w 與 post-activation。完成後計算 accumulated transmittance。

ray_pts, ray_id, step_id = self.sample_ray(
                rays_o=rays_o, rays_d=rays_d, **render_kwargs)
...
density = self.density(ray_pts)
        alpha = self.activate_density(density, interval)
...
weights, alphainv_last = Alphas2Weights.apply(alpha, ray_id, N)

計算顏色

if self.rgbnet is None:
    # no view-depend effect
    rgb = torch.sigmoid(k0)
else:
    # view-dependent color emission
    if self.rgbnet_direct:
        k0_view = k0
    else:
        k0_view = k0[:, 3:]
        k0_diffuse = k0[:, :3]
    viewdirs_emb = (viewdirs.unsqueeze(-1) * self.viewfreq).flatten(-2)
    viewdirs_emb = torch.cat([viewdirs, viewdirs_emb.sin(), viewdirs_emb.cos()], -1)
    viewdirs_emb = viewdirs_emb.flatten(0,-2)[ray_id]
    rgb_feat = torch.cat([k0_view, viewdirs_emb], -1)
    rgb_logit = self.rgbnet(rgb_feat)
    if self.rgbnet_direct:
        rgb = torch.sigmoid(rgb_logit)
    else:
        rgb = torch.sigmoid(rgb_logit + k0_diffuse)

最後 ray matching

rgb_marched = segment_coo(
        src=(weights.unsqueeze(-1) * rgb),
        index=ray_id,
        out=torch.zeros([N, 3]),
        reduce='sum')
rgb_marched += (alphainv_last.unsqueeze(-1) * render_kwargs['bg'])
ret_dict.update({
    'alphainv_last': alphainv_last,
    'weights': weights,
    'rgb_marched': rgb_marched,
    'raw_alpha': alpha,
    'raw_rgb': rgb,
    'ray_id': ray_id,
})

3. Performance

Setting PSNR SSIM LPIPS
Default 35.1887 0.9742 0.0227
Setting 2 35.1811 0.9742 0.0223

Setting 2

N_iters=10000,           # number of optimization steps
N_rand=12000,            # batch size (number of random rays per optimization step)
lrate_rgbnet=1e-3,       # lr of the mlp to preduct view-dependent color
lrate_decay=100,         # lr decay by 0.1 after every lrate_decay*1000 steps

試著改過了一些 sample 方式後與 iteration numbers 等參數後,從結果可見,差異並不大。

Problem 2 - Self-Supervised Pre-training for Image Classification

1. Describe the implementation details of your SSL method

Self-Supervised learning 主要使用助教提供的 repo (BYOL) 做 training。

byol = models.BYOL(
    resnet,
    image_size = 128,
    hidden_layer = 'avgpool'
)

Augmentation 使用預設帶有的幾項,如隨機更動對比、亮度、灰階、翻轉、高斯模糊等。

torch.nn.Sequential(
    RandomApply(
        T.ColorJitter(0.8, 0.8, 0.8, 0.2),
        p = 0.3
    ),
    T.RandomGrayscale(p=0.2),
    T.RandomHorizontalFlip(),
    RandomApply(
        T.GaussianBlur((3, 3), (1.0, 2.0)),
        p = 0.2
    ),
    T.RandomResizedCrop((image_size, image_size)),
    T.Normalize(
        mean=torch.tensor([0.485, 0.456, 0.406]),
        std=torch.tensor([0.229, 0.224, 0.225])),
)

使用 CrossEntropyLoss 與 Adam 優化器跑 160 個 epoch。

# Initial learning rate = 0.001
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(byol.parameters(), lr=learning_rate)
scheduler = MultiStepLR(optimizer, milestones=[100], gamma=0.1)

而 resnet 部分,先把最後一層 fc 換掉,然後依據需求決定要不要鎖 feature layers。

utils.load_checkpoint(args.backbone, resnet)
resnet.fc = nn.Linear(resnet.fc.in_features, num_classes)
if freeze_feature:
    for layer_idx, layer in enumerate(resnet.children()):
        if layer_idx < 8:
            for param in layer.parameters():
                param.requires_grad = False
        else:
            pass
model = resnet.to(device)

一樣跑 160 個 Epoch,

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
if not freeze_feature:
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
else:
    optimizer = torch.optim.Adam(\
        [p for p in model.parameters() if p.requires_grad], lr=learning_rate)
    # optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=3.0517578125e-05, momentum=0.875)
scheduler = MultiStepLR(optimizer, milestones=[35, 100], gamma=0.1)

至於 Label 與 class name 的 mapping 部分,我將它以字典的方式存在 label_dict.json 內,以確保每次的對應都會一樣。

{
    'Couch': 0, 'Helmet': 1, 'Refrigerator': 2, ...
}

2. Please conduct the Image classification on Office-Home dataset as the downstream task. Also, please complete the following Table, which contains different image classification setting, and discuss/analyze the results.

Setting Pre-training (Mini-ImageNet) Fine-tuning (Office-Home dataset) Validation accuracy (Office-Home dataset)
A - Train full model (backbone + classifier) 0.2783
B w/ label (TAs have provided this backbone) Train full model (backbone + classifier) 0.38
C w/o label (Your SSL pre-trained backbone) Train full model (backbone + classifier) 0.4778
D w/ label (TAs have provided this backbone) Fix the backbone. Train classifier only 0.3251
E w/o label (Your SSL pre-trained backbone) Fix the backbone. Train classifier only 0.3695

從結果看來,效果最好的自然是先用 SSL train feature extractor,並最後用另一個 dataset fine tune 所有 layer。而最糟的則是不跑 SSL,直接 train 整個 ResNet50 的情況。