DLCV HW4 - HackMD

# DLCV HW4 ###### tags: `Course` :::success 湯濬澤 NTUST_M11015117 HackMD Link：https://hackmd.io/@RTon/H1YkYXjUj ::: ## Problem 1 - 3D Novel View Synthesis ### 1. Explain #### a. the NeRF idea in your own words NeRF 透過神經網路，來學習將視角與顏色、密度的對應，轉成 Radiance Field，藉此來表示一個 3D 場景。理論上僅需要一些物體的多視角照片，就能夠合成出中間的其他所有視角圖，甚至轉換成 point cloud。 #### b .which part of NeRF do you think is the most important Position encoding 是一個讓 NeRF 表現好的關鍵技術。透過 Position encoding，讓網路快速地學到高頻訊息，它的做法是用傅立葉 (Fourier Convolution)，讓網路快速地學到多頻率的特徵，讓 NeRF 在高頻的訊息如：文字、高頻形狀上表現得更好。 #### c. compare NeRF’s pros/cons w.r.t. other novel view synthesis work 與傳統的方法相比，NeRF 在場景的真實度與光照上比傳統的方法要好的許多，由於訓練過程中，會學習光場與視角之間的關係，可以達到 ray tracing 的效果，真實度更高。不過 NeRF 的缺點在於它不好訓練、訓練時間要很久，而且需要先用 Structure from motion 來獲取相機視角資訊。 ### 2. Describe the implementation details of Direct Voxel Grid Optimization(DVGO) & explain DVGO’s method in your own ways. DVGO 大幅改善了 NeRF 訓練時間過久的問題，它採用了 dense voxel grid 來取代 NeRF 的 volume density，藉此加速運算。且相比其他種同樣採用 voxel grid 的方式，DVGO 進一步使用了post-activation，來減少 voxel grid 的數量，卻仍保有 sharp surface。至於顏色的部分，則是沿用了 NeRF 的方式，採用 MLP 來計算。實作部分，在 training 部分，先透過相機資訊計算 Bounding box。 ```python xyz_min_coarse, xyz_max_coarse = compute_bbox_by_cam_frustrm(args=args, cfg=cfg, **data_dict) ``` 然後創建一個 DVGO model ```python model = dvgo.DirectVoxGO( xyz_min=xyz_min, xyz_max=xyz_max, num_voxels=num_voxels, mask_cache_path=coarse_ckpt_path, **model_kwargs ) ``` 隨機 sample ray ```python sel_b = torch.randint(rgb_tr.shape[0], [cfg_train.N_rand]) sel_r = torch.randint(rgb_tr.shape[1], [cfg_train.N_rand]) sel_c = torch.randint(rgb_tr.shape[2], [cfg_train.N_rand]) target = rgb_tr[sel_b, sel_r, sel_c] rays_o = rays_o_tr[sel_b, sel_r, sel_c] rays_d = rays_d_tr[sel_b, sel_r, sel_c] viewdirs = viewdirs_tr[sel_b, sel_r, sel_c] ``` 丟進 model 做 trainig ```python # volume rendering render_result = model( rays_o, rays_d, viewdirs, global_step=global_step, is_train=True, **render_kwargs ) # gradient descent step optimizer.zero_grad(set_to_none=True) loss = cfg_train.weight_main * F.mse_loss(render_result['rgb_marched'], target) ... loss.backward() optimizer.step() ``` 至於 model 的 forward，首先在 ray 上 sample points，再計算 alpha w 與 post-activation。完成後計算 accumulated transmittance。 ```python ray_pts, ray_id, step_id = self.sample_ray( rays_o=rays_o, rays_d=rays_d, **render_kwargs) ... density = self.density(ray_pts) alpha = self.activate_density(density, interval) ... weights, alphainv_last = Alphas2Weights.apply(alpha, ray_id, N) ``` 計算顏色 ```python if self.rgbnet is None: # no view-depend effect rgb = torch.sigmoid(k0) else: # view-dependent color emission if self.rgbnet_direct: k0_view = k0 else: k0_view = k0[:, 3:] k0_diffuse = k0[:, :3] viewdirs_emb = (viewdirs.unsqueeze(-1) * self.viewfreq).flatten(-2) viewdirs_emb = torch.cat([viewdirs, viewdirs_emb.sin(), viewdirs_emb.cos()], -1) viewdirs_emb = viewdirs_emb.flatten(0,-2)[ray_id] rgb_feat = torch.cat([k0_view, viewdirs_emb], -1) rgb_logit = self.rgbnet(rgb_feat) if self.rgbnet_direct: rgb = torch.sigmoid(rgb_logit) else: rgb = torch.sigmoid(rgb_logit + k0_diffuse) ``` 最後 ray matching ```python rgb_marched = segment_coo( src=(weights.unsqueeze(-1) * rgb), index=ray_id, out=torch.zeros([N, 3]), reduce='sum') rgb_marched += (alphainv_last.unsqueeze(-1) * render_kwargs['bg']) ret_dict.update({ 'alphainv_last': alphainv_last, 'weights': weights, 'rgb_marched': rgb_marched, 'raw_alpha': alpha, 'raw_rgb': rgb, 'ray_id': ray_id, }) ``` ### 3. Performance | Setting | PSNR | SSIM | LPIPS | | :--------: | :--------: | :--------: | :--------: | | Default | 35.1887 | 0.9742 | 0.0227 | | Setting 2 | 35.1811 | 0.9742 | 0.0223 | #### **Setting 2** ```python N_iters=10000, # number of optimization steps N_rand=12000, # batch size (number of random rays per optimization step) lrate_rgbnet=1e-3, # lr of the mlp to preduct view-dependent color lrate_decay=100, # lr decay by 0.1 after every lrate_decay*1000 steps ``` 試著改過了一些 sample 方式後與 iteration numbers 等參數後，從結果可見，差異並不大。 ## Problem 2 - Self-Supervised Pre-training for Image Classification ### 1. Describe the implementation details of your SSL method Self-Supervised learning 主要使用助教提供的 repo (BYOL) 做 training。 ```python byol = models.BYOL( resnet, image_size = 128, hidden_layer = 'avgpool' ) ``` Augmentation 使用預設帶有的幾項，如隨機更動對比、亮度、灰階、翻轉、高斯模糊等。 ```python torch.nn.Sequential( RandomApply( T.ColorJitter(0.8, 0.8, 0.8, 0.2), p = 0.3 ), T.RandomGrayscale(p=0.2), T.RandomHorizontalFlip(), RandomApply( T.GaussianBlur((3, 3), (1.0, 2.0)), p = 0.2 ), T.RandomResizedCrop((image_size, image_size)), T.Normalize( mean=torch.tensor([0.485, 0.456, 0.406]), std=torch.tensor([0.229, 0.224, 0.225])), ) ``` 使用 CrossEntropyLoss 與 Adam 優化器跑 160 個 epoch。 ```python # Initial learning rate = 0.001 criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(byol.parameters(), lr=learning_rate) scheduler = MultiStepLR(optimizer, milestones=[100], gamma=0.1) ``` 而 resnet 部分，先把最後一層 fc 換掉，然後依據需求決定要不要鎖 feature layers。 ```python utils.load_checkpoint(args.backbone, resnet) resnet.fc = nn.Linear(resnet.fc.in_features, num_classes) if freeze_feature: for layer_idx, layer in enumerate(resnet.children()): if layer_idx < 8: for param in layer.parameters(): param.requires_grad = False else: pass model = resnet.to(device) ``` 一樣跑 160 個 Epoch， ```python criterion = nn.CrossEntropyLoss(label_smoothing=0.1) if not freeze_feature: optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) else: optimizer = torch.optim.Adam(\ [p for p in model.parameters() if p.requires_grad], lr=learning_rate) # optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=3.0517578125e-05, momentum=0.875) scheduler = MultiStepLR(optimizer, milestones=[35, 100], gamma=0.1) ``` 至於 Label 與 class name 的 mapping 部分，我將它以字典的方式存在 `label_dict.json` 內，以確保每次的對應都會一樣。 ```jsonld { 'Couch': 0, 'Helmet': 1, 'Refrigerator': 2, ... } ``` ### 2. Please conduct the Image classification on Office-Home dataset as the downstream task. Also, please complete the following Table, which contains different image classification setting, and discuss/analyze the results. | Setting | Pre-training (Mini-ImageNet) | Fine-tuning (Office-Home dataset) | Validation accuracy (Office-Home dataset) | | :------: | :------: | :------: | :------: | | A | - | Train full model (backbone + classifier) | 0.2783 | | B | w/ label (TAs have provided this backbone) | Train full model (backbone + classifier) | 0.38 | | C | w/o label (Your SSL pre-trained backbone) | Train full model (backbone + classifier) | 0.4778 | | D | w/ label (TAs have provided this backbone) | Fix the backbone. Train classifier only | 0.3251 | | E | w/o label (Your SSL pre-trained backbone) | Fix the backbone. Train classifier only | 0.3695 | 從結果看來，效果最好的自然是先用 SSL train feature extractor，並最後用另一個 dataset fine tune 所有 layer。而最糟的則是不跑 SSL，直接 train 整個 ResNet50 的情況。