owned this note
owned this note
Published
Linked with GitHub
# 應用電腦視覺 - 鄭文皇 (2022 Fall)
###### tags: `NYCU-2022-Fall`
## Class info.
[課程資訊](https://timetable.nycu.edu.tw/?r=main/crsoutline&Acy=111&Sem=1&CrsNo=535232&lang=zh-tw)
1. Learn the concepts and theories of Computer Vision (CV) and how they can be applied in practice to solve real-world problems.
2. Also cover the latest topics in current CV literature, such as self-supervised learning for CV applications.
作業50%、期中報告20%、期末考試30%
有邀講者來改成,作業40%、期中報告20%、期末考試30%、演講出席10%
<style>
.red{
color: red;
}
.blue{
color: #87ceeb;
}
</style>
## Date
### 9/12
Computer Vision:
feature engineering + model learning $\rightarrow$ deep learning
feature engineering: f = $f(I)$
model learning: y = $g(f,\theta)$
deep learning: y = $g(I,\theta)$
* Feature Detector
視覺系統的子系統,用來檢測存在或視覺場景中某些特徵的缺失
Image data from real world often display complex structure
**In general, computer vision does not work. (except in certain cases)**
* Intra-class Variability
相同影像類別,但不同照片呈現方式
### 9/19
* intensity: 色彩亮度 $\frac{R+G+B}{3}$
<table>
<tr>
<td>
<img src="https://i.imgur.com/S7fi26d.png" alt="drawing" width="400"/>
</td>
<td>
<img src="https://i.imgur.com/qZsEGsQ.png" alt="drawing" width="400"/>
</td>
</tr>
</table>
In comparison to global features, local features are more robust to occlusion and clutter.
* Properties of Ideal Local Feature
1. Repeatability
2. Distinctiveness / Informativeness (鑑別性: 局部結構變化,feature也要有變化)
3. Locality
4. Quantity
5. Accuracy
6. Efficiency
* [Sobel operator](https://zh.m.wikipedia.org/zh-tw/%E7%B4%A2%E8%B2%9D%E7%88%BE%E7%AE%97%E5%AD%90)
Before designing an edge detector
1. Use derivatives (in x and y direction) to define a location with high gradient
2. Need smoothing to reduce noise prior to take derivative
* Edge Detector in 1D & 2D
<table>
<tr>
<td>
<img src="https://i.imgur.com/QfolsiX.png" alt="drawing" width="500"/>
</td>
<td>
<img src="https://i.imgur.com/mvY53yA.png" alt="drawing" width="500"/>
</td>
</tr>
</table>
* [Convolution](https://iter01.com/480243.html)
$g$ 翻轉,之後依照 $\tau$ 值平移過 $f$
連續形式: $(f*g)(n)=\int^{\infty}_{-\infty}f(\tau)g(n-\tau)d\tau$
離散形式: $(f*g)(n)=\sum_{\tau=-\infty}^{\infty}f(\tau)g(n-\tau)$
<br>
* Canny Edge Detection [實作文章](https://medium.com/@pomelyu5199/canny-edge-detector-%E5%AF%A6%E4%BD%9C-opencv-f7d1a0a57d19)
Large $\sigma$ detects large scale edges, small $\sigma$ detects fine feature
* Image Gradient
![](https://i.imgur.com/AloAd9o.png)
Magnitude: $\| ∇f \|=\sqrt{(\frac{\partial f}{\partial x})^2 + \frac{\partial f}{\partial y})^2}$
Direction: $\theta = \tan^{-1}(\frac{\partial f}{\partial y} / \frac{\partial f}{\partial x})$
* Harris Corner Detector [實作文章](https://www.796t.com/p/1343014.html)
Invariant to large **rotation**, **translation**. But ==not-invariant to image scale==, it doesn’t tell us the scale of the corner
**過一次 0 就有一個 edge**
<img src="https://i.imgur.com/xTI0KVQ.png" alt="drawing" width="500"/>
<br><br>
> Impulse response
<img src="https://i.imgur.com/TWWezna.png" alt="drawing" width="500"/>
<br><br>
[Laplace operator, Laplacian](https://zh.wikipedia.org/zh-tw/%E6%8B%89%E6%99%AE%E6%8B%89%E6%96%AF%E7%AE%97%E5%AD%90)
<table>
<tr>
<td>
<img src="https://i.imgur.com/IoHzxPT.png" alt="drawing" width="450"/>
</td>
<td>
<img src="https://i.imgur.com/rows5Hz.png" alt="drawing" width="450"/>
</td>
</tr>
</table>
* SIFT Algorithm
![](https://i.imgur.com/cGS2ocA.png)
![](https://i.imgur.com/0mXhV3H.png)
![](https://i.imgur.com/AFkWfrf.png)
右邊 Gaussian 為左邊的兩倍
![](https://i.imgur.com/tjBo2gq.png)
### 9/26
:::info
**Appendix-SIFT**
![](https://i.imgur.com/NRwfNo1.png)
:::
* Keypoint Localization
![](https://i.imgur.com/F71YuBx.png)
![](https://i.imgur.com/AzRjEX2.png)
![](https://i.imgur.com/QHaS0DX.png)
* SIFT Descriptor
![](https://i.imgur.com/2GCYlTv.png)
[OpenCV SIFT](https://docs.opencv.org/3.4/da/df5/tutorial_py_sift_intro.html)
* HoG (Histogram of Oriented Gradients)
[HoG](http://alex-phd.blogspot.com/2014/03/hog.html)
<table>
<tr>
<td>
<img src="https://i.imgur.com/8jI7586.png" alt="drawing" width="500"/>
</td>
<td>
<img src="https://i.imgur.com/SRqCNU2.png" alt="drawing" width="500"/>
</td>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/nUgn10j.png" alt="drawing" width="500"/>
</td>
<td>
<img src="https://i.imgur.com/aKZ1Ca7.png" alt="drawing" width="500"/>
</td>
</tr>
<tr>
<td>
<img src="https://i.imgur.com/rjAgLxH.png" alt="drawing" width="500"/>
</td>
</tr>
</table>
* LBP (Local Binary Patterns)
LBP is a non-parametric descriptor whose aim is to efficiently summarize the local structures of images.
![](https://i.imgur.com/T1589eR.png)
![](https://i.imgur.com/bqRLV0u.png)
* Types of Object Detection
* Detection of specific categories
* Detection of specific instance
![](https://i.imgur.com/UEKJ7qM.png)
---
**Object Classification**
* [Image Classification Architectures review](https://medium.com/@14prakash/image-classification-architectures-review-d8b95075998f)
* ImageNet Dataset
* ImageNet with roughly 1000 images in each of 1000 categories.
* AlexNet
![](https://i.imgur.com/h7q82R7.png)
---
**Semantic Segmentation**
* Sliding Window
![](https://i.imgur.com/wsAJyhd.png)
* Downsampling & upsampling (solve the expensive convolution cost)
![](https://i.imgur.com/A6FzDaX.png)
* [U-Net](https://ithelp.ithome.com.tw/articles/10240314)
---
**<span class="red">There is no universal agreement in the literature on the definitions of various vision subtasks</span>**
* Two Main Categories for Generic Object Detection
![](https://i.imgur.com/uXpPvPG.png)
* [Region Proposals](https://medium.com/curiosity-and-exploration/%E5%8F%96%E5%BE%97-region-proposals-selective-search-%E5%90%AB%E7%A8%8B%E5%BC%8F%E7%A2%BC-be0aa5767901)
* [R-CNN & Fast R-CNN](https://zhuanlan.zhihu.com/p/40986674)
* [[Paper] EDF-SSD: An Improved Feature Fused SSD for Onjection Detection](https://jackson1998.medium.com/paper-edf-ssd-an-improved-feature-fused-ssd-for-onjection-detection-213c4566745)
### 10/3
Convolution size for kernel: height × width × dense
> 縮減深度計算量。Not really a 1x1 convolution → It's a 1x1xC convolution
![](https://i.imgur.com/llKweEN.png)
* A Fire module is comprised of:
a squeeze convolution layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters.
:::info
Top-1 ImageNet Accuracy: 僅能給 1 個答案
Top-5 ImageNet Accuracy: 能給 5 個答案
:::
![](https://i.imgur.com/ohuJBu0.png)
> Skip connections not only skip one layer
The advantage of adding this type of **skip connection** is that if any layer hurt the performance of architecture then it will be skipped by regularization.
So, this results in training a very deep neural network without the problems caused by vanishing/exploding gradient.
In conclusion, ResNets are one of the most efficient Neural Network Architectures, as they help in **maintaining a low error rate much deeper in the network.**
* DenseNet
![](https://i.imgur.com/EJaIlOp.png)
* [Feature Pyramid Networks](https://ivan-eng-murmur.medium.com/%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%AC-s8-feature-pyramid-networks-%E7%B0%A1%E4%BB%8B-99b676245b25)
![](https://i.imgur.com/eE1pm5S.png)
* [A Simple yet Effective Approach for Identifying Unexpected Road Obstacles](https://zhuanlan.zhihu.com/p/415220541)
* [Deep Learning for Generic
Object Detection: A Survey](https://www.796t.com/content/1545903385.html)
---
* Conventional two-stage solutions adopt the detect-then-segment approach → **<span class="red">Slow</span>**
* Focus on single-stage instance segmentation
![](https://i.imgur.com/RHemaxC.png)
* Local-mask-based Methods
* Contours with Explicit Encoding
* ExtremeNet (Four extreme points with one center point of objects)
同時,通過四個方向可以求得中心點(center point)。 (實際上,一個方向上的極值點可能不止一個)
* PolarMask: It utilizes rays at constant angle intervals from the center to describe the contour.
* FourierNet: a contour shape decoder using Fourier transform
* Compact Mask Encoding
:::info
**Contours with Explicit Encoding**
pros: fast to inference and easy to optimize.
cons: can not depict the mask precisely and can not describe objects that have holes in the center.
:::
* Global-mask-based Methods
* YOLACT: attempting real-time instance segmentation
![](https://i.imgur.com/u79wRiw.png)
* BlendMask
### 10/10
國慶日放假
### 10/17
* Challenge of Long-Tailed Visual Recognition
![](https://i.imgur.com/6MIP3q1.png)
* Loss function
* MSE:$$f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|^2$$
* MAE:$$f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|_1$$
* [Cross Entropy](https://zh.wikipedia.org/zh-tw/%E4%BA%A4%E5%8F%89%E7%86%B5): $$L = -\frac{1}{m} \sum_{i=1}^m y_i \cdot \ln(\hat{y}_i)$$
![](https://i.imgur.com/1bPlXyM.png)
* Solutions in the Literature for Long-Tailed Visual Recognition
* Re-sampling:
* over-sampling (adding repetitive data) for the minority class
* under-sampling (removing data) for the majority class
* Re-weighting: $$L = -\sum^{\mathcal{C}}_{i=1} w_i y_i \log p_i$$
* Class-Balanced Loss
$$
\rm{CB}(\textbf{p}, y) = \frac{1}{E_{n_{y}}} \mathcal{L} (\textbf{p}, y) = \frac{1 - \beta}{1 - \beta^{n_y}} \mathcal{L}(\textbf{p}, y)
$$
* re-balancing = re-sampling + re-weighting
![](https://i.imgur.com/rfCfgLI.png)
![](https://i.imgur.com/WhUeC18.png)
1. Feature extractor
2. classifier
![](https://i.imgur.com/6ycKe2S.png)
:::info
What is transfer learning?
Transfer learning is about **leveraging feature representations from a pre-trained model**, so you don't have to train a new model from scratch.
The pre-trained models are usually trained on massive datasets that are a standard benchmark in the computer vision frontier.
:::
* Bilateral-Branch Network
![](https://i.imgur.com/KBjJuMy.png)
![](https://i.imgur.com/PsFDarz.png)
---
[Mitigating Dataset Bias (BMVC 2020 Keynote)](https://www.youtube.com/watch?v=HAfB9qvGfMM)
* Dataset bias
<img src="https://i.imgur.com/TdZ8MAf.png" alt="drawing" width="500"/>
![](https://i.imgur.com/Yt1RuYw.png)
* Techniques that help deal with data bias
* Collect labelled data from target domain
* Better backbone CNNs
* Batch Normalization ([Li'17](https://arxiv.org/pdf/1603.04779.pdf), [Chang’19])
* Instance Normalization + Batch Normalization [Nam'19](https://proceedings.neurips.cc/paper/2018/file/018b59ce1fd616d874afad0f44ba338d-Paper.pdf)
* Data Augmentation, Mix Match [Berthelot'19](https://arxiv.org/pdf/1905.02249.pdf)
* Semi-supervised methods, such as Pseudo labeling [Zou’19](https://arxiv.org/pdf/1908.09822.pdf)
* Domain Adaptation (this talk)
* ==**Adversarial domain alignment**==
* Feature space
* Pixel space
![](https://i.imgur.com/68isLg1.png)
### 10/24
* Pixel-space alignment
![](https://i.imgur.com/eyK0Lsh.png)
* Few-shot domain translation
Lots of unlabeled target data, but only have 1-5 images of the target domain
![](https://i.imgur.com/hiNOFIZ.png)
* Disentangled features
![](https://i.imgur.com/1IrR44G.png)
![](https://i.imgur.com/UZdbch0.png)
![](https://i.imgur.com/2v2I57M.png)
* Weak Scene-level Alignment
![](https://i.imgur.com/kfVemRy.png)
![](https://i.imgur.com/ysbuiVR.png)
* Alignment that respects class boundaries
![](https://i.imgur.com/tiRXwYL.png)
* Category Shift
When categories aren't the same in source and target
![](https://i.imgur.com/1txrbNc.png)
![](https://i.imgur.com/Vx9AqJF.png)
---
![](https://i.imgur.com/Ky27di6.png)
* Recognition of Static Pose
* Recognition of Dynamic Pose
* Pose Model
![](https://i.imgur.com/avxJUjP.png)
* Inverse Kinematics
![](https://i.imgur.com/2sV5fKb.png)
* Exploiting Temporal Dependence
![](https://i.imgur.com/WWJiVDJ.png)
### 10/31
* Recurrent Neural Networks (RNN)
![](https://i.imgur.com/ZzB1b9S.png)
![](https://i.imgur.com/rhXx7IQ.png)
![](https://i.imgur.com/aFAnoXo.png)
* RNN cell
![](https://i.imgur.com/zWRW9vC.png)
:::info
**The Problem of RNN: Short-term Memory**
If a sequence is long enough, they’ll have a hard time carrying
information from earlier time steps to later ones.
**Long Short Term Memory (LSTM)** was created as the solution to short-term memory.
It has internal mechanisms called gates that can
regulate the flow of information.
[RNN Notes](https://hackmd.io/3PzYYuBBTNCgRymLI2fUuw?view)
:::
* GRU (Gated Recurrent Unit)
![](https://i.imgur.com/KI0nbrg.png)
* Deep LSTM
![](https://i.imgur.com/DBOr9ss.png)
* Two-way LSTM
![](https://i.imgur.com/TAlAU9Z.png)
* Connectionist Temporal Classification (CTC)
![](https://i.imgur.com/vyKLsw3.png)
### 11/7
* Attention Model
![](https://i.imgur.com/uZb0cur.png)
$c$ is the context, and the $y_i$ are the “part of the data” we are looking at.
$$
m_i = \rm{tanh}(W_{cm}c + W_{ym}y_i)
$$
The network computes $m_1, … m_n$ with a tanh layer
$$
softmax(x_1, ..., x_n) = (\frac{e^{x_i}}{\sum_j e^{x_j}})_i \\
z = \sum_i(s_iy_i)
$$
The output $z$ is the weighted arithmetic mean of all the $y_i$, where the weight represent the relevance for each variable according the context $c$.
---
:::info
* CV Weekly
- Generate video from text
- DIFFUSIONDB: Dataset for Text-to-Image Generative Models
:::
![](https://i.imgur.com/5xiycMO.png)
* 3D data representation
<style>
.bf{
font-weight: bold;
}
</style>
<table>
<tr>
<td>
<p class = "bf">Point Cloud</p>
</td>
<td>
<p class = "bf">Mesh</p>
</td>
</tr>
<tr>
<td>
A point cloud is a set of data points in space, which
measures a large number of points on the external
surfaces of objects around them.
</td>
<td>
A mesh is a collection of vertices, edges and faces that defines
the shape of a polyhedral. The faces usually consist of triangles
(triangle mesh), quadrilaterals, or other simple convex polygons.
</td>
</tr>
<tr>
<td>
<p class="bf">Voxel</p>
</td>
<td>
<p class="bf">Multi-View Images</p>
</td>
</tr>
<tr>
<td>
A voxel represents a value on a regular grid in three-dimensional
space.
</td>
<td>
Multi-view images are multiple looks of
the same target, e.g., at different viewing
angles, perspectives, and so forth.
</td>
</tr>
</table>
![](https://i.imgur.com/twtSUpj.png)
* Deep Learning on Multi-view Representation
![](https://i.imgur.com/Od4I514.png)
* Challenge
對抗 Geometric form (irregular) 排列表示不同的一至性
![](https://i.imgur.com/YPs8L54.png)
![](https://i.imgur.com/OMmRxrI.png)
**Permutation invariance: Symmetric function**
$$
f(x_1, x_2, ..., x_n) \equiv f(x_{\pi_1}, x_{\pi_2}, ... x_{\pi_n}), x_i \in \mathbb{R}^D
$$
Examples:
$$
f(x_1, x_2, ..., x_n) = \max\{x_1, x_2, ..., x_n\} \\
f(x_1, x_2, ..., x_n) = x_1 + x_2 + ... + x_n
$$
![](https://i.imgur.com/pGBnssR.png)
![](https://i.imgur.com/haVjm0e.png)
**Input Alignment by Transformer Network**
![](https://i.imgur.com/D2Sd900.png)
* PointNet Architecture
![](https://i.imgur.com/pDIOtw0.png)
---
* Recap RNN /LSTM
[RNN Notes](https://hackmd.io/3PzYYuBBTNCgRymLI2fUuw?view)
* Transformer network
[Transformer Notes](https://hackmd.io/ba1UQdFqRAGnhN9_eTZeog)
[PyTorch Transformer](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
* [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://zhuanlan.zhihu.com/p/266311690)
* Vision Transformer
![](https://i.imgur.com/WD0qtbs.jpg)
### 11/14
**課程調整**
| Dates | Topic |
| -------- | -------- |
| 11/21 | Invited Talks |
| 11/28 | Invited Talks |
| 12/05 | Midterm Presentation |
| 12/12 | Midterm Presentation |
| 12/19 | Invited Talks / Deep Generation Modeling |
| 12/26 | Final Examination |
* Homework 2: Transformer
* Homework 3: Invited Talks 500字心得
---
* [Tokens-to-Token ViT: Training Vision Transformers from Scratch on Imagenet](https://zhuanlan.zhihu.com/p/359930253)
* [Mobile-Former: Bridging MobileNet and Transformer](https://zhuanlan.zhihu.com/p/412964831)
* [EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers](https://blog.51cto.com/shanglianlm/5550217)
* [End-to-End Object Detection with Transformers](https://allen108108.github.io/blog/2020/07/27/[%E8%AB%96%E6%96%87]%20End-to-End%20Object%20Detection%20with%20Transformers/)
### 11/21
#### 暗光影像增強計算
:::info
北京大學 - 劉家瑛教授
:::
* Research topic:
* Image Reconstrucion
* ImageVideo Coding
* Image Generation
* Video Analytics
* Low-Light Degradation
* Intensive noise
**Problem: High-level vision in low-light scenarios**
* Reperesentative work
* Histogram equalization
* Dehazing method (invert $\rightarrow$ dehaze $\rightarrow$ invert again)
* Retinex Model (retinex decomposition ($S = R \cdot L$) / generate result ($S_{enhance} = R \cdot L^{\frac{1}{\gamma}}$))
* Learning-Based Model (LLNet/LLCNN...)
* Low-Light Datasets for High-Level Tasks (KAIST / Exclusively Dark)
* [Deep Retinex Decomposition for Low-Light Enhancement](https://zhuanlan.zhihu.com/p/87384811)
* Retinex Theory + Deep Learning
* Dataset: LOl Light
* [Benchmarking Low-Light Image Enhancement and Beyond](https://zhuanlan.zhihu.com/p/467789757)
* Paired datasets: LLNet
* Unpaired datasets: can't support for model training
* VE-LOL: evaluation of low/high-level visions
* **UG2 challenge**
* [HLA-Face: Joint High-Low Adapation for Low Light Face Detection](https://blog.csdn.net/weixin_45709330/article/details/116375825)
* Gaps between normal light and low light (Pixel-levle apperances/object-level sentiment)
* Consider joint low-level and high-level adaptation
* [Self-Aligned Concave Curve: Illumination Enhancement for Unsupervised Adaptation](https://arxiv.org/abs/2210.03792)
* Training strategy asymmetric self-supervised alighment
#### 非漫射複雜材質物體的多視角三維視覺建模
:::info
澳洲國立大學 - Hongdong Li教授
:::
* Research topic:
* Computer Vision
* Robotic Vision
* Smart Car Project
* City Modeling
* Bionic Eyes Project
* [Multi-view 3D Reconstruction of a Texture-less Smooth Surface of Unknown Generic Reflectance](https://openaccess.thecvf.com/content/CVPR2021/papers/Cheng_Multi-View_3D_Reconstruction_of_a_Texture-Less_Smooth_Surface_of_Unknown_CVPR_2021_paper.pdf)
* Vision-based 3D Shape Reconstrucion
* (Rigid Object / Scene) Structure from model
* Lambertian / Non-Lambertian
* Problem Setting: Traditional Photometric Stereo problem
* 3D computer vision $\leftrightarrow$ image inversion
* **The rendering equation**
* Solution: Minimizing a suitable objective (loss) function (augmented language nethod relaxation)
image formation + surface regularization + relax penalty
* [Diffeomorphic Neural Surface Parameterization for 3D and Reflectance Recovery](https://dl.acm.org/doi/10.1145/3528233.3530741)
* Shape deformation
* Learning / training process: Inverse graphics rendering
<style>
.red{
color: red;
};
</style>
* Recap
* Multi-view 3D reconstruction for object with unknown materials.
* Significantly outperforms SOTAs under unknown illuminations
* Achieves similar accuracy to darkroom methods but much more flexible
* Robust to complex shapes and specular materials
* Reconstructions can be easily plugged into rendering engines
* Limitations: <span class="red">piecewise smooth object shape assumption with simple topology; need a strong flashlight (SNR) slow convergence</span>
#### Consistent, Empathetic and Prosocial Dialogues
:::info
Prof. Gunhee Kim
:::
* [ProsocialDialog: A Prosocial Backbone for Conversational Agents](https://arxiv.org/pdf/2205.12688.pdf)
* [Anticipating safety issues in E2E Converisoal AI: Framework and Tooling](https://arxiv.org/abs/2107.03451)
* Dataset: DailyDialog / PersonaChat / EmpatheticDialogues... (All of those are biased towards positivity)
* Classification models trained on GoEmotions
* Canary / Prost
* [Will I Sound Like Me? Improving Persona Consistency in Dialogues through Pragmatic Self-Consciousness](https://arxiv.org/abs/2004.05816)
* *Public self-consciousness* is this awareness of the self as a social object that can be observed and evaluated by others
* Bayesian Rational Speech Acts framework, which has been originally applied to improving informativeness of referring expressions.
* [Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes](https://arxiv.org/abs/2109.08828)
* Related work
* Empathetic dialogue modeling
* Emotion Cause (Pair) Extraction
* Rational Speech Acts (RSA) framework
### 11/28
#### [Context Autoencoder for Scalable Self-Supvervised Representation Pretraining](https://arxiv.org/abs/2202.03026)
:::info
Baidu computer vision expert - Jingdong Wang (王井东)
:::
- Vision Foundadtion Models
- Big Data
- Big Parameter
- Big Task
- Big Algorithm
- Big Computation
- Representation Pretraining
- Goal: Learn an encoder mapping an image to a representation
- Pretraining Task $\rightarrow$ Downstream Task
- Scale Up: Sample scale (no supervised, yes semi-supervised / vision-language / self-supervised), Concept scale (no supervised / semi-supervised, yes vision-language / self-supervised)
- Self-Supvervised Representation Pretraining in Vision
- Contrastive pretraining
- Masked image modeling
- Other
- CAE: representation pretraining aims to learn an encoder, <span class="red">mapping an image to a representation that can be transferred to downstream task.</span>
- <span class="red">Regressor for masked image modeling $\rightarrow$ masked representation modeling:</span>
make predictions for masked patches from visible patches in the encoded representation space for solving the masked image modeling task.
- The encoder is <span class="red">dedicated for</span> representation pretraining, and representation pretraining is <span class="red">only by</span> the encoder.
- The task completion part (regressor and decoder) is <span class="red">separated</span> from the encoder.
<center>
<img src = "https://i.imgur.com/YtAg3d1.png">
<p>Figure 1: Context autoencoder</p>
</center>
- How contrastive pretraining works ?
- How can the representations of random crops from the same original image be similar ?
- Speculation: encoder extracts the representation of the <span class="red">part</span> of the object / prejector maps the part representation to the representation of the <span class="red">whole object</span>
- The projected representations than agree
- What representation are learned ?
- Observation: The common among random crops lie in the <span class="red">center</span> of the original image / The object in ImageNet image lies in the <span class="red">center</span>
- Conjecture: Contrastive pretraining mainly <span class="red">learns the semantics of the center region</span>
**[Github repo.](https://github.com/lxtGH/CAE)**
<center>
<img src = "https://i.imgur.com/A9RqRId.png">
<p>Table 1: Pretraining quality evaluation</p>
</center>
#### Relational and Structural Vision with High-Order Feature Transforms
:::info
POSTECH - Minsu Cho
:::
**Match and transfer**
- Relational Self-Attention: What's Missing in Attention for Video Understanding
- [SPair-73k: A Large-scale Benchmark for Semantic Correspondence](http://cvlab.postech.ac.kr/research/SPair-71k/)
- [Convolutional Hough Matching Networks](https://arxiv.org/abs/2103.16831)
- [TansforMatcher: Match-to-Match Attention for Semantic Correspondence](https://arxiv.org/abs/2205.11634)
- Few-shot image segementation
- [Hypercorrelation Squeeze for Few-Shot Segementation](https://openaccess.thecvf.com/content/ICCV2021/papers/Min_Hypercorrelation_Squeeze_for_Few-Shot_Segmentation_ICCV_2021_paper.pdf)
- Structure of correspondence in space
- [Learning to Discover Reflection Symmetry via Polar Matching Convolution](https://arxiv.org/abs/2108.12952)
- Motion-aware video recognition
- [Learning Self-Similarity in Space and Time as Generalized Motion](https://arxiv.org/abs/2102.07092)
- Relational Self-Attention
- [Relational Self-Attention: What's Missing in Attention for Video Understanding](https://arxiv.org/abs/2111.01673)
- Summary
- Real-world vision systems need to leverage relational and structural patterns of images and videos for systematic understanding.
- High-order convolution or self-attention is effective for capturing relational structures by considering geometric patterns of correlation.
- Learning relational structures is crucial for minimally-supervised recognition and structural perception of images and videos.
#### AURORA - Empirical Bayes from Replicates
:::info
Stanford University - Dennis L. Sun
:::
- [Empirical Bayes mean estimation with nonparametric errors via order statistic regression on replicated data](https://arxiv.org/abs/1911.05970)
- Estimate some quality $\mu_i$ from noisy observation $\textbf{Z} = \{Z_1, ... Z_N\}$.
- Empirical Bayes: First estimate $A$ using the data, then plug it into the prior.
- Prior: $G = \mathcal{N}(0, A)$
- Likelihood: $F(\cdot \ | \ \mu_i) = \mathcal{N}(\mu_i, \sigma^2)$
### 12/21
**Deploying CV at Edge - From Recent Vision Transformer to Future Metaverse**
##### Computing and AI Technology Group, MediaTek Inc.
#### Part1 Overview
* NIPS
* Marching toward metaverse era
#### Part2 Deploying Vision transformer at edge
* Computer vision resesrch evolves rapidly
* How to use them in out daily devices
:::info
鄭嘉珉 MTK資深經理
:::
Focus more on experimence sharing, especially from CV research to produciton in MTK
#### NIPS
[Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](https://arxiv.org/pdf/2205.11487.pdf)
Important topic :
* Adversarial robustness,
* Federated learning,
* Diffusion model,
* NeRF(Neural Radiance Field),
* NeMF(Neural Motion Field),
* CCNeRF(Compressilble-composable NeRF),
* GNN
#### Metaverse
##### Challenge
* High computing
* Low latency
* Low power
* Tiny form factor
* Display:
* immersive display experience
* Graphics
* Motion-to-Photo latency
* VR : under 20ms
* AR : under 5ms
* Concurrent multiple tasks
DNN > Edge Process > More's law
---
::: info
姜政銘(Jimmy Chiang)
:::
* Edge AI in MTK
* Vision Transformer
* MTK重視的AI人才
* 給準備進入職場的大家
Edge AI KSF:
Noise Reduct, Super resolution
#### CAI部門
AI-ALG演算法被賦予的任務
* AI CV
* AI NLP
* AI Network
* AI Methodology
* AI for 5G
* AI Architecture
AI-SW:
* 串接gpu->cuda->pytorch->python code
* NeuroPilot SW 串接手機上的gpu
AI-HW:
* 如何在有限的cost下,設計出高效率的APU
#### 想要讓訓練好了模型跑在手機上,需要做哪些事?
1. 如何整合NAS,Quatization?
2. 轉出平台支援的格式?
3. 結果超級慢?
#### Vision transformer
1. Patch embedding
* Opertaion
* Challenges in APU
* memory access is one of the bottlenecks in APU
* Patch-wise is like 'Sliding window' in convolution
* Patch size
2. Multihead Self-attention - Challenge
* global self-attention requires quadratic computing complexity
* The most challenge in APU => over 95% latency code in ViT
* Matrix multiplication
* Softmax
Summary:
* Global attention has better quality but suffer from MatMul and Sofmax
* Cross-varaice is favorable for high-resolution and less-chanel
#### Softmax Complexity
* Sofmax : naive formula doesn't work due to numeriacal stability(overflow)
* Most AI accelerators support float16 instead of float32 data format got better PPA(Performance, Power, Area)
* What happen if using float16? UNDERFLOW
#### Norm-Layer Challenge
* Overflow occur after Mul, which calculate varaince $\sigma$
* Underflow occur is Rsqrt
#### MLP-GELU Challenge Overview
* GELU activation is wodely used in Tansformer
* It's impractical to implement error function in AI accelerator!
#### What papers might not tell you, but matter in edge AI
* Low MAC/FLOPs doesnt imply high efficeintcy
* Accuracy in paper does not guarantee acuuracy in edge device
* Paper reports performance in mobile CPU and GPU
#### 職場
* 通用法則
- 基本功
- 團隊合作及溝通
- 好奇心
- 獨立思考
- 學習心態
- 人才?
- 算法,硬體,軟體
- 投履歷,準備好投影片
## Paper list
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-eojg{background-color:#FFF2CC;text-align:center;vertical-align:bottom}
.tg .tg-fwub{background-color:#FFF2CC;color:#121212;text-align:center;vertical-align:bottom}
.tg .tg-9o7t{background-color:#FFF2CC;border-color:inherit;text-align:center;vertical-align:bottom}
</style>
<table class="tg">
<thead>
<tr>
<th class="tg-9o7t"><span style="background-color:#FFF2CC">Paper </span></th>
<th class="tg-9o7t"><span style="background-color:#FFF2CC">Conference / Year</span></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">You Only Cut Once: Boosting Data Augmentation with a Single Cut, ICML 2022.</span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">ICML/2022</span></td>
</tr>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">Scaled-YOLOv4: Scaling Cross Stage Partial Network</span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2021</span></td>
</tr>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation</span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2022</span></td>
</tr>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">taming transformers for high-resolution image synthesis</span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR/2021</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">BEIT: Bert Pre-Training of Image Transformers</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">ICLR/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">GAN-Supervised Dense Visual Alignment</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">FMODetect: Robust Detection of Fast Moving Object</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">ICCV / 2021</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Swin Transformer: Hierarchical Vision Transformer using Shifted Windows</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2021</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Boosting Crowd Counting via Multifaceted Attention*</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Focal and Global Knowledge Distillation for Detectors</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2022</span></td>
</tr>
<tr>
<td class="tg-fwub"><span style="color:#121212;background-color:#FFF2CC">VideoINR: Learning Video Implicit Neural Representation for</span><br><span style="color:#121212;background-color:#FFF2CC">Continuous Space-Time Super-Resolution</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">RefineFace: Fefinement Neural Network for High Performance Face Detection</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">TPAMI/2021</span></td>
</tr>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">Restormer: Efficient Transformer for High-Resolution Image Restoration</span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR/2022 (Oral)</span></td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">Learning the Degradation Distribution for Blind Image Super-Resolution </span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2022</span></td>
</tr>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">Pose Recognition With Cascade Transformers </span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR/2021</span></td>
</tr>
<tr>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">Deep Constrained Least Squares for Blind Image Super-Resolution</span></td>
<td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">ACPL:Anti-curriculm Psudo-lablling for Semi-supervised Medical Image Clasification</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CoMoGAN: continuous model-guided image-to-image translation</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">TrackFormer: Multi-Object Tracking with Transformers</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Contrastive Embedding for Generalized Zero-Shot Learning</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Masked Autoencoders Are Scalable Vision Learners</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Crafting Better Contrastive Views for Siamese Representation learning </span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields </span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td>
</tr>
<tr>
<td class="tg-fwub"><span style="background-color:#FFF2CC">Scaling Vision Transformers</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">AAAI/2022</span></td>
</tr>
<tr>
<td class="tg-eojg"><span style="background-color:#FFF2CC">EditGAN: High-Precision Semantic Image Editing</span></td>
<td class="tg-eojg"><span style="background-color:#FFF2CC">NeurIPS 2021</span></td>
</tr>
</tbody>
</table>
## Final exam (Open anything)
1. Local Binary Patterns (15%)
- 怎樣算 ?
- 給三張 image patch 比較跟原圖相似度
2. Attention ($Z$) 怎樣算,題目已經給公式跟$K, V, Q$ 的矩陣 (20%)
3. 給一篇 paper [MetaFormer is Actually What You Need for Vision](https://openaccess.thecvf.com/content/CVPR2022/papers/Yu_MetaFormer_Is_Actually_What_You_Need_for_Vision_CVPR_2022_paper.pdf), 問他跟原先 transformer 的差異為何, 他怎樣改進效能 (20%)
4. 給一篇 paper
[Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment](https://openaccess.thecvf.com/content_cvpr_2018/papers/Kumar_Disentangling_3D_Pose_CVPR_2018_paper.pdf)
- 問 [3D STN](https://arxiv.org/pdf/1707.05653.pdf) 跟這篇 paper 的差異, 彼此的優缺點 (15%)
- 問 Hard sample 怎樣增加模型的 robustness, 須參考 [Hard Sample Mining](https://arxiv.org/pdf/1606.04232.pdf) (10%)
5. 課程意見反饋 (20%)
## Reference
[原文書電子檔申請](https://szeliski.org/Book/)