NYCU-2022-Fall
作業50%、期中報告20%、期末考試30%
有邀講者來改成,作業40%、期中報告20%、期末考試30%、演講出席10%
Computer Vision:
feature engineering + model learning \(\rightarrow\) deep learning
feature engineering: f = \(f(I)\)
model learning: y = \(g(f,\theta)\)
deep learning: y = \(g(I,\theta)\)
Image data from real world often display complex structure
In general, computer vision does not work. (except in certain cases)
![]() |
![]() |
In comparison to global features, local features are more robust to occlusion and clutter.
Before designing an edge detector
![]() |
![]() |
\(g\) 翻轉,之後依照 \(\tau\) 值平移過 \(f\)
連續形式: \((f*g)(n)=\int^{\infty}_{-\infty}f(\tau)g(n-\tau)d\tau\)
離散形式: \((f*g)(n)=\sum_{\tau=-\infty}^{\infty}f(\tau)g(n-\tau)\)
Canny Edge Detection 實作文章
Large \(\sigma\) detects large scale edges, small \(\sigma\) detects fine feature
Image Gradient
Magnitude: \(\| ∇f \|=\sqrt{(\frac{\partial f}{\partial x})^2 + \frac{\partial f}{\partial y})^2}\)
Direction: \(\theta = \tan^{-1}(\frac{\partial f}{\partial y} / \frac{\partial f}{\partial x})\)
Invariant to large rotation, translation. But not-invariant to image scale, it doesn’t tell us the scale of the corner
過一次 0 就有一個 edge
Impulse response
![]() |
![]() |
右邊 Gaussian 為左邊的兩倍
Appendix-SIFT
Keypoint Localization
SIFT Descriptor
OpenCV SIFT
HoG (Histogram of Oriented Gradients)
HoG
![]() |
![]() |
![]() |
![]() |
![]() |
LBP (Local Binary Patterns)
LBP is a non-parametric descriptor whose aim is to efficiently summarize the local structures of images.
Types of Object Detection
Object Classification
ImageNet Dataset
AlexNet
Semantic Segmentation
Sliding Window
Downsampling & upsampling (solve the expensive convolution cost)
There is no universal agreement in the literature on the definitions of various vision subtasks
Two Main Categories for Generic Object Detection
[Paper] EDF-SSD: An Improved Feature Fused SSD for Onjection Detection
Convolution size for kernel: height × width × dense
縮減深度計算量。Not really a 1x1 convolution → It's a 1x1xC convolution
Top-1 ImageNet Accuracy: 僅能給 1 個答案
Top-5 ImageNet Accuracy: 能給 5 個答案
Skip connections not only skip one layer
The advantage of adding this type of skip connection is that if any layer hurt the performance of architecture then it will be skipped by regularization.
So, this results in training a very deep neural network without the problems caused by vanishing/exploding gradient.
In conclusion, ResNets are one of the most efficient Neural Network Architectures, as they help in maintaining a low error rate much deeper in the network.
Contours with Explicit Encoding
pros: fast to inference and easy to optimize.
cons: can not depict the mask precisely and can not describe objects that have holes in the center.
國慶日放假
Loss function
Solutions in the Literature for Long-Tailed Visual Recognition
Class-Balanced Loss
\[ \rm{CB}(\textbf{p}, y) = \frac{1}{E_{n_{y}}} \mathcal{L} (\textbf{p}, y) = \frac{1 - \beta}{1 - \beta^{n_y}} \mathcal{L}(\textbf{p}, y) \]
What is transfer learning?
Transfer learning is about leveraging feature representations from a pre-trained model, so you don't have to train a new model from scratch.
The pre-trained models are usually trained on massive datasets that are a standard benchmark in the computer vision frontier.
Mitigating Dataset Bias (BMVC 2020 Keynote)
Techniques that help deal with data bias
Adversarial domain alignment
Pixel-space alignment
Few-shot domain translation
Lots of unlabeled target data, but only have 1-5 images of the target domain
Disentangled features
Weak Scene-level Alignment
Alignment that respects class boundaries
Category Shift
When categories aren't the same in source and target
Recognition of Static Pose
Recognition of Dynamic Pose
Pose Model
The Problem of RNN: Short-term Memory
If a sequence is long enough, they’ll have a hard time carrying
information from earlier time steps to later ones.
Long Short Term Memory (LSTM) was created as the solution to short-term memory.
It has internal mechanisms called gates that can
regulate the flow of information.
\(c\) is the context, and the \(y_i\) are the “part of the data” we are looking at.
\[ m_i = \rm{tanh}(W_{cm}c + W_{ym}y_i) \]
The network computes \(m_1, … m_n\) with a tanh layer
\[ softmax(x_1, ..., x_n) = (\frac{e^{x_i}}{\sum_j e^{x_j}})_i \\ z = \sum_i(s_iy_i) \]
The output \(z\) is the weighted arithmetic mean of all the \(y_i\), where the weight represent the relevance for each variable according the context \(c\).
Point Cloud |
Mesh |
A point cloud is a set of data points in space, which measures a large number of points on the external surfaces of objects around them. | A mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral. The faces usually consist of triangles (triangle mesh), quadrilaterals, or other simple convex polygons. |
Voxel |
Multi-View Images |
A voxel represents a value on a regular grid in three-dimensional space. | Multi-view images are multiple looks of the same target, e.g., at different viewing angles, perspectives, and so forth. |
對抗 Geometric form (irregular) 排列表示不同的一至性
Permutation invariance: Symmetric function
\[ f(x_1, x_2, ..., x_n) \equiv f(x_{\pi_1}, x_{\pi_2}, ... x_{\pi_n}), x_i \in \mathbb{R}^D \]
Examples:
\[ f(x_1, x_2, ..., x_n) = \max\{x_1, x_2, ..., x_n\} \\ f(x_1, x_2, ..., x_n) = x_1 + x_2 + ... + x_n \]
Input Alignment by Transformer Network
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Vision Transformer
課程調整
Dates | Topic |
---|---|
11/21 | Invited Talks |
11/28 | Invited Talks |
12/05 | Midterm Presentation |
12/12 | Midterm Presentation |
12/19 | Invited Talks / Deep Generation Modeling |
12/26 | Final Examination |
北京大學 - 劉家瑛教授
Research topic:
Low-Light Degradation
Intensive noise
Problem: High-level vision in low-light scenarios
Reperesentative work
Deep Retinex Decomposition for Low-Light Enhancement
Benchmarking Low-Light Image Enhancement and Beyond
HLA-Face: Joint High-Low Adapation for Low Light Face Detection
Self-Aligned Concave Curve: Illumination Enhancement for Unsupervised Adaptation
澳洲國立大學 - Hongdong Li教授
Research topic:
Multi-view 3D Reconstruction of a Texture-less Smooth Surface of Unknown Generic Reflectance
Diffeomorphic Neural Surface Parameterization for 3D and Reflectance Recovery
Prof. Gunhee Kim
ProsocialDialog: A Prosocial Backbone for Conversational Agents
Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes
Baidu computer vision expert - Jingdong Wang (王井东)
Vision Foundadtion Models
Representation Pretraining
Self-Supvervised Representation Pretraining in Vision
CAE: representation pretraining aims to learn an encoder, mapping an image to a representation that can be transferred to downstream task.
Figure 1: Context autoencoder
Table 1: Pretraining quality evaluation
POSTECH - Minsu Cho
Match and transfer
Relational Self-Attention: What's Missing in Attention for Video Understanding
SPair-73k: A Large-scale Benchmark for Semantic Correspondence
TansforMatcher: Match-to-Match Attention for Semantic Correspondence
Few-shot image segementation
Structure of correspondence in space
Motion-aware video recognition
Relational Self-Attention
Summary
Stanford University - Dennis L. Sun
Estimate some quality \(\mu_i\) from noisy observation \(\textbf{Z} = \{Z_1, ... Z_N\}\).
Empirical Bayes: First estimate \(A\) using the data, then plug it into the prior.
Deploying CV at Edge - From Recent Vision Transformer to Future Metaverse
鄭嘉珉 MTK資深經理
Focus more on experimence sharing, especially from CV research to produciton in MTK
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Important topic :
DNN > Edge Process > More's law
姜政銘(Jimmy Chiang)
Edge AI KSF:
Noise Reduct, Super resolution
AI-ALG演算法被賦予的任務
AI-SW:
AI-HW:
Summary:
Paper | Conference / Year |
---|---|
You Only Cut Once: Boosting Data Augmentation with a Single Cut, ICML 2022. | ICML/2022 |
Scaled-YOLOv4: Scaling Cross Stage Partial Network | CVPR / 2021 |
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation | CVPR / 2022 |
taming transformers for high-resolution image synthesis | CVPR/2021 |
BEIT: Bert Pre-Training of Image Transformers | ICLR/2022 |
GAN-Supervised Dense Visual Alignment | CVPR/2022 |
Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling | CVPR/2022 |
FMODetect: Robust Detection of Fast Moving Object | ICCV / 2021 |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | CVPR / 2021 |
Boosting Crowd Counting via Multifaceted Attention* | CVPR / 2022 |
Focal and Global Knowledge Distillation for Detectors | CVPR / 2022 |
VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution |
CVPR/2022 |
RefineFace: Fefinement Neural Network for High Performance Face Detection | TPAMI/2021 |
Restormer: Efficient Transformer for High-Resolution Image Restoration | CVPR/2022 (Oral) |
Learning the Degradation Distribution for Blind Image Super-Resolution | CVPR / 2022 |
Pose Recognition With Cascade Transformers | CVPR/2021 |
Deep Constrained Least Squares for Blind Image Super-Resolution | CVPR / 2022 |
ACPL:Anti-curriculm Psudo-lablling for Semi-supervised Medical Image Clasification | CVPR / 2022 |
CoMoGAN: continuous model-guided image-to-image translation | CVPR/2021 |
TrackFormer: Multi-Object Tracking with Transformers | CVPR/2022 |
Contrastive Embedding for Generalized Zero-Shot Learning | CVPR/2021 |
Masked Autoencoders Are Scalable Vision Learners | CVPR/2021 |
Crafting Better Contrastive Views for Siamese Representation learning | CVPR/2022 |
GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields | CVPR/2021 |
Scaling Vision Transformers | CVPR/2022 |
Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions | AAAI/2022 |
EditGAN: High-Precision Semantic Image Editing | NeurIPS 2021 |