changed 2 years ago
Published Linked with GitHub

應用電腦視覺 - 鄭文皇 (2022 Fall)

tags: NYCU-2022-Fall

Class info.

課程資訊

  1. Learn the concepts and theories of Computer Vision (CV) and how they can be applied in practice to solve real-world problems.
  2. Also cover the latest topics in current CV literature, such as self-supervised learning for CV applications.

作業50%、期中報告20%、期末考試30%

有邀講者來改成,作業40%、期中報告20%、期末考試30%、演講出席10%

Date

9/12

Computer Vision:

feature engineering + model learning \(\rightarrow\) deep learning

feature engineering: f = \(f(I)\)
model learning: y = \(g(f,\theta)\)
deep learning: y = \(g(I,\theta)\)

  • Feature Detector
    視覺系統的子系統,用來檢測存在或視覺場景中某些特徵的缺失

Image data from real world often display complex structure

In general, computer vision does not work. (except in certain cases)

  • Intra-class Variability
    相同影像類別,但不同照片呈現方式

9/19

  • intensity: 色彩亮度 \(\frac{R+G+B}{3}\)
drawing drawing

In comparison to global features, local features are more robust to occlusion and clutter.

  • Properties of Ideal Local Feature
  1. Repeatability
  2. Distinctiveness / Informativeness (鑑別性: 局部結構變化,feature也要有變化)
  3. Locality
  4. Quantity
  5. Accuracy
  6. Efficiency

Before designing an edge detector

  1. Use derivatives (in x and y direction) to define a location with high gradient
  2. Need smoothing to reduce noise prior to take derivative
  • Edge Detector in 1D & 2D
drawing drawing

\(g\) 翻轉,之後依照 \(\tau\) 值平移過 \(f\)

連續形式: \((f*g)(n)=\int^{\infty}_{-\infty}f(\tau)g(n-\tau)d\tau\)

離散形式: \((f*g)(n)=\sum_{\tau=-\infty}^{\infty}f(\tau)g(n-\tau)\)


  • Canny Edge Detection 實作文章
    Large \(\sigma\) detects large scale edges, small \(\sigma\) detects fine feature

  • Image Gradient

Magnitude: \(\| ∇f \|=\sqrt{(\frac{\partial f}{\partial x})^2 + \frac{\partial f}{\partial y})^2}\)

Direction: \(\theta = \tan^{-1}(\frac{\partial f}{\partial y} / \frac{\partial f}{\partial x})\)

Invariant to large rotation, translation. But not-invariant to image scale, it doesn’t tell us the scale of the corner

過一次 0 就有一個 edge

drawing



Impulse response

drawing



Laplace operator, Laplacian

drawing drawing
  • SIFT Algorithm

右邊 Gaussian 為左邊的兩倍

9/26

Appendix-SIFT

  • Keypoint Localization


  • SIFT Descriptor

    OpenCV SIFT

  • HoG (Histogram of Oriented Gradients)
    HoG

drawing drawing
drawing drawing
drawing
  • LBP (Local Binary Patterns)
    LBP is a non-parametric descriptor whose aim is to efficiently summarize the local structures of images.

  • Types of Object Detection

    • Detection of specific categories
    • Detection of specific instance


Object Classification


Semantic Segmentation

  • Sliding Window

  • Downsampling & upsampling (solve the expensive convolution cost)

  • U-Net


There is no universal agreement in the literature on the definitions of various vision subtasks

10/3

Convolution size for kernel: height × width × dense

縮減深度計算量。Not really a 1x1 convolution → It's a 1x1xC convolution

  • A Fire module is comprised of:
    a squeeze convolution layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters.

Top-1 ImageNet Accuracy: 僅能給 1 個答案
Top-5 ImageNet Accuracy: 能給 5 個答案

Skip connections not only skip one layer

The advantage of adding this type of skip connection is that if any layer hurt the performance of architecture then it will be skipped by regularization.
So, this results in training a very deep neural network without the problems caused by vanishing/exploding gradient.
In conclusion, ResNets are one of the most efficient Neural Network Architectures, as they help in maintaining a low error rate much deeper in the network.

  • DenseNet


  • Conventional two-stage solutions adopt the detect-then-segment approach → Slow
  • Focus on single-stage instance segmentation

  • Local-mask-based Methods
    • Contours with Explicit Encoding
      • ExtremeNet (Four extreme points with one center point of objects)
        同時,通過四個方向可以求得中心點(center point)。 (實際上,一個方向上的極值點可能不止一個)
      • PolarMask: It utilizes rays at constant angle intervals from the center to describe the contour.
      • FourierNet: a contour shape decoder using Fourier transform
    • Compact Mask Encoding

Contours with Explicit Encoding

pros: fast to inference and easy to optimize.
cons: can not depict the mask precisely and can not describe objects that have holes in the center.

  • Global-mask-based Methods
    • YOLACT: attempting real-time instance segmentation
    • BlendMask

10/10

國慶日放假

10/17

  • Challenge of Long-Tailed Visual Recognition

  • Loss function

    • MSE:\[f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|^2\]
    • MAE:\[f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|_1\]
    • Cross Entropy: \[L = -\frac{1}{m} \sum_{i=1}^m y_i \cdot \ln(\hat{y}_i)\]
  • Solutions in the Literature for Long-Tailed Visual Recognition

    • Re-sampling:
      • over-sampling (adding repetitive data) for the minority class
      • under-sampling (removing data) for the majority class
    • Re-weighting: \[L = -\sum^{\mathcal{C}}_{i=1} w_i y_i \log p_i\]
  • Class-Balanced Loss

\[ \rm{CB}(\textbf{p}, y) = \frac{1}{E_{n_{y}}} \mathcal{L} (\textbf{p}, y) = \frac{1 - \beta}{1 - \beta^{n_y}} \mathcal{L}(\textbf{p}, y) \]

  • re-balancing = re-sampling + re-weighting

  1. Feature extractor
  2. classifier

What is transfer learning?

Transfer learning is about leveraging feature representations from a pre-trained model, so you don't have to train a new model from scratch.

The pre-trained models are usually trained on massive datasets that are a standard benchmark in the computer vision frontier.

  • Bilateral-Branch Network


Mitigating Dataset Bias (BMVC 2020 Keynote)

  • Dataset bias
drawing

  • Techniques that help deal with data bias

    • Collect labelled data from target domain
    • Better backbone CNNs
    • Batch Normalization (Li'17, [Chang’19])
    • Instance Normalization + Batch Normalization Nam'19
    • Data Augmentation, Mix Match Berthelot'19
    • Semi-supervised methods, such as Pseudo labeling Zou’19
    • Domain Adaptation (this talk)
  • Adversarial domain alignment

    • Feature space
    • Pixel space

10/24

  • Pixel-space alignment

  • Few-shot domain translation
    Lots of unlabeled target data, but only have 1-5 images of the target domain

  • Disentangled features


  • Weak Scene-level Alignment

  • Alignment that respects class boundaries

  • Category Shift
    When categories aren't the same in source and target


  • Recognition of Static Pose

  • Recognition of Dynamic Pose

  • Pose Model

  • Inverse Kinematics

  • Exploiting Temporal Dependence

10/31

  • Recurrent Neural Networks (RNN)

  • RNN cell

The Problem of RNN: Short-term Memory

If a sequence is long enough, they’ll have a hard time carrying
information from earlier time steps to later ones.

Long Short Term Memory (LSTM) was created as the solution to short-term memory.
It has internal mechanisms called gates that can
regulate the flow of information.

RNN Notes

  • GRU (Gated Recurrent Unit)

  • Deep LSTM

  • Two-way LSTM

  • Connectionist Temporal Classification (CTC)

11/7

  • Attention Model

\(c\) is the context, and the \(y_i\) are the “part of the data” we are looking at.

\[ m_i = \rm{tanh}(W_{cm}c + W_{ym}y_i) \]

The network computes \(m_1, … m_n\) with a tanh layer

\[ softmax(x_1, ..., x_n) = (\frac{e^{x_i}}{\sum_j e^{x_j}})_i \\ z = \sum_i(s_iy_i) \]

The output \(z\) is the weighted arithmetic mean of all the \(y_i\), where the weight represent the relevance for each variable according the context \(c\).


  • CV Weekly
    • Generate video from text
    • DIFFUSIONDB: Dataset for Text-to-Image Generative Models

  • 3D data representation

Point Cloud

Mesh

A point cloud is a set of data points in space, which measures a large number of points on the external surfaces of objects around them. A mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral. The faces usually consist of triangles (triangle mesh), quadrilaterals, or other simple convex polygons.

Voxel

Multi-View Images

A voxel represents a value on a regular grid in three-dimensional space. Multi-view images are multiple looks of the same target, e.g., at different viewing angles, perspectives, and so forth.

  • Deep Learning on Multi-view Representation

  • Challenge

對抗 Geometric form (irregular) 排列表示不同的一至性

Permutation invariance: Symmetric function

\[ f(x_1, x_2, ..., x_n) \equiv f(x_{\pi_1}, x_{\pi_2}, ... x_{\pi_n}), x_i \in \mathbb{R}^D \]

Examples:

\[ f(x_1, x_2, ..., x_n) = \max\{x_1, x_2, ..., x_n\} \\ f(x_1, x_2, ..., x_n) = x_1 + x_2 + ... + x_n \]

Input Alignment by Transformer Network

  • PointNet Architecture


  • Recap RNN /LSTM

RNN Notes

  • Transformer network

Transformer Notes

PyTorch Transformer

11/14

課程調整

Dates Topic
11/21 Invited Talks
11/28 Invited Talks
12/05 Midterm Presentation
12/12 Midterm Presentation
12/19 Invited Talks / Deep Generation Modeling
12/26 Final Examination
  • Homework 2: Transformer
  • Homework 3: Invited Talks 500字心得

11/21

暗光影像增強計算

北京大學 - 劉家瑛教授

非漫射複雜材質物體的多視角三維視覺建模

澳洲國立大學 - Hongdong Li教授

  • Recap
    • Multi-view 3D reconstruction for object with unknown materials.
    • Significantly outperforms SOTAs under unknown illuminations
    • Achieves similar accuracy to darkroom methods but much more flexible
    • Robust to complex shapes and specular materials
    • Reconstructions can be easily plugged into rendering engines
    • Limitations: piecewise smooth object shape assumption with simple topology; need a strong flashlight (SNR) slow convergence

Consistent, Empathetic and Prosocial Dialogues

Prof. Gunhee Kim

11/28

Context Autoencoder for Scalable Self-Supvervised Representation Pretraining

Baidu computer vision expert - Jingdong Wang (王井东)

  • Vision Foundadtion Models

    • Big Data
    • Big Parameter
    • Big Task
    • Big Algorithm
    • Big Computation
  • Representation Pretraining

    • Goal: Learn an encoder mapping an image to a representation
    • Pretraining Task \(\rightarrow\) Downstream Task
    • Scale Up: Sample scale (no supervised, yes semi-supervised / vision-language / self-supervised), Concept scale (no supervised / semi-supervised, yes vision-language / self-supervised)
  • Self-Supvervised Representation Pretraining in Vision

    • Contrastive pretraining
    • Masked image modeling
    • Other
  • CAE: representation pretraining aims to learn an encoder, mapping an image to a representation that can be transferred to downstream task.

    • Regressor for masked image modeling \(\rightarrow\) masked representation modeling:
      make predictions for masked patches from visible patches in the encoded representation space for solving the masked image modeling task.
    • The encoder is dedicated for representation pretraining, and representation pretraining is only by the encoder.
    • The task completion part (regressor and decoder) is separated from the encoder.

Figure 1: Context autoencoder

  • How contrastive pretraining works ?
    • How can the representations of random crops from the same original image be similar ?
      • Speculation: encoder extracts the representation of the part of the object / prejector maps the part representation to the representation of the whole object
      • The projected representations than agree
    • What representation are learned ?
      • Observation: The common among random crops lie in the center of the original image / The object in ImageNet image lies in the center
      • Conjecture: Contrastive pretraining mainly learns the semantics of the center region

Github repo.

Table 1: Pretraining quality evaluation

Relational and Structural Vision with High-Order Feature Transforms

POSTECH - Minsu Cho

Match and transfer

AURORA - Empirical Bayes from Replicates

Stanford University - Dennis L. Sun

12/21

Deploying CV at Edge - From Recent Vision Transformer to Future Metaverse

Computing and AI Technology Group, MediaTek Inc.

Part1 Overview

  • NIPS
  • Marching toward metaverse era

Part2 Deploying Vision transformer at edge

  • Computer vision resesrch evolves rapidly
  • How to use them in out daily devices

鄭嘉珉 MTK資深經理

Focus more on experimence sharing, especially from CV research to produciton in MTK

NIPS

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Important topic :

  • Adversarial robustness,
  • Federated learning,
  • Diffusion model,
  • NeRF(Neural Radiance Field),
  • NeMF(Neural Motion Field),
  • CCNeRF(Compressilble-composable NeRF),
  • GNN

Metaverse

Challenge
  • High computing
  • Low latency
  • Low power
  • Tiny form factor
  • Display:
    • immersive display experience
  • Graphics
  • Motion-to-Photo latency
    • VR : under 20ms
    • AR : under 5ms
  • Concurrent multiple tasks

DNN > Edge Process > More's law


姜政銘(Jimmy Chiang)

  • Edge AI in MTK
  • Vision Transformer
  • MTK重視的AI人才
  • 給準備進入職場的大家

Edge AI KSF:
Noise Reduct, Super resolution

CAI部門

AI-ALG演算法被賦予的任務

  • AI CV
  • AI NLP
  • AI Network
  • AI Methodology
  • AI for 5G
  • AI Architecture

AI-SW:

  • 串接gpu->cuda->pytorch->python code
  • NeuroPilot SW 串接手機上的gpu

AI-HW:

  • 如何在有限的cost下,設計出高效率的APU

想要讓訓練好了模型跑在手機上,需要做哪些事?

  1. 如何整合NAS,Quatization?
  2. 轉出平台支援的格式?
  3. 結果超級慢?

Vision transformer

  1. Patch embedding
    • Opertaion
    • Challenges in APU
      • memory access is one of the bottlenecks in APU
    • Patch-wise is like 'Sliding window' in convolution
      • Patch size
  2. Multihead Self-attention - Challenge
    • global self-attention requires quadratic computing complexity
    • The most challenge in APU => over 95% latency code in ViT
      • Matrix multiplication
      • Softmax

Summary:

  • Global attention has better quality but suffer from MatMul and Sofmax
  • Cross-varaice is favorable for high-resolution and less-chanel

Softmax Complexity

  • Sofmax : naive formula doesn't work due to numeriacal stability(overflow)
  • Most AI accelerators support float16 instead of float32 data format got better PPA(Performance, Power, Area)
  • What happen if using float16? UNDERFLOW

Norm-Layer Challenge

  • Overflow occur after Mul, which calculate varaince \(\sigma\)
  • Underflow occur is Rsqrt

MLP-GELU Challenge Overview

  • GELU activation is wodely used in Tansformer
  • It's impractical to implement error function in AI accelerator!

What papers might not tell you, but matter in edge AI

  • Low MAC/FLOPs doesnt imply high efficeintcy
  • Accuracy in paper does not guarantee acuuracy in edge device
  • Paper reports performance in mobile CPU and GPU

職場

  • 通用法則
    • 基本功
    • 團隊合作及溝通
    • 好奇心
    • 獨立思考
    • 學習心態
  • 人才?
    • 算法,硬體,軟體
    • 投履歷,準備好投影片

Paper list

Paper Conference / Year
You Only Cut Once: Boosting Data Augmentation with a Single Cut, ICML 2022. ICML/2022
Scaled-YOLOv4: Scaling Cross Stage Partial Network CVPR / 2021
MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation CVPR / 2022
taming transformers for high-resolution image synthesis CVPR/2021
BEIT: Bert Pre-Training of Image Transformers ICLR/2022
GAN-Supervised Dense Visual Alignment CVPR/2022
Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling CVPR/2022
FMODetect: Robust Detection of Fast Moving Object ICCV / 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows CVPR / 2021
Boosting Crowd Counting via Multifaceted Attention* CVPR / 2022
Focal and Global Knowledge Distillation for Detectors CVPR / 2022
VideoINR: Learning Video Implicit Neural Representation for
Continuous Space-Time Super-Resolution
CVPR/2022
RefineFace: Fefinement Neural Network for High Performance Face Detection TPAMI/2021
Restormer: Efficient Transformer for High-Resolution Image Restoration CVPR/2022 (Oral)
Learning the Degradation Distribution for Blind Image Super-Resolution CVPR / 2022
Pose Recognition With Cascade Transformers CVPR/2021
Deep Constrained Least Squares for Blind Image Super-Resolution CVPR / 2022
ACPL:Anti-curriculm Psudo-lablling for Semi-supervised Medical Image Clasification CVPR / 2022
CoMoGAN: continuous model-guided image-to-image translation CVPR/2021
TrackFormer: Multi-Object Tracking with Transformers CVPR/2022
Contrastive Embedding for Generalized Zero-Shot Learning CVPR/2021
Masked Autoencoders Are Scalable Vision Learners CVPR/2021
Crafting Better Contrastive Views for Siamese Representation learning CVPR/2022
GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields CVPR/2021
Scaling Vision Transformers CVPR/2022
Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions AAAI/2022
EditGAN: High-Precision Semantic Image Editing NeurIPS 2021

Final exam (Open anything)

  1. Local Binary Patterns (15%)
    • 怎樣算 ?
    • 給三張 image patch 比較跟原圖相似度
  2. Attention (\(Z\)) 怎樣算,題目已經給公式跟\(K, V, Q\) 的矩陣 (20%)
  3. 給一篇 paper MetaFormer is Actually What You Need for Vision, 問他跟原先 transformer 的差異為何, 他怎樣改進效能 (20%)
  4. 給一篇 paper
    Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment
    • 3D STN 跟這篇 paper 的差異, 彼此的優缺點 (15%)
    • 問 Hard sample 怎樣增加模型的 robustness, 須參考 Hard Sample Mining (10%)
  5. 課程意見反饋 (20%)

Reference

原文書電子檔申請

Select a repo