應用電腦視覺 - 鄭文皇 (2022 Fall)

# 應用電腦視覺 - 鄭文皇 (2022 Fall) ###### tags: `NYCU-2022-Fall` ## Class info. [課程資訊](https://timetable.nycu.edu.tw/?r=main/crsoutline&Acy=111&Sem=1&CrsNo=535232&lang=zh-tw) 1. Learn the concepts and theories of Computer Vision (CV) and how they can be applied in practice to solve real-world problems. 2. Also cover the latest topics in current CV literature, such as self-supervised learning for CV applications. 作業50%、期中報告20%、期末考試30% 有邀講者來改成，作業40%、期中報告20%、期末考試30%、演講出席10% <style> .red{ color: red; } .blue{ color: #87ceeb; } </style> ## Date ### 9/12 Computer Vision: feature engineering + model learning $\rightarrow$ deep learning feature engineering: f = $f(I)$ model learning: y = $g(f,\theta)$ deep learning: y = $g(I,\theta)$ * Feature Detector 視覺系統的子系統，用來檢測存在或視覺場景中某些特徵的缺失 Image data from real world often display complex structure **In general, computer vision does not work. (except in certain cases)** * Intra-class Variability 相同影像類別，但不同照片呈現方式 ### 9/19 * intensity: 色彩亮度 $\frac{R+G+B}{3}$ <table> <tr> <td> <img src="https://i.imgur.com/S7fi26d.png" alt="drawing" width="400"/> </td> <td> <img src="https://i.imgur.com/qZsEGsQ.png" alt="drawing" width="400"/> </td> </tr> </table> In comparison to global features, local features are more robust to occlusion and clutter. * Properties of Ideal Local Feature 1. Repeatability 2. Distinctiveness / Informativeness (鑑別性: 局部結構變化，feature也要有變化) 3. Locality 4. Quantity 5. Accuracy 6. Efficiency * [Sobel operator](https://zh.m.wikipedia.org/zh-tw/%E7%B4%A2%E8%B2%9D%E7%88%BE%E7%AE%97%E5%AD%90) Before designing an edge detector 1. Use derivatives (in x and y direction) to define a location with high gradient 2. Need smoothing to reduce noise prior to take derivative * Edge Detector in 1D & 2D <table> <tr> <td> <img src="https://i.imgur.com/QfolsiX.png" alt="drawing" width="500"/> </td> <td> <img src="https://i.imgur.com/mvY53yA.png" alt="drawing" width="500"/> </td> </tr> </table> * [Convolution](https://iter01.com/480243.html) $g$ 翻轉，之後依照 $\tau$ 值平移過 $f$ 連續形式: $(f*g)(n)=\int^{\infty}_{-\infty}f(\tau)g(n-\tau)d\tau$ 離散形式: $(f*g)(n)=\sum_{\tau=-\infty}^{\infty}f(\tau)g(n-\tau)$ * Canny Edge Detection [實作文章](https://medium.com/@pomelyu5199/canny-edge-detector-%E5%AF%A6%E4%BD%9C-opencv-f7d1a0a57d19) Large $\sigma$ detects large scale edges, small $\sigma$ detects fine feature * Image Gradient ![](https://i.imgur.com/AloAd9o.png) Magnitude: $\| ∇f \|=\sqrt{(\frac{\partial f}{\partial x})^2 + \frac{\partial f}{\partial y})^2}$ Direction: $\theta = \tan^{-1}(\frac{\partial f}{\partial y} / \frac{\partial f}{\partial x})$ * Harris Corner Detector [實作文章](https://www.796t.com/p/1343014.html) Invariant to large **rotation**, **translation**. But ==not-invariant to image scale==, it doesn’t tell us the scale of the corner **過一次 0 就有一個 edge** <img src="https://i.imgur.com/xTI0KVQ.png" alt="drawing" width="500"/> > Impulse response <img src="https://i.imgur.com/TWWezna.png" alt="drawing" width="500"/> [Laplace operator, Laplacian](https://zh.wikipedia.org/zh-tw/%E6%8B%89%E6%99%AE%E6%8B%89%E6%96%AF%E7%AE%97%E5%AD%90) <table> <tr> <td> <img src="https://i.imgur.com/IoHzxPT.png" alt="drawing" width="450"/> </td> <td> <img src="https://i.imgur.com/rows5Hz.png" alt="drawing" width="450"/> </td> </tr> </table> * SIFT Algorithm ![](https://i.imgur.com/cGS2ocA.png) ![](https://i.imgur.com/0mXhV3H.png) ![](https://i.imgur.com/AFkWfrf.png) 右邊 Gaussian 為左邊的兩倍 ![](https://i.imgur.com/tjBo2gq.png) ### 9/26 :::info **Appendix-SIFT** ![](https://i.imgur.com/NRwfNo1.png) ::: * Keypoint Localization ![](https://i.imgur.com/F71YuBx.png) ![](https://i.imgur.com/AzRjEX2.png) ![](https://i.imgur.com/QHaS0DX.png) * SIFT Descriptor ![](https://i.imgur.com/2GCYlTv.png) [OpenCV SIFT](https://docs.opencv.org/3.4/da/df5/tutorial_py_sift_intro.html) * HoG (Histogram of Oriented Gradients) [HoG](http://alex-phd.blogspot.com/2014/03/hog.html) <table> <tr> <td> <img src="https://i.imgur.com/8jI7586.png" alt="drawing" width="500"/> </td> <td> <img src="https://i.imgur.com/SRqCNU2.png" alt="drawing" width="500"/> </td> </tr> <tr> <td> <img src="https://i.imgur.com/nUgn10j.png" alt="drawing" width="500"/> </td> <td> <img src="https://i.imgur.com/aKZ1Ca7.png" alt="drawing" width="500"/> </td> </tr> <tr> <td> <img src="https://i.imgur.com/rjAgLxH.png" alt="drawing" width="500"/> </td> </tr> </table> * LBP (Local Binary Patterns) LBP is a non-parametric descriptor whose aim is to efficiently summarize the local structures of images. ![](https://i.imgur.com/T1589eR.png) ![](https://i.imgur.com/bqRLV0u.png) * Types of Object Detection * Detection of specific categories * Detection of specific instance ![](https://i.imgur.com/UEKJ7qM.png) --- **Object Classification** * [Image Classification Architectures review](https://medium.com/@14prakash/image-classification-architectures-review-d8b95075998f) * ImageNet Dataset * ImageNet with roughly 1000 images in each of 1000 categories. * AlexNet ![](https://i.imgur.com/h7q82R7.png) --- **Semantic Segmentation** * Sliding Window ![](https://i.imgur.com/wsAJyhd.png) * Downsampling & upsampling (solve the expensive convolution cost) ![](https://i.imgur.com/A6FzDaX.png) * [U-Net](https://ithelp.ithome.com.tw/articles/10240314) --- **There is no universal agreement in the literature on the definitions of various vision subtasks** * Two Main Categories for Generic Object Detection ![](https://i.imgur.com/uXpPvPG.png) * [Region Proposals](https://medium.com/curiosity-and-exploration/%E5%8F%96%E5%BE%97-region-proposals-selective-search-%E5%90%AB%E7%A8%8B%E5%BC%8F%E7%A2%BC-be0aa5767901) * [R-CNN & Fast R-CNN](https://zhuanlan.zhihu.com/p/40986674) * [[Paper] EDF-SSD: An Improved Feature Fused SSD for Onjection Detection](https://jackson1998.medium.com/paper-edf-ssd-an-improved-feature-fused-ssd-for-onjection-detection-213c4566745) ### 10/3 Convolution size for kernel: height × width × dense > 縮減深度計算量。Not really a 1x1 convolution → It's a 1x1xC convolution ![](https://i.imgur.com/llKweEN.png) * A Fire module is comprised of: a squeeze convolution layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters. :::info Top-1 ImageNet Accuracy: 僅能給 1 個答案 Top-5 ImageNet Accuracy: 能給 5 個答案 ::: ![](https://i.imgur.com/ohuJBu0.png) > Skip connections not only skip one layer The advantage of adding this type of **skip connection** is that if any layer hurt the performance of architecture then it will be skipped by regularization. So, this results in training a very deep neural network without the problems caused by vanishing/exploding gradient. In conclusion, ResNets are one of the most efficient Neural Network Architectures, as they help in **maintaining a low error rate much deeper in the network.** * DenseNet ![](https://i.imgur.com/EJaIlOp.png) * [Feature Pyramid Networks](https://ivan-eng-murmur.medium.com/%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%AC-s8-feature-pyramid-networks-%E7%B0%A1%E4%BB%8B-99b676245b25) ![](https://i.imgur.com/eE1pm5S.png) * [A Simple yet Effective Approach for Identifying Unexpected Road Obstacles](https://zhuanlan.zhihu.com/p/415220541) * [Deep Learning for Generic Object Detection: A Survey](https://www.796t.com/content/1545903385.html) --- * Conventional two-stage solutions adopt the detect-then-segment approach → **Slow** * Focus on single-stage instance segmentation ![](https://i.imgur.com/RHemaxC.png) * Local-mask-based Methods * Contours with Explicit Encoding * ExtremeNet (Four extreme points with one center point of objects) 同時，通過四個方向可以求得中心點（center point）。（實際上，一個方向上的極值點可能不止一個） * PolarMask: It utilizes rays at constant angle intervals from the center to describe the contour. * FourierNet: a contour shape decoder using Fourier transform * Compact Mask Encoding :::info **Contours with Explicit Encoding** pros: fast to inference and easy to optimize. cons: can not depict the mask precisely and can not describe objects that have holes in the center. ::: * Global-mask-based Methods * YOLACT: attempting real-time instance segmentation ![](https://i.imgur.com/u79wRiw.png) * BlendMask ### 10/10 國慶日放假 ### 10/17 * Challenge of Long-Tailed Visual Recognition ![](https://i.imgur.com/6MIP3q1.png) * Loss function * MSE:$$f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|^2$$ * MAE:$$f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|_1$$ * [Cross Entropy](https://zh.wikipedia.org/zh-tw/%E4%BA%A4%E5%8F%89%E7%86%B5): $$L = -\frac{1}{m} \sum_{i=1}^m y_i \cdot \ln(\hat{y}_i)$$ ![](https://i.imgur.com/1bPlXyM.png) * Solutions in the Literature for Long-Tailed Visual Recognition * Re-sampling: * over-sampling (adding repetitive data) for the minority class * under-sampling (removing data) for the majority class * Re-weighting: $$L = -\sum^{\mathcal{C}}_{i=1} w_i y_i \log p_i$$ * Class-Balanced Loss $$ \rm{CB}(\textbf{p}, y) = \frac{1}{E_{n_{y}}} \mathcal{L} (\textbf{p}, y) = \frac{1 - \beta}{1 - \beta^{n_y}} \mathcal{L}(\textbf{p}, y) $$ * re-balancing = re-sampling + re-weighting ![](https://i.imgur.com/rfCfgLI.png) ![](https://i.imgur.com/WhUeC18.png) 1. Feature extractor 2. classifier ![](https://i.imgur.com/6ycKe2S.png) :::info What is transfer learning? Transfer learning is about **leveraging feature representations from a pre-trained model**, so you don't have to train a new model from scratch. The pre-trained models are usually trained on massive datasets that are a standard benchmark in the computer vision frontier. ::: * Bilateral-Branch Network ![](https://i.imgur.com/KBjJuMy.png) ![](https://i.imgur.com/PsFDarz.png) --- [Mitigating Dataset Bias (BMVC 2020 Keynote)](https://www.youtube.com/watch?v=HAfB9qvGfMM) * Dataset bias <img src="https://i.imgur.com/TdZ8MAf.png" alt="drawing" width="500"/> ![](https://i.imgur.com/Yt1RuYw.png) * Techniques that help deal with data bias * Collect labelled data from target domain * Better backbone CNNs * Batch Normalization ([Li'17](https://arxiv.org/pdf/1603.04779.pdf), [Chang’19]) * Instance Normalization + Batch Normalization [Nam'19](https://proceedings.neurips.cc/paper/2018/file/018b59ce1fd616d874afad0f44ba338d-Paper.pdf) * Data Augmentation, Mix Match [Berthelot'19](https://arxiv.org/pdf/1905.02249.pdf) * Semi-supervised methods, such as Pseudo labeling [Zou’19](https://arxiv.org/pdf/1908.09822.pdf) * Domain Adaptation (this talk) * ==**Adversarial domain alignment**== * Feature space * Pixel space ![](https://i.imgur.com/68isLg1.png) ### 10/24 * Pixel-space alignment ![](https://i.imgur.com/eyK0Lsh.png) * Few-shot domain translation Lots of unlabeled target data, but only have 1-5 images of the target domain ![](https://i.imgur.com/hiNOFIZ.png) * Disentangled features ![](https://i.imgur.com/1IrR44G.png) ![](https://i.imgur.com/UZdbch0.png) ![](https://i.imgur.com/2v2I57M.png) * Weak Scene-level Alignment ![](https://i.imgur.com/kfVemRy.png) ![](https://i.imgur.com/ysbuiVR.png) * Alignment that respects class boundaries ![](https://i.imgur.com/tiRXwYL.png) * Category Shift When categories aren't the same in source and target ![](https://i.imgur.com/1txrbNc.png) ![](https://i.imgur.com/Vx9AqJF.png) --- ![](https://i.imgur.com/Ky27di6.png) * Recognition of Static Pose * Recognition of Dynamic Pose * Pose Model ![](https://i.imgur.com/avxJUjP.png) * Inverse Kinematics ![](https://i.imgur.com/2sV5fKb.png) * Exploiting Temporal Dependence ![](https://i.imgur.com/WWJiVDJ.png) ### 10/31 * Recurrent Neural Networks (RNN) ![](https://i.imgur.com/ZzB1b9S.png) ![](https://i.imgur.com/rhXx7IQ.png) ![](https://i.imgur.com/aFAnoXo.png) * RNN cell ![](https://i.imgur.com/zWRW9vC.png) :::info **The Problem of RNN: Short-term Memory** If a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. **Long Short Term Memory (LSTM)** was created as the solution to short-term memory. It has internal mechanisms called gates that can regulate the flow of information. [RNN Notes](https://hackmd.io/3PzYYuBBTNCgRymLI2fUuw?view) ::: * GRU (Gated Recurrent Unit) ![](https://i.imgur.com/KI0nbrg.png) * Deep LSTM ![](https://i.imgur.com/DBOr9ss.png) * Two-way LSTM ![](https://i.imgur.com/TAlAU9Z.png) * Connectionist Temporal Classification (CTC) ![](https://i.imgur.com/vyKLsw3.png) ### 11/7 * Attention Model ![](https://i.imgur.com/uZb0cur.png) $c$ is the context, and the $y_i$ are the “part of the data” we are looking at. $$ m_i = \rm{tanh}(W_{cm}c + W_{ym}y_i) $$ The network computes $m_1, … m_n$ with a tanh layer $$ softmax(x_1, ..., x_n) = (\frac{e^{x_i}}{\sum_j e^{x_j}})_i \\ z = \sum_i(s_iy_i) $$ The output $z$ is the weighted arithmetic mean of all the $y_i$, where the weight represent the relevance for each variable according the context $c$. --- :::info * CV Weekly - Generate video from text - DIFFUSIONDB: Dataset for Text-to-Image Generative Models ::: ![](https://i.imgur.com/5xiycMO.png) * 3D data representation <style> .bf{ font-weight: bold; } </style> <table> <tr> <td> Point Cloud </td> <td> Mesh </td> </tr> <tr> <td> A point cloud is a set of data points in space, which measures a large number of points on the external surfaces of objects around them. </td> <td> A mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral. The faces usually consist of triangles (triangle mesh), quadrilaterals, or other simple convex polygons. </td> </tr> <tr> <td> Voxel </td> <td> Multi-View Images </td> </tr> <tr> <td> A voxel represents a value on a regular grid in three-dimensional space. </td> <td> Multi-view images are multiple looks of the same target, e.g., at different viewing angles, perspectives, and so forth. </td> </tr> </table> ![](https://i.imgur.com/twtSUpj.png) * Deep Learning on Multi-view Representation ![](https://i.imgur.com/Od4I514.png) * Challenge 對抗 Geometric form (irregular) 排列表示不同的一至性 ![](https://i.imgur.com/YPs8L54.png) ![](https://i.imgur.com/OMmRxrI.png) **Permutation invariance: Symmetric function** $$ f(x_1, x_2, ..., x_n) \equiv f(x_{\pi_1}, x_{\pi_2}, ... x_{\pi_n}), x_i \in \mathbb{R}^D $$ Examples: $$ f(x_1, x_2, ..., x_n) = \max\{x_1, x_2, ..., x_n\} \\ f(x_1, x_2, ..., x_n) = x_1 + x_2 + ... + x_n $$ ![](https://i.imgur.com/pGBnssR.png) ![](https://i.imgur.com/haVjm0e.png) **Input Alignment by Transformer Network** ![](https://i.imgur.com/D2Sd900.png) * PointNet Architecture ![](https://i.imgur.com/pDIOtw0.png) --- * Recap RNN /LSTM [RNN Notes](https://hackmd.io/3PzYYuBBTNCgRymLI2fUuw?view) * Transformer network [Transformer Notes](https://hackmd.io/ba1UQdFqRAGnhN9_eTZeog) [PyTorch Transformer](https://pytorch.org/tutorials/beginner/transformer_tutorial.html) * [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://zhuanlan.zhihu.com/p/266311690) * Vision Transformer ![](https://i.imgur.com/WD0qtbs.jpg) ### 11/14 **課程調整** | Dates | Topic | | -------- | -------- | | 11/21 | Invited Talks | | 11/28 | Invited Talks | | 12/05 | Midterm Presentation | | 12/12 | Midterm Presentation | | 12/19 | Invited Talks / Deep Generation Modeling | | 12/26 | Final Examination | * Homework 2: Transformer * Homework 3: Invited Talks 500字心得 --- * [Tokens-to-Token ViT: Training Vision Transformers from Scratch on Imagenet](https://zhuanlan.zhihu.com/p/359930253) * [Mobile-Former: Bridging MobileNet and Transformer](https://zhuanlan.zhihu.com/p/412964831) * [EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers](https://blog.51cto.com/shanglianlm/5550217) * [End-to-End Object Detection with Transformers](https://allen108108.github.io/blog/2020/07/27/[%E8%AB%96%E6%96%87]%20End-to-End%20Object%20Detection%20with%20Transformers/) ### 11/21 #### 暗光影像增強計算 :::info 北京大學 - 劉家瑛教授 ::: * Research topic: * Image Reconstrucion * ImageVideo Coding * Image Generation * Video Analytics * Low-Light Degradation * Intensive noise **Problem: High-level vision in low-light scenarios** * Reperesentative work * Histogram equalization * Dehazing method (invert $\rightarrow$ dehaze $\rightarrow$ invert again) * Retinex Model (retinex decomposition ($S = R \cdot L$) / generate result ($S_{enhance} = R \cdot L^{\frac{1}{\gamma}}$)) * Learning-Based Model (LLNet/LLCNN...) * Low-Light Datasets for High-Level Tasks (KAIST / Exclusively Dark) * [Deep Retinex Decomposition for Low-Light Enhancement](https://zhuanlan.zhihu.com/p/87384811) * Retinex Theory + Deep Learning * Dataset: LOl Light * [Benchmarking Low-Light Image Enhancement and Beyond](https://zhuanlan.zhihu.com/p/467789757) * Paired datasets: LLNet * Unpaired datasets: can't support for model training * VE-LOL: evaluation of low/high-level visions * **UG2 challenge** * [HLA-Face: Joint High-Low Adapation for Low Light Face Detection](https://blog.csdn.net/weixin_45709330/article/details/116375825) * Gaps between normal light and low light (Pixel-levle apperances/object-level sentiment) * Consider joint low-level and high-level adaptation * [Self-Aligned Concave Curve: Illumination Enhancement for Unsupervised Adaptation](https://arxiv.org/abs/2210.03792) * Training strategy asymmetric self-supervised alighment #### 非漫射複雜材質物體的多視角三維視覺建模 :::info 澳洲國立大學 - Hongdong Li教授 ::: * Research topic: * Computer Vision * Robotic Vision * Smart Car Project * City Modeling * Bionic Eyes Project * [Multi-view 3D Reconstruction of a Texture-less Smooth Surface of Unknown Generic Reflectance](https://openaccess.thecvf.com/content/CVPR2021/papers/Cheng_Multi-View_3D_Reconstruction_of_a_Texture-Less_Smooth_Surface_of_Unknown_CVPR_2021_paper.pdf) * Vision-based 3D Shape Reconstrucion * (Rigid Object / Scene) Structure from model * Lambertian / Non-Lambertian * Problem Setting: Traditional Photometric Stereo problem * 3D computer vision $\leftrightarrow$ image inversion * **The rendering equation** * Solution: Minimizing a suitable objective (loss) function (augmented language nethod relaxation) image formation + surface regularization + relax penalty * [Diffeomorphic Neural Surface Parameterization for 3D and Reflectance Recovery](https://dl.acm.org/doi/10.1145/3528233.3530741) * Shape deformation * Learning / training process: Inverse graphics rendering <style> .red{ color: red; }; </style> * Recap * Multi-view 3D reconstruction for object with unknown materials. * Significantly outperforms SOTAs under unknown illuminations * Achieves similar accuracy to darkroom methods but much more flexible * Robust to complex shapes and specular materials * Reconstructions can be easily plugged into rendering engines * Limitations: piecewise smooth object shape assumption with simple topology; need a strong flashlight (SNR) slow convergence #### Consistent, Empathetic and Prosocial Dialogues :::info Prof. Gunhee Kim ::: * [ProsocialDialog: A Prosocial Backbone for Conversational Agents](https://arxiv.org/pdf/2205.12688.pdf) * [Anticipating safety issues in E2E Converisoal AI: Framework and Tooling](https://arxiv.org/abs/2107.03451) * Dataset: DailyDialog / PersonaChat / EmpatheticDialogues... (All of those are biased towards positivity) * Classification models trained on GoEmotions * Canary / Prost * [Will I Sound Like Me? Improving Persona Consistency in Dialogues through Pragmatic Self-Consciousness](https://arxiv.org/abs/2004.05816) * *Public self-consciousness* is this awareness of the self as a social object that can be observed and evaluated by others * Bayesian Rational Speech Acts framework, which has been originally applied to improving informativeness of referring expressions. * [Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes](https://arxiv.org/abs/2109.08828) * Related work * Empathetic dialogue modeling * Emotion Cause (Pair) Extraction * Rational Speech Acts (RSA) framework ### 11/28 #### [Context Autoencoder for Scalable Self-Supvervised Representation Pretraining](https://arxiv.org/abs/2202.03026) :::info Baidu computer vision expert - Jingdong Wang (王井东) ::: - Vision Foundadtion Models - Big Data - Big Parameter - Big Task - Big Algorithm - Big Computation - Representation Pretraining - Goal: Learn an encoder mapping an image to a representation - Pretraining Task $\rightarrow$ Downstream Task - Scale Up: Sample scale (no supervised, yes semi-supervised / vision-language / self-supervised), Concept scale (no supervised / semi-supervised, yes vision-language / self-supervised) - Self-Supvervised Representation Pretraining in Vision - Contrastive pretraining - Masked image modeling - Other - CAE: representation pretraining aims to learn an encoder, mapping an image to a representation that can be transferred to downstream task. - Regressor for masked image modeling $\rightarrow$ masked representation modeling: make predictions for masked patches from visible patches in the encoded representation space for solving the masked image modeling task. - The encoder is dedicated for representation pretraining, and representation pretraining is only by the encoder. - The task completion part (regressor and decoder) is separated from the encoder. <center> <img src = "https://i.imgur.com/YtAg3d1.png"> Figure 1: Context autoencoder </center> - How contrastive pretraining works ? - How can the representations of random crops from the same original image be similar ? - Speculation: encoder extracts the representation of the part of the object / prejector maps the part representation to the representation of the whole object - The projected representations than agree - What representation are learned ? - Observation: The common among random crops lie in the center of the original image / The object in ImageNet image lies in the center - Conjecture: Contrastive pretraining mainly learns the semantics of the center region **[Github repo.](https://github.com/lxtGH/CAE)** <center> <img src = "https://i.imgur.com/A9RqRId.png"> Table 1: Pretraining quality evaluation </center> #### Relational and Structural Vision with High-Order Feature Transforms :::info POSTECH - Minsu Cho ::: **Match and transfer** - Relational Self-Attention: What's Missing in Attention for Video Understanding - [SPair-73k: A Large-scale Benchmark for Semantic Correspondence](http://cvlab.postech.ac.kr/research/SPair-71k/) - [Convolutional Hough Matching Networks](https://arxiv.org/abs/2103.16831) - [TansforMatcher: Match-to-Match Attention for Semantic Correspondence](https://arxiv.org/abs/2205.11634) - Few-shot image segementation - [Hypercorrelation Squeeze for Few-Shot Segementation](https://openaccess.thecvf.com/content/ICCV2021/papers/Min_Hypercorrelation_Squeeze_for_Few-Shot_Segmentation_ICCV_2021_paper.pdf) - Structure of correspondence in space - [Learning to Discover Reflection Symmetry via Polar Matching Convolution](https://arxiv.org/abs/2108.12952) - Motion-aware video recognition - [Learning Self-Similarity in Space and Time as Generalized Motion](https://arxiv.org/abs/2102.07092) - Relational Self-Attention - [Relational Self-Attention: What's Missing in Attention for Video Understanding](https://arxiv.org/abs/2111.01673) - Summary - Real-world vision systems need to leverage relational and structural patterns of images and videos for systematic understanding. - High-order convolution or self-attention is effective for capturing relational structures by considering geometric patterns of correlation. - Learning relational structures is crucial for minimally-supervised recognition and structural perception of images and videos. #### AURORA - Empirical Bayes from Replicates :::info Stanford University - Dennis L. Sun ::: - [Empirical Bayes mean estimation with nonparametric errors via order statistic regression on replicated data](https://arxiv.org/abs/1911.05970) - Estimate some quality $\mu_i$ from noisy observation $\textbf{Z} = \{Z_1, ... Z_N\}$. - Empirical Bayes: First estimate $A$ using the data, then plug it into the prior. - Prior: $G = \mathcal{N}(0, A)$ - Likelihood: $F(\cdot \ | \ \mu_i) = \mathcal{N}(\mu_i, \sigma^2)$ ### 12/21 **Deploying CV at Edge - From Recent Vision Transformer to Future Metaverse** ##### Computing and AI Technology Group, MediaTek Inc. #### Part1 Overview * NIPS * Marching toward metaverse era #### Part2 Deploying Vision transformer at edge * Computer vision resesrch evolves rapidly * How to use them in out daily devices :::info 鄭嘉珉ＭＴＫ資深經理 ::: Focus more on experimence sharing, especially from CV research to produciton in MTK #### NIPS [Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](https://arxiv.org/pdf/2205.11487.pdf) Important topic : * Adversarial robustness, * Federated learning, * Diffusion model, * NeRF(Neural Radiance Field), * NeMF(Neural Motion Field), * CCNeRF(Compressilble-composable NeRF), * GNN #### Metaverse ##### Challenge * High computing * Low latency * Low power * Tiny form factor * Display: * immersive display experience * Graphics * Motion-to-Photo latency * VR : under 20ms * AR : under 5ms * Concurrent multiple tasks DNN > Edge Process > More's law --- ::: info 姜政銘(Jimmy Chiang) ::: * Edge AI in MTK * Vision Transformer * ＭＴＫ重視的ＡＩ人才 * 給準備進入職場的大家 Edge AI KSF: Noise Reduct, Super resolution #### CAI部門 AI-ALG演算法被賦予的任務 * AI CV * AI NLP * AI Network * AI Methodology * AI for 5G * AI Architecture AI-SW: * 串接gpu->cuda->pytorch->python code * NeuroPilot SW 串接手機上的gpu AI-HW: * 如何在有限的cost下，設計出高效率的ＡＰＵ #### 想要讓訓練好了模型跑在手機上，需要做哪些事？ 1. 如何整合ＮＡＳ，Ｑuatization? 2. 轉出平台支援的格式？ 3. 結果超級慢？ #### Vision transformer 1. Patch embedding * Opertaion * Challenges in APU * memory access is one of the bottlenecks in APU * Patch-wise is like 'Sliding window' in convolution * Patch size 2. Multihead Self-attention - Challenge * global self-attention requires quadratic computing complexity * The most challenge in APU => over 95% latency code in ViT * Matrix multiplication * Softmax Summary: * Global attention has better quality but suffer from MatMul and Sofmax * Cross-varaice is favorable for high-resolution and less-chanel #### Softmax Complexity * Sofmax : naive formula doesn't work due to numeriacal stability(overflow) * Most AI accelerators support float16 instead of float32 data format got better PPA(Performance, Power, Area) * What happen if using float16? UNDERFLOW #### Norm-Layer Challenge * Overflow occur after Mul, which calculate varaince $\sigma$ * Underflow occur is Rsqrt #### MLP-GELU Challenge Overview * GELU activation is wodely used in Tansformer * It's impractical to implement error function in AI accelerator! #### What papers might not tell you, but matter in edge AI * Low MAC/FLOPs doesnt imply high efficeintcy * Accuracy in paper does not guarantee acuuracy in edge device * Paper reports performance in mobile CPU and GPU #### 職場 * 通用法則 - 基本功 - 團隊合作及溝通 - 好奇心 - 獨立思考 - 學習心態 - 人才？ - 算法，硬體，軟體 - 投履歷，準備好投影片 ## Paper list <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-eojg{background-color:#FFF2CC;text-align:center;vertical-align:bottom} .tg .tg-fwub{background-color:#FFF2CC;color:#121212;text-align:center;vertical-align:bottom} .tg .tg-9o7t{background-color:#FFF2CC;border-color:inherit;text-align:center;vertical-align:bottom} </style> <table class="tg"> <thead> <tr> <th class="tg-9o7t">Paper </th> <th class="tg-9o7t">Conference / Year</th> </tr> </thead> <tbody> <tr> <td class="tg-9o7t">You Only Cut Once: Boosting Data Augmentation with a Single Cut, ICML 2022.</td> <td class="tg-9o7t">ICML/2022</td> </tr> <tr> <td class="tg-9o7t">Scaled-YOLOv4: Scaling Cross Stage Partial Network</td> <td class="tg-9o7t">CVPR / 2021</td> </tr> <tr> <td class="tg-9o7t">MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation</td> <td class="tg-9o7t">CVPR / 2022</td> </tr> <tr> <td class="tg-9o7t">taming transformers for high-resolution image synthesis</td> <td class="tg-9o7t">CVPR/2021</td> </tr> <tr> <td class="tg-eojg">BEIT: Bert Pre-Training of Image Transformers</td> <td class="tg-eojg">ICLR/2022</td> </tr> <tr> <td class="tg-eojg">GAN-Supervised Dense Visual Alignment</td> <td class="tg-eojg">CVPR/2022</td> </tr> <tr> <td class="tg-eojg">Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling</td> <td class="tg-eojg">CVPR/2022</td> </tr> <tr> <td class="tg-eojg">FMODetect: Robust Detection of Fast Moving Object</td> <td class="tg-eojg">ICCV / 2021</td> </tr> <tr> <td class="tg-eojg">Swin Transformer: Hierarchical Vision Transformer using Shifted Windows</td> <td class="tg-eojg">CVPR / 2021</td> </tr> <tr> <td class="tg-eojg">Boosting Crowd Counting via Multifaceted Attention*</td> <td class="tg-eojg">CVPR / 2022</td> </tr> <tr> <td class="tg-eojg">Focal and Global Knowledge Distillation for Detectors</td> <td class="tg-eojg">CVPR / 2022</td> </tr> <tr> <td class="tg-fwub">VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution</td> <td class="tg-eojg">CVPR/2022</td> </tr> <tr> <td class="tg-eojg">RefineFace: Fefinement Neural Network for High Performance Face Detection</td> <td class="tg-eojg">TPAMI/2021</td> </tr> <tr> <td class="tg-9o7t">Restormer: Efficient Transformer for High-Resolution Image Restoration</td> <td class="tg-9o7t">CVPR/2022 (Oral)</td> </tr> </thead> <tbody> <tr> <td class="tg-9o7t">Learning the Degradation Distribution for Blind Image Super-Resolution </td> <td class="tg-9o7t">CVPR / 2022</td> </tr> <tr> <td class="tg-9o7t">Pose Recognition With Cascade Transformers </td> <td class="tg-9o7t">CVPR/2021</td> </tr> <tr> <td class="tg-9o7t">Deep Constrained Least Squares for Blind Image Super-Resolution</td> <td class="tg-9o7t">CVPR / 2022</td> </tr> <tr> <td class="tg-eojg">ACPL:Anti-curriculm Psudo-lablling for Semi-supervised Medical Image Clasification</td> <td class="tg-eojg">CVPR / 2022</td> </tr> <tr> <td class="tg-eojg">CoMoGAN: continuous model-guided image-to-image translation</td> <td class="tg-eojg">CVPR/2021</td> </tr> <tr> <td class="tg-eojg">TrackFormer: Multi-Object Tracking with Transformers</td> <td class="tg-eojg">CVPR/2022</td> </tr> <tr> <td class="tg-eojg">Contrastive Embedding for Generalized Zero-Shot Learning</td> <td class="tg-eojg">CVPR/2021</td> </tr> <tr> <td class="tg-eojg">Masked Autoencoders Are Scalable Vision Learners</td> <td class="tg-eojg">CVPR/2021</td> </tr> <tr> <td class="tg-eojg">Crafting Better Contrastive Views for Siamese Representation learning </td> <td class="tg-eojg">CVPR/2022</td> </tr> <tr> <td class="tg-eojg">GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields </td> <td class="tg-eojg">CVPR/2021</td> </tr> <tr> <td class="tg-fwub">Scaling Vision Transformers</td> <td class="tg-eojg">CVPR/2022</td> </tr> <tr> <td class="tg-eojg">Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions</td> <td class="tg-eojg">AAAI/2022</td> </tr> <tr> <td class="tg-eojg">EditGAN: High-Precision Semantic Image Editing</td> <td class="tg-eojg">NeurIPS 2021</td> </tr> </tbody> </table> ## Final exam (Open anything) 1. Local Binary Patterns (15%) - 怎樣算 ? - 給三張 image patch 比較跟原圖相似度 2. Attention ($Z$) 怎樣算，題目已經給公式跟$K, V, Q$ 的矩陣 (20%) 3. 給一篇 paper [MetaFormer is Actually What You Need for Vision](https://openaccess.thecvf.com/content/CVPR2022/papers/Yu_MetaFormer_Is_Actually_What_You_Need_for_Vision_CVPR_2022_paper.pdf), 問他跟原先 transformer 的差異為何, 他怎樣改進效能 (20%) 4. 給一篇 paper [Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment](https://openaccess.thecvf.com/content_cvpr_2018/papers/Kumar_Disentangling_3D_Pose_CVPR_2018_paper.pdf) - 問 [3D STN](https://arxiv.org/pdf/1707.05653.pdf) 跟這篇 paper 的差異, 彼此的優缺點 (15%) - 問 Hard sample 怎樣增加模型的 robustness, 須參考 [Hard Sample Mining](https://arxiv.org/pdf/1606.04232.pdf) (10%) 5. 課程意見反饋 (20%) ## Reference [原文書電子檔申請](https://szeliski.org/Book/)