Po-Chuan Chen
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
10
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# 應用電腦視覺 - 鄭文皇 (2022 Fall) ###### tags: `NYCU-2022-Fall` ## Class info. [課程資訊](https://timetable.nycu.edu.tw/?r=main/crsoutline&Acy=111&Sem=1&CrsNo=535232&lang=zh-tw) 1. Learn the concepts and theories of Computer Vision (CV) and how they can be applied in practice to solve real-world problems. 2. Also cover the latest topics in current CV literature, such as self-supervised learning for CV applications. 作業50%、期中報告20%、期末考試30% 有邀講者來改成,作業40%、期中報告20%、期末考試30%、演講出席10% <style> .red{ color: red; } .blue{ color: #87ceeb; } </style> ## Date ### 9/12 Computer Vision: feature engineering + model learning $\rightarrow$ deep learning feature engineering: f = $f(I)$ model learning: y = $g(f,\theta)$ deep learning: y = $g(I,\theta)$ * Feature Detector 視覺系統的子系統,用來檢測存在或視覺場景中某些特徵的缺失 Image data from real world often display complex structure **In general, computer vision does not work. (except in certain cases)** * Intra-class Variability 相同影像類別,但不同照片呈現方式 ### 9/19 * intensity: 色彩亮度 $\frac{R+G+B}{3}$ <table> <tr> <td> <img src="https://i.imgur.com/S7fi26d.png" alt="drawing" width="400"/> </td> <td> <img src="https://i.imgur.com/qZsEGsQ.png" alt="drawing" width="400"/> </td> </tr> </table> In comparison to global features, local features are more robust to occlusion and clutter. * Properties of Ideal Local Feature 1. Repeatability 2. Distinctiveness / Informativeness (鑑別性: 局部結構變化,feature也要有變化) 3. Locality 4. Quantity 5. Accuracy 6. Efficiency * [Sobel operator](https://zh.m.wikipedia.org/zh-tw/%E7%B4%A2%E8%B2%9D%E7%88%BE%E7%AE%97%E5%AD%90) Before designing an edge detector 1. Use derivatives (in x and y direction) to define a location with high gradient 2. Need smoothing to reduce noise prior to take derivative * Edge Detector in 1D & 2D <table> <tr> <td> <img src="https://i.imgur.com/QfolsiX.png" alt="drawing" width="500"/> </td> <td> <img src="https://i.imgur.com/mvY53yA.png" alt="drawing" width="500"/> </td> </tr> </table> * [Convolution](https://iter01.com/480243.html) $g$ 翻轉,之後依照 $\tau$ 值平移過 $f$ 連續形式: $(f*g)(n)=\int^{\infty}_{-\infty}f(\tau)g(n-\tau)d\tau$ 離散形式: $(f*g)(n)=\sum_{\tau=-\infty}^{\infty}f(\tau)g(n-\tau)$ <br> * Canny Edge Detection [實作文章](https://medium.com/@pomelyu5199/canny-edge-detector-%E5%AF%A6%E4%BD%9C-opencv-f7d1a0a57d19) Large $\sigma$ detects large scale edges, small $\sigma$ detects fine feature * Image Gradient ![](https://i.imgur.com/AloAd9o.png) Magnitude: $\| ∇f \|=\sqrt{(\frac{\partial f}{\partial x})^2 + \frac{\partial f}{\partial y})^2}$ Direction: $\theta = \tan^{-1}(\frac{\partial f}{\partial y} / \frac{\partial f}{\partial x})$ * Harris Corner Detector [實作文章](https://www.796t.com/p/1343014.html) Invariant to large **rotation**, **translation**. But ==not-invariant to image scale==, it doesn’t tell us the scale of the corner **過一次 0 就有一個 edge** <img src="https://i.imgur.com/xTI0KVQ.png" alt="drawing" width="500"/> <br><br> > Impulse response <img src="https://i.imgur.com/TWWezna.png" alt="drawing" width="500"/> <br><br> [Laplace operator, Laplacian](https://zh.wikipedia.org/zh-tw/%E6%8B%89%E6%99%AE%E6%8B%89%E6%96%AF%E7%AE%97%E5%AD%90) <table> <tr> <td> <img src="https://i.imgur.com/IoHzxPT.png" alt="drawing" width="450"/> </td> <td> <img src="https://i.imgur.com/rows5Hz.png" alt="drawing" width="450"/> </td> </tr> </table> * SIFT Algorithm ![](https://i.imgur.com/cGS2ocA.png) ![](https://i.imgur.com/0mXhV3H.png) ![](https://i.imgur.com/AFkWfrf.png) 右邊 Gaussian 為左邊的兩倍 ![](https://i.imgur.com/tjBo2gq.png) ### 9/26 :::info **Appendix-SIFT** ![](https://i.imgur.com/NRwfNo1.png) ::: * Keypoint Localization ![](https://i.imgur.com/F71YuBx.png) ![](https://i.imgur.com/AzRjEX2.png) ![](https://i.imgur.com/QHaS0DX.png) * SIFT Descriptor ![](https://i.imgur.com/2GCYlTv.png) [OpenCV SIFT](https://docs.opencv.org/3.4/da/df5/tutorial_py_sift_intro.html) * HoG (Histogram of Oriented Gradients) [HoG](http://alex-phd.blogspot.com/2014/03/hog.html) <table> <tr> <td> <img src="https://i.imgur.com/8jI7586.png" alt="drawing" width="500"/> </td> <td> <img src="https://i.imgur.com/SRqCNU2.png" alt="drawing" width="500"/> </td> </tr> <tr> <td> <img src="https://i.imgur.com/nUgn10j.png" alt="drawing" width="500"/> </td> <td> <img src="https://i.imgur.com/aKZ1Ca7.png" alt="drawing" width="500"/> </td> </tr> <tr> <td> <img src="https://i.imgur.com/rjAgLxH.png" alt="drawing" width="500"/> </td> </tr> </table> * LBP (Local Binary Patterns) LBP is a non-parametric descriptor whose aim is to efficiently summarize the local structures of images. ![](https://i.imgur.com/T1589eR.png) ![](https://i.imgur.com/bqRLV0u.png) * Types of Object Detection * Detection of specific categories * Detection of specific instance ![](https://i.imgur.com/UEKJ7qM.png) --- **Object Classification** * [Image Classification Architectures review](https://medium.com/@14prakash/image-classification-architectures-review-d8b95075998f) * ImageNet Dataset * ImageNet with roughly 1000 images in each of 1000 categories. * AlexNet ![](https://i.imgur.com/h7q82R7.png) --- **Semantic Segmentation** * Sliding Window ![](https://i.imgur.com/wsAJyhd.png) * Downsampling & upsampling (solve the expensive convolution cost) ![](https://i.imgur.com/A6FzDaX.png) * [U-Net](https://ithelp.ithome.com.tw/articles/10240314) --- **<span class="red">There is no universal agreement in the literature on the definitions of various vision subtasks</span>** * Two Main Categories for Generic Object Detection ![](https://i.imgur.com/uXpPvPG.png) * [Region Proposals](https://medium.com/curiosity-and-exploration/%E5%8F%96%E5%BE%97-region-proposals-selective-search-%E5%90%AB%E7%A8%8B%E5%BC%8F%E7%A2%BC-be0aa5767901) * [R-CNN & Fast R-CNN](https://zhuanlan.zhihu.com/p/40986674) * [[Paper] EDF-SSD: An Improved Feature Fused SSD for Onjection Detection](https://jackson1998.medium.com/paper-edf-ssd-an-improved-feature-fused-ssd-for-onjection-detection-213c4566745) ### 10/3 Convolution size for kernel: height × width × dense > 縮減深度計算量。Not really a 1x1 convolution → It's a 1x1xC convolution ![](https://i.imgur.com/llKweEN.png) * A Fire module is comprised of: a squeeze convolution layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters. :::info Top-1 ImageNet Accuracy: 僅能給 1 個答案 Top-5 ImageNet Accuracy: 能給 5 個答案 ::: ![](https://i.imgur.com/ohuJBu0.png) > Skip connections not only skip one layer The advantage of adding this type of **skip connection** is that if any layer hurt the performance of architecture then it will be skipped by regularization. So, this results in training a very deep neural network without the problems caused by vanishing/exploding gradient. In conclusion, ResNets are one of the most efficient Neural Network Architectures, as they help in **maintaining a low error rate much deeper in the network.** * DenseNet ![](https://i.imgur.com/EJaIlOp.png) * [Feature Pyramid Networks](https://ivan-eng-murmur.medium.com/%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%AC-s8-feature-pyramid-networks-%E7%B0%A1%E4%BB%8B-99b676245b25) ![](https://i.imgur.com/eE1pm5S.png) * [A Simple yet Effective Approach for Identifying Unexpected Road Obstacles](https://zhuanlan.zhihu.com/p/415220541) * [Deep Learning for Generic Object Detection: A Survey](https://www.796t.com/content/1545903385.html) --- * Conventional two-stage solutions adopt the detect-then-segment approach → **<span class="red">Slow</span>** * Focus on single-stage instance segmentation ![](https://i.imgur.com/RHemaxC.png) * Local-mask-based Methods * Contours with Explicit Encoding * ExtremeNet (Four extreme points with one center point of objects) 同時,通過四個方向可以求得中心點(center point)。 (實際上,一個方向上的極值點可能不止一個) * PolarMask: It utilizes rays at constant angle intervals from the center to describe the contour. * FourierNet: a contour shape decoder using Fourier transform * Compact Mask Encoding :::info **Contours with Explicit Encoding** pros: fast to inference and easy to optimize. cons: can not depict the mask precisely and can not describe objects that have holes in the center. ::: * Global-mask-based Methods * YOLACT: attempting real-time instance segmentation ![](https://i.imgur.com/u79wRiw.png) * BlendMask ### 10/10 國慶日放假 ### 10/17 * Challenge of Long-Tailed Visual Recognition ![](https://i.imgur.com/6MIP3q1.png) * Loss function * MSE:$$f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|^2$$ * MAE:$$f^* = \rm{arg} \min_f \mathbb{E}_{x,y \sim p_{data}} \| y - f(x) \|_1$$ * [Cross Entropy](https://zh.wikipedia.org/zh-tw/%E4%BA%A4%E5%8F%89%E7%86%B5): $$L = -\frac{1}{m} \sum_{i=1}^m y_i \cdot \ln(\hat{y}_i)$$ ![](https://i.imgur.com/1bPlXyM.png) * Solutions in the Literature for Long-Tailed Visual Recognition * Re-sampling: * over-sampling (adding repetitive data) for the minority class * under-sampling (removing data) for the majority class * Re-weighting: $$L = -\sum^{\mathcal{C}}_{i=1} w_i y_i \log p_i$$ * Class-Balanced Loss $$ \rm{CB}(\textbf{p}, y) = \frac{1}{E_{n_{y}}} \mathcal{L} (\textbf{p}, y) = \frac{1 - \beta}{1 - \beta^{n_y}} \mathcal{L}(\textbf{p}, y) $$ * re-balancing = re-sampling + re-weighting ![](https://i.imgur.com/rfCfgLI.png) ![](https://i.imgur.com/WhUeC18.png) 1. Feature extractor 2. classifier ![](https://i.imgur.com/6ycKe2S.png) :::info What is transfer learning? Transfer learning is about **leveraging feature representations from a pre-trained model**, so you don't have to train a new model from scratch. The pre-trained models are usually trained on massive datasets that are a standard benchmark in the computer vision frontier. ::: * Bilateral-Branch Network ![](https://i.imgur.com/KBjJuMy.png) ![](https://i.imgur.com/PsFDarz.png) --- [Mitigating Dataset Bias (BMVC 2020 Keynote)](https://www.youtube.com/watch?v=HAfB9qvGfMM) * Dataset bias <img src="https://i.imgur.com/TdZ8MAf.png" alt="drawing" width="500"/> ![](https://i.imgur.com/Yt1RuYw.png) * Techniques that help deal with data bias * Collect labelled data from target domain * Better backbone CNNs * Batch Normalization ([Li'17](https://arxiv.org/pdf/1603.04779.pdf), [Chang’19]) * Instance Normalization + Batch Normalization [Nam'19](https://proceedings.neurips.cc/paper/2018/file/018b59ce1fd616d874afad0f44ba338d-Paper.pdf) * Data Augmentation, Mix Match [Berthelot'19](https://arxiv.org/pdf/1905.02249.pdf) * Semi-supervised methods, such as Pseudo labeling [Zou’19](https://arxiv.org/pdf/1908.09822.pdf) * Domain Adaptation (this talk) * ==**Adversarial domain alignment**== * Feature space * Pixel space ![](https://i.imgur.com/68isLg1.png) ### 10/24 * Pixel-space alignment ![](https://i.imgur.com/eyK0Lsh.png) * Few-shot domain translation Lots of unlabeled target data, but only have 1-5 images of the target domain ![](https://i.imgur.com/hiNOFIZ.png) * Disentangled features ![](https://i.imgur.com/1IrR44G.png) ![](https://i.imgur.com/UZdbch0.png) ![](https://i.imgur.com/2v2I57M.png) * Weak Scene-level Alignment ![](https://i.imgur.com/kfVemRy.png) ![](https://i.imgur.com/ysbuiVR.png) * Alignment that respects class boundaries ![](https://i.imgur.com/tiRXwYL.png) * Category Shift When categories aren't the same in source and target ![](https://i.imgur.com/1txrbNc.png) ![](https://i.imgur.com/Vx9AqJF.png) --- ![](https://i.imgur.com/Ky27di6.png) * Recognition of Static Pose * Recognition of Dynamic Pose * Pose Model ![](https://i.imgur.com/avxJUjP.png) * Inverse Kinematics ![](https://i.imgur.com/2sV5fKb.png) * Exploiting Temporal Dependence ![](https://i.imgur.com/WWJiVDJ.png) ### 10/31 * Recurrent Neural Networks (RNN) ![](https://i.imgur.com/ZzB1b9S.png) ![](https://i.imgur.com/rhXx7IQ.png) ![](https://i.imgur.com/aFAnoXo.png) * RNN cell ![](https://i.imgur.com/zWRW9vC.png) :::info **The Problem of RNN: Short-term Memory** If a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. **Long Short Term Memory (LSTM)** was created as the solution to short-term memory. It has internal mechanisms called gates that can regulate the flow of information. [RNN Notes](https://hackmd.io/3PzYYuBBTNCgRymLI2fUuw?view) ::: * GRU (Gated Recurrent Unit) ![](https://i.imgur.com/KI0nbrg.png) * Deep LSTM ![](https://i.imgur.com/DBOr9ss.png) * Two-way LSTM ![](https://i.imgur.com/TAlAU9Z.png) * Connectionist Temporal Classification (CTC) ![](https://i.imgur.com/vyKLsw3.png) ### 11/7 * Attention Model ![](https://i.imgur.com/uZb0cur.png) $c$ is the context, and the $y_i$ are the “part of the data” we are looking at. $$ m_i = \rm{tanh}(W_{cm}c + W_{ym}y_i) $$ The network computes $m_1, … m_n$ with a tanh layer $$ softmax(x_1, ..., x_n) = (\frac{e^{x_i}}{\sum_j e^{x_j}})_i \\ z = \sum_i(s_iy_i) $$ The output $z$ is the weighted arithmetic mean of all the $y_i$, where the weight represent the relevance for each variable according the context $c$. --- :::info * CV Weekly - Generate video from text - DIFFUSIONDB: Dataset for Text-to-Image Generative Models ::: ![](https://i.imgur.com/5xiycMO.png) * 3D data representation <style> .bf{ font-weight: bold; } </style> <table> <tr> <td> <p class = "bf">Point Cloud</p> </td> <td> <p class = "bf">Mesh</p> </td> </tr> <tr> <td> A point cloud is a set of data points in space, which measures a large number of points on the external surfaces of objects around them. </td> <td> A mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral. The faces usually consist of triangles (triangle mesh), quadrilaterals, or other simple convex polygons. </td> </tr> <tr> <td> <p class="bf">Voxel</p> </td> <td> <p class="bf">Multi-View Images</p> </td> </tr> <tr> <td> A voxel represents a value on a regular grid in three-dimensional space. </td> <td> Multi-view images are multiple looks of the same target, e.g., at different viewing angles, perspectives, and so forth. </td> </tr> </table> ![](https://i.imgur.com/twtSUpj.png) * Deep Learning on Multi-view Representation ![](https://i.imgur.com/Od4I514.png) * Challenge 對抗 Geometric form (irregular) 排列表示不同的一至性 ![](https://i.imgur.com/YPs8L54.png) ![](https://i.imgur.com/OMmRxrI.png) **Permutation invariance: Symmetric function** $$ f(x_1, x_2, ..., x_n) \equiv f(x_{\pi_1}, x_{\pi_2}, ... x_{\pi_n}), x_i \in \mathbb{R}^D $$ Examples: $$ f(x_1, x_2, ..., x_n) = \max\{x_1, x_2, ..., x_n\} \\ f(x_1, x_2, ..., x_n) = x_1 + x_2 + ... + x_n $$ ![](https://i.imgur.com/pGBnssR.png) ![](https://i.imgur.com/haVjm0e.png) **Input Alignment by Transformer Network** ![](https://i.imgur.com/D2Sd900.png) * PointNet Architecture ![](https://i.imgur.com/pDIOtw0.png) --- * Recap RNN /LSTM [RNN Notes](https://hackmd.io/3PzYYuBBTNCgRymLI2fUuw?view) * Transformer network [Transformer Notes](https://hackmd.io/ba1UQdFqRAGnhN9_eTZeog) [PyTorch Transformer](https://pytorch.org/tutorials/beginner/transformer_tutorial.html) * [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://zhuanlan.zhihu.com/p/266311690) * Vision Transformer ![](https://i.imgur.com/WD0qtbs.jpg) ### 11/14 **課程調整** | Dates | Topic | | -------- | -------- | | 11/21 | Invited Talks | | 11/28 | Invited Talks | | 12/05 | Midterm Presentation | | 12/12 | Midterm Presentation | | 12/19 | Invited Talks / Deep Generation Modeling | | 12/26 | Final Examination | * Homework 2: Transformer * Homework 3: Invited Talks 500字心得 --- * [Tokens-to-Token ViT: Training Vision Transformers from Scratch on Imagenet](https://zhuanlan.zhihu.com/p/359930253) * [Mobile-Former: Bridging MobileNet and Transformer](https://zhuanlan.zhihu.com/p/412964831) * [EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers](https://blog.51cto.com/shanglianlm/5550217) * [End-to-End Object Detection with Transformers](https://allen108108.github.io/blog/2020/07/27/[%E8%AB%96%E6%96%87]%20End-to-End%20Object%20Detection%20with%20Transformers/) ### 11/21 #### 暗光影像增強計算 :::info 北京大學 - 劉家瑛教授 ::: * Research topic: * Image Reconstrucion * ImageVideo Coding * Image Generation * Video Analytics * Low-Light Degradation * Intensive noise **Problem: High-level vision in low-light scenarios** * Reperesentative work * Histogram equalization * Dehazing method (invert $\rightarrow$ dehaze $\rightarrow$ invert again) * Retinex Model (retinex decomposition ($S = R \cdot L$) / generate result ($S_{enhance} = R \cdot L^{\frac{1}{\gamma}}$)) * Learning-Based Model (LLNet/LLCNN...) * Low-Light Datasets for High-Level Tasks (KAIST / Exclusively Dark) * [Deep Retinex Decomposition for Low-Light Enhancement](https://zhuanlan.zhihu.com/p/87384811) * Retinex Theory + Deep Learning * Dataset: LOl Light * [Benchmarking Low-Light Image Enhancement and Beyond](https://zhuanlan.zhihu.com/p/467789757) * Paired datasets: LLNet * Unpaired datasets: can't support for model training * VE-LOL: evaluation of low/high-level visions * **UG2 challenge** * [HLA-Face: Joint High-Low Adapation for Low Light Face Detection](https://blog.csdn.net/weixin_45709330/article/details/116375825) * Gaps between normal light and low light (Pixel-levle apperances/object-level sentiment) * Consider joint low-level and high-level adaptation * [Self-Aligned Concave Curve: Illumination Enhancement for Unsupervised Adaptation](https://arxiv.org/abs/2210.03792) * Training strategy asymmetric self-supervised alighment #### 非漫射複雜材質物體的多視角三維視覺建模 :::info 澳洲國立大學 - Hongdong Li教授 ::: * Research topic: * Computer Vision * Robotic Vision * Smart Car Project * City Modeling * Bionic Eyes Project * [Multi-view 3D Reconstruction of a Texture-less Smooth Surface of Unknown Generic Reflectance](https://openaccess.thecvf.com/content/CVPR2021/papers/Cheng_Multi-View_3D_Reconstruction_of_a_Texture-Less_Smooth_Surface_of_Unknown_CVPR_2021_paper.pdf) * Vision-based 3D Shape Reconstrucion * (Rigid Object / Scene) Structure from model * Lambertian / Non-Lambertian * Problem Setting: Traditional Photometric Stereo problem * 3D computer vision $\leftrightarrow$ image inversion * **The rendering equation** * Solution: Minimizing a suitable objective (loss) function (augmented language nethod relaxation) image formation + surface regularization + relax penalty * [Diffeomorphic Neural Surface Parameterization for 3D and Reflectance Recovery](https://dl.acm.org/doi/10.1145/3528233.3530741) * Shape deformation * Learning / training process: Inverse graphics rendering <style> .red{ color: red; }; </style> * Recap * Multi-view 3D reconstruction for object with unknown materials. * Significantly outperforms SOTAs under unknown illuminations * Achieves similar accuracy to darkroom methods but much more flexible * Robust to complex shapes and specular materials * Reconstructions can be easily plugged into rendering engines * Limitations: <span class="red">piecewise smooth object shape assumption with simple topology; need a strong flashlight (SNR) slow convergence</span> #### Consistent, Empathetic and Prosocial Dialogues :::info Prof. Gunhee Kim ::: * [ProsocialDialog: A Prosocial Backbone for Conversational Agents](https://arxiv.org/pdf/2205.12688.pdf) * [Anticipating safety issues in E2E Converisoal AI: Framework and Tooling](https://arxiv.org/abs/2107.03451) * Dataset: DailyDialog / PersonaChat / EmpatheticDialogues... (All of those are biased towards positivity) * Classification models trained on GoEmotions * Canary / Prost * [Will I Sound Like Me? Improving Persona Consistency in Dialogues through Pragmatic Self-Consciousness](https://arxiv.org/abs/2004.05816) * *Public self-consciousness* is this awareness of the self as a social object that can be observed and evaluated by others * Bayesian Rational Speech Acts framework, which has been originally applied to improving informativeness of referring expressions. * [Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes](https://arxiv.org/abs/2109.08828) * Related work * Empathetic dialogue modeling * Emotion Cause (Pair) Extraction * Rational Speech Acts (RSA) framework ### 11/28 #### [Context Autoencoder for Scalable Self-Supvervised Representation Pretraining](https://arxiv.org/abs/2202.03026) :::info Baidu computer vision expert - Jingdong Wang (王井东) ::: - Vision Foundadtion Models - Big Data - Big Parameter - Big Task - Big Algorithm - Big Computation - Representation Pretraining - Goal: Learn an encoder mapping an image to a representation - Pretraining Task $\rightarrow$ Downstream Task - Scale Up: Sample scale (no supervised, yes semi-supervised / vision-language / self-supervised), Concept scale (no supervised / semi-supervised, yes vision-language / self-supervised) - Self-Supvervised Representation Pretraining in Vision - Contrastive pretraining - Masked image modeling - Other - CAE: representation pretraining aims to learn an encoder, <span class="red">mapping an image to a representation that can be transferred to downstream task.</span> - <span class="red">Regressor for masked image modeling $\rightarrow$ masked representation modeling:</span> make predictions for masked patches from visible patches in the encoded representation space for solving the masked image modeling task. - The encoder is <span class="red">dedicated for</span> representation pretraining, and representation pretraining is <span class="red">only by</span> the encoder. - The task completion part (regressor and decoder) is <span class="red">separated</span> from the encoder. <center> <img src = "https://i.imgur.com/YtAg3d1.png"> <p>Figure 1: Context autoencoder</p> </center> - How contrastive pretraining works ? - How can the representations of random crops from the same original image be similar ? - Speculation: encoder extracts the representation of the <span class="red">part</span> of the object / prejector maps the part representation to the representation of the <span class="red">whole object</span> - The projected representations than agree - What representation are learned ? - Observation: The common among random crops lie in the <span class="red">center</span> of the original image / The object in ImageNet image lies in the <span class="red">center</span> - Conjecture: Contrastive pretraining mainly <span class="red">learns the semantics of the center region</span> **[Github repo.](https://github.com/lxtGH/CAE)** <center> <img src = "https://i.imgur.com/A9RqRId.png"> <p>Table 1: Pretraining quality evaluation</p> </center> #### Relational and Structural Vision with High-Order Feature Transforms :::info POSTECH - Minsu Cho ::: **Match and transfer** - Relational Self-Attention: What's Missing in Attention for Video Understanding - [SPair-73k: A Large-scale Benchmark for Semantic Correspondence](http://cvlab.postech.ac.kr/research/SPair-71k/) - [Convolutional Hough Matching Networks](https://arxiv.org/abs/2103.16831) - [TansforMatcher: Match-to-Match Attention for Semantic Correspondence](https://arxiv.org/abs/2205.11634) - Few-shot image segementation - [Hypercorrelation Squeeze for Few-Shot Segementation](https://openaccess.thecvf.com/content/ICCV2021/papers/Min_Hypercorrelation_Squeeze_for_Few-Shot_Segmentation_ICCV_2021_paper.pdf) - Structure of correspondence in space - [Learning to Discover Reflection Symmetry via Polar Matching Convolution](https://arxiv.org/abs/2108.12952) - Motion-aware video recognition - [Learning Self-Similarity in Space and Time as Generalized Motion](https://arxiv.org/abs/2102.07092) - Relational Self-Attention - [Relational Self-Attention: What's Missing in Attention for Video Understanding](https://arxiv.org/abs/2111.01673) - Summary - Real-world vision systems need to leverage relational and structural patterns of images and videos for systematic understanding. - High-order convolution or self-attention is effective for capturing relational structures by considering geometric patterns of correlation. - Learning relational structures is crucial for minimally-supervised recognition and structural perception of images and videos. #### AURORA - Empirical Bayes from Replicates :::info Stanford University - Dennis L. Sun ::: - [Empirical Bayes mean estimation with nonparametric errors via order statistic regression on replicated data](https://arxiv.org/abs/1911.05970) - Estimate some quality $\mu_i$ from noisy observation $\textbf{Z} = \{Z_1, ... Z_N\}$. - Empirical Bayes: First estimate $A$ using the data, then plug it into the prior. - Prior: $G = \mathcal{N}(0, A)$ - Likelihood: $F(\cdot \ | \ \mu_i) = \mathcal{N}(\mu_i, \sigma^2)$ ### 12/21 **Deploying CV at Edge - From Recent Vision Transformer to Future Metaverse** ##### Computing and AI Technology Group, MediaTek Inc. #### Part1 Overview * NIPS * Marching toward metaverse era #### Part2 Deploying Vision transformer at edge * Computer vision resesrch evolves rapidly * How to use them in out daily devices :::info 鄭嘉珉 MTK資深經理 ::: Focus more on experimence sharing, especially from CV research to produciton in MTK #### NIPS [Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](https://arxiv.org/pdf/2205.11487.pdf) Important topic : * Adversarial robustness, * Federated learning, * Diffusion model, * NeRF(Neural Radiance Field), * NeMF(Neural Motion Field), * CCNeRF(Compressilble-composable NeRF), * GNN #### Metaverse ##### Challenge * High computing * Low latency * Low power * Tiny form factor * Display: * immersive display experience * Graphics * Motion-to-Photo latency * VR : under 20ms * AR : under 5ms * Concurrent multiple tasks DNN > Edge Process > More's law --- ::: info 姜政銘(Jimmy Chiang) ::: * Edge AI in MTK * Vision Transformer * MTK重視的AI人才 * 給準備進入職場的大家 Edge AI KSF: Noise Reduct, Super resolution #### CAI部門 AI-ALG演算法被賦予的任務 * AI CV * AI NLP * AI Network * AI Methodology * AI for 5G * AI Architecture AI-SW: * 串接gpu->cuda->pytorch->python code * NeuroPilot SW 串接手機上的gpu AI-HW: * 如何在有限的cost下,設計出高效率的APU #### 想要讓訓練好了模型跑在手機上,需要做哪些事? 1. 如何整合NAS,Quatization? 2. 轉出平台支援的格式? 3. 結果超級慢? #### Vision transformer 1. Patch embedding * Opertaion * Challenges in APU * memory access is one of the bottlenecks in APU * Patch-wise is like 'Sliding window' in convolution * Patch size 2. Multihead Self-attention - Challenge * global self-attention requires quadratic computing complexity * The most challenge in APU => over 95% latency code in ViT * Matrix multiplication * Softmax Summary: * Global attention has better quality but suffer from MatMul and Sofmax * Cross-varaice is favorable for high-resolution and less-chanel #### Softmax Complexity * Sofmax : naive formula doesn't work due to numeriacal stability(overflow) * Most AI accelerators support float16 instead of float32 data format got better PPA(Performance, Power, Area) * What happen if using float16? UNDERFLOW #### Norm-Layer Challenge * Overflow occur after Mul, which calculate varaince $\sigma$ * Underflow occur is Rsqrt #### MLP-GELU Challenge Overview * GELU activation is wodely used in Tansformer * It's impractical to implement error function in AI accelerator! #### What papers might not tell you, but matter in edge AI * Low MAC/FLOPs doesnt imply high efficeintcy * Accuracy in paper does not guarantee acuuracy in edge device * Paper reports performance in mobile CPU and GPU #### 職場 * 通用法則 - 基本功 - 團隊合作及溝通 - 好奇心 - 獨立思考 - 學習心態 - 人才? - 算法,硬體,軟體 - 投履歷,準備好投影片 ## Paper list <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-eojg{background-color:#FFF2CC;text-align:center;vertical-align:bottom} .tg .tg-fwub{background-color:#FFF2CC;color:#121212;text-align:center;vertical-align:bottom} .tg .tg-9o7t{background-color:#FFF2CC;border-color:inherit;text-align:center;vertical-align:bottom} </style> <table class="tg"> <thead> <tr> <th class="tg-9o7t"><span style="background-color:#FFF2CC">Paper </span></th> <th class="tg-9o7t"><span style="background-color:#FFF2CC">Conference / Year</span></th> </tr> </thead> <tbody> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">You Only Cut Once: Boosting Data Augmentation with a Single Cut, ICML 2022.</span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">ICML/2022</span></td> </tr> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">Scaled-YOLOv4: Scaling Cross Stage Partial Network</span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2021</span></td> </tr> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation</span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2022</span></td> </tr> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">taming transformers for high-resolution image synthesis</span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR/2021</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">BEIT: Bert Pre-Training of Image Transformers</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">ICLR/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">GAN-Supervised Dense Visual Alignment</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">FMODetect: Robust Detection of Fast Moving Object</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">ICCV / 2021</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Swin Transformer: Hierarchical Vision Transformer using Shifted Windows</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2021</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Boosting Crowd Counting via Multifaceted Attention*</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Focal and Global Knowledge Distillation for Detectors</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2022</span></td> </tr> <tr> <td class="tg-fwub"><span style="color:#121212;background-color:#FFF2CC">VideoINR: Learning Video Implicit Neural Representation for</span><br><span style="color:#121212;background-color:#FFF2CC">Continuous Space-Time Super-Resolution</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">RefineFace: Fefinement Neural Network for High Performance Face Detection</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">TPAMI/2021</span></td> </tr> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">Restormer: Efficient Transformer for High-Resolution Image Restoration</span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR/2022 (Oral)</span></td> </tr> </thead> <tbody> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">Learning the Degradation Distribution for Blind Image Super-Resolution </span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2022</span></td> </tr> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">Pose Recognition With Cascade Transformers </span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR/2021</span></td> </tr> <tr> <td class="tg-9o7t"><span style="background-color:#FFF2CC">Deep Constrained Least Squares for Blind Image Super-Resolution</span></td> <td class="tg-9o7t"><span style="background-color:#FFF2CC">CVPR / 2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">ACPL:Anti-curriculm Psudo-lablling for Semi-supervised Medical Image Clasification</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR / 2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">CoMoGAN: continuous model-guided image-to-image translation</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">TrackFormer: Multi-Object Tracking with Transformers</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Contrastive Embedding for Generalized Zero-Shot Learning</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Masked Autoencoders Are Scalable Vision Learners</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Crafting Better Contrastive Views for Siamese Representation learning </span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields </span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2021</span></td> </tr> <tr> <td class="tg-fwub"><span style="background-color:#FFF2CC">Scaling Vision Transformers</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">CVPR/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">AAAI/2022</span></td> </tr> <tr> <td class="tg-eojg"><span style="background-color:#FFF2CC">EditGAN: High-Precision Semantic Image Editing</span></td> <td class="tg-eojg"><span style="background-color:#FFF2CC">NeurIPS 2021</span></td> </tr> </tbody> </table> ## Final exam (Open anything) 1. Local Binary Patterns (15%) - 怎樣算 ? - 給三張 image patch 比較跟原圖相似度 2. Attention ($Z$) 怎樣算,題目已經給公式跟$K, V, Q$ 的矩陣 (20%) 3. 給一篇 paper [MetaFormer is Actually What You Need for Vision](https://openaccess.thecvf.com/content/CVPR2022/papers/Yu_MetaFormer_Is_Actually_What_You_Need_for_Vision_CVPR_2022_paper.pdf), 問他跟原先 transformer 的差異為何, 他怎樣改進效能 (20%) 4. 給一篇 paper [Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment](https://openaccess.thecvf.com/content_cvpr_2018/papers/Kumar_Disentangling_3D_Pose_CVPR_2018_paper.pdf) - 問 [3D STN](https://arxiv.org/pdf/1707.05653.pdf) 跟這篇 paper 的差異, 彼此的優缺點 (15%) - 問 Hard sample 怎樣增加模型的 robustness, 須參考 [Hard Sample Mining](https://arxiv.org/pdf/1606.04232.pdf) (10%) 5. 課程意見反饋 (20%) ## Reference [原文書電子檔申請](https://szeliski.org/Book/)

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully