電腦視覺 - HackMD

# 電腦視覺 ## 考試範圍 CV : Camera model 到 GMM DL : Lenet、AlexNet、VGG16 ## 20230907 ### Title CvDl-Introduction、LeNet ### Note deepLearning 有一個很重要的 part：convolution。 convolution 跟 Image processing 相關。 many deep Learning methods come from machine Learning。 signal processing -> image processing -> computer vision -> deep learning。最重要的就是 signal processing 中的convolution (作為 filter)。 Knowledge Base: ![image.png](https://hackmd.io/_uploads/HJmm_IIQ6.png) (R+G+B)/3 = Gray value computer vision vs. graphics： * vision：real scene 經過 sensors 轉化成 3D model。 * graphics：3D model 經過 synthetic camera 轉換成 3D virtual view(animation)。 ### GPT： ![image.png](https://hackmd.io/_uploads/H1GI1PIXa.png) * pre-training：need big data，目的是避免 user 給 insufficient data 的情況。並且可用於 customization。 * transformer：有 attention 和 geometric information。 ### LeNet-5： 重要： * 5 layer：2 convolution layers(feature extraction) + 3 full connection layers(classification) * vgg16：13 convolution layers + 3 connection layers。 * vgg19：16 convolution layers + 3 connection layers。 convolution layers：每層做 extract feature points，然後投影到高維度(activation fn.)。 connection layers：負責分類 (classification)。 ![123.PNG](https://hackmd.io/_uploads/HyYIFPIX6.png) --- 其他： ![image.png](https://hackmd.io/_uploads/HJ-j0OIm6.png) deep learning 的 down sampling 通常不超過5個 (3-5個，偶而也會有6個，但5個最好)。 faster R-CNN 的 R：region RNN 的 R：recurrent AutoEncoder = encoder + decoder normalization 的用途是將 output(probability) 的範圍限制在 [0, 1]之間。 ![image.png](https://hackmd.io/_uploads/rJOY7tLQ6.png) * convolution + BN(batch normalization) + Activation fn. = 1 layer of CNN。(CNN 是 deepLearning 的其中一種模型) >convolution： is a filter to extract feature. >activation function：目的是將 low dimension 投射到 high dimension，讓 feature space 變得稀疏，就比較容易使用 linear discrimination 做到分類(classify)的動作。 SVM(supported vector machine)：一種 machine learning，會執行kernel fn.(activation fn.)。 DeepLearning 的目的是改善 Machine learning 的弱點。 >因此，若有篇文獻說使用 deepLearning 搭配 SVM，就代表這篇文獻是大便。 (因為deepLearning是machine learning的強化，本身就有做activation function，不需要 SVM) 所以，實際上 deepLearning 做的事情就是：他有很多層，每層 layer 都會抽取 feature points，再把它們投影到更高的維度，層層往復，最後就會得到一個很鬆散的 feature map(高維度)，因此能夠輕鬆地來做分類(in neural network)。 * flatten：因為要把 input(2D) 輸入到 Neural network(1D) 做分類，所以要把 feature map 拉平。 <big>考試會考：</big> transformer is better than convolutional neural network(cnn). * convolutional neural network vs. transformer: >>transformer 有 attention(important feature) 和 gemoetric information。 ## 20230914 --- ### homogeneous coordinate ![image.png](https://hackmd.io/_uploads/BkZ9uUPXp.png) 二維和三維互相轉換的公式重點是w，是常數項scaling factor，決定圖像放大放小問題 ![image.png](https://hackmd.io/_uploads/HkCARUvm6.png) Optical axis 3D的光軸穿透2D平米不會穿透的是中心點，因為光學和機械誤差，要做校正之後才可以 ### CNN的過程 ![image.png](https://hackmd.io/_uploads/rJawNDP7a.png) 簡單來說就是將600 x 400 x 3 cells的一個圖片通過convolution和max pooling拉平成matrix to vector也就是Flatten，最後變成MLP（一種多層架構） basic on image color to activation function（空間轉換）為了convolution ![image.png](https://hackmd.io/_uploads/Sy5zYPv7T.png) 縮小filter而不是放大主要是因為放大會導致參數變大，也可以增加視野範圍 ![image.png](https://hackmd.io/_uploads/SkjxYduX6.png) <big>考試會考：</big> whats difference of 相關矩陣(Correlation Matrix) 和共變異數矩陣（covariance matrix）的區別 ans：總之，相關矩陣用於衡量變數之間的線性相關性，並且結果是標準化的，以便更容易比較不同變數之間的相關性。協方差矩陣則用於衡量變數之間的聯合變異性，並且結果包含原始數據的單位，通常用於更廣泛的統計分析和建模中。 ## 20230921 ### Title Camera Model + Geometric Transformation ### Outline * What is Computer Vision? * Definition of Camera Model Pinghole Camera Model * Reconstruct Object from 2D to 3D Weak Perspective Model * Application of Camera Model ### Pinghole Camera Model Affine Transform: 不變形的 Perspective Transform: 會變形 ← Pinghole ### Reconstruct Object from 2D to 3D 3D -> Each has 3 parameters, total 9 parameters. #### Extrinsic Parameters: Rotation, 6 degrees of freedom(why?) Extrinsic可使世界坐標系轉化為相機坐標系。 (affine transformation) #### Intrinsic Parameters: 3D to 2D. 別人定義的 (projection matrix) Intrinsic可使相機坐標系轉化為圖片坐標系。 Image center won't be the same. Define u0 v0 as principal point 不同攝影機這個係數會不一樣。 #### Homogeneous Coordinates 把s'移走，從Linear變成Non-linear。 A point is vector from (0, 0) Where is the start point? 要去定義 Equation的物理意義? 3D的觀念你沒有3D的觀念就...就是墜機。 #### Distortion Parameters 2 kinds of distortion: Radial distortion Tangential distortion ### Weak Perspective Model 我剛回台灣都在做這個，現在用什麼我不知道好久沒做了 #### 就是他教我我在教你們很多東西我們有一個作業很單純你能不能在空間中畫很多箭頭我們有很多作業你就把箭頭畫起來箭頭有這兩個是三十度左邊三十右邊三十那時候想老師怎麼出這個問題喔這個題很難欸我給你任一點你把箭頭畫出來看起來好像很簡單後來發現齁老師要我們用unit vector來做用向量的方式來做就簡單 #### 我不玩數學的人齁我只是拿來用我們不是玩數學的就期待他們有更好的結果我們再拿來用 重點 25:55 我考試問你，for example，vector的物理意義是什麼，欸你們線性代數學過了吧，然後我就四個選擇給你，然後複選題要你選兩個。所以vector的物理意義是什麼，是包括<big> rotation 和 scaling </big>。 --- ## 20231012 ### Major Issue: <big>考試會考：</big> ![image.png](https://hackmd.io/_uploads/SkWugwIXp.png) ![image.png](https://hackmd.io/_uploads/HyF1-v8Qp.png) --- 1. camera Calibration - Intrinsic parameter - extrinsic parameter 2. Optimization base on Homogeneous Matrix 3. purpose of Camera Modeling and Camera Calibration ### camera Calibration #### distortion parameters ![image.png](https://hackmd.io/_uploads/HykrKj8Xp.png) Y:0~1 ![image.png](https://hackmd.io/_uploads/S1xl9sUmp.png) ![image.png](https://hackmd.io/_uploads/rk3msiU76.png) 箭頭指的地方是learning rate ![image.png](https://hackmd.io/_uploads/Bys42oL7p.png) learning rate 太小會導致需要過多的時間才能逼近 global minimum learning rate 太大會導致可能會overshooting ![image.png](https://hackmd.io/_uploads/Bk1n2j8Q6.png) local receptive field 用來擷取部分特徵一次是一個filter 一般來說是3X3 shared weight 利用共用權重的方式減少中間layer的節點數 Sub-sampling can subsample the pixels to make image smaller -> less parameter for the network to process the image ![image.png](https://hackmd.io/_uploads/rkJLX28QT.png) filter parameter 一般說是3X3的filter有9個filter parameter是錯的要在乘上channel數，如果是RGB的圖片的話filter parameter 就是3X3X3 = 27 ![image.png](https://hackmd.io/_uploads/HJj54h8Qp.png) image 對一個filter做完convolution會產生一個feature map而有幾個filter是由使用者決定做完convolution會丟失boundary的資訊，下圖為例會缺少最外一圈的值(這裡是指filter為3X3，如果是5X5會丟失外兩圈的資訊) ![image.png](https://hackmd.io/_uploads/BJqUL28Qp.png) 如果不要缺少最外一圈的值的話要用padding的方式 padding又分為full padding 跟zero padding ![image.png](https://hackmd.io/_uploads/rydgthIma.png) 這裡用第一次convolution結果會產生14X14X6的feature map用16個filter (5X5X6)再做一次convolution 得到10X10X16的feature map，這裡10X10是因為用5X5的filter，14上下各減二 ![image.png](https://hackmd.io/_uploads/S1NwTh8Q6.png) convolution為上面講的subsampling則會直接把input縮小這裡是用2X2做subsampling會直接使長寬除二 <big>考試會考：</big> Cv_DeepLearning_20231012_Dl_02_RecogDoc_LeNet_Ieee1998 38:00-40:00左右新的method來做convolution 1. crossover entire feature map而不是像原本一樣先對幾個channel 做事 2. 在最後不使用fully connection而是使用fully convolutional network，因為fully connection在不同size的iamge會有錯，如果要各種size都可以就要使用fully convolutional network activation function ReLU 比 Sigmod function 還要好在中間的結果我們希望能夠不要太靠近0或1而是能夠更代表原來值的特徵，最後通過softmax layer來將結果normalize到0~1 ![image.png](https://hackmd.io/_uploads/rJmFfT8XT.png) --- ## 20231019 ![1699412696808.jpg](https://hackmd.io/_uploads/SyExju_Qp.jpg) - sigmoid 不好的原因:gradient vanish problem,所以現在多用RELU ## 20231026 part1-Sift： ### System flow ![3.png](https://hackmd.io/_uploads/r1kZsOL7a.png =95%x) 1. Scale-Space Extrema Detection - Extract the important feature of the input image - Use：DoG、Edge detection、Corner detection 2. Keypoint Localization - Delete those features that aren’t important. 3. Orientation Assignment for each Keypoint - Measure the direction of each small pages - Rotate to the degree 0 (To do comparison) 4. Keypoint Descriptor - How can we represent the page - One image has only one descriptor ### Detection of Scale-Space Extrema <big>他說考試會考：(但考古沒有)</big> - High pass filter：正面積 = 負面積，面積和 = 0。 - Low pass filter：只有正的，面積和 = 1。 #### Difference of Gaussian (DoG) filter - 用 DoG 優點： - 只需用 1 個 Scan line，即得到結果。 - filters 減完，只要處理一次就好。 #### Non-maxima/minima suppression - Non-maxima suppression 之流程講解： - Extract the keypoints，detect each Edge (pixel) - Consider 上下層的 neighbor pixels (9+8+9) - 若其 Magnitude 比你還大，就該自刪 - 才能找到 local maximum - 做完後，會變成獨立的點們，不會有連續的 ### Keypoint Localization - 要 delete 不重要的 local maximum. #### Keypoint localization in sub-pixel accuracy - 如何算 sub-pixel accuracy ($\hat{x}$)？ - let $x$ = 10，magnitude size = 1、6、5 ![Untitled2.png](https://hackmd.io/_uploads/rJ36qdIQT.png) - 一開始是要找那條曲線，解 $f(x)$ - 泰勒展開式 (4/7 有過程)….解出 - 最後解出最大值 6(1/3)。 - 注意：此值如果 > 1/2，是錯的！要 shift pixels。 #### Remove low contrast keypoints - if the magnitude < some threshold，delete it. ![Untitled1.png](https://hackmd.io/_uploads/Hk1uc_IQa.png =80%x) #### Eliminating edge responses - Keep the **Corner point**. - Because corner is unique. - Remove points that “not corner”. - Corner 的物理意義是啥？ 1. 引入 Hessian matrix ![Untitled.png](https://hackmd.io/_uploads/SkKucdUQT.png =80%x) 2. 找 eigenvalue - 物理意義：變異性 (誤差) 3. 最後，查此圖： ![Untitled.png](https://hackmd.io/_uploads/r1yg9uLQT.png =40%x) --- For each corner would give a page, the center will be the corner point. For each page, gived it small region and find the major rotation angle. #### 每個corner會有一個page，每個page去找他的rotation angle Rotation angle used in comparison (compared after rotate to 0°) ### Orientation Assignment for each Keypoint P.44 Any pixel has a magnitude, it just delete small magnitude to keep the page After threshold some edge will become 0 -> keep higher magnitude value #### 去找出有較高magnitude 的page，並用Gaussian強化中間點的影響 Since the corner is the center, Center point, the center pixel is most important one. Use the mask, Gaussian distribution weight, to the weight in center be high. The magnitude in each pixel multiply the corresponding weights to get **weighted gradient magnitude**. ![image.png](https://hackmd.io/_uploads/BkXeFoUma.png) Weighted gradient magnitude become the Y-axis of the distribution. #### 統計相同bin段內的Weighted gradient magnitude和 Because it is the page, each pixel has its rotation angle. Accumulator the rotation angle (0°~180° -> 36 bins) 5° = 1 bin **根據bin累加相同bin的magnitude 畫出直方圖** ![image.png](https://hackmd.io/_uploads/rkDFts87a.png) Choose the highest one, in the graph, the peak will be the rotation angle for this page. Magnitude 總和最高的bin為該page的 rotation angle. Each page has probably **1 or 2** rotation angles ![image.png](https://hackmd.io/_uploads/rJO39i8Qa.png) Any other local peak that is within **80% of the highest peak** is used to also create a keypoint with that orientation. ### Keypoint Descriptor 1個sub region 有8個bin (rotation angle) 1個page 有 4x4 sub region ![image.png](https://hackmd.io/_uploads/r1cs3i8QT.png) So the descriptor is 128 bins The feature vector, vj(128 dimensions), is normalized to unit length.(Reduce change constrast) $$ v_{j} \to \frac {v_{j}}{ || v_{j}||} $$ ### SIFT application 先將page旋轉至0° 先左比右，再右比左一致，將線連出速度快，可做Real time Can do stiching image and stiching smooth together Also can apply on AR --- part4-Detect_BkgdSubt_Gmm： ### The Important is not Background Substraction, is modeling,Gaussian mixture model #### 兩種model 參數型(數值化)和非參數型( Parametric modeling Vs. non-parametric modeling) non-parametric model結果較好，因為更接近真實狀況缺點: 吃記憶體 #### The most important theory in AI is baye's theory : Posterior_Prob = Likelihood_Prob * Priori_Prob / k 每一個曲線是一個Gaussian model 組合出Gaussian mixture model(GMM) ![image.png](https://hackmd.io/_uploads/ry84O9I7a.png) 上圖有3個clustrs k=3 d is dimension of Gaussuian(ex: RGB(3 channel) is 3) w is weight，此Gaussian於分布圖裡的比重(樣本數比例) ![image.png](https://hackmd.io/_uploads/r15QF9LQ6.png) ### 考題: **cos is low-pass filter sin is high-pass filter** Anything should be modeled by low-pass filter & high-pass filter(cos & sin, that's fourier transform ) ## 20231102 :::info **part2-Detect_BkgdSubt_Gmm：** 投影片 P.24 之後都不考，說明在 [53:26](https://youtu.be/koQJnMUjHNQ?t=3206) ::: <big>他說考試會考：（但考古沒有）</big> 1. What is the baye's(貝氏) theory? (24:01) Ans: Posterior Prob. = Likelihood Prob. * Priori Prob. / k ![image.png](https://hackmd.io/_uploads/rkV08ODQa.png =400x) 3. 給定一式子，判斷是 posterior 還是 likelihood probability。範例 Gaussian model (P.16)，因為是similarity measure(相似度量測)，所以是 likelihood probability。 ![image.png](https://hackmd.io/_uploads/SkPmw_P7p.png =400x) 3. What is the difference between average value and mediuam value? (P.18-19) Ans(老師課堂舉例): 給定60 pixels，找 average and medium時，medium較好，因為 average is sensitivity affected by noise. --- ### 課程筆記判斷一張圖有無人或是車輛或是其他物品的步驟: detection → tracking → recongnition 其中 detection 有以下兩種方法: - **background subtraction**: one kind of detection. 缺點是會占用較多 memory 來 model。 - **faster rcnn:** 因為使用 bounding box 所以不知道物件的邊框，但是前者可以。 P.13 ![image.png](https://hackmd.io/_uploads/Sye-S_wQp.png =500x) #### Background subtraction 因為灰階值太過於起伏所以難以 model，所以 → 將灰階值依個數累加得到 histogram → 經過 normalization → 達到 better detection sensitivity ![image.png](https://hackmd.io/_uploads/ByyLS_P76.png =500x) --- **MAP: Posterior Prob. = Likelihood Prob.** * Priori Prob. / k P.17 - non-parametric model has better result than parametric model. 因為不計算 mean value。 P.21 ![image.png](https://hackmd.io/_uploads/Syl8IuPQa.png =550x) 因為要去除 noise 以及 light effect 所以要做 pre-processing，經過 background subtraction 後仍有 noise (白點) 時，使用 suppression of False Detection: 1. **Pixel Displacement probability**: 取一塊來當目標，再和周邊做比較，如果周邊background pixel 較高，則將目標自身去除掉。→ may lead to **overkill** 說明: [47:53](https://youtu.be/koQJnMUjHNQ?t=2873) 1. **Component displacment probability**: 用於 recover 經過 overkill 的結果，將一塊當中的 pixel 相乘，並透過此數值來判斷是否要留下或是刪除。如果相乘結果數值**小**，則代表我們正考慮的這一塊有 foreground pixel，所以可以刪除這一塊。如果相乘結果數值**大**，則代表這一塊有很多background pixel ，所以保留這一塊。說明: [48:53](https://youtu.be/koQJnMUjHNQ?t=2933) --- part3-CvAi_SimilarityMeasure_Match_Bayes： ##### P.39 30:00 transformer有沒有辦法把Likelyhood跟Priori兩個拆開再回來乘在一起當成最後結果 #### Priori Prob.用意舉例:排球辨識可以先排除不應該出現排球的位置避免/減少誤判，在這邊Priori就是location的位置，所以多乘上Priori probability可以讓成效更好 Recog_Obj_DeepCnn_AlexNet_ImgNet_Nips2012： ## AlexNet ### 架構 - 5 Convolution layers (Feature extraction) - 3 Full Connection Layers (NN, Classification) ### 比LaNet進步的地方 - ReLUs $\to$ **最重要** - 使用多GPU進行訓練 ### 缺點 - 使用 Overlapping pooling ![image.png](https://hackmd.io/_uploads/By2tpxtQa.png) ## DL Parameters ### 會影響training結果的5大參數 #### Choosing proper loss function - **Continue** or **regression** : Mean Square Error (MSE) - **Discrete** : Cross Entropy (CE) #### Mini-batch size - 做完一次 batch 叫做一次 **iteration** - 把整個 data 做完叫做一次 **epoch** #### Activation function #### Learning rate - 剛開始的 epoch 的 learning rate 會比較高，接著越來越小 - 最常用的是 **Adam** #### Momentum - 把值? 推到 global minimal (最佳化) ### Inference 解決 overfitting 的問題 ![image.png](https://hackmd.io/_uploads/BJ8VHWKQa.png =70%x) #### Early stop - 有 overfitting 的跡象出現便停止訓練。如: validation 在某個時間點開始變大 #### Drop out - 每個 neuron 有一定**機率**會 **idle**，在本次 iteration 中不會繼續更新 parameter，但所有 cell 都會參與 inference。 ![image.png](https://hackmd.io/_uploads/rJg6vWFm6.png =70%x) #### Transfer learning (Pre training) #### Regularization (Weight Decay) - 調整 w (收斂速度)的大小，不要那麼快，做更多次epoch ![image.png](https://hackmd.io/_uploads/H1A4IZt7p.png =70%x) #### Network Structure - 因為參數太多(太多層)所以要 reduce layer #### Data Augmentation - 把資料量變多 - horizontal reflections ![image.png](https://hackmd.io/_uploads/B1AtFZKmT.png =50%x) - generating image translations ![image.png](https://hackmd.io/_uploads/HJAhKZKX6.png =30%x) ## VGG - 把 AlexNet 中的 7 x 7 filter 換成 3 x 3 filter - 使得參數變少