Person Re-identification via Contrastive Learning

# Person Re-identification via Contrastive Learning ###### tags: `person reid` ## An image is worth 16x16 words: Transformers for image recognition at scale (vit) * Unknow - layer norm: - multi-head self-attention: - multi-head attention則是通過h個不同的線性變換對Q，K，V進行投影，最後將不同的attention結果拼接起來 - ![](https://i.imgur.com/wU2Mgad.png) * 核心概念 - 用self-attention機製完全替代CNN。 - 因為原本transformer 是用在NLP上，所以要想辦法將影像改成序列化，圖像被切割成一個個patch，這些patch按照一定的順序排列，就成了序列化的數據 * 小Tip - 在大的dataset上，transformer效果好 * Method - ![](https://i.imgur.com/MHjquA5.png) 1. 將圖像轉化為序列化數據 - 將圖像分割成一個個patch，將每個patch reshape成一個向量，得到所謂的 flattened patch。 - 將[公式]個patch reshape後的向量concat在一起就得到了一個[公式]的二維矩陣，相當於NLP中輸入transformer的詞向量 2. 避免patch size影響 - 對flattened patches向量做了Linear Projection - 將不同長度的flattened patch向量轉化為固定長度的向量（記做[公式]維向量 - ![](https://i.imgur.com/om0EmWj.png) 3. Position embedding - 由於transformer模型本身是沒有位置信息 - 如下圖編號有0-9的紫色框表示各個位置的position embedding - 將position embedding（即圖中紫色框）和patch embedding（即圖中粉色框）相加的方式結合position信息 - ![](https://i.imgur.com/JwG0bVC.png) 4. Learnable embedding - 帶星號的粉色框不是通過某個patch產生 - 記作[Xclass]，經過encoder對應的結果視為整張圖的表示 - 為甚麼要家這個呢 - 防止整體表示偏向於某個embedding的信息， - ![](https://i.imgur.com/RlTzQkW.png) 5. Transformer encoder - ![](https://i.imgur.com/ZyrfzDN.png) ## TransReID: Transformer-based Object Re-Identification(base on 上面一篇) - paper: https://arxiv.org/pdf/2102.04378.pdf - code: https://github.com/heshuting555/TransReID - comments: 2021 (阿里巴巴,浙江大學) * unknow - Vision Transformer (ViT) - Side Information Embedding (SIE)模塊 - 處理這些非視覺信息(side information) 視角、相機風格 - JPM: Jigsaw patch module - 小patch在JPM中打亂重組成更大的patch - 重組patch使得模型適應擾動；2) 新構建的patch依然包含全局的信息。 - Transformer Layer 在做甚麼 (與CNN差別) - model with self attention - multi-head attention - 則是通過h個不同的線性變換對Q，K，V進行投影，最後將不同的attention結果concat 起來 - self attentation - (Q,k,V) : - 1. 對每個Q，K做匹配(attention) 做iner product - Multi-head attention - 多一層layer，我從一個matrix，讓Q可以有更多表達式，不同的Matrix可能反應 global or local feature - m: m patches - K: k groups. - N fixed-sized patches * supplement - 第一個使用純Transformer進行ReID研究的工作 * Related Work - Side Information - 對於在跨相機系統中捕獲的圖像，由於不同的相機設置和物體視點而導致的姿勢，方向，照明，分辨率等方面存在很大的差異。視點/方向不變特徵學習[7，60]對人和車輛ReID 都同樣重要。 * Methodology - Transformerbased strong baseline - two main stages(feature extraction and supervision learning) - Overlapping Patches - - Position Embeddings - 在ImageNet上的預訓練模型會加載到網絡中 ---------------------------- ## Re-Identification with Consistent Attentive Siamese Networks - paper: https://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Re-Identification_With_Consistent_Attentive_Siamese_Networks_CVPR_2019_paper.pdf - code: https://github.com/salehe-e/siamese_person_re_id - comments: CVPR19 * 動機 - 1）IDE模型只關注圖像的部分區域。2）比較兩個圖像的特徵時沒有保證關注的特徵一致性。（比如，同一個ID，query image關注上半身，gallery關注下半身，會造成匹配錯誤） * introdction - 將圖片切成若干個patches，以序列的形式輸入ViT，這樣做能充分挖掘每個patch之間的聯繫，所以理應有效。 --- ## Fully Unsupervised Person Re-identification via Selective Contrastive Learning - paper: https://arxiv.org/pdf/2010.07608.pdf - code: no --- ## Joint Generative and Contrastive Learning for Unsupervised Person Re-identification - paper: https://arxiv.org/abs/2012.09071 - code: https://github.com/chenhao2345/GCL - Comments: CVPR 2021 - 用Contrastive 解決view point 問題 * Unknow - Human Mesh Recovery（HMR） paper[21]: - 如何從單張RGB圖片重建人體的mesh 稱為HMR - 使用兩種數據集，一種是帶有關鍵點標註的二維圖像，一種是帶有姿勢和體型信息的3D網格。使用這兩種沒有關聯性的數據集進行訓練。輸入一張圖像，首先獲取到3D網格參數和相機參數，之後將3D關鍵點映射回二維平面，判斷是否能與原圖中的2D關鍵點契合。使用這種2D到3D之後再到2D的方法，就可以很好的利用不含三維標註信息的數據集進行訓練。 - 生成的是3D網格而不像以往的方法生成3D骨架，這樣生成的3D網格有更多了利用價值 - cycle consistency - 希望從domain X-> domain Y - mapping: G實現X->Y,即G(X)與Y分佈相同 - invearse mapping F：Y-> X， - cycle consistency loss : F(G(X)) ≈X - 反之亦然， G(F(Y)) ≈Y - ![](https://i.imgur.com/6icOQNj.png) - ![](https://i.imgur.com/XulXDiX.png) - memory bank - renewed at the beginning of every epoch. - adversarial losses - 可能讓生成器生成的數據分佈接近於真實的數據分佈 - cycle consistency losses - 防止生成器G與F相互矛盾，即兩個生成器生成數據之後還能變換回來近似看成X->Y->X - 如何從memory找到pos feature - structure and id 如何結合到G * Introduction - we use a GAN as a novel view generator for contrastive learning, which does not require a labeled source dataset. - we aim at enhancing view diversity for contrastive learning via generation under the fully unsupervised setting - We estimate 3D meshes from unlabeled training images, then rotate these 3D meshes to simulate new structures. - contrastive module - to reduce intra-class variation between original and generated images - generative module - a 3D mesh based novel view generator, - 組成 - joint generative and contrastive learning framework - estimates body shape and pose from a single RGB image. - sori: the 2D projection of a 3D mesh as original structure * Proposed Method - Firstly,the View Generator uses cycle-consistency on both - mesh-guided structure features to generate one person in new view-points - original and generated views are exploited as positive pairs in the View Contrast Module, - identity features (CNN) - ![](https://i.imgur.com/U5tWpta.png) * View Generator - * View Contrast (Contrastive Module) - 原本的View Generator 只有對original view-point extract feature, 缺乏rotating different view-point indormation - 所以contrast 要來enhancing the view-invariance of representations - Fisrst - find positive images and negative - we store all instance representations in a memory bank * Joint Training - enhancing the quality of representations built by the shared identity encoder Eid. * 總結 - the generative module generates online data augmentation, which enhances the positive view diversity for contrastive module - Secondly, the contrastive module, in turn, learns view-invariant representations by matching original and generated views, which refine the generation quality. (學習特徵不變性從而改善generator質量) - Eid提取視圖不變特徵 --- ## CUPR: Contrastive Unsupervised Learning for Person Re-identification - paper: https://www.scitepress.org/Papers/2021/102399/102399.pdf - comment: VISAPP21-6B: Oral Presentations --- ## Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID - paper: https://arxiv.org/pdf/2006.02713.pdf - code: https://github.com/yxgeee/SpCL * Unknow - Self-paced learning - The “easy-to-hard” training scheme is at the core of self-paced learning - CL是根據某種先驗，將按照困難度排好序的樣本逐漸餵給模型SPL與CL最大的不同之處在於這個排樣本的先驗是嵌入到模型裡面的，是動態的，可以優化學習的 - nonparametric - 若f是目標域中離群樣本，則z+是目標域中離群樣本中和f相匹配的樣本（？？？論文中未說明） * Abstract - 域自適應目標Re-ID致力於將從有標籤源域中學習到的知識轉移到無標籤目標域 - 最新基於偽標籤方法因為域間差異以及不能令人滿意的聚類效果 - 我們提出混合存儲動態生成source-domain class-level、target-domain cluster-level和un-clustered instance-level的有監督信號以學習特徵 * Introduction - (1) supervised pre-training on the source domain, (2) unsupervised fine-tuning on the target domain - two major limitations that hinder - 1. , the source-domain images were either not considered - 2. cluster outliers is valuable samples for target-domain - initializes the learning process by using the hybrid memory with the most reliable target-domain clusters - by incorporating more un-clustered instances into the new clusters mitigate the effects of noisy pseudo labels and boost the feature learning process * Method - ![](https://i.imgur.com/TCTdJVr.png) - Hybrid Memory - w : - v : initial directly encoded by fθ - c - Memory initialization - 源域類中心：用源域訓練好的模型提取源域樣本特徵，算出類中心； - 目標域樣本：用訓練好的模型直接提取特徵 - 目標域樣本聚類簇中心：用初始化的樣本進行DBSCAN聚類，然後計算簇中心 - Memory update - - 基於該標準只有可靠的聚類被保留，其他模糊不清的聚類被分解放回非聚類實例裡。 - Self-paced Learning with Reliable Clusters - 如何有效地找到reliable cluster - Self-paced Learning with Reliable Clusters - Independence of clusters - -------- ## Mask-guided Contrastive Attention Model for Person Re-Identification - paper: https://openaccess.thecvf.com/content_cvpr_2018/CameraReady/1085.pdf - code: - comment: 2018 IEEE/CVF ---