# Person Re-identification via Contrastive Learning
###### tags: `person reid`
## An image is worth 16x16 words: Transformers for image recognition at scale (vit)
* Unknow
- layer norm:
- multi-head self-attention:
- multi-head attention則是通過h個不同的線性變換對Q,K,V進行投影,最後將不
同的attention結果拼接起來
- 
* 核心概念
- 用self-attention機製完全替代CNN。
- 因為原本transformer 是用在NLP上,所以要想辦法將影像改成序列化,圖像被切割成一個
個patch,這些patch按照一定的順序排列,就成了序列化的數據
* 小Tip
- 在大的dataset上,transformer效果好
* Method
- 
1. 將圖像轉化為序列化數據
- 將圖像分割成一個個patch,將每個patch reshape成一個向量,得到所謂的
flattened patch。
- 將[公式]個patch reshape後的向量concat在一起就得到了一個[公式]的二維矩陣,
相當於NLP中輸入transformer的詞向量
2. 避免patch size影響
- 對flattened patches向量做了Linear Projection
- 將不同長度的flattened patch向量轉化為固定長度的向量(記做[公式]維向量
- 
3. Position embedding
- 由於transformer模型本身是沒有位置信息
- 如下圖 編號有0-9的紫色框表示各個位置的position embedding
- 將position embedding(即圖中紫色框)和patch embedding(即圖中粉色框)相
加的方式結合position信息
- 
4. Learnable embedding
- 帶星號的粉色框 不是通過某個patch產生
- 記作[Xclass],經過encoder對應的結果視為整張圖的表示
- 為甚麼要家這個呢
- 防止整體表示偏向於某個embedding的信息,
- 
5. Transformer encoder
- 
## TransReID: Transformer-based Object Re-Identification(base on 上面一篇)
- paper: https://arxiv.org/pdf/2102.04378.pdf
- code: https://github.com/heshuting555/TransReID
- comments: 2021 (阿里巴巴,浙江大學)
* unknow
- Vision Transformer (ViT)
- Side Information Embedding (SIE)模塊
- 處理這些非視覺信息(side information) 視角、相機風格
- JPM: Jigsaw patch module
- 小patch在JPM中打亂重組成更大的patch
- 重組patch使得模型適應擾動;2) 新構建的patch依然包含全局的信息。
- Transformer Layer 在做甚麼 (與CNN差別)
- model with self attention
- multi-head attention
- 則是通過h個不同的線性變換對Q,K,V進行投影,最後將不同的attention結果concat
起來
- self attentation
- (Q,k,V) :
- 1. 對每個Q,K做匹配(attention) 做iner product
- Multi-head attention
- 多一層layer,我從一個matrix,讓Q可以有更多表達式,不同的Matrix可能反應
global or local feature
- m: m patches
- K: k groups.
- N fixed-sized patches
* supplement
- 第一個使用純Transformer進行ReID研究的工作
* Related Work
- Side Information
- 對於在跨相機系統中捕獲的圖像,由於不同的相機設置和物體視點而導致的姿勢,方向,
照明,分辨率等方面存在很大的差異。視點/方向不變特徵學習[7,60]對人和車輛ReID
都同樣重要。
* Methodology
- Transformerbased strong baseline
- two main stages(feature extraction and supervision learning)
- Overlapping Patches
-
- Position Embeddings
- 在ImageNet上的預訓練模型會加載到網絡中
----------------------------
## Re-Identification with Consistent Attentive Siamese Networks
- paper: https://openaccess.thecvf.com/content_CVPR_2019/papers/Zheng_Re-Identification_With_Consistent_Attentive_Siamese_Networks_CVPR_2019_paper.pdf
- code: https://github.com/salehe-e/siamese_person_re_id
- comments: CVPR19
* 動機
- 1)IDE模型只關注圖像的部分區域。2)比較兩個圖像的特徵時沒有保證關注的特徵一
致性。(比如,同一個ID,query image關注上半身,gallery關注下半身,會造成
匹配錯誤)
* introdction
- 將圖片切成若干個patches,以序列的形式輸入ViT,這樣做能充分挖掘每個patch之
間的聯繫,所以理應有效。
---
## Fully Unsupervised Person Re-identification via Selective Contrastive Learning
- paper: https://arxiv.org/pdf/2010.07608.pdf
- code: no
---
## Joint Generative and Contrastive Learning for Unsupervised Person Re-identification
- paper: https://arxiv.org/abs/2012.09071
- code: https://github.com/chenhao2345/GCL
- Comments: CVPR 2021
- 用Contrastive 解決view point 問題
* Unknow
- Human Mesh Recovery(HMR) paper[21]:
- 如何從單張RGB圖片重建人體的mesh 稱為HMR
- 使用兩種數據集,一種是帶有關鍵點標註的二維圖像,一種是帶有姿勢和體型信息的3D網格。使用這兩種沒有關聯性的數據集進行訓練。輸入一張圖像,首先獲取到3D網格參數和相機參數,之後將3D關鍵點
映射回二維平面,判斷是否能與原圖中的2D關鍵點契合。使用這種2D到3D之後再到2D的方法,就可以很好的利用不含三維標註信息的數據集進行訓練。
- 生成的是3D網格而不像以往的方法生成3D骨架,這樣生成的3D網格有更多了利用價值
- cycle consistency
- 希望從domain X-> domain Y
- mapping: G實現X->Y,即G(X)與Y分佈相同
- invearse mapping F:Y-> X,
- cycle consistency loss : F(G(X)) ≈X
- 反之亦然, G(F(Y)) ≈Y
- 
- 
- memory bank
- renewed at the beginning of every epoch.
- adversarial losses
- 可能讓生成器生成的數據分佈接近於真實的數據分佈
- cycle consistency losses
- 防止生成器G與F相互矛盾,即兩個生成器生成數據之後還能變換回來近似看成X->Y->X
- 如何從memory找到pos feature
- structure and id 如何結合到G
* Introduction
- we use a GAN as a novel view generator for contrastive learning, which does not require a labeled source dataset.
- we aim at enhancing view diversity for contrastive learning via generation under the fully unsupervised setting
- We estimate 3D meshes from unlabeled training images, then rotate these 3D meshes to simulate new structures.
- contrastive module
- to reduce intra-class variation between original and generated images
- generative module
- a 3D mesh based novel view generator,
- 組成
- joint generative and contrastive learning framework
- estimates body shape and pose from a single RGB image.
- sori: the 2D projection of a 3D mesh as original structure
* Proposed Method
- Firstly,the View Generator uses cycle-consistency on both
- mesh-guided structure features to generate one person in new view-points
- original and generated views are exploited as positive pairs in the View Contrast Module,
- identity features (CNN)
- 
* View Generator
-
* View Contrast (Contrastive Module)
- 原本的View Generator 只有對original view-point extract feature, 缺乏rotating different view-point indormation
- 所以contrast 要來enhancing the view-invariance of representations
- Fisrst
- find positive images and negative
- we store all instance representations in a memory bank
* Joint Training
- enhancing the quality of representations built by the shared identity
encoder Eid.
* 總結
- the generative module generates online data augmentation, which
enhances the positive view diversity for contrastive module
- Secondly, the contrastive module, in turn, learns view-invariant
representations by matching original and generated views, which refine the generation quality. (學習特徵不變性 從而改善generator質量)
- Eid提取視圖不變特徵
---
## CUPR: Contrastive Unsupervised Learning for Person Re-identification
- paper: https://www.scitepress.org/Papers/2021/102399/102399.pdf
- comment: VISAPP21-6B: Oral Presentations
---
## Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID
- paper: https://arxiv.org/pdf/2006.02713.pdf
- code: https://github.com/yxgeee/SpCL
* Unknow
- Self-paced learning
- The “easy-to-hard” training scheme is at the core of self-paced
learning
- CL是根據某種先驗,將按照困難度排好序的樣本逐漸餵給模型SPL與CL最大的不同之處在
於這個排樣本的先驗是嵌入到模型裡面的,是動態的,可以優化學習的
- nonparametric
- 若f是目標域中離群樣本,則z+是目標域中離群樣本中和f相匹配的樣本(???論文中未說
明)
* Abstract
- 域自適應目標Re-ID致力於將從有標籤源域中學習到的知識轉移到無標籤目標域
- 最新基於偽標籤方法因為域間差異以及不能令人滿意的聚類效果
- 我們提出混合存儲動態生成source-domain class-level、target-domain
cluster-level和un-clustered instance-level的有監督信號以學習特徵
* Introduction
- (1) supervised pre-training on the source domain,
(2) unsupervised fine-tuning on the target domain
- two major limitations that hinder
- 1. , the source-domain images were either not considered
- 2. cluster outliers is valuable samples for target-domain
- initializes the learning process by using the hybrid memory with the
most reliable target-domain clusters
- by incorporating more un-clustered instances into the new clusters
mitigate the effects of noisy pseudo labels and boost the feature
learning process
* Method
- 
- Hybrid Memory
- w :
- v : initial directly encoded by fθ
- c
- Memory initialization
- 源域類中心:用源域訓練好的模型提取源域樣本特徵,算出類中心;
- 目標域樣本:用訓練好的模型直接提取特徵
- 目標域樣本聚類簇中心:用初始化的樣本進行DBSCAN聚類,然後計算簇中心
- Memory update
-
- 基於該標準只有可靠的聚類被保留,其他模糊不清的聚類被分解放回非聚類實例裡。
- Self-paced Learning with Reliable Clusters
- 如何有效地找到reliable cluster
- Self-paced Learning with Reliable Clusters
- Independence of clusters
-
--------
## Mask-guided Contrastive Attention Model for Person Re-Identification
- paper: https://openaccess.thecvf.com/content_cvpr_2018/CameraReady/1085.pdf
- code:
- comment: 2018 IEEE/CVF
---