ViT & T2T-ViT & HIPT Comparison

--- tags: Paper --- # ViT & T2T-ViT & HIPT Comparison  ## ViT ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/ORhRLCt.gif =500x330) <br> ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/uXSJPG6.png =600x) 圖片切塊(d~1~ x d~2~ x d~3~) $\rightarrow$ 向量化(d~1~d~2~d~3~ x 1 for each) $\rightarrow$ FC線性轉換 with trainable paras，並加上位置矩陣 $\rightarrow$ 把以上n個向量和CLS向量一起輸入transformer，輸出n+1個向量 $\rightarrow$ 取特徵向量放進softmax，得到最後結果 ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/pneHqw4.png =460x) --- ## T2T-ViT ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/qZ1u7Fk.png =x260) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/ZjaCqjd.png =430x310) --- ## Hierarchical Image Pyramid Transformer(HIPT) ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/n9w7tAT.png) Can be witten as: ![](https://newprediction.blob.core.windows.net/dogchen-hackmd/V2xKe39.png =340x) :::spoiler **Notation** * x~R~: image with resolution R x R * ViT~A~-l: ViT working on a A x A image resolution with token shape [l x l] * {x~l~^(i)^}~i=1~^M^ ∈ R^M×dl^: extracted tokens sequence within x * sequence length = M * token size = [l x l] * embedding dimension = d ::: <br> * Recursively subdivide the whole slide into non-overlapping tokens * 3-level aggregation * Hierarchical pre-training (based on [student teacher knowlege distillation](https://chtseng.wordpress.com/2020/05/12/%E7%9F%A5%E8%AD%98%E8%92%B8%E9%A4%BE-knowledgedistillation/)) 1. pre-train patch-level ViT first 2. then pre-train region-level ViT  Based on 兩個WSI圖片的特性: 1. Have a fixed scale 2. There exists a hierarchical structure of visual tokens at varying image resolutions  [HIPT source code](https://github.com/mahmoodlab/HIPT)