franchingkao
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
--- title: 【論文研讀】STAR-Transformer - A Spatio-temporal Cross Attention Transformer for Human Action Recognition catalog: true date: 2023-08-15 author: Frances Kao categories: - paper review - transformer - HAR --- [![hackmd-github-sync-badge](https://hackmd.io/i9DwNzkbR5-e2nPl5HrjoQ/badge)](https://hackmd.io/i9DwNzkbR5-e2nPl5HrjoQ) [STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition](https://arxiv.org/abs/2210.07503) * Journal reference: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) * Author: Dasom Ahn, Sangwon Kim, Hyunsu Hong, Byoung Chul Ko* * github: 無 --- # Introduction ![](https://hackmd.io/_uploads/SkntFtzth.png) 為了解決多模態資料之特徵整合問題,首先將**兩種模態特徵轉換為可識別的向量特徵**,作為STAR-transformer之輸入。其次,基於ViT架構設計一**交互注意力機制(cross-attention module)**.以解決原先多頭注意力機制用於時序影片之辨識上所造成的高運算量問題。 ## 現今HAR技術面臨的問題 * video-based: 容易受外在環境如相機角度、目標物大小、雜亂背景等因素影響 * skeleton-based: 需要一額外的預訓練模型以取得人體骨架,且結果會受骨架辨識準確率及骨架重疊程度而影響 * cross-modal (video+skseleton): 將兩種模態之數據結合起來是個模糊的過程,需要有一單獨的子模型進行跨模態學習(即自兩種模態資料中提取並整合有用資訊的一個學習過程)。 * Vision Transformer (ViT): 該技術現已可應用於影像分類任務中,但在HAR任務中也應考慮時序關聯性,如此便會導致**運算量較高**的問題。 * 使用Transformer耦合跨模態信息的方法尚未發展。 ### 本研究主要貢獻 * multi-feature representation method * input video -> global grid tokens; * skeleton sequence -> joint map tokens; * 以上token整合 -> multi-class tokens * Spatio-TemporAl cRoss transformer (STAR-transformer) * 設計cross-attention module學習跨模態之特徵,以取代原本ViT的多頭注意力機制 * encoder * FAttn (full spatio-temporal attention) * ZAttn (zigzag spatio-temporal attention) * decoder * FAttn (full spatio-temporal attention) * BAttn (binary spatio-temporal attention) * 此研究為首創無須使用額外子模型訓練,便可直接使用空間/時序之跨模態資料作為ViT之輸入 --- # Method ![](https://hackmd.io/_uploads/HyYLbhGKh.png) 補充: [ViT](https://zhuanlan.zhihu.com/p/340149804) ## Cross-Modal Learning 跨模態的部分包含了三種資訊,即全域性、局部性、以及合併資訊,因此在類別上也會有三種。其中在特徵提取使用了預訓練模型ResNet-MC18 (mixed convolution 18)[43],透過其不同層之特徵圖得到全域與局部的特徵再以Transformer加以提取與結合。 ### Global grid token (GG-token) $\mathbb{T}^{t}_{g}$ * 提取ResNet18之**最後一層**作為全域特徵,得 $h \times w$ 之特徵圖,並且將其flatten為 $hw$ (即 $P$ )之向量,即得到 $\mathbb{T}^{t}_{g}$ 中 $g^{t}_{1,...,P}$ 之各元素: $\mathbb{T}^{t}_{g} = \{ g^{t}_{1}, ..., g^{t}_{P} \}$ ### Joint map token (JM-tokem) $\mathbb{T}^{t}_{j}$ * 提取ResNet18之中間層作為局部特徵,以獲得較細節之特徵,當中的元素 $j^{t}_{n}$ 來自於local feature map $F$ 與第n張joint heat map $h^{t}_{n}$ 的合併(concatenate);首先擷取local feature map $F$ 後,第n張joint heat map $h_{n}$ 即為在一暫時性特徵映射圖上的第n個joint結果,其經過高斯模糊縮放(scaled at $\sigma$)後的大小為 $h' \times w'$ : $\mathbb{T}^{t}_{j} = \{ j^{t}_{1}, ..., j^{t}_{N} \}$, 其中 $N$ 為 joint heat map的數量 <br> ![](https://hackmd.io/_uploads/S1rvvvLc3.png) ### Multi-class token $Z$ Vanilla Vit主要以單一類別學習全域關聯性(下圖(a)),而此研究則加入各個模態資料的類別標記,也就是在總類別 $CLS_{total}$ 加上 $\mathbb{T}_{g}$ 和 $\mathbb{T}_{j}$ 兩者token,其中僅有 $\mathbb{T}_{j}$ 保有位置編碼(此與一般ViT位置編碼方法有所不同處,這邊作者主張位置資訊對JM特別重要),詳細其公式如下: > $\mathbb{T}_{g} = CLS_{global} \oplus \mathbb{T}_{g}$ > $\mathbb{T}_{j} = CLS_{joint} \oplus (\mathbb{T}_{j} + pos)$ > $Z = \mathbb{T}_{g} \oplus \mathbb{T}_{j} \oplus CLS_{total}$ ![](https://hackmd.io/_uploads/r1Duw2fth.png) ## Spatio-temporal cross attention ![](https://hackmd.io/_uploads/BkTyMuIqn.png) ### FAttn (full spatical-temperal attantion) 該輸入包含了所有空間維度與時間維度的資訊,其複雜性為 $T^{2}S^{2}$ 。而另外設計的兩個注意力機制將 $T$ 的部分拆為兩份分別給了Q向量和KV向量,因此其複雜性僅為FAttn的四分之一倍。 <br> $FAttn(Q, K, V) = \displaystyle\sum^{T}_{t} \displaystyle\sum^{S}_{s} Softmax{\dfrac{Q_{s,t} \cdot K_{s, t}}{\sqrt{d_{h}}}} V_{s,t}$ ### ZAttn (zigzag spatical-temperal attantion) Q與K和V套用了交叉幀資訊之解藕,即將$T$資訊分為基偶數幀,並且交互計算出 $a'$ 和 $a''$ 再將其合併。首先, $a'$ 以所有TS之基數幀作為Q向量而偶數幀則為K和V向量;而 $a''$ 則反之,以所有TS之基數幀作為K和V向量而偶數幀則為Q向量。如此作法是希望能捕捉到特徵中的細微變動。 <br> $a' = \displaystyle\sum^{T/2}_{t} \displaystyle\sum^{S}_{s} Softmax{\dfrac{Q'_{s,t} \cdot K'_{s, t}}{\sqrt{d_{h}}}} V'_{s,t}$ <br> $a'' = \displaystyle\sum^{T/2}_{t} \displaystyle\sum^{S}_{s} Softmax{\dfrac{Q''_{s,t} \cdot K''_{s, t}}{\sqrt{d_{h}}}} V''_{s,t}$ <br> $ZAttn(Q, K, V) = a' \oplus a''$ <br> 註:zigzag意指曲折的線段,指的就是在此方法上會將資訊以Z行曲折的方式拆分開來。 ### BAttn (binary spatical-temperal attantion) BAttn之注意力機制同樣將原來的TS資訊分為前後段,並分配予Q向量與K、V向量。 <br> $BAttn(Q, K, V) = b' \oplus b''$ ## Encoder & Decoder 本研究提出的Transformer encoder-decoder結構是以一般的Transformer為基礎,而非處理影像的ViT那樣僅有encoder。 <br> $\bar{Z}_{l} = LN\{FSTA(z_{l-1})+z_{l-1}\}$ <br> $z'_{l}, z''_{l} = Decoupling(\bar{Z}_{l})$ <br> $\hat{Z}_{l} =LN\{(STA(z'_{l})+z'_{l}) \oplus (STA(z''_{l})+z''_{l})\}$ <br> $z_{l} = LN\{MLP(\bar{z}_{l}) + \bar{z}_{l}\}$ <br> 如此經過STAR Transformer的多類別token便會透過平均、餵入MLP推測動作標籤,即將多類別的特徵類別融合為單一類別。 --- # Results ## Datasets * [Penn-Action](http://dreamdragon.github.io/PennAction/) * [NTU-RGB+D 60, NTU-RGB+D 120](https://arxiv.org/abs/1604.02808) * 指標說明 * XSub: 根據不同受試者(共有40位受試者,training set和testing set各為20/20不同受試者) * XView: 根據不同視角(共有3個視角,training set和testing set分別為camera1/camera2&3) * XSet: 根據不同場景 ## Implemetation details * backbone: ResNet18 (pretrained on Kinetics-400) * batch size: 4 * epoch: 300 * optimizer: SGD * learning rate: 2e-4 * momentum: 0.9 * GPU/CPU: NVIDIA Tesla V100 GPU x4 * 16 fixed frames** ### Quantitative analysis: Penn-Action * STAR-Transformery在未使用pre-training model的情況下,同時使用RGB+skeleton資料可得出較好的模型表現。 -> 但實際上結果並未高出其他SOTA方法很多 ![](https://hackmd.io/_uploads/HkXrSqzKh.png) ### Quantitative analysis: NTU RGB+D ![](https://hackmd.io/_uploads/BJxC8cGKh.png) * 結果說明 * VPN [14]在NTU60稍高而NTU120稍低: 在分類增加或場景改變後表現下降,不多討論 (同樣配置) * PoseC3D [17]高出許多: 該模型使用pre-trained poseConv3D model,這導致該方法的骨架辨識與動作辨識模型並非end-to-end,即兩者並未能整合而得以在一個單獨的模型中進行訓練,以致於模型效率不如預期。 * KA-AGTN [31]在NTU60稍低而NTU120稍高: 該模型僅使用spatial-temporal骨架資訊進行特徵提取,與本研究方法使用RGB+skeleton特徵之表現不相上下。 * 在此實驗中可觀察到多模態資料之表現皆較單模態資料來的高。 * 此實驗驗證STAR-Transformer的表現無受限於不同受試者、視角或場景。 ### Qualitative analysis (Ablation study) | Ablation Study | Result | |-|-| | multi-expression learning | ![](https://hackmd.io/_uploads/S1_DSiGtn.png) | | number of transformer layers | ![](https://hackmd.io/_uploads/H1LUHoMKn.png) | | cross attention modules | ![](https://hackmd.io/_uploads/rkiMUjGKh.png) | | spatio-temporal cross attention | ![](https://hackmd.io/_uploads/HkKo8jfth.png) | * spatio-temporal cross attention: 此實驗計算出根據時序,各注意力機制所關注的資訊。可觀察到三種注意力機制所關注的資訊,如FAttn並不考慮全面的時序資訊,而是關注後三幀;ZAttn關注中後之資訊;BAttn則關注前中後之資訊。而這在cross attention modules的實驗中也可以看出,根據關注不同位置的注意力機制之組合也得出不同準確率,例如僅使用關注後三幀的FAttn作為encoder/decoder,可能導致資訊遺漏而使得準確率相對較低(96.1%)。 * 作者表示在進行動作辨識任務時,不能只關注某些幀,而是需要均勻地學習並考慮所有幀中的特徵。 -> 在此研究中,使用attention mechanism與不使用的差別?若均勻的關注所有幀而不使用注意力機制的實驗? --- # Future works * 設計更有效率的模型,例如可使用較少量的資料進行訓練 * 將模型修改為可同時進行骨架特徵與動作辨識的端到端模型 --- ## 補充問題 * 三個類別分別為何? * 可訓練的token, 分別代表了global feature, local feature, multi-class feature,是以注意力訓練得出所能代表這段輸入的特徵,而非所謂動作類別標籤。 * 如何決定最終類別? * 經過transformer後,會生成多類別特徵表示(token),透過平均這些token並且將這些特徵資訊餵入MLP而得到最終動作類別。文中沒有特別提及MLP的結構,並非此篇論文重點。 * 16 fixed frames為何? * 推測為固定每個資料為固定16幀(T=16),但文中未提及如何將資料集處理為固定幀數(例如取前取後?)這種方法的優點是可以減少STAR-transformer模型的計算量,從而提高模型的訓練速度。同時,這種方法還可以減少動作序列中的噪聲和冗余信息,從而提高模型的準確率。然而,這種方法也有一些缺點,例如可能會丟失一些重要的動作信息,因為每個時間段只提取了一個幀作為輸入。因此,在實現STAR-transformer模型時,需要根據具體的應用場景和需求來選擇合適的方法。 * 僅對joint map token做position embedding,(未說明)joint map比起global grid的位置資訊更重要的原因可能是因為我們的任務是針對人體骨架的動作。未實驗若也對global grid token加上position embedding或者皆不加,結果差異如何。

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully