franchingkao
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee
  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    --- title: 【論文研讀】Something-Else Compositional Action Recognition with Spatial-Temporal Interaction Networks catalog: true date: 2023-09-30 author: Frances Kao categories: - paper review - HAR - Object-relation HAR --- # Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks [![hackmd-github-sync-badge](https://hackmd.io/wavzvcNCTGO4g0cQbiwCsA/badge)](https://hackmd.io/wavzvcNCTGO4g0cQbiwCsA) - Journal reference: CVPR2020 - Author: Materzynska, Joanna and Xiao, Tete and Herzig, Roei and Xu, Huijuan and Wang, Xiaolong and Darrell, Trevor - Github: https://github.com/joaanna/something_else ## Introduction 以RGB-based HAR模型而言,輸入資料中的物件被模型視為非獨立物件,但事實上**它們為動作的一部分,應被視為獨立物件一同考慮**。因此作者提出方法,嘗試完整**捕捉動作與物件間的組成**。前提是現今HAR方法本是基於空間資訊再進行延伸,應更加注重空間特徵來提升動作辨識效果。 ### Spatial-Temporal Interaction Network (STIN) 一動作辨識模型,其將動作中的主體(Subject)和物體(Object)之間的幾何關係納入模型考慮。STIN會追溯根據偵測與追蹤結果得來的候選稀疏圖。 1. 輸入: 物體和主體的位置和形狀 1. 對輸入資料進行空間交互作用推理(spatial interaction reasoning) 2. 對沿著相同軌跡的框進行時間交互作用推理(temporal interaction reasoning),即針對物體的變換和主體與物體之間的關係兩者進行編碼。 3. 接著,再組合主體與物體的軌跡,以進行動作辨識。 4. 若當中無交互作用的動作(如倒水或撕紙等未有兩物體互動的動作),則僅以單純的時空場景表示方式進行辨識。(baseline) ### Something-Else task 延伸自[Something-Something V2 dataset](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiB2sfziNmBAxVLslYBHSHTDDMQFnoECA8QAQ&url=https%3A%2F%2Fpaperswithcode.com%2Fdataset%2Fsomething-something-v2&usg=AOvVaw0G1EXIeKID1UAip5dZ_cPj&opi=89978449),作者再加入: - 資料標記:Bounding Box - 訓練-測試資料切割 ### Contribution 1. 提出STIN (Spatial-Temporal Interaction Network)得以明確地理解人員與物件間的幾何變化特徵。 2. 提出兩項任務資料集,用以測試模型泛化性,其中包含影片中的標註框。 3. 模型性能優於其他appearance-based HAR模型。 ## Related Work > However, a recent study in [77] indicates that most of the current models trained with the above-mentioned datasets are not focusing on temporal reasoning but the appearance of the frames: Reversing the order of the video frames at test time will lead to almost the same classification result > Misra et al. [45] propose a method to compose classifiers of known visual concepts and apply this model to recognize objects with unseen combinations of concepts. > Visual Interaction Network [69], which models the physical interactions between objects in a simulated environment. ## Method - detector and tracker >> object-graph representations (bbox) - spatial-temporal reasoning ### Object-centric Representation 針對影片進行物件偵測(detector可偵測手及其互動物件);所有互動物體(object)在訓練檢測器中皆被視為一個類別,也就是僅有主體(手)和物體(任何互動物體)兩個類別。偵測出N個物件後,進行多物件追蹤,以獲得多幀中物件框間的關聯性。 - Bounding box coordinates(location&shape): 輸入物件框資訊(x,y,h,w)至MLP,得到一d維特徵;即透過物件的位置與形狀提取物件與其移動資訊。 - Object identity embedding: 利用一可學習的d維嵌入層來表示**物體和主體的類別**: - subject embedding - object embedding - null embedding: 與動作無關的物件框(dummy boxes) - 以上三個類別嵌入皆起始於多變數常態分佈(彼此獨立),並且與上述的物件框特徵合併作為模型輸入。 ### Spatial-temporal interaction reasoning 假設影片共有 $T$ 幀且每幀 $N$ 個物件, $x^t_n$ 即代表第 $t$ 幀中的物件 $n$ 的特徵: $\begin{gather*} X = (x^1_1, ..., x^1_N, x^2_1, ..., x^2_N, x^T_1, ..., x^T_N) \end{gather*}$ ![](https://hackmd.io/_uploads/HytdYsIgT.png) #### Spatial interaction module $f(x^t_n) = ReLU(W^T_f)[x^t_n, \dfrac{1}{N-1} \displaystyle\sum_{j!=n}x^t_j ]$ - $[,]$ 表示 $x^t_n$ 與其他 $N-1$ 個物件之特徵合併 - $W^T_f$ 為一可學習權重 </br> ![](https://hackmd.io/_uploads/SyrIKiIga.png) #### Temporal interaction module $p(X) = W^T_p h(\{g(x^1_n, ..., x^T_n)\}^N_{n=1})$ - 一時間軌跡內各物件的特徵資訊為 $g(x^1_n, ..., x^T_n)$ - $h$ 為合併時間軌跡內資訊的函式 - a simple averaging function - [non-local block [67]](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjyqIu2kNmBAxUI_mEKHR7PC50QFnoECAgQAQ&url=https%3A%2F%2Fopenaccess.thecvf.com%2Fcontent_cvpr_2018%2Fpapers%2FWang_Non-Local_Neural_Networks_CVPR_2018_paper.pdf&usg=AOvVaw3lpouVx5j7REXCf-OFWmzS&opi=89978449):對兩兩一組的軌跡特徵進行編碼,再將其平均 - $W_p$ 最終的分類器(CE) #### Combining video appearance representation - 投入 $T$ 幀至3D ConvNet([ResNet-50 I3D [68]](https://arxiv.org/abs/1806.01810)),提取時空特徵表示 - 對上述所提取之時空特徵表示進行average pooling,得一d維特徵 - 上述得之video appearance representation便可與object representation $h(\{g(x^1_n, ..., x^T_n)\}^N_{n=1})$ 結合,再餵入分類器中 > Given a long clip of video (around 5 seconds), we sample 32 video frames from it with the same temporal duration between every two frames. [68] ## The Something-Else Task - 源自Something-Something V2 dataset [20]:174組人與物體互動的動作類別,以動作(動詞)與隨意物件(名詞)的互動做組合 - 其資料切分方式未考慮到,相同人員或相同動作組合在訓練與測試資料同時可能出現,導致資料偏頗,使得於此資料集表現優異的模型缺乏泛用性。 - 作者加上**物件框標記**與**訓練資料集分割**,Something-Else 任務需要辨識「針對看不見的物件執行操作時的動作」,即在訓練時不會與該操作一起出現的物件。因此,方法是針對「某物」進行訓練的,但測試其泛化為「其他東西」的能力。該資料集中的每個動作類別都被描述為由相同動詞和不同名詞組成的短語。作者重新組織了組合動作識別的資料集,並對每個動作隨時間變化的物件間幾何配置的動態進行建模。 ![](https://hackmd.io/_uploads/ry5h5pIlp.png) ## Experiments ### Implementation Details - Detector - 為偵測影片中的主體與互動物體 - 使用模型: - Faster R-CNN [50, 71] with Feature Pyramid Network (FPN) [42] and ResNet-101 [24] backbone, pre-trained with the COCO [43]dataset - 使用作者的Something-Else(加入物件框標記)再進行finetune - finetune過程僅使用手與其互動物件兩個類別進行訊練; 互動物件的數量至多為4 - Tracker(multi-object tracking) - 為提取不同幀中物體之間的對應關係 - 方法: - [**卡爾曼濾波器(Kalman Filter)**[34]](https://zh.wikipedia.org/zh-tw/%E5%8D%A1%E5%B0%94%E6%9B%BC%E6%BB%A4%E6%B3%A2) 基於先前的追蹤結果去預測當前幀中的物件可能位置,藉以捕捉物體的移動軌跡 - [**匈牙利算法(Kuhn-Munkres algorithm)**[39]](https://zh.wikipedia.org/zh-tw/%E5%8C%88%E7%89%99%E5%88%A9%E7%AE%97%E6%B3%95) 將上述預測與單幀檢測結果進行匹配,即確認哪些物體在不同幀之間是相同的。 ### Setup - training setting - 2層MLP (隱藏層d=512) - 50 epoch - lr=0.01 - optim=SGD (weight dacay=0.0001, momentum=0.9, lr dacay=0.1 at 35e & 45e) - Visulization ![](https://hackmd.io/_uploads/B1w3elPxa.png) ### Experiments 1. 原始資料集分割(Original Something-Something Split): 原始的 Something-Something 資料集上評估模型,該資料集包含174種常見的人與物體互動類別。這裡使用資料集提供的標準訓練-測試分割方式。 2. 組織性資料分割(Compositional Action Recognition): 針對Something-Something 資料集進行新的訓練-測試組合分割,其中任務著重於辨識出「使用未見過的物體執行的動作」。 3. 少樣本訓練(Few-shot Compositional Action Recognition): 僅使用少樣本訓練資料進行模型訓練。 > 由#2和#3的結果可得知,即便是使用detector所生成的bbox作為下游模型的輸入資料,其準確率也能夠維持。 4. 泛化能力-單一物體之資料訓練(One-object training): 為分析模型對不同物體類別的泛化能力,設定實驗為僅使用「與物體類別 "box" 互動的影片」訓練模型,並於其它剩餘影片上評估模型性能。 5. 個別類別之辨識效果(Category analysis): 針對表現最好與表現最差的前五個類別,檢視其模型效能。作者發現,在那些有直接移動物體的動作上,模型表現較好(例如take sth, push sth, ...);而在物體本身變化的動作上,表現較差(例如poking, tearing)。 |#1|#2|#3|#5| | -- | -- | -- | -- | |![](https://hackmd.io/_uploads/rypkimYg6.png)|![](https://hackmd.io/_uploads/SygS3QKea.png)|![](https://hackmd.io/_uploads/ry2I3XYxp.png)|![](https://hackmd.io/_uploads/SJscnmtxa.png)| ## Conclusion - 因物件偵測追蹤的不準確而導致下游任務表現不佳。 - 雖考慮時空間之交互關係,但完全為考慮物件之外觀,可能限制模型於不同場景之泛化能力。 - 因需要大量標記資料來進行訓練,缺乏實際應用之可延展性。 - 非end to end模型,無法作到即時辨識。![](https://hackmd.io/_uploads/ByD527Fg6.png)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully