HOI-M³: Capture Multiple Humans and Objects Interaction within Contextual Environment

# [HOI-M³: Capture Multiple Humans and Objects Interaction within Contextual Environment](https://arxiv.org/pdf/2404.00299) ## Issue 論文解決的核心挑戰： - **Data scarcity問題**：現有Human-Object Interaction (HOI) datasets主要專注於單人與單物體交互，缺乏multiple humans和multiple objects的真實世界數據 - **Multiple HOI capture困難**：在contextual environment中準確捕捉多人多物體運動極具挑戰性，特別是存在嚴重遮擋時 - **技術挑戰**：需要dome-like dense cameras和object-mounted IMUs，且需要複雜的preprocessing和joint optimization pipeline - **研究空白**：缺乏data-driven的multiple human-object motion capture和synthesis方法 ## Problem Formulation ### Notation 1. **$T \in \mathbb{R}^3, R \in SO(3)$** ---- Object的3D translation和rotation 2. **$R^{IMU}_t$** ---- IMU提供的rotation信息 3. **$R^{off}_t$** ---- Calibration offset rotation 4. **$V^j_t(R^{IMU}_t, R^{off}_t, T_t)$** ---- Object mesh在frame t的3D location 5. **$O(c_j)$** ---- Category $c_j$的mesh template 6. **$x = [x_1, x_2, ..., x_N]$** ---- Multiple HOI representation，$x_i \in \mathbb{R}^{88}$ 7. **$\theta_i \in \mathbb{R}^{24 \times 3}, \beta_i \in \mathbb{R}^{10}$** ---- Human pose和shape parameters 8. **$T^h_i, R^h_i, T^o_i, R^o_i$** ---- Human和object的global transformation ### Objective 主要目標： 1. **建立首個multiple humans和multiple objects的large-scale dataset**：提供高質量3D tracking和rich modality 2. **開發robust capture pipeline**：整合dense RGB和object-mounted IMU inputs進行準確tracking 3. **提出novel downstream tasks**：Monocular multiple HOI capture和unstructured multiple HOI generation 4. **建立benchmark**：為future HOI research提供數據基礎和評估標準 ### Model Architecture **Dataset Architecture**： - **Capture System**：42 Z CAM cinema cameras (4K@60fps) + object-mounted IMUs - **Environment**：5種daily scenarios (Bedroom, Dining Room, Living Room, Fitness Room, Office) - **Scale**：199 sequences, 181M frames, 90 objects, 31 human subjects **Pipeline Components**： 1. **Data Annotation**：SAM-based segmentation + professional human annotation 2. **Synchronization & Calibration**：RGB-IMU alignment via controlled jump detection 3. **Human Motion Capture**：ViTPose + cross-view matching + SMPL fitting 4. **Inertial-aided Multi-object Tracking**：IMU-guided optimization with multiple constraints ### Algorithm **Data Capture Pipeline**： 1. **Pre-scanning**：使用Polycam掃描90個everyday objects 2. **Recording**：42-camera system + embedded IMUs記錄interactions 3. **Synchronization**：通過controlled jump同步RGB和IMU signals 4. **Annotation**：SAM + human annotators進行segmentation **Human Motion Capture**： 1. **2D Detection**：ViTPose檢測2D human keypoints 2. **Cross-view Matching**：建立affinity matrix進行multi-view correspondence 3. **3D Reconstruction**：Triangulation重建3D keypoint trajectories 4. **SMPL Fitting**：使用Easymocap工具擬合parametric model **Multi-object Tracking**： 1. **IMU Integration**：利用IMU提供rotation priors 2. **Joint Optimization**：同時優化translation和rotation offset 3. **Multi-constraint Enforcement**：mask, offscreen, collision, smoothness constraints ### Key Formulations **Object Tracking Optimization**： $$V^j_t(R^{IMU}_t, R^{off}_t, T_t) = R^{off}_t R^{IMU}_t O(c_j) + T_t$$ **Optimization Objective**： $$R^{off}_t, T_t = \arg\min_{R,T} (\lambda_{mask}E_{mask} + \lambda_{offscreen}E_{offscreen} + \lambda_{collision}E_{collision} + \lambda_{smt}E_{smt})$$ **Human-Object Mask Constraint**： $$E_{homask} = \|\sum_{v=1}^{42} (I^{homask}_v - DR(O(c_j), R^{IMU}_t, T_t)\|_2^2$$ **Smoothness Constraint**： $$E_{smt} = \max(0, \|(R^{off}_t R^{IMU}_t)^{-1}R^{off}_{t+1}R^{IMU}_{t+1}\|_2 - \|(R^{IMU}_t)^{-1}R^{IMU}_{t+1}\|_2)$$ **Multiple HOI Diffusion Model**： $$q(x_{1:N}|x_0) := \prod_{n=1}^N q(x_n|x_{n-1})$$ $$q(x_n|x_{n-1}) := \mathcal{N}(x_n; \sqrt{1-\beta_n}x_{n-1}, \beta_n I)$$ ### Losses **Multi-object Tracking Losses**： 1. **Mask loss** ---- $E_{mask}$：Human-object mask consistency 2. **Offscreen loss** ---- $E_{offscreen}$：Prevent objects moving out of camera views 3. **Collision loss** ---- $E_{collision}$：Avoid human-object interpenetration 4. **Smoothness loss** ---- $E_{smt}$：Maintain motion smoothness **Monocular Capture Losses**： $$L_{sum} = \lambda_{theta}L_{theta} + \lambda_{beta}L_{beta} + \lambda_{object}L_{object} + \lambda_{3D}L_{3D} + \lambda_{2D}L_{2D} + \lambda_{hm}L_{hm} + \lambda_{depth}L_{depth}$$ **Generation Loss**： $$L = \mathbb{E}_{x_0,n}[|\hat{x}_\theta(x_n, n) - x_0|_1]$$ ### Training **Dataset Collection**： - **Recording Time**：20+ hours across 5 scenarios - **Subjects**：31 humans (20 males, 11 females) - **Objects**：90 pre-scanned everyday objects - **Cameras**：42 synchronized 4K@60fps cameras **Capture Pipeline Training**： - 使用EasyMocap進行SMPL parameter fitting - Multi-view triangulation for 3D keypoint reconstruction - Joint optimization with weighted constraint terms **Baseline Method Training**： - **Monocular Capture**：ResNet-34 backbone, 512×512 input - **Generation Model**：Transformer-based diffusion model - **Evaluation**：PCK, PCKabs, Chamfer distance, FID metrics ## Experiments **Dataset Statistics**： - **Scale**：181M frames, 199 sequences, 90 objects, 31 subjects - **Diversity**：5 daily scenarios, various human body shapes和object scales - **Quality**：4K resolution, 60fps, synchronized RGB-IMU data **Evaluation Protocols**： - **Human Pose**：PCK (15cm threshold), PCKabs (absolute coordinates), 3DPCK - **Object Pose**：Chamfer distance, vertex-to-vertex (v2v) error - **Generation**：FID (quality), Penetration rate (physical plausibility) **Comparison Methods**： - **Monocular Capture**：PHOSA, CHORE等SOTA single HOI methods - **Human Pose Estimation**：HMR, SPIN, HybrIK, PARE等主流methods ## Results **Dataset貢獻**： - **首個multiple human-object dataset**：提供準確3D tracking labels - **最大規模**：在recording time和frame數量上超越existing datasets - **豐富modality**：RGB + IMU + segmentation + pre-scanned geometry **Monocular Multiple HOI Capture**： - **Human Pose**：PCKrel 68.5%, PCKabs 5.9%（顯著超越SOTA方法） - **Object Pose**：Chamfer distance 235.0, V2V 297.8 - **優勢**：使用perspective camera model避免weak-projection的depth inaccuracy **Multiple HOI Generation**： - **FID Score**：Joint evaluation 36.906±0.087 - **Penetration Rate**：9.265%（物理合理性良好） - **能力**：可生成semantically corresponding motions given object inputs **關鍵發現**： - Multiple HOI scenarios比single HOI顯著更具挑戰性 - IMU-aided tracking大幅提升object tracking accuracy - 生成的interactions展現良好的semantic correspondence ## Key Contributions 1. **首個large-scale multiple human-object interaction dataset**：HOI-M³包含181M frames和準確3D tracking 2. **Robust capture pipeline**：整合dense RGB和object-mounted IMUs的joint optimization方法 3. **Novel downstream tasks**：提出monocular multiple HOI capture和unstructured generation兩個新任務 4. **Strong baselines**：為兩個任務提供companion baseline methods和comprehensive evaluation 5. **Community resource**：將釋放dataset、code和pre-trained models促進future research ## Limitations **硬體成本限制**： - 目前僅限於indoor settings，擴展到outdoor environments面臨non-trivial challenges - 需要expensive multi-camera setup和specialized IMU equipment **數據收集限制**： - 僅涵蓋5個common scenes，場景多樣性有限 - Human resources intensive，限制了dataset規模進一步擴展 **環境限制**： - Fixed illumination conditions，背景變化較少 - 對其他environment的generalization ability有限 **方法局限**： - 依賴pre-scanned object templates，對novel objects適應性不足 - IMU calibration errors可能影響tracking precision - Multiple HOI generation仍處於early stage，生成質量有待提升 **評估挑戰**： - 缺乏established metrics for multiple HOI evaluation - Physical plausibility assessment仍相對簡單 - Long-term temporal consistency評估不足