# [HOI-M³: Capture Multiple Humans and Objects Interaction within Contextual Environment](https://arxiv.org/pdf/2404.00299) ## Issue 論文解決的核心挑戰: - **Data scarcity問題**:現有Human-Object Interaction (HOI) datasets主要專注於單人與單物體交互,缺乏multiple humans和multiple objects的真實世界數據 - **Multiple HOI capture困難**:在contextual environment中準確捕捉多人多物體運動極具挑戰性,特別是存在嚴重遮擋時 - **技術挑戰**:需要dome-like dense cameras和object-mounted IMUs,且需要複雜的preprocessing和joint optimization pipeline - **研究空白**:缺乏data-driven的multiple human-object motion capture和synthesis方法 ## Problem Formulation ### Notation 1. **$T \in \mathbb{R}^3, R \in SO(3)$** ---- Object的3D translation和rotation 2. **$R^{IMU}_t$** ---- IMU提供的rotation信息 3. **$R^{off}_t$** ---- Calibration offset rotation 4. **$V^j_t(R^{IMU}_t, R^{off}_t, T_t)$** ---- Object mesh在frame t的3D location 5. **$O(c_j)$** ---- Category $c_j$的mesh template 6. **$x = [x_1, x_2, ..., x_N]$** ---- Multiple HOI representation,$x_i \in \mathbb{R}^{88}$ 7. **$\theta_i \in \mathbb{R}^{24 \times 3}, \beta_i \in \mathbb{R}^{10}$** ---- Human pose和shape parameters 8. **$T^h_i, R^h_i, T^o_i, R^o_i$** ---- Human和object的global transformation ### Objective 主要目標: 1. **建立首個multiple humans和multiple objects的large-scale dataset**:提供高質量3D tracking和rich modality 2. **開發robust capture pipeline**:整合dense RGB和object-mounted IMU inputs進行準確tracking 3. **提出novel downstream tasks**:Monocular multiple HOI capture和unstructured multiple HOI generation 4. **建立benchmark**:為future HOI research提供數據基礎和評估標準 ### Model Architecture **Dataset Architecture**: - **Capture System**:42 Z CAM cinema cameras (4K@60fps) + object-mounted IMUs - **Environment**:5種daily scenarios (Bedroom, Dining Room, Living Room, Fitness Room, Office) - **Scale**:199 sequences, 181M frames, 90 objects, 31 human subjects **Pipeline Components**: 1. **Data Annotation**:SAM-based segmentation + professional human annotation 2. **Synchronization & Calibration**:RGB-IMU alignment via controlled jump detection 3. **Human Motion Capture**:ViTPose + cross-view matching + SMPL fitting 4. **Inertial-aided Multi-object Tracking**:IMU-guided optimization with multiple constraints ### Algorithm **Data Capture Pipeline**: 1. **Pre-scanning**:使用Polycam掃描90個everyday objects 2. **Recording**:42-camera system + embedded IMUs記錄interactions 3. **Synchronization**:通過controlled jump同步RGB和IMU signals 4. **Annotation**:SAM + human annotators進行segmentation **Human Motion Capture**: 1. **2D Detection**:ViTPose檢測2D human keypoints 2. **Cross-view Matching**:建立affinity matrix進行multi-view correspondence 3. **3D Reconstruction**:Triangulation重建3D keypoint trajectories 4. **SMPL Fitting**:使用Easymocap工具擬合parametric model **Multi-object Tracking**: 1. **IMU Integration**:利用IMU提供rotation priors 2. **Joint Optimization**:同時優化translation和rotation offset 3. **Multi-constraint Enforcement**:mask, offscreen, collision, smoothness constraints ### Key Formulations **Object Tracking Optimization**: $$V^j_t(R^{IMU}_t, R^{off}_t, T_t) = R^{off}_t R^{IMU}_t O(c_j) + T_t$$ **Optimization Objective**: $$R^{off}_t, T_t = \arg\min_{R,T} (\lambda_{mask}E_{mask} + \lambda_{offscreen}E_{offscreen} + \lambda_{collision}E_{collision} + \lambda_{smt}E_{smt})$$ **Human-Object Mask Constraint**: $$E_{homask} = \|\sum_{v=1}^{42} (I^{homask}_v - DR(O(c_j), R^{IMU}_t, T_t)\|_2^2$$ **Smoothness Constraint**: $$E_{smt} = \max(0, \|(R^{off}_t R^{IMU}_t)^{-1}R^{off}_{t+1}R^{IMU}_{t+1}\|_2 - \|(R^{IMU}_t)^{-1}R^{IMU}_{t+1}\|_2)$$ **Multiple HOI Diffusion Model**: $$q(x_{1:N}|x_0) := \prod_{n=1}^N q(x_n|x_{n-1})$$ $$q(x_n|x_{n-1}) := \mathcal{N}(x_n; \sqrt{1-\beta_n}x_{n-1}, \beta_n I)$$ ### Losses **Multi-object Tracking Losses**: 1. **Mask loss** ---- $E_{mask}$:Human-object mask consistency 2. **Offscreen loss** ---- $E_{offscreen}$:Prevent objects moving out of camera views 3. **Collision loss** ---- $E_{collision}$:Avoid human-object interpenetration 4. **Smoothness loss** ---- $E_{smt}$:Maintain motion smoothness **Monocular Capture Losses**: $$L_{sum} = \lambda_{theta}L_{theta} + \lambda_{beta}L_{beta} + \lambda_{object}L_{object} + \lambda_{3D}L_{3D} + \lambda_{2D}L_{2D} + \lambda_{hm}L_{hm} + \lambda_{depth}L_{depth}$$ **Generation Loss**: $$L = \mathbb{E}_{x_0,n}[|\hat{x}_\theta(x_n, n) - x_0|_1]$$ ### Training **Dataset Collection**: - **Recording Time**:20+ hours across 5 scenarios - **Subjects**:31 humans (20 males, 11 females) - **Objects**:90 pre-scanned everyday objects - **Cameras**:42 synchronized 4K@60fps cameras **Capture Pipeline Training**: - 使用EasyMocap進行SMPL parameter fitting - Multi-view triangulation for 3D keypoint reconstruction - Joint optimization with weighted constraint terms **Baseline Method Training**: - **Monocular Capture**:ResNet-34 backbone, 512×512 input - **Generation Model**:Transformer-based diffusion model - **Evaluation**:PCK, PCKabs, Chamfer distance, FID metrics ## Experiments **Dataset Statistics**: - **Scale**:181M frames, 199 sequences, 90 objects, 31 subjects - **Diversity**:5 daily scenarios, various human body shapes和object scales - **Quality**:4K resolution, 60fps, synchronized RGB-IMU data **Evaluation Protocols**: - **Human Pose**:PCK (15cm threshold), PCKabs (absolute coordinates), 3DPCK - **Object Pose**:Chamfer distance, vertex-to-vertex (v2v) error - **Generation**:FID (quality), Penetration rate (physical plausibility) **Comparison Methods**: - **Monocular Capture**:PHOSA, CHORE等SOTA single HOI methods - **Human Pose Estimation**:HMR, SPIN, HybrIK, PARE等主流methods ## Results **Dataset貢獻**: - **首個multiple human-object dataset**:提供準確3D tracking labels - **最大規模**:在recording time和frame數量上超越existing datasets - **豐富modality**:RGB + IMU + segmentation + pre-scanned geometry **Monocular Multiple HOI Capture**: - **Human Pose**:PCKrel 68.5%, PCKabs 5.9%(顯著超越SOTA方法) - **Object Pose**:Chamfer distance 235.0, V2V 297.8 - **優勢**:使用perspective camera model避免weak-projection的depth inaccuracy **Multiple HOI Generation**: - **FID Score**:Joint evaluation 36.906±0.087 - **Penetration Rate**:9.265%(物理合理性良好) - **能力**:可生成semantically corresponding motions given object inputs **關鍵發現**: - Multiple HOI scenarios比single HOI顯著更具挑戰性 - IMU-aided tracking大幅提升object tracking accuracy - 生成的interactions展現良好的semantic correspondence ## Key Contributions 1. **首個large-scale multiple human-object interaction dataset**:HOI-M³包含181M frames和準確3D tracking 2. **Robust capture pipeline**:整合dense RGB和object-mounted IMUs的joint optimization方法 3. **Novel downstream tasks**:提出monocular multiple HOI capture和unstructured generation兩個新任務 4. **Strong baselines**:為兩個任務提供companion baseline methods和comprehensive evaluation 5. **Community resource**:將釋放dataset、code和pre-trained models促進future research ## Limitations **硬體成本限制**: - 目前僅限於indoor settings,擴展到outdoor environments面臨non-trivial challenges - 需要expensive multi-camera setup和specialized IMU equipment **數據收集限制**: - 僅涵蓋5個common scenes,場景多樣性有限 - Human resources intensive,限制了dataset規模進一步擴展 **環境限制**: - Fixed illumination conditions,背景變化較少 - 對其他environment的generalization ability有限 **方法局限**: - 依賴pre-scanned object templates,對novel objects適應性不足 - IMU calibration errors可能影響tracking precision - Multiple HOI generation仍處於early stage,生成質量有待提升 **評估挑戰**: - 缺乏established metrics for multiple HOI evaluation - Physical plausibility assessment仍相對簡單 - Long-term temporal consistency評估不足