# [HOI-M³: Capture Multiple Humans and Objects Interaction within Contextual Environment](https://arxiv.org/pdf/2404.00299)
## Issue
論文解決的核心挑戰:
- **Data scarcity問題**:現有Human-Object Interaction (HOI) datasets主要專注於單人與單物體交互,缺乏multiple humans和multiple objects的真實世界數據
- **Multiple HOI capture困難**:在contextual environment中準確捕捉多人多物體運動極具挑戰性,特別是存在嚴重遮擋時
- **技術挑戰**:需要dome-like dense cameras和object-mounted IMUs,且需要複雜的preprocessing和joint optimization pipeline
- **研究空白**:缺乏data-driven的multiple human-object motion capture和synthesis方法
## Problem Formulation
### Notation
1. **$T \in \mathbb{R}^3, R \in SO(3)$** ---- Object的3D translation和rotation
2. **$R^{IMU}_t$** ---- IMU提供的rotation信息
3. **$R^{off}_t$** ---- Calibration offset rotation
4. **$V^j_t(R^{IMU}_t, R^{off}_t, T_t)$** ---- Object mesh在frame t的3D location
5. **$O(c_j)$** ---- Category $c_j$的mesh template
6. **$x = [x_1, x_2, ..., x_N]$** ---- Multiple HOI representation,$x_i \in \mathbb{R}^{88}$
7. **$\theta_i \in \mathbb{R}^{24 \times 3}, \beta_i \in \mathbb{R}^{10}$** ---- Human pose和shape parameters
8. **$T^h_i, R^h_i, T^o_i, R^o_i$** ---- Human和object的global transformation
### Objective
主要目標:
1. **建立首個multiple humans和multiple objects的large-scale dataset**:提供高質量3D tracking和rich modality
2. **開發robust capture pipeline**:整合dense RGB和object-mounted IMU inputs進行準確tracking
3. **提出novel downstream tasks**:Monocular multiple HOI capture和unstructured multiple HOI generation
4. **建立benchmark**:為future HOI research提供數據基礎和評估標準
### Model Architecture
**Dataset Architecture**:
- **Capture System**:42 Z CAM cinema cameras (4K@60fps) + object-mounted IMUs
- **Environment**:5種daily scenarios (Bedroom, Dining Room, Living Room, Fitness Room, Office)
- **Scale**:199 sequences, 181M frames, 90 objects, 31 human subjects
**Pipeline Components**:
1. **Data Annotation**:SAM-based segmentation + professional human annotation
2. **Synchronization & Calibration**:RGB-IMU alignment via controlled jump detection
3. **Human Motion Capture**:ViTPose + cross-view matching + SMPL fitting
4. **Inertial-aided Multi-object Tracking**:IMU-guided optimization with multiple constraints
### Algorithm
**Data Capture Pipeline**:
1. **Pre-scanning**:使用Polycam掃描90個everyday objects
2. **Recording**:42-camera system + embedded IMUs記錄interactions
3. **Synchronization**:通過controlled jump同步RGB和IMU signals
4. **Annotation**:SAM + human annotators進行segmentation
**Human Motion Capture**:
1. **2D Detection**:ViTPose檢測2D human keypoints
2. **Cross-view Matching**:建立affinity matrix進行multi-view correspondence
3. **3D Reconstruction**:Triangulation重建3D keypoint trajectories
4. **SMPL Fitting**:使用Easymocap工具擬合parametric model
**Multi-object Tracking**:
1. **IMU Integration**:利用IMU提供rotation priors
2. **Joint Optimization**:同時優化translation和rotation offset
3. **Multi-constraint Enforcement**:mask, offscreen, collision, smoothness constraints
### Key Formulations
**Object Tracking Optimization**:
$$V^j_t(R^{IMU}_t, R^{off}_t, T_t) = R^{off}_t R^{IMU}_t O(c_j) + T_t$$
**Optimization Objective**:
$$R^{off}_t, T_t = \arg\min_{R,T} (\lambda_{mask}E_{mask} + \lambda_{offscreen}E_{offscreen} + \lambda_{collision}E_{collision} + \lambda_{smt}E_{smt})$$
**Human-Object Mask Constraint**:
$$E_{homask} = \|\sum_{v=1}^{42} (I^{homask}_v - DR(O(c_j), R^{IMU}_t, T_t)\|_2^2$$
**Smoothness Constraint**:
$$E_{smt} = \max(0, \|(R^{off}_t R^{IMU}_t)^{-1}R^{off}_{t+1}R^{IMU}_{t+1}\|_2 - \|(R^{IMU}_t)^{-1}R^{IMU}_{t+1}\|_2)$$
**Multiple HOI Diffusion Model**:
$$q(x_{1:N}|x_0) := \prod_{n=1}^N q(x_n|x_{n-1})$$
$$q(x_n|x_{n-1}) := \mathcal{N}(x_n; \sqrt{1-\beta_n}x_{n-1}, \beta_n I)$$
### Losses
**Multi-object Tracking Losses**:
1. **Mask loss** ---- $E_{mask}$:Human-object mask consistency
2. **Offscreen loss** ---- $E_{offscreen}$:Prevent objects moving out of camera views
3. **Collision loss** ---- $E_{collision}$:Avoid human-object interpenetration
4. **Smoothness loss** ---- $E_{smt}$:Maintain motion smoothness
**Monocular Capture Losses**:
$$L_{sum} = \lambda_{theta}L_{theta} + \lambda_{beta}L_{beta} + \lambda_{object}L_{object} + \lambda_{3D}L_{3D} + \lambda_{2D}L_{2D} + \lambda_{hm}L_{hm} + \lambda_{depth}L_{depth}$$
**Generation Loss**:
$$L = \mathbb{E}_{x_0,n}[|\hat{x}_\theta(x_n, n) - x_0|_1]$$
### Training
**Dataset Collection**:
- **Recording Time**:20+ hours across 5 scenarios
- **Subjects**:31 humans (20 males, 11 females)
- **Objects**:90 pre-scanned everyday objects
- **Cameras**:42 synchronized 4K@60fps cameras
**Capture Pipeline Training**:
- 使用EasyMocap進行SMPL parameter fitting
- Multi-view triangulation for 3D keypoint reconstruction
- Joint optimization with weighted constraint terms
**Baseline Method Training**:
- **Monocular Capture**:ResNet-34 backbone, 512×512 input
- **Generation Model**:Transformer-based diffusion model
- **Evaluation**:PCK, PCKabs, Chamfer distance, FID metrics
## Experiments
**Dataset Statistics**:
- **Scale**:181M frames, 199 sequences, 90 objects, 31 subjects
- **Diversity**:5 daily scenarios, various human body shapes和object scales
- **Quality**:4K resolution, 60fps, synchronized RGB-IMU data
**Evaluation Protocols**:
- **Human Pose**:PCK (15cm threshold), PCKabs (absolute coordinates), 3DPCK
- **Object Pose**:Chamfer distance, vertex-to-vertex (v2v) error
- **Generation**:FID (quality), Penetration rate (physical plausibility)
**Comparison Methods**:
- **Monocular Capture**:PHOSA, CHORE等SOTA single HOI methods
- **Human Pose Estimation**:HMR, SPIN, HybrIK, PARE等主流methods
## Results
**Dataset貢獻**:
- **首個multiple human-object dataset**:提供準確3D tracking labels
- **最大規模**:在recording time和frame數量上超越existing datasets
- **豐富modality**:RGB + IMU + segmentation + pre-scanned geometry
**Monocular Multiple HOI Capture**:
- **Human Pose**:PCKrel 68.5%, PCKabs 5.9%(顯著超越SOTA方法)
- **Object Pose**:Chamfer distance 235.0, V2V 297.8
- **優勢**:使用perspective camera model避免weak-projection的depth inaccuracy
**Multiple HOI Generation**:
- **FID Score**:Joint evaluation 36.906±0.087
- **Penetration Rate**:9.265%(物理合理性良好)
- **能力**:可生成semantically corresponding motions given object inputs
**關鍵發現**:
- Multiple HOI scenarios比single HOI顯著更具挑戰性
- IMU-aided tracking大幅提升object tracking accuracy
- 生成的interactions展現良好的semantic correspondence
## Key Contributions
1. **首個large-scale multiple human-object interaction dataset**:HOI-M³包含181M frames和準確3D tracking
2. **Robust capture pipeline**:整合dense RGB和object-mounted IMUs的joint optimization方法
3. **Novel downstream tasks**:提出monocular multiple HOI capture和unstructured generation兩個新任務
4. **Strong baselines**:為兩個任務提供companion baseline methods和comprehensive evaluation
5. **Community resource**:將釋放dataset、code和pre-trained models促進future research
## Limitations
**硬體成本限制**:
- 目前僅限於indoor settings,擴展到outdoor environments面臨non-trivial challenges
- 需要expensive multi-camera setup和specialized IMU equipment
**數據收集限制**:
- 僅涵蓋5個common scenes,場景多樣性有限
- Human resources intensive,限制了dataset規模進一步擴展
**環境限制**:
- Fixed illumination conditions,背景變化較少
- 對其他environment的generalization ability有限
**方法局限**:
- 依賴pre-scanned object templates,對novel objects適應性不足
- IMU calibration errors可能影響tracking precision
- Multiple HOI generation仍處於early stage,生成質量有待提升
**評估挑戰**:
- 缺乏established metrics for multiple HOI evaluation
- Physical plausibility assessment仍相對簡單
- Long-term temporal consistency評估不足