4D - Humans Paper === ## Table of Contents [TOC] ## Abstract * HMR 2.0 (Human Mesh Recovery) fully transformer based approach for 3D human pose and shape reconstruction from single image * 4D Humans is built on HMR2.0 to reconstruct and track humans from videos. * We use 3D reconstructions from HMR 2.0 as input to tracking system * 3D reconstructions helps to deal with multiple people and maintain identities through occlusion or obstructing events. * Other approaches on HumanMesh & Motion Recovery from video often operate in scenarios where tracking is simple egL videos with single person or minimal occlusion. 4D Humans tend to solve this problem. ## Reconstructing People ### Body Model &rarr; The SMPL model is a low-dimensional parametric model of the human body. &rarr; Given input parameters for **pose* (θ ∈ ℝ<sup>24×3×3</sup> ) and *shape* (β ∈ ℝ<sup>10</sup>) it outputs *mesh** (M ∈ R<sup>3 x N</sup>) with N = 6890 vertices. &rarr; The **body joints** (X ∈ ℝ<sup>3 x K</sup>) are defined as a linear combination of vertices and can be computed as <em>X=MW</em> with fixed weights (W ∈ ℝ<sup>N x K</sup>) &rarr; Note that pose parameters θ include the body pose parameters θ<sub>b</sub> ∈ ℝ<sup>23×3×3</sup> and the global orientation θ<sub>g</sub> ∈ ℝ<sup>3×3</sup>. <br> ### Camera &rarr; We use a perspective camera model with fixed focal length and intrinsics K. &rarr; Each camera π = (R,t) consists of a global orientation R ∈ ℝ<sup>3x3</sup> and translation t ∈ ℝ<sup>3</sup> &rarr; Points in SMPL space can be projected to image as <center>x = π(X) = Π(K(RX +t)) </center> where Π is a perspective projection with camera intrinsics K. <br> ### HMR &rarr; The goal of HMR is to learn a predictor <em>f(I)</em> that given a single image I, reconstructs the person in the image by predicting their 3D pose and shape parameters. <center>Θ = [θ, β, π] = f(I)</center> ## Architecture ![alt text](image.png) &rarr; Two parts in HMR 2.0 transformer algorithm: 1. **ViT** (Vision Transformer) to extract image tokens. (Basically get 2-D key points) 2. **Transformer Decoder** We use a standard transformer with multi-head self-attention. This processes a single (zero) input token by cross-attending to the output image and ends with linear readout of Θ. &rarr; Original **PHALP** uses pose features based on last layer of HMR network. This limits PHALP to be used with different PHALP models. To create more generic version of PHALP we perform modification of representing pose in terms of SMPL pose parameters & optimize PHALP cost function to utilize the new pose distance We train vanilla transformer model by masking random pose tokens. &rarr; **4D Humans** we use sampling based parameters free appearance head & new pose predictor. Instead of doing bounding boxes & tracking out, here we combine reconstruction & tracking into a single system & show better pose reconstrucition resulting butter tracking For appearance, we texture visible points on mesh by projecting them onto input image & sampling color from corresponding pixels. &rarr; In HMR we used ResNET(to get image patches) but here we use ViT backbone. ### ***ViT - 16*** &rarr; It is used for 2D-key point localization &rarr; it has 50 transformer layers input - 256 x 192 sized image output - 16x12 image tokens of dimensions 1280 ### ***Transformer Decoder (Standard Transformer Decoder Architecture )*** &rarr; It has 6 layers. Each contains * Multi head self attention * Multi head cross attention * Feed forward blocks with layer normalization (1024 hidden dimension) &rarr; Attention layers have 2048 hidden dimensions with 8 (64-dim) heads. &rarr; It operates on single learnable 2048-dimensional SMPL query token as input & cross attends to 16x12 image tokens. &rarr; Output of transformation decoder - [pose θ, shape β, camera π] ## Loses &rarr; We train our predictor f with a combination of 2D losses, 3D losses, and a discriminator (like we have seen in HMR). &rarr; Model is trained on mixture of datasets. We check error based on 2d key points ground truth if we dont have 3d keypoints. If we have both the ground truths then we will check error based on both the keypoints. (*denotes the ground truth) &rarr; For Pose and Shape error : <center>L<sub>smpl</sub> = ||θ − θ<sup>∗</sup>||<sup>2</sup><sub>2</sub> + ||β − β<sup>∗</sup>||<sup>2</sup><sub>2</sub></center> <br> &rarr; When we have accurate ground truth 3D keypoint (X<sup>*</sup>) <br> <center>L<sub>kp3D</sub> = ||X − X<sup>∗</sup>||<sub>1</sub></center> <br> &rarr; When we have 2D keypoint annotations (x<sup>*</sup>) we supervise projections of predicted 3D keypoints π(X) <center>L<sub>kp2D</sub> = ||π(X) − x<sup>∗</sup>||<sub>1</sub></center> &rarr; To check whether the model has predicted a valid pose we use adversarial prior like in HMR. We train discriminator D<sub>k</sub> for each factor of body model, the generator loss can be expressed as: <center>L<sub>adv</sub> = &Sigma;<sub>k</sub> (D<sub>k</sub>(θ<sub>b</sub>,β)-1)<sup>2</sup> </center> ## Future Scope &rarr; SMPL model creates some limitations like we can't detect kids. &rarr; Any moving object mesh detection in 4D To be discussed more