--- title: 'HMR 2.0 along with code' disqus: hackmd --- HMR 2.0 along with code === ## Index [TOC] Setup HMR 2 --- (4D-Humans/hmr2/models/hmr2.py) * Create backbone feature extractor using ViT(Vision Transformer). ```python= # Create backbone feature extractor self.backbone = create_backbone(cfg) if cfg.MODEL.BACKBONE.get('PRETRAINED_WEIGHTS', None): log.info(f'Loading backbone weights from {cfg.MODEL.BACKBONE.PRETRAINED_WEIGHTS}') self.backbone.load_state_dict(torch.load(cfg.MODEL.BACKBONE.PRETRAINED_WEIGHTS, map_location='cpu')['state_dict']) ``` * We create SMPL head where we pass the image tokens (conditioning_feats) from ViT transformer as input to it. ```python= # Create SMPL head self.smpl_head = build_smpl_head(cfg) ``` * Create discriminator ```python= # Create discriminator if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0: self.discriminator = Discriminator() ``` * Define loss functions ```python= # Define loss functions self.keypoint_3d_loss = Keypoint3DLoss(loss_type='l1') self.keypoint_2d_loss = Keypoint2DLoss(loss_type='l1') self.smpl_parameter_loss = ParameterLoss() ``` ViT (backbone feature extractor) (HMR 2) --- * This is used to extract image tokens. * We use a ViT-H/16, the “Huge” variant with 16 × 16 input patch size * Input parameters: * patch_size = 16 * embed_dim = 1280 * depth = 32 (number of blocks) * drop_path_rate = 0.55 * There are 2 types of embedding extractors and we use any one of them. They are : * Hybrid Embed : * The HybridEmbed class uses a pretrained CNN backbone to extract feature maps from images. These feature maps are then flattened and projected into the embedding space * Patch Embed: * The PatchEmbed class converts image inputs into patch embeddings. This is achieved by dividing the input image into patches and projecting them into a lower-dimensional embedding space using a convolutional layer. * Create Positional Embedding ```python= # since the pretraining model has class token self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) ``` * Create drop path rate and depth number of block layers. ```python= dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)] # stochastic depth decay rule self.blocks = nn.ModuleList([ Block( dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale, drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer, ) for i in range(depth)]) ``` * Last layer will be normaliztion layer ```python= self.last_norm = norm_layer(embed_dim) if last_norm else nn.Identity() ``` * Each block typically consists of two main components: 1. Multi-head self-attention mechanism 2. Feedforward neural network (MLP). ```python= class Block(nn.Module): def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0., drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, attn_head_dim=None ): super().__init__() self.norm1 = norm_layer(dim) self.attn = Attention( dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim ) # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity() self.norm2 = norm_layer(dim) mlp_hidden_dim = int(dim * mlp_ratio) self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop) def forward(self, x): x = x + self.drop_path(self.attn(self.norm1(x))) x = x + self.drop_path(self.mlp(self.norm2(x))) return x ``` SMPL Head (Transformer Decoder) (HMR 2) --- * We use a standard transformer decoder with multi-head self-attention. * It processes a single (zero) input token by cross-attending to the output image tokens and ends with a linear readout of Θ * Initial values * joint_rep_type * joint_rep_dim * npose is basically number of pose parameters we want (if joint_rep_type=23, joint_rep_dim=6 then npose=144) * input_is_mean_shape (boolean) * transformer_args ```python= transformer_args = dict( num_tokens=1, token_dim=(npose + 10 + 3) if self.input_is_mean_shape else 1, dim=1024, ) ``` * initialize body_pose (θ), shape (β) and camera (π) with smpl mean parameters ```python= # SMPL.MEAN_PARAMS refer data/smpl_mean_params.npz mean_params = np.load(cfg.SMPL.MEAN_PARAMS) init_body_pose = torch.from_numpy(mean_params['pose'].astype(np.float32)).unsqueeze(0) init_betas = torch.from_numpy(mean_params['shape'].astype('float32')).unsqueeze(0) init_cam = torch.from_numpy(mean_params['cam'].astype(np.float32)).unsqueeze(0) ``` * initialize transformer with transformer_args as parameter to it ```python= self.transformer = TransformerDecoder(**transformer_args) ``` * In TransformerDecoder we use **DropTokenDropout** to prevent overfitting. ```python= self.pos_embedding = nn.Parameter(torch.randn(1, num_tokens, dim)) if emb_dropout_type == "drop": self.dropout = DropTokenDropout(emb_dropout) ``` * We use TransformerCrossAttn class. ```python= self.transformer = TransformerCrossAttn( dim, depth, heads, dim_head, mlp_dim, dropout, norm=norm, norm_cond_dim=norm_cond_dim, context_dim=context_dim, ) ``` * **depth** number of layers are created for transformer block * each layer contains: * **Self-Attention (sa)** - this allows a transformer model to attend to different parts of the same input sequence. * **Cross-Attention (ca)** - This enables the model to attend to the output image tokens or features, allowing it to incorporate information from the image context during decoding. * **FeedForward** - last part of layer ```python= class FeedForward(nn.Module): def __init__(self, dim, hidden_dim, dropout=0.0): super().__init__() self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout), ) ``` * All the above components are wrapped inside PreNorm layer and appended to self.layer. ```python= self.layers = nn.ModuleList([]) for _ in range(depth): sa = Attention(dim, heads=heads, dim_head=dim_head, dropout=dropout) ca = CrossAttention( dim, context_dim=context_dim, heads=heads, dim_head=dim_head, dropout=dropout ) ff = FeedForward(dim, mlp_dim, dropout=dropout) self.layers.append( nn.ModuleList( [ PreNorm(dim, sa, norm=norm, norm_cond_dim=norm_cond_dim), PreNorm(dim, ca, norm=norm, norm_cond_dim=norm_cond_dim), PreNorm(dim, ff, norm=norm, norm_cond_dim=norm_cond_dim), ] ) ) ``` * While running this model (in forward function): * We pass init mean params as input to transformer/token and output ViT as context to transformer decorder. ```python= def forward(self, x, **kwargs): batch_size = x.shape[0] # vit pretrained backbone is channel-first. Change to token-first x = einops.rearrange(x, 'b c h w -> b (h w) c') init_body_pose = self.init_body_pose.expand(batch_size, -1) init_betas = self.init_betas.expand(batch_size, -1) init_cam = self.init_cam.expand(batch_size, -1) pred_body_pose = init_body_pose pred_betas = init_betas pred_cam = init_cam pred_body_pose_list = [] pred_betas_list = [] pred_cam_list = [] for i in range(self.cfg.MODEL.SMPL_HEAD.get('IEF_ITERS', 1)): # Input token to transformer is zero token if self.input_is_mean_shape: token = torch.cat([pred_body_pose, pred_betas, pred_cam], dim=1)[:,None,:] else: token = torch.zeros(batch_size, 1, 1).to(x.device) # Pass through transformer token_out = self.transformer(token, context=x) token_out = token_out.squeeze(1) # (B, C) # Readout from token_out pred_body_pose = self.decpose(token_out) + pred_body_pose pred_betas = self.decshape(token_out) + pred_betas pred_cam = self.deccam(token_out) + pred_cam pred_body_pose_list.append(pred_body_pose) pred_betas_list.append(pred_betas) pred_cam_list.append(pred_cam) ``` * The transformer decoder is ran for iterative error feedback(IEF_ITERS) number of times. Loss Functions (HMR 2) --- 1. When the image has accurate ground-truth 3D keypoint annotations X* then using the predicted 3D Keypoint X, using L1 loss we calculate : $$ L_{kp3D}=|| X-X^*||_{1} $$ ```python= self.keypoint_3d_loss = Keypoint3DLoss(loss_type='l1') def forward(self, pred_keypoints_3d: torch.Tensor, gt_keypoints_3d: torch.Tensor, pelvis_id: int = 39): batch_size = pred_keypoints_3d.shape[0] gt_keypoints_3d = gt_keypoints_3d.clone() pred_keypoints_3d = pred_keypoints_3d - pred_keypoints_3d[:, pelvis_id, :].unsqueeze(dim=1) gt_keypoints_3d[:, :, :-1] = gt_keypoints_3d[:, :, :-1] - gt_keypoints_3d[:, pelvis_id, :-1].unsqueeze(dim=1) conf = gt_keypoints_3d[:, :, -1].unsqueeze(-1).clone() gt_keypoints_3d = gt_keypoints_3d[:, :, :-1] loss = (conf * self.loss_fn(pred_keypoints_3d, gt_keypoints_3d)).sum(dim=(1,2)) return loss.sum() ``` 2. When the image has accurate ground-truth 2D keypoint annotations x* then using the predicted 2D Keypoint π(X), using L1 loss we calculate : $$ L_{kp2D}=|| π(X)-x^*||_{1} $$ ```python= self.keypoint_2d_loss = Keypoint2DLoss(loss_type='l1') def forward(self, pred_keypoints_2d: torch.Tensor, gt_keypoints_2d: torch.Tensor) -> torch.Tensor: conf = gt_keypoints_2d[:, :, -1].unsqueeze(-1).clone() batch_size = conf.shape[0] loss = (conf * self.loss_fn(pred_keypoints_2d, gt_keypoints_2d[:, :, :-1])).sum(dim=(1,2)) return loss.sum() ``` 3. If we have groud truth SMPL pose parameters then we combine pose(θ) and shape(β) the model predictions using an MSE loss. $$ L_{smpl}=|| \theta-\theta^*||^2_{2} + || \beta-\beta^*||^2_{2} $$ ```python= self.smpl_parameter_loss = ParameterLoss() def forward(self, pred_param: torch.Tensor, gt_param: torch.Tensor, has_param: torch.Tensor): batch_size = pred_param.shape[0] num_dims = len(pred_param.shape) mask_dimension = [batch_size] + [1] * (num_dims-1) has_param = has_param.type(pred_param.type()).view(*mask_dimension) loss_param = (has_param * self.loss_fn(pred_param, gt_param)) return loss_param.sum() ``` Discriminator (HMR 2) --- * To make sure whether the model predicts valid 3D poses and use the adversarial prior in HMR. * We train a discriminator $\ D_{k}$ for each factor of the body model, and the generator loss can be expressed as: $$ L_{adv}=\sum_k (D_k(\theta_b,\beta)-1)^2 $$ body pose parameters - $\ θ_b$ shape parameters - β ```python= if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0: disc_out = self.discriminator(pred_smpl_params['body_pose'].reshape(batch_size, -1), pred_smpl_params['betas'].reshape(batch_size, -1)) loss_adv = ((disc_out - 1.0) ** 2).sum() / batch_size loss = loss + self.cfg.LOSS_WEIGHTS.ADVERSARIAL * loss_adv ``` * ***training discriminator*** * disc_fake_out will return output of discriminator ran for predicted body pose and shape * disc_real_out will return output of discriminator ran for ground truth body pose and shape * loss is calculated by multiplying the initialized weight in config file with disc_fake + disc_real $$ L_{adv}=\sum_k (D_k(\theta_b,\beta)-0)_a^2+(D_k(\theta_b^*,\beta^*)-1)_b^2 $$ *implies ground truth a - loss_fake b - loss_real * Optimizer is used to update weights during training. ```python= def training_step_discriminator(self, batch: Dict, body_pose: torch.Tensor, betas: torch.Tensor, optimizer: torch.optim.Optimizer) -> torch.Tensor: batch_size = body_pose.shape[0] gt_body_pose = batch['body_pose'] gt_betas = batch['betas'] gt_rotmat = aa_to_rotmat(gt_body_pose.view(-1,3)).view(batch_size, -1, 3, 3) disc_fake_out = self.discriminator(body_pose.detach(), betas.detach()) loss_fake = ((disc_fake_out - 0.0) ** 2).sum() / batch_size disc_real_out = self.discriminator(gt_rotmat, gt_betas) loss_real = ((disc_real_out - 1.0) ** 2).sum() / batch_size loss_disc = loss_fake + loss_real loss = self.cfg.LOSS_WEIGHTS.ADVERSARIAL * loss_disc optimizer.zero_grad() self.manual_backward(loss) optimizer.step() return loss_disc.detach() ``` * We build discriminator network for pose, pose joints and shape. ```python= # poses_alone self.D_conv1 = nn.Conv2d(9, 32, kernel_size=1) nn.init.xavier_uniform_(self.D_conv1.weight) nn.init.zeros_(self.D_conv1.bias) self.relu = nn.ReLU(inplace=True) self.D_conv2 = nn.Conv2d(32, 32, kernel_size=1) nn.init.xavier_uniform_(self.D_conv2.weight) nn.init.zeros_(self.D_conv2.bias) pose_out = [] for i in range(self.num_joints): pose_out_temp = nn.Linear(32, 1) nn.init.xavier_uniform_(pose_out_temp.weight) nn.init.zeros_(pose_out_temp.bias) pose_out.append(pose_out_temp) self.pose_out = nn.ModuleList(pose_out) # betas self.betas_fc1 = nn.Linear(10, 10) nn.init.xavier_uniform_(self.betas_fc1.weight) nn.init.zeros_(self.betas_fc1.bias) self.betas_fc2 = nn.Linear(10, 5) nn.init.xavier_uniform_(self.betas_fc2.weight) nn.init.zeros_(self.betas_fc2.bias) self.betas_out = nn.Linear(5, 1) nn.init.xavier_uniform_(self.betas_out.weight) nn.init.zeros_(self.betas_out.bias) # poses_joint self.D_alljoints_fc1 = nn.Linear(32*self.num_joints, 1024) nn.init.xavier_uniform_(self.D_alljoints_fc1.weight) nn.init.zeros_(self.D_alljoints_fc1.bias) self.D_alljoints_fc2 = nn.Linear(1024, 1024) nn.init.xavier_uniform_(self.D_alljoints_fc2.weight) nn.init.zeros_(self.D_alljoints_fc2.bias) self.D_alljoints_out = nn.Linear(1024, 1) nn.init.xavier_uniform_(self.D_alljoints_out.weight) nn.init.zeros_(self.D_alljoints_out.bias) ``` Training (HMR 2) --- (Look into hmr2.py/training_step()) * We use tensorboard for logging. * **batch** will contain input images from dataset in matches and **mocap_batch** will contain ground truth data. ```python= def training_step(self, joint_batch: Dict, batch_idx: int) -> Dict: batch = joint_batch['img'] mocap_batch = joint_batch['mocap'] ``` * Setup Adam Optimizers for discriminator and smpl_head. (LR=1e-5 and weight_decay=1e-4) ```python= def configure_optimizers(self) -> Tuple[torch.optim.Optimizer, torch.optim.Optimizer]: param_groups = [{'params': filter(lambda p: p.requires_grad, self.get_parameters()), 'lr': self.cfg.TRAIN.LR}] optimizer = torch.optim.AdamW(params=param_groups, # lr=self.cfg.TRAIN.LR, weight_decay=self.cfg.TRAIN.WEIGHT_DECAY) optimizer_disc = torch.optim.AdamW(params=self.discriminator.parameters(), lr=self.cfg.TRAIN.LR, weight_decay=self.cfg.TRAIN.WEIGHT_DECAY) return optimizer, optimizer_disc ``` * Pass the training data into model and compute loss from output of model. ```python= batch_size = batch['img'].shape[0] output = self.forward_step(batch, train=True) pred_smpl_params = output['pred_smpl_params'] loss = self.compute_loss(batch, output, train=True) ``` * Pass the body_pose and shape/betas into discriminator and calculate adversarial loss ```python= if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0: disc_out = self.discriminator(pred_smpl_params['body_pose'].reshape(batch_size, -1), pred_smpl_params['betas'].reshape(batch_size, -1)) loss_adv = ((disc_out - 1.0) ** 2).sum() / batch_size loss = loss + self.cfg.LOSS_WEIGHTS.ADVERSARIAL * loss_adv ``` * save loss_adv and loss_disc ```python= optimizer.step() if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0: loss_disc = self.training_step_discriminator(mocap_batch, pred_smpl_params['body_pose'].reshape(batch_size, -1), pred_smpl_params['betas'].reshape(batch_size, -1), optimizer_disc) output['losses']['loss_gen'] = loss_adv output['losses']['loss_disc'] = loss_disc ``` Pose Transformer v2 (used while running on videos) --- (refer https://arxiv.org/pdf/2304.01199) * We use lart_transformer encoder ```python= self.encoder = lart_transformer( opt = self.cfg, phalp_cfg = self.phalp_cfg, dim = self.cfg.in_feat, depth = self.cfg.transformer.depth, heads = self.cfg.transformer.heads, mlp_dim = self.cfg.transformer.mlp_dim, dim_head = self.cfg.transformer.dim_head, dropout = self.cfg.transformer.dropout, emb_dropout = self.cfg.transformer.emb_dropout, droppath = self.cfg.transformer.droppath, ) ``` * In predict_next function: * inputs are en_pose and en_time * We reconstruct the input data such that pose_shape will be of shape(number of persons,time,0,:). Similar is the mask/has_detection ```python= # set number of people to one n_p = 1 pose_shape_ = torch.zeros(en_pose.shape[0], self.cfg.frame_length, n_p, 229) has_detection_ = torch.zeros(en_pose.shape[0], self.cfg.frame_length, n_p, 1) mask_detection_ = torch.zeros(en_pose.shape[0], self.cfg.frame_length, n_p, 1) # loop thorugh each person and construct the input data t_end = [] for p_ in range(en_time.shape[0]): t_min = en_time[p_, 0].min() # loop through time for t_ in range(en_time.shape[1]): # get the time from start. t = min(en_time[p_, t_] - t_min, self.cfg.frame_length - 1) # get the pose pose_shape_[p_, t, 0, :] = en_pose[p_, t_, :] # get the mask has_detection_[p_, t, 0, :] = 1 t_end.append(t.item()) input_data = { "pose_shape" : (pose_shape_ - self.mean_[:, :, None, :]) / (self.std_[:, :, None, :] + 1e-10), "has_detection" : has_detection_, "mask_detection" : mask_detection_ } ``` * Now we pass the input_data to encoder (which is lart transformer) and then to readout_pose to decode it. ```python= # single forward pass output, _ = self.encoder(input_data, self.cfg.mask_type_test) decoded_output = self.readout_pose(output[:, self.cfg.max_people:, :]) ``` ### LART transformer * Here we implement mask token * positional embeddings ```python= self.pos_embedding = nn.Parameter(positionalencoding1d(self.dim, 10000)) self.pos_embedding_learned1 = nn.Parameter(torch.randn(1, self.cfg.frame_length, self.dim)) self.pos_embedding_learned2 = nn.Parameter(torch.randn(1, self.cfg.frame_length, self.dim)) self.register_buffer('pe', self.pos_embedding) ``` * We use * self.pose_shape_encoder - encoding pose shape features, used by default * self.smpl_head - SMPL head for predicting SMPL parameters * self.loca_head - Location head for predicting 3D location of the person * self.action_head_ava - Action head for predicting action class in AVA dataset labels (This is not used) * We add mask to random index ```python= def bert_mask(self, data, mask_type): if(mask_type=="random"): has_detection = data['has_detection']==1 mask_detection = data['mask_detection'] for i in range(data['has_detection'].shape[0]): indexes = has_detection[i].nonzero() indexes_mask = indexes[torch.randperm(indexes.shape[0])[:int(indexes.shape[0]*self.cfg.mask_ratio)]] mask_detection[i, indexes_mask[:, 0], indexes_mask[:, 1], indexes_mask[:, 2]] = 1.0 ``` * We have 3 transformers in LART: * self.transformer - This is not used * self.transformer1 - This is used after adding pos_embedding_learned1 to input of model after adding mask to it * self.transformer2 - This is used after adding pos_embedding_learned2 to output of self.transformer 1 ```python= #lart_transformer - def forward() # prepare the input data and masking data, has_detection, mask_detection = self.bert_mask(data, mask_type) # encode the input pose tokens pose_ = data['pose_shape'].float() pose_en = self.pose_shape_encoder(pose_) x = pose_en # mask the input tokens x[mask_detection[:, :, :, 0]==1] = self.mask_token x = x + self.pos_embedding_learned1 x = self.transformer1(x, [has_detection, mask_detection]) x = x.transpose(1, 2) x = self.conv_en(x) x = self.conv_de(x) x = x.transpose(1, 2) x = x.contiguous() x = x + self.pos_embedding_learned2 has_detection = has_detection*0 + 1 mask_detection = mask_detection*0 x = self.transformer2(x, [has_detection, mask_detection]) x = torch.concat([self.class_token.repeat(BS, self.cfg.max_people, 1), x], dim=1) return x,0 ``` * Why are we using 2 transformers and 2 pos_embeddings Notations (HMR 2) --- ### Body Model * **θ** = SMPL pose ( θ∈$\ \mathbb R^{24x3x3}$ ) * **β** are the shape parameters (β∈$\ \mathbb R^{10}$) * θ include * $\ \theta_{b} ∈ \mathbb R^{23x3x3}$ body pose parameters * $\ \theta_{g} ∈ \mathbb R^{3x3}$ global orientation * Using θ and β we get **mesh** M ∈$\ \mathbb R^{3xN}$ with N = 6890 vertices * Body joints **X** ∈ $\ \mathbb R^{3xk}$ are defined as a linear combination of the vertices and can be computed as X = M*W with fixed weights W ∈ $\ \mathbb R^{Nxk}$. ### Camera * Perspective camera model where focal length and intrinsics K are fixed. * Each camera π = (R, t) consists of a global orientation R ∈ $\ \mathbb R^{3x3}$ and translation t ∈ $\ \mathbb R^{3}$ . * Points in SMPL space (e.g., joints X) can be projected to the image as x = π(X) = Π(K(RX+t)), where Π is a perspective projection with camera intrinsics K. ### HMR * Θ = [θ, β, π]=f(I) where * f is model * I is single image * f(I) is predictor