owned this note
owned this note
Published
Linked with GitHub
---
title: 'HMR 2.0 along with code'
disqus: hackmd
---
HMR 2.0 along with code
===
## Index
[TOC]
Setup HMR 2
---
(4D-Humans/hmr2/models/hmr2.py)
* Create backbone feature extractor using ViT(Vision Transformer).
```python=
# Create backbone feature extractor
self.backbone = create_backbone(cfg)
if cfg.MODEL.BACKBONE.get('PRETRAINED_WEIGHTS', None):
log.info(f'Loading backbone weights from {cfg.MODEL.BACKBONE.PRETRAINED_WEIGHTS}')
self.backbone.load_state_dict(torch.load(cfg.MODEL.BACKBONE.PRETRAINED_WEIGHTS, map_location='cpu')['state_dict'])
```
* We create SMPL head where we pass the image tokens (conditioning_feats) from ViT transformer as input to it.
```python=
# Create SMPL head
self.smpl_head = build_smpl_head(cfg)
```
* Create discriminator
```python=
# Create discriminator
if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0:
self.discriminator = Discriminator()
```
* Define loss functions
```python=
# Define loss functions
self.keypoint_3d_loss = Keypoint3DLoss(loss_type='l1')
self.keypoint_2d_loss = Keypoint2DLoss(loss_type='l1')
self.smpl_parameter_loss = ParameterLoss()
```
ViT (backbone feature extractor) (HMR 2)
---
* This is used to extract image tokens.
* We use a ViT-H/16, the “Huge” variant with 16 × 16 input patch size
* Input parameters:
* patch_size = 16
* embed_dim = 1280
* depth = 32 (number of blocks)
* drop_path_rate = 0.55
* There are 2 types of embedding extractors and we use any one of them. They are :
* Hybrid Embed :
* The HybridEmbed class uses a pretrained CNN backbone to extract feature maps from images. These feature maps are then flattened and projected into the embedding space
* Patch Embed:
* The PatchEmbed class converts image inputs into patch embeddings. This is achieved by dividing the input image into patches and projecting them into a lower-dimensional embedding space using a convolutional layer.
* Create Positional Embedding
```python=
# since the pretraining model has class token
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
```
* Create drop path rate and depth number of block layers.
```python=
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)] # stochastic depth decay rule
self.blocks = nn.ModuleList([
Block(
dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
)
for i in range(depth)])
```
* Last layer will be normaliztion layer
```python=
self.last_norm = norm_layer(embed_dim) if last_norm else nn.Identity()
```
* Each block typically consists of two main components:
1. Multi-head self-attention mechanism
2. Feedforward neural network (MLP).
```python=
class Block(nn.Module):
def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None,
drop=0., attn_drop=0., drop_path=0., act_layer=nn.GELU,
norm_layer=nn.LayerNorm, attn_head_dim=None
):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(
dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim
)
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
def forward(self, x):
x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))
return x
```
SMPL Head (Transformer Decoder) (HMR 2)
---
* We use a standard transformer decoder with multi-head self-attention.
* It processes a single (zero) input token by cross-attending to the output image tokens and ends with a linear readout of Θ
* Initial values
* joint_rep_type
* joint_rep_dim
* npose is basically number of pose parameters we want (if joint_rep_type=23, joint_rep_dim=6 then npose=144)
* input_is_mean_shape (boolean)
* transformer_args
```python=
transformer_args = dict(
num_tokens=1,
token_dim=(npose + 10 + 3) if self.input_is_mean_shape else 1,
dim=1024,
)
```
* initialize body_pose (θ), shape (β) and camera (π) with smpl mean parameters
```python=
# SMPL.MEAN_PARAMS refer data/smpl_mean_params.npz
mean_params = np.load(cfg.SMPL.MEAN_PARAMS)
init_body_pose = torch.from_numpy(mean_params['pose'].astype(np.float32)).unsqueeze(0)
init_betas = torch.from_numpy(mean_params['shape'].astype('float32')).unsqueeze(0)
init_cam = torch.from_numpy(mean_params['cam'].astype(np.float32)).unsqueeze(0)
```
* initialize transformer with transformer_args as parameter to it
```python=
self.transformer = TransformerDecoder(**transformer_args)
```
* In TransformerDecoder we use **DropTokenDropout** to prevent overfitting.
```python=
self.pos_embedding = nn.Parameter(torch.randn(1, num_tokens, dim))
if emb_dropout_type == "drop":
self.dropout = DropTokenDropout(emb_dropout)
```
* We use TransformerCrossAttn class.
```python=
self.transformer = TransformerCrossAttn(
dim,
depth,
heads,
dim_head,
mlp_dim,
dropout,
norm=norm,
norm_cond_dim=norm_cond_dim,
context_dim=context_dim,
)
```
* **depth** number of layers are created for transformer block
* each layer contains:
* **Self-Attention (sa)** - this allows a transformer model to attend to different parts of the same input sequence.
* **Cross-Attention (ca)** - This enables the model to attend to the output image tokens or features, allowing it to incorporate information from the image context during decoding.
* **FeedForward** - last part of layer
```python=
class FeedForward(nn.Module):
def __init__(self, dim, hidden_dim, dropout=0.0):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout),
)
```
* All the above components are wrapped inside PreNorm layer and appended to self.layer.
```python=
self.layers = nn.ModuleList([])
for _ in range(depth):
sa = Attention(dim, heads=heads, dim_head=dim_head, dropout=dropout)
ca = CrossAttention(
dim, context_dim=context_dim, heads=heads, dim_head=dim_head, dropout=dropout
)
ff = FeedForward(dim, mlp_dim, dropout=dropout)
self.layers.append(
nn.ModuleList(
[
PreNorm(dim, sa, norm=norm, norm_cond_dim=norm_cond_dim),
PreNorm(dim, ca, norm=norm, norm_cond_dim=norm_cond_dim),
PreNorm(dim, ff, norm=norm, norm_cond_dim=norm_cond_dim),
]
)
)
```
* While running this model (in forward function):
* We pass init mean params as input to transformer/token and output ViT as context to transformer decorder.
```python=
def forward(self, x, **kwargs):
batch_size = x.shape[0]
# vit pretrained backbone is channel-first. Change to token-first
x = einops.rearrange(x, 'b c h w -> b (h w) c')
init_body_pose = self.init_body_pose.expand(batch_size, -1)
init_betas = self.init_betas.expand(batch_size, -1)
init_cam = self.init_cam.expand(batch_size, -1)
pred_body_pose = init_body_pose
pred_betas = init_betas
pred_cam = init_cam
pred_body_pose_list = []
pred_betas_list = []
pred_cam_list = []
for i in range(self.cfg.MODEL.SMPL_HEAD.get('IEF_ITERS', 1)):
# Input token to transformer is zero token
if self.input_is_mean_shape:
token = torch.cat([pred_body_pose, pred_betas, pred_cam], dim=1)[:,None,:]
else:
token = torch.zeros(batch_size, 1, 1).to(x.device)
# Pass through transformer
token_out = self.transformer(token, context=x)
token_out = token_out.squeeze(1) # (B, C)
# Readout from token_out
pred_body_pose = self.decpose(token_out) + pred_body_pose
pred_betas = self.decshape(token_out) + pred_betas
pred_cam = self.deccam(token_out) + pred_cam
pred_body_pose_list.append(pred_body_pose)
pred_betas_list.append(pred_betas)
pred_cam_list.append(pred_cam)
```
* The transformer decoder is ran for iterative error feedback(IEF_ITERS) number of times.
Loss Functions (HMR 2)
---
1. When the image has accurate ground-truth 3D keypoint annotations X* then using the predicted 3D Keypoint X, using L1 loss we calculate :
$$
L_{kp3D}=|| X-X^*||_{1}
$$
```python=
self.keypoint_3d_loss = Keypoint3DLoss(loss_type='l1')
def forward(self, pred_keypoints_3d: torch.Tensor, gt_keypoints_3d: torch.Tensor, pelvis_id: int = 39):
batch_size = pred_keypoints_3d.shape[0]
gt_keypoints_3d = gt_keypoints_3d.clone()
pred_keypoints_3d = pred_keypoints_3d - pred_keypoints_3d[:, pelvis_id, :].unsqueeze(dim=1)
gt_keypoints_3d[:, :, :-1] = gt_keypoints_3d[:, :, :-1] - gt_keypoints_3d[:, pelvis_id, :-1].unsqueeze(dim=1)
conf = gt_keypoints_3d[:, :, -1].unsqueeze(-1).clone()
gt_keypoints_3d = gt_keypoints_3d[:, :, :-1]
loss = (conf * self.loss_fn(pred_keypoints_3d, gt_keypoints_3d)).sum(dim=(1,2))
return loss.sum()
```
2. When the image has accurate ground-truth 2D keypoint annotations x* then using the predicted 2D Keypoint π(X), using L1 loss we calculate :
$$
L_{kp2D}=|| π(X)-x^*||_{1}
$$
```python=
self.keypoint_2d_loss = Keypoint2DLoss(loss_type='l1')
def forward(self, pred_keypoints_2d: torch.Tensor, gt_keypoints_2d: torch.Tensor) -> torch.Tensor:
conf = gt_keypoints_2d[:, :, -1].unsqueeze(-1).clone()
batch_size = conf.shape[0]
loss = (conf * self.loss_fn(pred_keypoints_2d, gt_keypoints_2d[:, :, :-1])).sum(dim=(1,2))
return loss.sum()
```
3. If we have groud truth SMPL pose parameters then we combine pose(θ) and shape(β) the model predictions using an MSE loss.
$$
L_{smpl}=|| \theta-\theta^*||^2_{2} + || \beta-\beta^*||^2_{2}
$$
```python=
self.smpl_parameter_loss = ParameterLoss()
def forward(self, pred_param: torch.Tensor, gt_param: torch.Tensor, has_param: torch.Tensor):
batch_size = pred_param.shape[0]
num_dims = len(pred_param.shape)
mask_dimension = [batch_size] + [1] * (num_dims-1)
has_param = has_param.type(pred_param.type()).view(*mask_dimension)
loss_param = (has_param * self.loss_fn(pred_param, gt_param))
return loss_param.sum()
```
Discriminator (HMR 2)
---
* To make sure whether the model predicts valid 3D poses and use the adversarial prior in HMR.
* We train a discriminator $\ D_{k}$ for each factor of the body model, and the generator loss can be expressed as:
$$
L_{adv}=\sum_k (D_k(\theta_b,\beta)-1)^2
$$
body pose parameters - $\ θ_b$
shape parameters - β
```python=
if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0:
disc_out = self.discriminator(pred_smpl_params['body_pose'].reshape(batch_size, -1), pred_smpl_params['betas'].reshape(batch_size, -1))
loss_adv = ((disc_out - 1.0) ** 2).sum() / batch_size
loss = loss + self.cfg.LOSS_WEIGHTS.ADVERSARIAL * loss_adv
```
* ***training discriminator***
* disc_fake_out will return output of discriminator ran for predicted body pose and shape
* disc_real_out will return output of discriminator ran for ground truth body pose and shape
* loss is calculated by multiplying the initialized weight in config file with disc_fake + disc_real
$$
L_{adv}=\sum_k (D_k(\theta_b,\beta)-0)_a^2+(D_k(\theta_b^*,\beta^*)-1)_b^2
$$
*implies ground truth
a - loss_fake
b - loss_real
* Optimizer is used to update weights during training.
```python=
def training_step_discriminator(self, batch: Dict,
body_pose: torch.Tensor,
betas: torch.Tensor,
optimizer: torch.optim.Optimizer) -> torch.Tensor:
batch_size = body_pose.shape[0]
gt_body_pose = batch['body_pose']
gt_betas = batch['betas']
gt_rotmat = aa_to_rotmat(gt_body_pose.view(-1,3)).view(batch_size, -1, 3, 3)
disc_fake_out = self.discriminator(body_pose.detach(), betas.detach())
loss_fake = ((disc_fake_out - 0.0) ** 2).sum() / batch_size
disc_real_out = self.discriminator(gt_rotmat, gt_betas)
loss_real = ((disc_real_out - 1.0) ** 2).sum() / batch_size
loss_disc = loss_fake + loss_real
loss = self.cfg.LOSS_WEIGHTS.ADVERSARIAL * loss_disc
optimizer.zero_grad()
self.manual_backward(loss)
optimizer.step()
return loss_disc.detach()
```
* We build discriminator network for pose, pose joints and shape.
```python=
# poses_alone
self.D_conv1 = nn.Conv2d(9, 32, kernel_size=1)
nn.init.xavier_uniform_(self.D_conv1.weight)
nn.init.zeros_(self.D_conv1.bias)
self.relu = nn.ReLU(inplace=True)
self.D_conv2 = nn.Conv2d(32, 32, kernel_size=1)
nn.init.xavier_uniform_(self.D_conv2.weight)
nn.init.zeros_(self.D_conv2.bias)
pose_out = []
for i in range(self.num_joints):
pose_out_temp = nn.Linear(32, 1)
nn.init.xavier_uniform_(pose_out_temp.weight)
nn.init.zeros_(pose_out_temp.bias)
pose_out.append(pose_out_temp)
self.pose_out = nn.ModuleList(pose_out)
# betas
self.betas_fc1 = nn.Linear(10, 10)
nn.init.xavier_uniform_(self.betas_fc1.weight)
nn.init.zeros_(self.betas_fc1.bias)
self.betas_fc2 = nn.Linear(10, 5)
nn.init.xavier_uniform_(self.betas_fc2.weight)
nn.init.zeros_(self.betas_fc2.bias)
self.betas_out = nn.Linear(5, 1)
nn.init.xavier_uniform_(self.betas_out.weight)
nn.init.zeros_(self.betas_out.bias)
# poses_joint
self.D_alljoints_fc1 = nn.Linear(32*self.num_joints, 1024)
nn.init.xavier_uniform_(self.D_alljoints_fc1.weight)
nn.init.zeros_(self.D_alljoints_fc1.bias)
self.D_alljoints_fc2 = nn.Linear(1024, 1024)
nn.init.xavier_uniform_(self.D_alljoints_fc2.weight)
nn.init.zeros_(self.D_alljoints_fc2.bias)
self.D_alljoints_out = nn.Linear(1024, 1)
nn.init.xavier_uniform_(self.D_alljoints_out.weight)
nn.init.zeros_(self.D_alljoints_out.bias)
```
Training (HMR 2)
---
(Look into hmr2.py/training_step())
* We use tensorboard for logging.
* **batch** will contain input images from dataset in matches and **mocap_batch** will contain ground truth data.
```python=
def training_step(self, joint_batch: Dict, batch_idx: int) -> Dict:
batch = joint_batch['img']
mocap_batch = joint_batch['mocap']
```
* Setup Adam Optimizers for discriminator and smpl_head. (LR=1e-5 and weight_decay=1e-4)
```python=
def configure_optimizers(self) -> Tuple[torch.optim.Optimizer, torch.optim.Optimizer]:
param_groups = [{'params': filter(lambda p: p.requires_grad, self.get_parameters()), 'lr': self.cfg.TRAIN.LR}]
optimizer = torch.optim.AdamW(params=param_groups,
# lr=self.cfg.TRAIN.LR,
weight_decay=self.cfg.TRAIN.WEIGHT_DECAY)
optimizer_disc = torch.optim.AdamW(params=self.discriminator.parameters(),
lr=self.cfg.TRAIN.LR,
weight_decay=self.cfg.TRAIN.WEIGHT_DECAY)
return optimizer, optimizer_disc
```
* Pass the training data into model and compute loss from output of model.
```python=
batch_size = batch['img'].shape[0]
output = self.forward_step(batch, train=True)
pred_smpl_params = output['pred_smpl_params']
loss = self.compute_loss(batch, output, train=True)
```
* Pass the body_pose and shape/betas into discriminator and calculate adversarial loss
```python=
if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0:
disc_out = self.discriminator(pred_smpl_params['body_pose'].reshape(batch_size, -1), pred_smpl_params['betas'].reshape(batch_size, -1))
loss_adv = ((disc_out - 1.0) ** 2).sum() / batch_size
loss = loss + self.cfg.LOSS_WEIGHTS.ADVERSARIAL * loss_adv
```
* save loss_adv and loss_disc
```python=
optimizer.step()
if self.cfg.LOSS_WEIGHTS.ADVERSARIAL > 0:
loss_disc = self.training_step_discriminator(mocap_batch, pred_smpl_params['body_pose'].reshape(batch_size, -1), pred_smpl_params['betas'].reshape(batch_size, -1), optimizer_disc)
output['losses']['loss_gen'] = loss_adv
output['losses']['loss_disc'] = loss_disc
```
Pose Transformer v2 (used while running on videos)
---
(refer https://arxiv.org/pdf/2304.01199)
* We use lart_transformer encoder
```python=
self.encoder = lart_transformer(
opt = self.cfg,
phalp_cfg = self.phalp_cfg,
dim = self.cfg.in_feat,
depth = self.cfg.transformer.depth,
heads = self.cfg.transformer.heads,
mlp_dim = self.cfg.transformer.mlp_dim,
dim_head = self.cfg.transformer.dim_head,
dropout = self.cfg.transformer.dropout,
emb_dropout = self.cfg.transformer.emb_dropout,
droppath = self.cfg.transformer.droppath,
)
```
* In predict_next function:
* inputs are en_pose and en_time
* We reconstruct the input data such that pose_shape will be of shape(number of persons,time,0,:). Similar is the mask/has_detection
```python=
# set number of people to one
n_p = 1
pose_shape_ = torch.zeros(en_pose.shape[0], self.cfg.frame_length, n_p, 229)
has_detection_ = torch.zeros(en_pose.shape[0], self.cfg.frame_length, n_p, 1)
mask_detection_ = torch.zeros(en_pose.shape[0], self.cfg.frame_length, n_p, 1)
# loop thorugh each person and construct the input data
t_end = []
for p_ in range(en_time.shape[0]):
t_min = en_time[p_, 0].min()
# loop through time
for t_ in range(en_time.shape[1]):
# get the time from start.
t = min(en_time[p_, t_] - t_min, self.cfg.frame_length - 1)
# get the pose
pose_shape_[p_, t, 0, :] = en_pose[p_, t_, :]
# get the mask
has_detection_[p_, t, 0, :] = 1
t_end.append(t.item())
input_data = {
"pose_shape" : (pose_shape_ - self.mean_[:, :, None, :]) / (self.std_[:, :, None, :] + 1e-10),
"has_detection" : has_detection_,
"mask_detection" : mask_detection_
}
```
* Now we pass the input_data to encoder (which is lart transformer) and then to readout_pose to decode it.
```python=
# single forward pass
output, _ = self.encoder(input_data, self.cfg.mask_type_test)
decoded_output = self.readout_pose(output[:, self.cfg.max_people:, :])
```
### LART transformer
* Here we implement mask token
* positional embeddings
```python=
self.pos_embedding = nn.Parameter(positionalencoding1d(self.dim, 10000))
self.pos_embedding_learned1 = nn.Parameter(torch.randn(1, self.cfg.frame_length, self.dim))
self.pos_embedding_learned2 = nn.Parameter(torch.randn(1, self.cfg.frame_length, self.dim))
self.register_buffer('pe', self.pos_embedding)
```
* We use
* self.pose_shape_encoder - encoding pose shape features, used by default
* self.smpl_head - SMPL head for predicting SMPL parameters
* self.loca_head - Location head for predicting 3D location of the person
* self.action_head_ava - Action head for predicting action class in AVA dataset labels (This is not used)
* We add mask to random index
```python=
def bert_mask(self, data, mask_type):
if(mask_type=="random"):
has_detection = data['has_detection']==1
mask_detection = data['mask_detection']
for i in range(data['has_detection'].shape[0]):
indexes = has_detection[i].nonzero()
indexes_mask = indexes[torch.randperm(indexes.shape[0])[:int(indexes.shape[0]*self.cfg.mask_ratio)]]
mask_detection[i, indexes_mask[:, 0], indexes_mask[:, 1], indexes_mask[:, 2]] = 1.0
```
* We have 3 transformers in LART:
* self.transformer - This is not used
* self.transformer1 - This is used after adding pos_embedding_learned1 to input of model after adding mask to it
* self.transformer2 - This is used after adding pos_embedding_learned2 to output of self.transformer 1
```python=
#lart_transformer - def forward()
# prepare the input data and masking
data, has_detection, mask_detection = self.bert_mask(data, mask_type)
# encode the input pose tokens
pose_ = data['pose_shape'].float()
pose_en = self.pose_shape_encoder(pose_)
x = pose_en
# mask the input tokens
x[mask_detection[:, :, :, 0]==1] = self.mask_token
x = x + self.pos_embedding_learned1
x = self.transformer1(x, [has_detection, mask_detection])
x = x.transpose(1, 2)
x = self.conv_en(x)
x = self.conv_de(x)
x = x.transpose(1, 2)
x = x.contiguous()
x = x + self.pos_embedding_learned2
has_detection = has_detection*0 + 1
mask_detection = mask_detection*0
x = self.transformer2(x, [has_detection, mask_detection])
x = torch.concat([self.class_token.repeat(BS, self.cfg.max_people, 1), x], dim=1)
return x,0
```
* Why are we using 2 transformers and 2 pos_embeddings
Notations (HMR 2)
---
### Body Model
* **θ** = SMPL pose ( θ∈$\ \mathbb R^{24x3x3}$ )
* **β** are the shape parameters (β∈$\ \mathbb R^{10}$)
* θ include
* $\ \theta_{b} ∈ \mathbb R^{23x3x3}$ body pose parameters
* $\ \theta_{g} ∈ \mathbb R^{3x3}$ global orientation
* Using θ and β we get **mesh** M ∈$\ \mathbb R^{3xN}$ with N = 6890 vertices
* Body joints **X** ∈ $\ \mathbb R^{3xk}$ are defined as a linear combination of the vertices and can be computed as X = M*W with fixed weights W ∈ $\ \mathbb R^{Nxk}$.
### Camera
* Perspective camera model where focal length and intrinsics K are fixed.
* Each camera π = (R, t) consists of a global orientation R ∈ $\ \mathbb R^{3x3}$ and translation t ∈ $\ \mathbb R^{3}$ .
* Points in SMPL space (e.g., joints X) can be projected to the image as x = π(X) = Π(K(RX+t)), where Π is a perspective projection with camera intrinsics K.
### HMR
* Θ = [θ, β, π]=f(I) where
* f is model
* I is single image
* f(I) is predictor