---
title: 'Blog post'
disqus: hackmd
---
Group 68 - Reproducibility Project Blog Post
===
## Students
Quinn Begelinger, 4962338, q.begelinger@student.tudelft.nl
Nanami Hashimoto, 4779495, N.hashimoto@student.tudelft.nl
Roel Webb, 4643674 R.c.webb@student.tudelft.nl
## Table of Contents <!--- I dont think we need a table of contents -->
[TOC]
## Sources
Github
https://github.com/nanaminh/Direct3DKinematicEstimation
## Introduction <!--- Quinn -->
In this blog, it is explained how to use a neural network to estimate joint angles of the human body from a single camera image. In 2022, Bittner et al published an article titled "Towards Single Camera Human 3D-Kinematics," in which they present their method for achieving the aforementioned task. In their method, joint angles are directly estimated by a convolutional network, which are then converted to a pose through a skeletal-model kinematics layer and finally these poses are fed through a sequential network to smoothen estimations.
This blog will serve as a guide for reproducting this paper, as that is no simple task. We will list and explain all the necessary steps, as well as the problems that we ran into, and how to avoid them. Afterwards, we will compare the results of this reproduction with those from the paper and discuss whether you can expect to achieve similar results. But first, to aid your understanding of what needs to be done, we will go over the technical details in a summary of the paper.
The project is a part of "CS4240 Deep Learning" course at TU Delft.
## Paper
For the understanding of ourselves and this blogs readers, we have broken down the 'Network Structure' subchapter of Bittner et al's paper into a comprehensible summary.
The entire model consists of three parts; A CNN (Convolutional Neural Network), a sequential network and a Skeletal-Model layer.
Step one: Videos of subjects are used as input for the CNN, which takes individual video frames as input. A standard, pre-trained ResNext-50 model is used for this, presumably for its good performance at estimating 2D-poses. The CNN outputs a predicted 2D-pose. Through this step, a rough initial prediction for the joint angles and scaling parameters is obtained.
Step two: To finetune these parameters, a sequential network is used. This is used to 'lift' the estimated pose from 2D to 3D and also at the same time incorporate temporal data, meaning that it also takes pose data into account from the previous frames in the video.
Step three: The skeletal-model. This doesn't have any trained parameters, so can be seen as a function that takes *the estimated joint angles* and *scaling parameters* from the other two layers as **input**, and gives *3D marker and joint positions* as **output**. The skeletal-model contains default orientations and transformations to link joints and body parts.

*Figure 1: The model architecture simplified*
The skeletal model works as follows: In order to locate all of the joints and marker positions, we obtain transformation matrices that transform our coordinate system from one bodypart, to a connected bodypart, in order to link every bodypart together. We need to obtain all the transformation matrices necessary to start at the root node (always the pelvis) and go all the way to each leaf node (hands, feet and other ends of the body). A hypothetical example:
R~pelvis->hand~ = R~pelvis->shoulder~ * R~shoulder->arm~ * R~arm->hand~
The skeletal model contains all the important information in order to do this. Each joint in the model connects a 'parent' bone (closer to the root) with a 'child' bone (closer to the leaf). To obtain the transformation matrix from a parent to its child bone, we multiply three component matrices.
1. The transformation matrix from the parent to the joint
2. The rotation matrix of the joint itself
3. The transformation matrix of the joint to the child.
The transformation matrix from the parent/child to the joint is a default matrix that is given inside the skeletal model and only needs to be scaled, using the body scales that have been predicted by the Convolutional and Sequential networks.
The matrix that describes the rotation of the joint itself is refered to as R~motion~, which is created using the predicted the joint angles from the Convolutional and Sequential network θ^1^,θ^2^,θ^3^.
With all of the parent->child transformations determined, all the bones can be located in 3D and the joint and marker locations can be determined from the skeletal model.
These are added to the loss function, to indirectly take the constraints of the skeletal model into account for the training of the neural network.
The final loss function used to train the weights of the Convolutional and Sequential networks is:
L=λ~1~L~joint~+λ~2~L~marker~+λ~3~L~body~+λ~4~L~angle~.
## Manual<!--- Roel -->
This manual is not a step by step instruction, which can be found after the manual, but connects the theoretical part in the paper with the practical steps for reproduction.
### Ground truth generation <!--- not fully finished -->
The model proposed in the paper uses videos as input and directly outputs the following: joint angles and the scales of individual bones, a rotation matrix of the pelvis to the ground as well as the marker positions corresponding to them. However there are no ground truth datasets containing a mapping from participant videos recorded with MMC (Markerless Motion Capture) to the mentioned output the model can be trained on. Thus a dataset containing custom ground truth output corresponding to participant videos was calculated.
To calculate the ground truth output, recorded marker position data captured with OMC (Optical Motion Capture) in combination with recorded 3D human meshes are used. The marker position data used in the paper was recorded in the Bio Motion Lab and is publicaly available. The human meshes used are contained in the large dataset called AMASS, where specifically the meshes captured in the Bio Motion Lab are used.
First a general (musculo)skeletal model is created in OpenSim to fit the marker position data converted to the 3D human mesh representations. Then using the recorded marker positions in combination with the human meshes the (musculo)skeletal model is scaled and fitted, resulting in the ground truth bone scales. Now that a ground truth (musculo)skeletal model is calculated the only step left is to perform inverse kinematics to find the joint angles. <!--This results in the joint angle and the scales of individual bones and a rotation matrix of the pelvis to the ground.
We fit our data??? to the musculoskeletal model, by first defining virtual markers on the vertices of the 3D mesh representation.
Then in OpenSim individual body segments were scaled using the distances of the virtual markers on the 3D mesh. Finally inverse kinematics was used to calculate the joint angles.-->
To perform mentioned ground truth generation all necessary data needs to be installed and located in the right directory to be properly accesible. The list of neccesary data includes the following:
- F_PGX_Subject_X_L.avi: contains participant videos in .avi format from 1 of 2 cameras from different participants (both F_PG1.. and F_PG2.. with multiple participants is needed replacing Subject_X with Subject_5 etc.)
- Camera Parameters.tar: contains the camera parameters
- F_Subjects_1_45.tar: contains the marker locations in .mat format captured with OMC
- smplh.tar.xz: contains the Skinned Multi-Person Linear model including hands (SMPL+H)
- dmpls.tar.xz: contains a similar model to SMPL+H with added realistic modeling of dynamic soft-tissue deformations
- BMLmovi.tar.bz2: contains the AMASS 3d human mesh data in .npz file format
### Preprocessing <!-- Nanami and Roel -->
<!--
[in the paper]
Video frame cropped to 256 x 256 pixels
Data augmentation (scaling, rotation, translation and noise)
PASCAL Visual Object Classes to create bounding boxes
(Are these all that prepare dataset.py do ?????)
[in the code]
- video avi is turned into HDF file (.hdf5)
- Generate bounding boxes for each frame
- (Generate # Generate movement index file (.mat))
- Create opemsim label data set
-
-->
The input of the neural network model has a fixed shape of 256 x 256 pixels. However the recored videos are of a different much bigger size, thus for every frame a bounding box is generated around the participant, which is then scaled to 256 x 256 pixels. To detect the participants in every frame a pre-trained R-CNN with ResNet-50 is used in combination with a dataset called PASCAL containing images of persons with bounding boxes per person. To impove generalization the new input is augmented with scaling, rotation, translation and noise to simulate occlusions.
Now that the correct input data is created for training the Opensim label dataset is generated, which is used for training.
<!---
I was in a bit rush writing this. Please feel free to add more if you think something is missing -->
### Training procedure <!-- Nanami -->
The training was conducted using the PyTorch library and the pretrained ResNext and Faster RCNN networks from the torchvision library. This process is executed by the `python run_training.py` command. In this script, the preprocessed dataset and the pascal file are loaded from the folders. Also, the path to store trained model is defined as *../checkpoints*. The preprocessed dataset is splited into the training and valification dataset by the *train_opensim_temporal4.py* script. The number of epoch and the training steps are both set as 1 initially.
### Inference <!-- Nanami -->
<!--
Direct vs. Multi-step estimation
-- These might not blong here, but write it anyways :)
Calibration in T posse
-->
After the training is done, inference can be performed by running `python run_inference.py`. This step is supposed to provide the accuracy of prediction and the evaluation for the time it takes to conduct inference for different batch size.
## Steps for reproduction <!-- Roel -->
### Dependencies
Firstly it is assumed that the popular package managers pip and conda are installed and that python is installed.
Before downloading any data or running any code, firstly the repository Direct3DKinematicEstimation needs to be cloned from https://github.com/bittnerma/Direct3DKinematicEstimation.
When the files of the repository are localy available a new conda environment (called d3ke) needs to be created by running: `conda env create -f environment_setup.yml`
After the environment d3ke is created, activate it with: `conda activate d3ke`
The repository has two other dependencies, one for training the neural network called pytorch and one for the (musco)skeletal model mentioned in the manual called OpenSim.
To run the code from the repository, PyTorch version 1.11.0 and OpenSim 4.3 are used.
It is important to install Pytorch with CUDA to speed up training, preferably running the training on powerfull hardware or with a cloud computing service. These dependencies need to be installed from https://pytorch.org/get-started/previous-versions/ and https://simtk.org/frs/?group_id=91 respectively.
Once the dependencies are installed run the following commands in C:\OpenSim 4.3\sdk\Python:
`python setup_win_python38.py`
`python -m pip install .`
The last steps for the dependencies are to add C:\OpenSim 4.3\bin to the environment variables PATH and copy all *.obj files from \resources\opensim\geometry to C:\OpenSim 4.3\geometry.
Now that all dependencies are installed, the last preparation steps is to install all neccesary datasets mentioned in the manual and create A new folder with the name _dataset_full in the cloned repository.
In the following table you can find the link to the datasets and the destination the files need to be unpacked in the Direct3DKinematicEstimation folder.
| Dataset | Destination path | Link |
| -------- | -------- | -------- |
| F_PGX_Subject_X_L.avi | resources\opensim\videos | https://www.biomotionlab.ca/movi/ |
| Camera Parameters.tar | resources\opensim\videos | https://www.biomotionlab.ca/movi/|
| F_Subjects_1_45.tar | resources\V3D\F | https://www.biomotionlab.ca/movi/|
| smplh.tar.xz | ms_model_estimation\resources | https://mano.is.tue.mpg.de/download.php|
| dmpls.tar.xz | ms_model_estimation\resources | https://smpl.is.tue.mpg.de/download.php|
| BMLmovi.tar.bz2 | resources\amass | https://amass.is.tue.mpg.de/download.php|
| VOCtrainval_11-May-2012.tar* | resources | http://host.robots.ox.ac.uk/pascal/VOC/voc2012/#devkit|
*For VOCtrainval_11-May-2012.tar unpack the compressed file and move the subfolder VOC2012 into resources, not the folder VOCtrainval_11-May-2012.
### GT generation
Now we can finally start creating the ground truth by running, `python generate_opensim_gt.py`, however we found the script would only run by changing line 40 from `npzPathList=gtGenerator.traverse_npz_files("BMLmovi")` to `npzPathList=gtGenerator.traverse_npz_files("BMLmovi/BMLmovi")`.
After the ground truth is generated in \resources\opensim\BMLmovi\BMLmovi\BMLmovi copy the results to \resources\opensim\BMLmovi\BMLmovi.
### Dataset preparation
To prepare the video data by cropping the videos and performing data augmentation run: `python prepare_dataset.py --BMLMoviDir resources\opensim\videos --PascalDir resources\VOC2012`
### Unable to train and evaluate
<!-->not done yet here???<-->
Since we were not able to run the training and inference script succesfully we do not have detailed instructions for them.
## Reproduction results
### GT generation <!--- Roel -->
We were able to generate ground truth results in Direct3DKinematicEstimation\resources\opensim\BMLmovi\BMLmovi\BMLmovi for participant 11 up to 15 containing the inverse kinematics results (joint angles), virtual marker locations and bone scales.

### Dataset preparation <!--- Roel -->
We were not able to fully run the preparation script, but could create the bounding boxes for multiple participants in _dataset and _dataset_full. The error we received will be discussed in chapter **Pitfalls**.

### Training <!--- Nanami -->
When running the *run_training.py* script, the following value error was raised. It argues that the input array cannot be broadcasted because its first dimension is 0, meaning the array is empty.

After some investigation, we found out that results from the fact that indices of the array, which are determined by the "start frame", "frame", "globalDatasetIdx" and "endFrame" variables in in the *BMLOpenSimTemporalDataSet.py* are out of bound with respect to the input .npy file.
We think there are two possible causes of this problem. The first possibility is that because we tried to train for 6 subjects although the script is originally created for all subjects. This might have caused the discrepancy with the hard-coded index. However, theoretically it should be possible to train a model for different number of the subjects, so this may not be the cause of the problem.
Another possibility is that even if the training could be performed for different number of the subjects, we might have had trouble with the previous steps. When we generated the ground truth data, the process had to be killed after running it for a few days because of our limited computational resources. This may have caused a missing dataset and caused this index error.
### Inference <!--- Nanami -->
When we run the *run_inference.py* initially, we got the following attribute error indicating 'Training' object has not attribute 'run_windowed_predictions'. This error is due to the fact that there is no function called 'run_windowed_predictions' in the 'Training' class in *train_os_spatialTemporal_infer.py*. By changing the indicated line to 'run_windowed_**temporal_** predictions' , the error was resolved.

However, since we did not complete the training step, inference is not performed.
## Pitfalls <!---Nanami and Roel mostly -->
We have encountered many issues while reproducing the above mentioned steps. Here we present those issues and the solutions for the problems we solved.
We first have to mention that the GitHub repository had quite some bugs and was not ready to be used out of the box. When we accessed the repository it was roughly 4 months old and the README stated: ‘This repository is currently under construction’. This chapter is not meant as an attack or harsh critique to the author, however the following points are brought up to show our process of attempting to reproduce the results and show the setbacks that made it difficult to produce good results. Lastlt looking at the commits from march till April indicate that some bugs still needed to be resolved while we were working on this project.
### Readme typo
The first issue we encountered was a typo in the first command to create the conda environment, resulting in a file not being found. Luckily this issue was fixed relatively fast after noticing the command was trying to call a file that did exist, but had a slightly different name than in the command.
### Cannot import OpenSim error
When we tried to run the "generate_opensim_gt.py" script initially, OpenSim could not be found. To fix this issue the path to the folder C:\OpenSim 4.3\bin needed to be added to the environment variables.
### Obj. files cannot be found warning
While running the "generate_opensim_gt.py" script, we encountered "Couldn't find file '... .obj' ." warnings as shown below. These object files were stored under the /OpenSim 4.x/Geometry folder as mentioned by the original author.
In the end, the process managed to generate the ground truth files dispite these warnings. However, we thought this should be noted here.

### Preparation script cannot open file
After running the preparation script (creating bounding boxes of the participants) an error was raised indicating that a file could not be found, stopping the program. The needed file could be found in a subfolder, but was not available were it was supposed to be.

### Training instructions missing
After finishing the ground truth generation and the pre processing, we noticed there was no explanation for running the training script since there only was a single header stating 'Training' without anything else.
### Download models
In the readme under the header _Evaluation_ there is a new header saying _Download models_ however it is never explained what this means.
### Inconsistent folder names
In the *prepare_dataset.py*, the prepared dataset is created under a new folder named "_dataset". However, in the *run_training.py*, these dataset is called from a folder named "_dataset_full". Threrefore, the folder name had to be changed in order to run the training script.
## Discussion and Recommendation <!-- Every one each one recommendation -->
Unfortunately, the reproduction that was achieved is not complete. Despite solving a lot of errors that were encountered in the reproduction process, the inference that would have given us results that can be compared with the paper, hasn't been able to take place. While this blog can get you very far, the rest of the reproduction process is left as a recommendation to the reader. Additionally, we recommend the writers of the paper to work on the clarity and completeness of their repository, to make the reproduction process smoother, and hopefully prevent the issues that we ran into.
We had initally planned on trying several variations on the original research, which we will also disclose as recommendations to whoever succeeds in the complete reproduction.
First of all, the videos used for ground truth generation and training were recorded in a single lab location, with only two types of outfits worn (one for men and one for women).
To improve the generalization of the model, different locations with different surroundings and different outfits can be used for ground truth generation to be trained on. Additionally, more angles can also be recorded.
During the simultaneous recording of videos for MMC and marker tracking for OMC, many markers are visible in the MMC videos. However, the objective of the research was to enhance the accuracy of pose estimation for simple MMC, which does not utilize markers. It is possible that the proposed model was relied on detecting and tracking markers in the MMC videos, leading to a decrease in performance when markers are not used in MMC.
Therefore we propose to remove the markers from the MMC videos for ground truth generation and training. An easy solution could be post processing the MMC video frames to remove the markers from the MMC videos or by having the markers emit a light frequency that cannot be detected by the MMC cameras but is still visible to the OMC system.
We suggest recording data by using wearable motion sensors, similarly to what is used for translating human motions to 3D-animations. This would greatly reduce the time it costs to generate the ground truths, as you could immediately record all of the necessary values; the body scales, the joint angles and joint positions.
Finally, we have another recommendation for anyone trying to reproduce this paper. We spent multiple days running the pre-processing and training processes only a single time. Having powerful hardware is essential for this reproduction and therefore we recommend using cloud computing through a resource such as googlecloud. The main drawback of this method is that installing OpenSim on a Linux operating system is very difficult and all the cloud computing virtual machines that we tried run on Linux. Besides that, running these virtual machines on the cloud is costly. Regardless, if a method can be found to install OpenSim and if enough funds are available, this method is highly recommended.
<!-- ## Conclusion -->
## References
Bittner M, Yang W-T, Zhang X, Seth A, van Gemert J, van der Helm FCT. Towards Single Camera Human 3D-Kinematics. Sensors. 2023; 23(1):341. https://doi.org/10.3390/s23010341
#### Who did what
Nanami - GT generation, training, inference, reporting
Quinn - GT generation, Google colab, reporting
Roel - GT generation, Preprocessing, training, reporting
<!---
## Appendix
###### tags: `Documentation`
--->