<!-- # Thribhuvan's Hacks
1. Human4d
2. Stage 1 - https://hackmd.io/@anakin513/S1AToK8vR
3. Stage 2 - https://hackmd.io/@anakin513/rJ2oA3PwA
4. Stage 3 -
3.1 - HMR 2.0 BREAKDOWN - https://hackmd.io/@anakin513/SyrO_1-dC
3.2 - HMAR BREAKDOWN - https://hackmd.io/@anakin513/ryaFLG3P0 -->
PHALP - 4D Humans WorkFlow
===
## Step By Step
[TOC]
Probabilistic Graphical Models (tutorials)
STAGE 1 - Setting Up Frames (batch offline)
---
1. We extract all the INPUT frames from video(input) and OUTPUT (store) all the frame id's/numbers under variable **list_of_frames** and also the ground truth bounding box if passed under variable **additional_data**(refer io.py/get_frames_from_source()).
>>> Ground truth boxes are labelled video frame data (optional).
2. We make all the required folders/directories to store the Final results (PHALP.py/default_setup()). Below are the final results
* results/demo_(video_name).pkl :
* This contains all the `list of frame ids` in the input video as `key`
<details>
<summary>each frame_id will have the following: data</summary>
<br>
* time: (int) 0,1,2...
<br>
* shot: (int) scene change/shot number (lets say 10,15 are scene change frame ids then 0-10 will be 0 11-15 will be 1)
* frame_path
* tracked_ids: list of ids tracked till the last frame
* tracked_bbox: bbox of tracked people (number of people detected,4)
* tid: id of detected people in the current frame
* bbox: bbox of detected people (number of people detected,4)
* tracked_time: time since a specific person is tracked list of length (number of tracked people)
* appe: appearance embedding of each detected person (number of detected people,4096)
* loca: location embedding of each detected person (number of detected people,99)
* pose: pose embedding of each detected person (number of detected people,229)
* center: center of each detected person bbox (number of detected people,2)
* scale: scale of each detected person bbox
* size: size of image (number of detected people,2)
* img_path
* img_name
* class_name
* conf: confidence value (between 0 to 1) whether detected bbox contains human or not
* annotations: this will be the ground truth data if passed
* smpl: this contains results of HMR2. (number of detected people,3).
* global orient (1,3,3)
* body_pose (23,3,3)
* betas (10)
* camera: (number of detected people,3)
* camera_bbox: (number of detected people,3)
* 3d_joints: (number of detected people,45,3)
* 2d_joints: (number of detected people,90)
</details>
>>> Note: Please describe a representation (formats) of all final results
3. Using PySceneDetect algorithm, we will extract the frame id's wherever there is **scene change** and store them in list_of_shots variable.(PHALP.py/get_list_of_shots) (example: https://www.scenedetect.com/cli/)
* We detect **scene changes** in videos because when one scene ends and another begins, **the people or objects we were following might disappear or change**. So, we reset our tracking to start fresh and keep track accurately in each new scene.
> (a) How is the key-frames (list_of_shots) selected.. provide details (link a separate hack with details)
Refer https://hackmd.io/@anakin513/S1AToK8vR to know more about PySceneDetect.
4. **Now we iterate through each frame**
5. We extract the image resolutions(height,width,left,top) of the current frame (using image/fame properties)
* We will check whether the current frame id/number is present in the list_of_shots that we get from PySceneDetect.
>(b) explain details of the check of list_of_shot
6. This step is to initialize (setting up) for STAGE-5.
* If rendering is enabled(render.enable=True) then we reset the renderer (using visualizer.py/reset_render). We initialize pyrender.OffscreenRenderer with features of image/frame resolution.(py_renderer.py).
> explain the timing flow of input frame processing pipeline and the rendering pipeline
STAGE 2 - Extracting human bbox & mask in a frame
---
Stage 2 - https://hackmd.io/@anakin513/rJ2oA3PwA
* `INPUT` - Frame(Image) (Sequential)
* `OUTPUT` -
* `pred_bbox`: (number of detected person,4): coordinates of `rectangular boundary box` containing each person.
* `pred_masks`: (number of detected person, image height, image width): `Segmented Person Image`
* `pred_scores`: Score whether the detect person is person or not
* `pred_classes`: To which class the detected object belongs to. (We config the detectron2 to detect only humans so this will always be 1(person) by default).
* Refer Code - phalp.py/get_detections()
* Detectron2 (https://github.com/facebookresearch/detectron2)
* There are 2 `models` in Detectron2 used in PHALP:
1. `maskrcnn`
2. `vitdet`(used by default)
> How is the vitdet and how is its performance compared to maskrcnn.. provide details (link a separate hack with details)
* If we have the ground truth (additional_data) then
1. We extract the `ground truth boxes` and its `ground truth track ids`
2. We make `bbox_array` (shape=number of people,4(x1,y1,x2,y2)), `scores_array`(value is 1 for all the people since its ground truth) and `class_array` (value is 0).
3. Instances is initialized with the image dimensons (img_height, img_width).
4. self.detector_x.predict_with_bbox(image, inst) is called to get outputs_x. ***What this does is to detect/extract humans within the ground truth bbox***.
5. Filtering is done to extract instances labeled as people (instances_people) based on pred_classes.
> Describe prediction bbox, scores and especially the representation of masks and also what categories objects detected and extracted and how ?
> This describes an internal 2d graphical model extracted from frame (image) space.
* If we dont have the ground truth then we pass the image to detectron2
* We pass the image to detector and extract the prediction bbox, scores and masks of all objects whose scores are greater than 0.8(self.cfg.phalp.low_th_c) which makes sure that all the objects are humans.
* Using run_additional_models() function we try to get the extra_data which is `list of number of humans` (Example: extra_data=[0, 1, 2, 3, 4], there are 5 humans in an image).
* Now we will go through get_human_features() function.
* We will generate `masked_image_list`(shape: (number of people,4,256,256)) which will contain tensor.stack of cropped image(box) of each human detected in the frame.
example of masked_image

Example:
All detected object boxes:-

Boxes containing humans and boxes corner coordinates:-

<img src="https://hackmd.io/_uploads/rJ4dUp8ZC.png" alt="image" width="300" height="auto">
STAGE 3 - Extracting Human Features (Pose, Location and Appearance) Using HMR 2.0
---
3.1 - HMR 2.0 BREAKDOWN - https://hackmd.io/@anakin513/SyrO_1-dC
3.2 - HMAR BREAKDOWN - https://hackmd.io/@anakin513/ryaFLG3P0
* BS (batch size will be number of people)
13. We pass the detected human images(stage 2 results) to HMR2.0.
* From HMR 2.0 we get the
* `pred_smpl_params`
* `body_pose` [batch_size,23,3,3]
* `global_orient` [batch_size,1,3,3]
* `betas` [batch_size,10]
* `pred_cam` [batch_size,3] : translation of the person for a camera capturing the local bounding box.
* `pred_cam_t` [batch_size,3] : translation of the person for a camera capturing the entire image.
* `pred_cam[:,0]` value corresponds to s, the scaling factor of the `weak perspective projection`, which approximates `f/Z`. Z is depth of human `Z=f/s`.
* `pred_keypoints_3d` [batch_size,44,3]
* `pred_vertices` [batch_size,6890,3]
* `pred_keypoints_2d` [batch_size,44,2]
For more details about HMR 2.0 model refer:
14. We generate uv_image(appearance feature) by projecting mesh to uvmap.
* We get
* `uv_image` [batch_size,4,256,256]
* `uv_vector` (scaled down uv_image) [batch_size,4,256,256]
## For Tracking Features
* Using the above results we generate the tracking features.
15. For `appearance`, we pass the uv_vector (texture image) to `autoencoder_hmar` to get the `appearance_embedding` [batch_size,4096] which is used while tracking.
16. For `pose`,
1. if pose_distance type is `joints`
* the `pose_embedding` will be `pred_joints` [batch_size,45,3] which is basically `pred_keypoints_3d`.
* Note: `pred_keypoints_3d` 2nd axis is 44 but `pred_joints` is 45(44+1(array of zeros)).
2. if pose_distance type is `smpl` (used by default)
* the `pose_embedding` will be concatenation of
* global_orient(1x3x3 -> 9)
* body_pose (23x3x3 -> 207)
* shape (10 -> 10)
* location (3 -> 3)
* `pose_embedding` shape will be [batch_size,229]
17. For `location`, `location_embedding` will be of shape [BATCH_SIZE,99].
* location_embedding consists of
* pred_joints_2d (45x2 -> 90)
* 3 x (pred_cam (3))
18. For above results, we create detection objects with its respective detection_data for each detected person.
STAGE 4 - Tracking and updating
---
-> For the initial frame
18. Now we will run `tracker.predict()`
* This propagates track state distributions one time step forward. This function should be called once every time step, before `tracker.update`.
* For each track we will run we will increment age and time_since_update by 1.
* In the initial run(if we are trying to run the first frame) self.track will be empty so it will just skip this step and run tracker.update().
19. We will run `tracker.update` with inputs (detections(refer 17.), t_(frame number 0,1,2...),frame_name,self.cfg.phalp.shot(whether to check for scene change or not in the algorithm))
* since there are no tracks we will have zero matches
* For each detection we will initiate track
* track_data will create history. History will contain cfg.phalp.track_history(default 7) number of times the same data( Example: If we have a location then track_data['history'] will have to 7 number of times the same detected location)
* If fewer than the specified number of frames in the track history are processed, duplicate entries will occur.(for first frame(1,1,1,1,1,1,1), 2nd frame (1,1,1,1,1,1,2),3rd frame (1,1,1,1,1,2,3), 4th frame (1,1,1,1,2,3,4)....)
* We add the pose,location and appearance values from detection data to track_data['prediction']['pose'],track_data['prediction']['loca'] and track_data['prediction']['appe']
* add the initiated track to list of all tracks.
* Each track will have 3 states (Confirmed, Tentative and Deleted). So the initial track state will be tentative. If we have ground truth then track state will be confirmed.
* We update the prediction values to samples variable
(shape: (number of tracks,number of times,4(appe_feature, loca_feature, pose_feature, uv_map))) in nn_matching. (refer nn_matching.py/partial_fit).
* Then we will save the results and go to next frame (Go to STAGE 2). Using this samples variable data we try to create cost matrix.
-> After the initial frame we will have some tracks
20. We will increment age and time_since_update by 1 for each track. (`tracker.predict()`) and then we follow from Stage 4.1 below.
STAGE 4.1 - Creating Cost Matrix
---
21. We will try to find the cost_matrix which is of shape (Number of tracks,Number of detections).
* 
22. We will iterate through each track and find nearest neighbour distance metric (Euclidean) of predicted track pose for next frame(result of pose_transformer_v2)(stored in self.samples variable) with respect to the detections. A nearest neighbour distance metric that, for each target/target
23. 1,2,3,5,6 are track ids and distance_a is nearest neighbour of the track with respect to all the detections

STAGE 4.2 - Matching detections and tracks
---
* 
24. Using `linear_assignment` from scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html) we run the `hungarian algorithm` on cost_matrix and we get the matched frame with the detection. (below image indicies is the result of linear_assignment)
* 
25. We then update the matched tracks with latest detection data.
* 
26. We make the undetected tracks state to deleted if the prev track state is tentative or its time_since_update > max_age(initialized value).
27. For unmatched detections we do step 19. i.e create new track
* 
28. Now for each matched_track and unmatched_track we predict the location, pose for next frame.
Note - If we dont go through track_history(number of frames we have to track history for) number of frames then we will consider duplicates(for first frame(1,1,1,1,1,1,1), 2nd frame (1,1,1,1,1,1,2),3rd frame (1,1,1,1,1,2,3), 4th frame (1,1,1,1,2,3,4)....)
STAGE 4.3 - Predicting Pose for the next frame
---
29. We pass 3 things i.e history(last 7(cfg.phalp.track_history) frames) of :
* pose features
* pose data (xy,scale,time)
* time
* Refer https://hackmd.io/@anakin513/Hkwj4p9BR
30. We do single forward pass to the input data.
* Encoder is basically lart_transformer
* We mask the data using bert_mask function. If the mask type is random then this function randomly masks poses by making them one. If the mask type is 0 then wherever there is no detection we mask that poses.
* we encode the input pose_shape data
* Now we add pos_embedding_learned1 to x and pass it to transformer1
* We pass the x to convolution layer and then to transformer2 along with has_detection and mask_detection and get the output
31. We decode the output from step 30. and get pose_vector and pred_cam
32. We return the pose_camera at the time we need to predict.

STAGE 4.4 - Predicting Location for the next frame
---
33. For location we use ridge regression to predict location parameters (xy,nearness)
STAGE 4.5 - Predicting Appearance for the next frame
---
34. The single frame appearance representation for the person i at time step t $A^i_t$, is taken from the HMAR model by combining the UV image of that person
$T^i_t$ ∈ $R^{4×256×256}$ and the corresponding visibility map $V^i_t$ ∈ $R^{1×256×256}$ at time step t:
$$A^i_t = [T^i_t, V^i_t] ∈ R^{4×256×256}$$
the visibility mask $V^i_t$ ∈ [0, 1] indicates whether a pixel in the UV image is visible or not, based on the estimated mask from Mask-.RCNN
35. After every new detection we create a singe per-tracklet appearance representation:


Example of predicted Appearance

(brighness increased for better visualization)
STAGE 4.6 - Updating Predicted Values
---
34. We add those predicted pose and location to prediction key of track_data.
35. Using metric.partial_fit function we add the predicted values self.samples in nn_matching.py which will be used to create cost_matrix for next frame.
36. Iterate through each track and save all the results
STAGE 5 - Saving the data and Rendering
---
37. We skip this step if the track is not confirmed state
38. We will start rendering when the current frame number is greater than 5(n_init)
49. We iterate through each frame and its data,
* extact mesh using vertices for all the tracks and add them to scene
* create and run offscreen renderer to scene with black background and add camera parameters to get image
* Now we add orginal image with the above image we get