PHALP - 4D Humans WorkFlow

PHALP - 4D Humans WorkFlow === ## Step By Step [TOC] Probabilistic Graphical Models (tutorials) STAGE 1 - Setting Up Frames (batch offline) --- 1. We extract all the INPUT frames from video(input) and OUTPUT (store) all the frame id's/numbers under variable **list_of_frames** and also the ground truth bounding box if passed under variable **additional_data**(refer io.py/get_frames_from_source()). >>> Ground truth boxes are labelled video frame data (optional). 2. We make all the required folders/directories to store the Final results (PHALP.py/default_setup()). Below are the final results * results/demo_(video_name).pkl : * This contains all the `list of frame ids` in the input video as `key` <details> <summary>each frame_id will have the following: data</summary> <br> * time: (int) 0,1,2... <br> * shot: (int) scene change/shot number (lets say 10,15 are scene change frame ids then 0-10 will be 0 11-15 will be 1) * frame_path * tracked_ids: list of ids tracked till the last frame * tracked_bbox: bbox of tracked people (number of people detected,4) * tid: id of detected people in the current frame * bbox: bbox of detected people (number of people detected,4) * tracked_time: time since a specific person is tracked list of length (number of tracked people) * appe: appearance embedding of each detected person (number of detected people,4096) * loca: location embedding of each detected person (number of detected people,99) * pose: pose embedding of each detected person (number of detected people,229) * center: center of each detected person bbox (number of detected people,2) * scale: scale of each detected person bbox * size: size of image (number of detected people,2) * img_path * img_name * class_name * conf: confidence value (between 0 to 1) whether detected bbox contains human or not * annotations: this will be the ground truth data if passed * smpl: this contains results of HMR2. (number of detected people,3). * global orient (1,3,3) * body_pose (23,3,3) * betas (10) * camera: (number of detected people,3) * camera_bbox: (number of detected people,3) * 3d_joints: (number of detected people,45,3) * 2d_joints: (number of detected people,90) </details> >>> Note: Please describe a representation (formats) of all final results 3. Using PySceneDetect algorithm, we will extract the frame id's wherever there is **scene change** and store them in list_of_shots variable.(PHALP.py/get_list_of_shots) (example: https://www.scenedetect.com/cli/) * We detect **scene changes** in videos because when one scene ends and another begins, **the people or objects we were following might disappear or change**. So, we reset our tracking to start fresh and keep track accurately in each new scene. > (a) How is the key-frames (list_of_shots) selected.. provide details (link a separate hack with details) Refer https://hackmd.io/@anakin513/S1AToK8vR to know more about PySceneDetect. 4. **Now we iterate through each frame** 5. We extract the image resolutions(height,width,left,top) of the current frame (using image/fame properties) * We will check whether the current frame id/number is present in the list_of_shots that we get from PySceneDetect. >(b) explain details of the check of list_of_shot 6. This step is to initialize (setting up) for STAGE-5. * If rendering is enabled(render.enable=True) then we reset the renderer (using visualizer.py/reset_render). We initialize pyrender.OffscreenRenderer with features of image/frame resolution.(py_renderer.py). > explain the timing flow of input frame processing pipeline and the rendering pipeline STAGE 2 - Extracting human bbox & mask in a frame --- Stage 2 - https://hackmd.io/@anakin513/rJ2oA3PwA * `INPUT` - Frame(Image) (Sequential) * `OUTPUT` - * `pred_bbox`: (number of detected person,4): coordinates of `rectangular boundary box` containing each person. * `pred_masks`: (number of detected person, image height, image width): `Segmented Person Image` * `pred_scores`: Score whether the detect person is person or not * `pred_classes`: To which class the detected object belongs to. (We config the detectron2 to detect only humans so this will always be 1(person) by default). * Refer Code - phalp.py/get_detections() * Detectron2 (https://github.com/facebookresearch/detectron2) * There are 2 `models` in Detectron2 used in PHALP: 1. `maskrcnn` 2. `vitdet`(used by default) > How is the vitdet and how is its performance compared to maskrcnn.. provide details (link a separate hack with details) * If we have the ground truth (additional_data) then 1. We extract the `ground truth boxes` and its `ground truth track ids` 2. We make `bbox_array` (shape=number of people,4(x1,y1,x2,y2)), `scores_array`(value is 1 for all the people since its ground truth) and `class_array` (value is 0). 3. Instances is initialized with the image dimensons (img_height, img_width). 4. self.detector_x.predict_with_bbox(image, inst) is called to get outputs_x. ***What this does is to detect/extract humans within the ground truth bbox***. 5. Filtering is done to extract instances labeled as people (instances_people) based on pred_classes. > Describe prediction bbox, scores and especially the representation of masks and also what categories objects detected and extracted and how ? > This describes an internal 2d graphical model extracted from frame (image) space. * If we dont have the ground truth then we pass the image to detectron2 * We pass the image to detector and extract the prediction bbox, scores and masks of all objects whose scores are greater than 0.8(self.cfg.phalp.low_th_c) which makes sure that all the objects are humans. * Using run_additional_models() function we try to get the extra_data which is `list of number of humans` (Example: extra_data=[0, 1, 2, 3, 4], there are 5 humans in an image). * Now we will go through get_human_features() function. * We will generate `masked_image_list`(shape: (number of people,4,256,256)) which will contain tensor.stack of cropped image(box) of each human detected in the frame. example of masked_image ![image](https://hackmd.io/_uploads/HyOEFHMKA.png) Example: All detected object boxes:- ![image](https://hackmd.io/_uploads/H1mXznIW0.png) Boxes containing humans and boxes corner coordinates:- ![image](https://hackmd.io/_uploads/HJkY-3I-A.png) <img src="https://hackmd.io/_uploads/rJ4dUp8ZC.png" alt="image" width="300" height="auto"> STAGE 3 - Extracting Human Features (Pose, Location and Appearance) Using HMR 2.0 --- 3.1 - HMR 2.0 BREAKDOWN - https://hackmd.io/@anakin513/SyrO_1-dC 3.2 - HMAR BREAKDOWN - https://hackmd.io/@anakin513/ryaFLG3P0 * BS (batch size will be number of people) 13. We pass the detected human images(stage 2 results) to HMR2.0. * From HMR 2.0 we get the * `pred_smpl_params` * `body_pose` [batch_size,23,3,3] * `global_orient` [batch_size,1,3,3] * `betas` [batch_size,10] * `pred_cam` [batch_size,3] : translation of the person for a camera capturing the local bounding box. * `pred_cam_t` [batch_size,3] : translation of the person for a camera capturing the entire image. * `pred_cam[:,0]` value corresponds to s, the scaling factor of the `weak perspective projection`, which approximates `f/Z`. Z is depth of human `Z=f/s`. * `pred_keypoints_3d` [batch_size,44,3] * `pred_vertices` [batch_size,6890,3] * `pred_keypoints_2d` [batch_size,44,2] For more details about HMR 2.0 model refer: 14. We generate uv_image(appearance feature) by projecting mesh to uvmap. * We get * `uv_image` [batch_size,4,256,256] * `uv_vector` (scaled down uv_image) [batch_size,4,256,256] ## For Tracking Features * Using the above results we generate the tracking features. 15. For `appearance`, we pass the uv_vector (texture image) to `autoencoder_hmar` to get the `appearance_embedding` [batch_size,4096] which is used while tracking. 16. For `pose`, 1. if pose_distance type is `joints` * the `pose_embedding` will be `pred_joints` [batch_size,45,3] which is basically `pred_keypoints_3d`. * Note: `pred_keypoints_3d` 2nd axis is 44 but `pred_joints` is 45(44+1(array of zeros)). 2. if pose_distance type is `smpl` (used by default) * the `pose_embedding` will be concatenation of * global_orient(1x3x3 -> 9) * body_pose (23x3x3 -> 207) * shape (10 -> 10) * location (3 -> 3) * `pose_embedding` shape will be [batch_size,229] 17. For `location`, `location_embedding` will be of shape [BATCH_SIZE,99]. * location_embedding consists of * pred_joints_2d (45x2 -> 90) * 3 x (pred_cam (3)) 18. For above results, we create detection objects with its respective detection_data for each detected person. STAGE 4 - Tracking and updating --- -> For the initial frame 18. Now we will run `tracker.predict()` * This propagates track state distributions one time step forward. This function should be called once every time step, before `tracker.update`. * For each track we will run we will increment age and time_since_update by 1. * In the initial run(if we are trying to run the first frame) self.track will be empty so it will just skip this step and run tracker.update(). 19. We will run `tracker.update` with inputs (detections(refer 17.), t_(frame number 0,1,2...),frame_name,self.cfg.phalp.shot(whether to check for scene change or not in the algorithm)) * since there are no tracks we will have zero matches * For each detection we will initiate track * track_data will create history. History will contain cfg.phalp.track_history(default 7) number of times the same data( Example: If we have a location then track_data['history'] will have to 7 number of times the same detected location) * If fewer than the specified number of frames in the track history are processed, duplicate entries will occur.(for first frame(1,1,1,1,1,1,1), 2nd frame (1,1,1,1,1,1,2),3rd frame (1,1,1,1,1,2,3), 4th frame (1,1,1,1,2,3,4)....) * We add the pose,location and appearance values from detection data to track_data['prediction']['pose'],track_data['prediction']['loca'] and track_data['prediction']['appe'] * add the initiated track to list of all tracks. * Each track will have 3 states (Confirmed, Tentative and Deleted). So the initial track state will be tentative. If we have ground truth then track state will be confirmed. * We update the prediction values to samples variable (shape: (number of tracks,number of times,4(appe_feature, loca_feature, pose_feature, uv_map))) in nn_matching. (refer nn_matching.py/partial_fit). * Then we will save the results and go to next frame (Go to STAGE 2). Using this samples variable data we try to create cost matrix. -> After the initial frame we will have some tracks 20. We will increment age and time_since_update by 1 for each track. (`tracker.predict()`) and then we follow from Stage 4.1 below. STAGE 4.1 - Creating Cost Matrix --- 21. We will try to find the cost_matrix which is of shape (Number of tracks,Number of detections). * ![image](https://hackmd.io/_uploads/rkyE8M7v0.png) 22. We will iterate through each track and find nearest neighbour distance metric (Euclidean) of predicted track pose for next frame(result of pose_transformer_v2)(stored in self.samples variable) with respect to the detections. A nearest neighbour distance metric that, for each target/target 23. 1,2,3,5,6 are track ids and distance_a is nearest neighbour of the track with respect to all the detections ![image](https://hackmd.io/_uploads/SJ_UIf7PR.png) STAGE 4.2 - Matching detections and tracks --- * ![image](https://hackmd.io/_uploads/S1opTG7PR.png) 24. Using `linear_assignment` from scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html) we run the `hungarian algorithm` on cost_matrix and we get the matched frame with the detection. (below image indicies is the result of linear_assignment) * ![image](https://hackmd.io/_uploads/BkAvAz7vC.png) 25. We then update the matched tracks with latest detection data. * ![image](https://hackmd.io/_uploads/ByVrymmDC.png) 26. We make the undetected tracks state to deleted if the prev track state is tentative or its time_since_update > max_age(initialized value). 27. For unmatched detections we do step 19. i.e create new track * ![image](https://hackmd.io/_uploads/ryksHm7DC.png) 28. Now for each matched_track and unmatched_track we predict the location, pose for next frame. Note - If we dont go through track_history(number of frames we have to track history for) number of frames then we will consider duplicates(for first frame(1,1,1,1,1,1,1), 2nd frame (1,1,1,1,1,1,2),3rd frame (1,1,1,1,1,2,3), 4th frame (1,1,1,1,2,3,4)....) STAGE 4.3 - Predicting Pose for the next frame --- 29. We pass 3 things i.e history(last 7(cfg.phalp.track_history) frames) of : * pose features * pose data (xy,scale,time) * time * Refer https://hackmd.io/@anakin513/Hkwj4p9BR 30. We do single forward pass to the input data. * Encoder is basically lart_transformer * We mask the data using bert_mask function. If the mask type is random then this function randomly masks poses by making them one. If the mask type is 0 then wherever there is no detection we mask that poses. * we encode the input pose_shape data * Now we add pos_embedding_learned1 to x and pass it to transformer1 * We pass the x to convolution layer and then to transformer2 along with has_detection and mask_detection and get the output 31. We decode the output from step 30. and get pose_vector and pred_cam 32. We return the pose_camera at the time we need to predict. ![image](https://hackmd.io/_uploads/Sk4DY47vR.png) STAGE 4.4 - Predicting Location for the next frame --- 33. For location we use ridge regression to predict location parameters (xy,nearness) STAGE 4.5 - Predicting Appearance for the next frame --- 34. The single frame appearance representation for the person i at time step t $A^i_t$, is taken from the HMAR model by combining the UV image of that person $T^i_t$ ∈ $R^{4×256×256}$ and the corresponding visibility map $V^i_t$ ∈ $R^{1×256×256}$ at time step t: $$A^i_t = [T^i_t, V^i_t] ∈ R^{4×256×256}$$ the visibility mask $V^i_t$ ∈ [0, 1] indicates whether a pixel in the UV image is visible or not, based on the estimated mask from Mask-.RCNN 35. After every new detection we create a singe per-tracklet appearance representation: ![image](https://hackmd.io/_uploads/SkxvOCmwA.png) ![image](https://hackmd.io/_uploads/By-7I07DR.png) Example of predicted Appearance ![image](https://hackmd.io/_uploads/B1mgi0mDC.png) (brighness increased for better visualization) STAGE 4.6 - Updating Predicted Values --- 34. We add those predicted pose and location to prediction key of track_data. 35. Using metric.partial_fit function we add the predicted values self.samples in nn_matching.py which will be used to create cost_matrix for next frame. 36. Iterate through each track and save all the results STAGE 5 - Saving the data and Rendering --- 37. We skip this step if the track is not confirmed state 38. We will start rendering when the current frame number is greater than 5(n_init) 49. We iterate through each frame and its data, * extact mesh using vertices for all the tracks and add them to scene * create and run offscreen renderer to scene with black background and add camera parameters to get image * Now we add orginal image with the above image we get