Our Github code: https://github.com/ana-baltaretu/dl-action-segmentation Reproduced paper: https://arxiv.org/abs/2304.06403 Author's code: https://github.com/elenabbbuenob/TSA-ActionSeg Dataset Breakfast: https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/ Dataset 50Salads: https://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/; https://drive.google.com/file/d/1mzcN9pz1tKygklQOiWI7iEvcJ1vJfU3R/view ## Introduction and Background This blog post presents the results of a reproduction study of the paper "*Leveraging triplet loss for unsupervised action segmentation*" by E. Bueno-Benito, B. Tura, and M. Dimiccoli. Additionally, it outlines the testing of the model on new data, a hyperparameter check and the evaluation of a new algorithm variant. This Reproductibility Project was conducted as part of the *CS4240 Deep Learning* course 2023/24 at TU Delft. In short: The paper proposes a novel **fully unsupervised** framework to solve the task of **action segmentation**. This model learns action representations tailored for action segmentation directly from a single input video, without the need for any training data. **Why do we need this paper?** Real-world videos are typically lengthy, untrimmed and contain a sequence of various actions. From making breakfast to complex surveillance footage, the ability to automatically segment videos into meaningful parts is crucial. **Why unsupervised?** The field of action segmentation relied heavily on supervised learning techniques. These techniques require large, precisely annotated data sets. This dependency limits the models' applicability to new, un-annotated videos and it requires massive computation power. **Approach from this paper** The authors present a fully unsupervised framework that learns from the video itself. This method uses a shallow neural network architecture combined with a triplet loss mechanism that exploits temporal and semantic similarities to segment actions. Some advantages are that the model is evaluated on widely recognized benchmark datasets, it is a fast training network, and it shows very good results on a large variety of videos, without any prior training on labeled datasets. ## Experiments ### 1) Reproduction The first task focused on reproduction the results obtained by the paper on the **Breakfast dataset** (dense trajectory features 64D). The resulting metrics for the Breakfast datased (see Table 1) were obtained by running the model with the [default configuration](https://github.com/elenabbbuenob/TSA-ActionSeg/blob/main/configs/breakfast_action.yml) for the full breakfast action dataset provided by the author: * *learning rate* of 0.051 * *distance* 0.032 * *batch size* of 128. To get similar results, we recommend following the setup guide from our [README file](https://github.com/ana-baltaretu/dl-action-segmentation/blob/main/README.md) and downloading the following files: Pre-computed dense trajectoires [dense_traj_all_s1.tar.gz](https://www.dropbox.com/s/mfz8pe7jerlj54g/dense_traj_all_s1.tar.gz), [dense_traj_all_s2.tar.gz](https://www.dropbox.com/s/l25u0jbmi62jwjp/dense_traj_all_s2.tar.gz), [dense_traj_all_s3.tar.gz](https://www.dropbox.com/s/mfz8pe7jerlj54g/dense_traj_all_s3.tar.gz), [dense_traj_all_s4.tar.gz](https://www.dropbox.com/s/mfz8pe7jerlj54g/dense_traj_all_s4.tar.gz) (in total ~220GB), with the ground truth and mapings from [BF_gt_mapping](https://drive.google.com/file/d/1RO8lrvLy4bVaxZ7C62R0jVQtclXibLXU/edit). <figure id='res_triplet'> <img src="https://hackmd.io/_uploads/r1oi7zdeR.png" alt="results_triplet_loss" width="450"> <figcaption>Table 1. Comparison of the results on the Breakfast Action dataset reported by the paper and obtained by us in a run on the full features dataset. The values marked with N/A were not specified by the paper.</figcaption> </figure> As it can be seen in Table 1, the results are very simmilar with the ones from the paper with a **~2%** difference for Kmeans and Spectral clustering methods. The only noticeable difference is for the FINCH clustering, which is about **~15%** (see row highlighted int red)! By inspecting the values from the table, it seems that the value reported by the paper for FINCH is closer to the value obtained by us for TW_FINCH, but the paper does not mention using the namings interchangeably. ### 2) Triplet Loss Variations The triplet loss function is the core component of the proposed method because it represents the criterion being optimised when learning the feature space transformation. The authors use the KL-divergence as default distance metric in the triplet loss function: <img src="https://hackmd.io/_uploads/r1s_MqHJC.png" alt='triplet_loss' width="350"/> where $f_{ts}$ denotes the temporal-semantic similarity distribution and $i$ is the anchor index choosen among the set of frames. The indices $i^+$ and $i^-$ are selected from the set of indices of the highest (positive) and lowest (negative) similarity frames, respectively, when compared to $i$. Nevertheless, it is possible to compute the triplet loss with other distance metrics as well, besides the KL-divergence. The only requirement in this case is of mathematical nature, imposed by the fact that $f_{ts}$ is a (discrete) probability distribution function; any other distance used should be applicable to distributions as well. The authors do not provide details in the paper on why the KL-divergence in particular was selected or details on how it compares to other alternative distance metrics. We expect that such a choice will be impactful on the model's performance, given the importance of the triplet loss function in the optimisation process. Therefore, we decided to investigate whether other metrics for the triplet loss function yield a different (preferably better) performance of the model. Numerous distance metrics for probability distributions have been proposed in literature, making the choice difficult. A few examples and their properties are available in [1]. We decided to move forward with two alternatives: Hellinger Distance and Cross-Entropy Loss. **Hellinger Distance** measures the similarity between two probabilities, $P_1$ and $P_2$ , by comparing their square roots: <img src="https://hackmd.io/_uploads/rkzWEeLJR.png" alt='hellinger_dist' width="270"/> where in this case $P_1$ and $P_2$ are discrete. We have selected it because it has two desirable properties, as opposed to the KL-divergence: it is bounded (to $[0, 1]$) and symmetrical [2]. **Cross-Entropy Loss** between two (discrete) probability functions $p$ and $q$ is defined as <img src="https://hackmd.io/_uploads/rJ3ddxLJC.svg" alt='cross_entropy_loss' width="230"/> It is a commonly used loss function in Machine Learning tasks, which made it appealing choice to us. While it is not symmetrical, Cross-Entropy Loss is still bounded, to $[0, 1]$. Moreover, it is said to offer smoother gradients than other loss functions, which should in turn lead to better convergence during training. We repeated the model evaluation from the paper on the Breakfast Action dataset with all three distance metrics for the triplet loss function. We record the results of four clustering methods (K-means, Spectral, FINCH, TW-FINCH) on three metrics (Mean over Frames - MoF, Intersection over Union - IoU, F1-score). Due to time and hardware constraints, we adjusted some of the experimental parameters. We run 10 epochs (instead of 45) on 2 actions only (cereals and coffee) out of the 10 available in the dataset. Lastly, we used the reduced values of the features, with only 6 decimal places instead of 18. These are readily available on the website of the dataset. The results are made available in the [table](#res_triplet) below. Surprisingly, the choice of distance metric has almost no impact on the model's performance, contrary to our expectation. The highest difference between the worst and best performing distance metrics is of only 0.9 (F1-score with TW-FINCH). Even when performance differences exist, they seem to be inconsistent. Clustering methods Spectral and TW-FINCH seem to work marginally better with KL-divergence, while FINCH yields somewhat better results with Hellinger Distance and Cross-Entropy. <figure id='res_triplet'> <img src="https://hackmd.io/_uploads/Sk_quvrk0.png" alt="results_triplet_loss"> <figcaption>Table 2. Results on the Breakfast Action dataset for different distance metrics in the triplet loss function. KL stands for KL-divergence, HD for Hellinger Distance, and CE for Cross-Entropy Loss. We underline the results that significantly outperform the alternatives (i.e. at least 0.2 higer than the second best result).</figcaption> </figure> ### 3) Temporal similarity distribution (**f~t~**) The **goal** of this section is to help us **understand how the temporal similarity distribution (f~t~) influences the performance of the model**. To achieve this goal, 2 types of experiments were run on a subset of the breakfast dataset: - (E1) adjusting the user-picked variable **L** (`positive_window` in the code) - (E2) removing parts of the weighted sum between f~t~ and f~s~ **3.1) Variable definition** $f_{ts}$ = weighted sum between temporal similarity (f~t~) and semantic similarity (f~s~), defined as $$f_{ts} = \alpha \cdot f_t + (1 - \alpha) \cdot f_s$$ $\alpha$ = weights, learnable parameters of the model $d$ = temporal distance, i.e. how close frames are to each other in the video timeline. This is relevant in a continuous video (without jumps / cuts) because frames which are close together are potentially part of the same action $w(\cdot)$ = weight function based on slope $\beta$ and temporal distance $d$, is defined as $w(d) = −1 + 2 \cdot exp(\frac{−1}{\beta} \cdot d)$. In the paper it was used to define the slope $\beta$ by centering the weight function around the middle of the positive window length ($\frac{L}{2}$), by forcing $w(\frac{L}{2}) = 0$ $\beta$ = controls the slope of the weight function, i.e. how much importance the surrounding frames get, and is defined as $\beta = \frac{-L}{2\cdot ln(\frac{1}{2})}$ $L$ = length of the positive window, see below the shape of the $\beta$ slope for different window lengths. Smaller window results in a steeper slope => closer frames to your "current" frame have more importance. |Window length $L$| L = 1 | L = 5 | L = 10 | L = 25 | L = 100 | |---| -------- | -------- | -------- | --- |--- | |Slope $\beta$| ![window_1](https://hackmd.io/_uploads/rJ2AZjcJ0.png) | ![window_5](https://hackmd.io/_uploads/Sk20boqyR.png) | ![window_10](https://hackmd.io/_uploads/HknC-jcyR.png)| ![window_25](https://hackmd.io/_uploads/BJhCZo9k0.png) | ![window_100](https://hackmd.io/_uploads/S1r9rJxeC.png)| <figure id='time_window_length'> <figcaption>Table 3. Comparison of how the slope looks for different time window lengths. Notice how the slope becomes less steep as the window length increases. </figcaption> </figure> In the author's version of the code the window length (`positive_window`) is calculated based on the average amount of actions (`Nc`) in the dataset, since it is defined as: `positive_window = len(features) // Nc`. I believe that this can lead to variations in length in the results for videos from the same category. Another potential issue are the "gaps" or downtime in the videos (i.e. no relevant action happening), because this influences the overall length of the video. **3.2) Assumptions** What we think should happen: - A1. Larger time window leads to smoother transitions (frame assignment not jitter/flicker at edges) since it would get more information from neighbours. - A2. Smaller time window better for more "precise" actions (better for quicker and more varied actions). - A3. Removing either the temporal similarity (f~t~) or the semantic similarity (f~s~) would have a significant impact on the results. **3.3) Setup & Results** All of the tests from this section were run only on the (reduced) cereals videos numbered 0 to 34 (inclusive). ------ **Experiment E1** ------ In this experiment we changed the `positive_window` variable (L) from [this line of code](https://github.com/elenabbbuenob/TSA-ActionSeg/blob/main/tsa.py#L139). In the original code the window length is calculated by dividing the total amount of frames by some user defined amount of actions (`Nc`). This leads to a varying window between videos (between 100-160 frames), which we will use as a baseline when comparing to our results. We expected to get more consistent results for video if we let the user select the `positive_window` vairable instead of `Nc`. ![image](https://hackmd.io/_uploads/B1yqJydl0.png) <figure id='results_time_window'> <figcaption>Table 4. Results of running the experiment on different window sizes.</figcaption> </figure> We can also visualize these results (on the `P03_cam01_P03_cereals` video), each line in the graphs below means the following: - 1st line from each image is the Ground truth (GT), - 2nd line are the initial features generated for each method (IDT), - then the remaining lines are the different values selected for the time window. | Kmeans | Spectral| | -------- | -------- | | ![__final_kmeans_result](https://hackmd.io/_uploads/SJLCZydgC.png) | ![__final_spectral_result](https://hackmd.io/_uploads/rJ81M1_eR.png) | | **Finch** | **TW-Finch** | | ![__final_finch_result](https://hackmd.io/_uploads/B13ezJOxA.png) | ![__final_twfinch_result](https://hackmd.io/_uploads/Bkj1zkdeA.png) | From the data and graphs above we can draw the following conclusions: - (C1) There is a large difference between the length choice of 10 frames VS 50 frames. It seems that indeed if you (very) poorly approximate the number of frames per action the results are significantly worse (off by around 10-15%). - (C2) The method is mainly reselient to small changes in the window length (there are small difference between len=100 and len=120) - (C3) The TW-FINCH model does not seem to be affected by changes in the window length, but is is unclear why. ![image](https://hackmd.io/_uploads/rJTv-Jug0.png) <figure id='highlighted_conclusions_tw'> <figcaption>Table 5. Highlighted conclusions over the results from Table 4.</figcaption> </figure> ------ **Experiment E2** ------ For this experiment we wanted to see how much the Temporal similarity influences the final results. We tested this by "removing" it from the calculation of F~ts~ (which is calculated [in this line of code](https://github.com/elenabbbuenob/TSA-ActionSeg/blob/main/model/mlp.py#L37)). We are *not actually* removing the line as deleting it would lead to errors. For this reason, we just set the values of the respective temporal features (`mt`) to a very small number (`1e-20`), such that it does not influence the result. More specifically for the results in Table 6 and Table 7 we set: - `Fs` line makes use of **only the semantic features**, we've added a line in [between 156-157](https://github.com/elenabbbuenob/TSA-ActionSeg/blob/main/tsa.py#L156-L157) which makes the temporal features insignificant `temporal = torch.full(temporal.shape, 1e-20)` - `Ft` line makes use of **only the temporal features**, with the line added [between 142-143](https://github.com/elenabbbuenob/TSA-ActionSeg/blob/main/tsa.py#L142-L143) which makes the semantic features insignificant `semantic = torch.full(semantic.shape, 1e-20)` ![image](https://hackmd.io/_uploads/rklHTedlR.png) <figure id='Fts_fs_ft'> <figcaption>Table 6. Removed parts of the weighted sum.</figcaption> </figure> | Removing f~s~ from the weighted sum, equivalent with Ft line from Table 6 | Removing f~t~ from the weighted sum, equivalent with Fs line from Table 6 | | -------- | -------- | | ![report2](https://hackmd.io/_uploads/rkaYGZdgR.png) | ![report1](https://hackmd.io/_uploads/HkZcfZuxC.png) | | No significant difference | Very poor results | <figure id='Fts_fs_ft'> <figcaption>Table 7. Visualization of classification of frames when removing parts of the weighted sum. Image shows results on the video `P03_cam01_P03_cereals` for the FINCH metric.</figcaption> </figure> **3.4) Concluding remarks** Assumption (A1) wasn't correct, in reality a "properly" selected time window leads to results with minimal jitterness. If the time window is too large it might not only look at neighboring sections when deciding what cluster to assign the current frame to. Because of this and the random sampling we believe that there is some amount of jitter in the clasification of frames (specifically for the L=250 time window). Assumption (A2) was inconclusive since we didn't end up testing the model on medical data, but some insight on this topic could be gathered from Section 5, granularity of the labels. Assumption (A3) was surprising because we expected both the Semantic Similarity and the Temporal Similarity to significantly impact the results. Incredibly, it seems that the Temporal Similarity can properly classify frames on its own. On the other hand, removing the Temporal Similarity massively impacted the performance of the models, meaning that the Semantic similarly cannot be used alone to be able to solve this task. ### 4) Semantic similarity distribution (f~S~) In this section, we will explain the results of our experiments varying of the filtering parameter of the exponential function used in Semantic similarity distribution (f~S~). All experiments were run on a subset of the original data. **4.1) Variable definition** To define f~S~ , we assume that the set of most similar frames in the original feature space of an anchor i is very likely to be part of the same action. The similarity of an anchor i to all other frames is defined element-wise via a pairwise similarity, upon normalization to the total unit weight, f~S~ = w~i,j~ / W with $W = \sum_{i,j∈E} = w_{i,j}$ and $w_{i,j} = \exp(−(1 − d(x_{i}, x_{j} ))/h)$, and where *E* is the set of pairwise relations, *d(·, ·)* is the cosine distance and *h* is the filtering parameter of the exponential function. In the author's version of the code the filtering parameter *h* is set to 0.01 and no further comments or remarks are made on this parameter. **4.2) Assumptions based on the formula:** The parameter *h* serves as a filtering parameter for the exponential function used to calculate the pairwise similarity between frames. This parameter controls the sensitivity of the exponential function to differences in the cosine distance between frames. A smaller value of *h* makes the exponential function more sensitive, while a larger value makes it less sensitive. 1. For small *h*, the exponential function sharply boosts pairwise similarity with decreasing cosine distance, favoring significantly higher similarities for frames with small distances over those with larger distances. 2. In contrast, larger *h* values lead to a slower rise in pairwise similarity with decreasing cosine distance, resulting in less pronounced differences between frames with small and large distances. **4.3) Setup & Results** We ran the experiments only on the (reduced) cereals and coffee videos. These two features were chosen based on how different they perform on the original reproduction run and how much more the coffe was segmented. We performed a Grid Search across a range of potential values for the parameter *h* and systematically evaluated the performance of the model. This method helped in identifying an optimal range or specific value for parameter *h*. <figure id='res_FilterParam'> <img src="https://hackmd.io/_uploads/SyZGJ8deR.png" alt="res_FilterParam"> <figcaption>Table 8. Results of running the experiment with different filtering parameter h. </figcaption> </figure> **4.4) Conclusion** Given that varrying the *h* parameter in the range [0.0001, 1] as shown in Table 8 yielded no noticeable differences in the outcomes, several possibilities are implied: * Insensitivity to *h*: The semantic similarity distribution may be relatively insensitive to changes in the *h* parameter within the tested range. This suggests that the exponential function's sensitivity to differences in cosine distance, as controlled by *h*, might not significantly affect the pairwise similarity calculations in this particular context. * Dominance of Other Factors: Other factors or parameters in the model may exert a stronger influence on the outcomes compared to *h*. These factors could overshadow any potential effects of changing *h*, leading to consistent results despite variations in its value. Finally, our results from the second experiment of section 3 reveal that the influence of the Temporal distribution outweighs that of the Semantic similarity distribution. Therefore, making adjustments related to the latter may not significantly impact performance. However, investigating both distributions was crucial for comprehensively understanding the overall potential and effectiveness of the method described in the paper. ### 5) Experiments on a new dataset Finally, to complete our reproductibility study on the paper, we wanted to test how things would go if we tried the method with a dataset the original authors didn't use. Plus, we're curious to check if the model still performs as well as it did before on the **Breakfast** and **INRIA YouTube Instructional Videos** datasets, but also in comparison with models proposed by 2 other paper. **4.1) The 50Salads dataset** As we wanted to be able to compare our results with other papers, we had to pick a dataset that is commonly used in the action segmentation task. For this reason (and to keep the food theme of our experiments), we choose to use the **50Salads** dataset, introduce by [University of Dundee](https://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/). The **50Salads** dataset comprises 50 videos (4.5 hours in total), featuring 25 participants preparing two distinct salads. These videos are longer on average (around 10k frames each) and have a wider range of action categories compared to those in the Breakfast dataset. What is great about this dataset is that the videos are annotated at 3 levels of granularity: *"mid", "high" and "eval"*. This means that we don't only have 3 datasets to experiment on, but we could also investigate how different granularities of labels influnce the results! The classification of actions varies based on these levels, as does the number of labels: 17 for mid, 11 for eval and 4 for high granularity. For example, the *"high"* labels only look at very high level actions such as: *cut and mix ingredients*, *prepare dressing* and *serve salad*. In contrast, the *"mid"* labels differentiates between fine-grained actions such as *add vinegar*, *cut tomato*, *mix dressing*, *peel cucumber*. **4.2) Experiments Setup** We performed experiments on the different action granularities to evaluate the model and compare how they affect the performance of action segmentation. The choice of parameters such as learning rate, distance and Nc was based on some small experiments run with different values, but also intuition from how the author set the parameters for the other datasets. Based on our experiments, we have noticed that higher learning rates (>0.4) worked better for the *Mid* and *Eval* granularities, which had more labels. Moreover, with more labels, the Nc also had to be increased. All runs used a *batch size* of 128 and 15 *epochs*. To get similar results, we recommend downloading the following files: RGB features [fs_features.tar.gz](https://drive.google.com/file/d/17o0WfF970cVnazrRuOWE92-OiYHEXTT3/view), with the ground truth and mapings from [fs_gt_mapping](https://drive.google.com/file/d/1mzcN9pz1tKygklQOiWI7iEvcJ1vJfU3R/view). Moreover, when we specify that the experiment was run on a subset of the data, we recommend using the following files: *rgb-20-1, rgb-20-2, rgb-24-1, rgb-24-2,rgb-26-1, rgb-26-2*. These files were picked from the dataset as they were shorther and would need less time to finish one experiment. **4.3) Results and Discussion** By looking at the results on a subset of the data, we observed that the model performs best on the *High* granularity level. This was expected since this level has the fewest labels to distinguish between. As the number of labels increases, the model appears to struggle more to distinguish between the fine-graned actions. Additionally, we acknowledge that properly setting up parameters such as Nc, learning rate (lr) and distance (dist) is very important. Thus, we believe that further refining the choice of parameters based on these results could enhance the model's performance. Lastly, our decision to run the experiments on a subset of the data impacted the results. This can be seen in the colums Mid and Mid* from the table 9. Although we cannot definitively say whether the values obtained from the subset are higher or lower than those from the entire dataset, a difference between them is expected. <figure id='res_50Salads'> <img src="https://hackmd.io/_uploads/BJKBeRmgR.png" alt="results_50Salads"> <figcaption>Table 9. Results on the 50Salads dataset for granulation levels High (Nc: 5, dist: 0.1, lr: 0.1), Mid (Nc: 10, dist: 0.8, lr: 0.4) and Eval (Nc: 10, dist: 0.6, lr: 0.6). The results marked with * were obtained from a run on a subset of the dataset. </figcaption> </figure> To further evaluate the model's performance, we compared our results with 2 other unsupservised methods **[3, 4]**. Since our experiments retured MoF values for each clustering method, we selected the highest overall value for comparison with the results presented in the other papers. <figure id='res_50Salads'> <img src="https://hackmd.io/_uploads/rJuFkrugR.png" alt="results_50Salads"> <figcaption>Table 10. Comparison of the results of the presented paper and papers [3,4]. The results marked with * were obtained from a run on a subset of the dataset. </figcaption> </figure> When we compared the results from the 3 methods, we noticed a very large difference between our experiments and those presented in **[3]**. On the other hand, our results are somewhat closer to the ones from **[4]**, although there's still a *~6%* difference for *Mid* and *~13%* difference for *Eval*. It seems that *Eval* granularity should perform better than *Mid*, as shown in **[3,4]**, but for us, it's the other way around. It is importan to highlight that we didn't have a full run on the *Eval* granularity, so the average MoF we got from a few videos might not represent the actual value for the entire dataset. Given the differences noticed between the results of the three methods, we decided to compare the results for the **Breakfast** and **YouTube Instructional** datasets as well. This was done to check if the variances found for the **50Salads** dataset were justifiable. We found a 22-24% difference between our experiments and **[3]**. Interestingly, this same margin was evident in our findings for the 50Salads dataset. Conversely, **[5]** appears to perform slightly worse on Breakfast and YouTube Instructional but better on 50Salads. This suggests that the parameter values chosen (lr, Nc, dist) might not have been optimal for the 50Salads dataset. For this reason, we have decided to run extra experiments on the *Eval* granularity on the subset of the data. **4.4) Conclusion** Overall, our findings show the effectiveness of the method presented in the paper when applied to a new dataset. Our experiments show better results compared to th unsupervised approach from paper **[3]**. Additionally, the results also emphasize the significance of appropriately initializing parameters for optimal performance. --- ## References [1] Nobel, A. (2020, October). "Distances and Divergences for Probability Dsitributions" [Course notes]. The University of North Carolina and Chapel Hill. https://nobel.web.unc.edu/wp-content/uploads/sites/13591/2020/11/Distance-Divergence.pdf [2] Ding, R., and Mullhaupt, A. (2023). "Empirical Squared Helllinger Distance Estimator and Generalizations to a Family of $\alpha$-Divergence Estimators". Entropy 2023, 25(4), p. 612, doi: https://doi.org/10.3390/e25040612 [3] Kukleva, Anna & Kuehne, Hilde & Sener, Fadime & Gall, Jurgen. (2019). "Unsupervised Learning of Action Classes With Continuous Temporal Embedding" https://arxiv.org/abs/1904.04189. [4] Sarfraz, S., Murray, N., Sharma, V., Diba, A., Van Gool, L., & Stiefelhagen, R. (2021). Temporally-weighted hierarchical clustering for unsupervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11225-11234) https://doi.org/10.48550/arXiv.2103.11264 . --- ## Workload Distribution | Student name | Experiment | | ------------------| ------------ | | Alexandra Marcu | New Dataset (Section 5) | | Ana Balteretu | Temporal similarity distribution (Section 3) | | Andrei Tociu | Triplet Loss Variations (Section 2)| | Cristian Cutitei | Variation on exponential filter parameter (Section 4)|