Rebuttal - HackMD

# Rebuttal ## Common Response to All Reviewers We thank the reviewers for their positive comments on our work. We are pleased they found *"the idea new"* (R-n8aa) and *"reasonable"* (R-wLu7), the paper *"well written"* (R-VyHZ, R-6MvG), the *"technical aspects explained clearly"* (R-6MvG), and the results *"competitive and compelling"* (R-VyHZ, R-n8aa, R-wLu7, R-6MvG). *** * We note a confusion on the difference between our proposed approach (Drop-DTW) versus a naive baseline that we call drop-**then**-DTW in our paper. Here, we first make the distinction clear. drop-**then**-DTW addresses the problem of aligning sequences with outliers in two steps: **1)** it identifies and drops the outliers using a drop-cost threshold on the matching costs, and then **2)** aligns the remaining sequence elements with *standard* DTW. Importantly, step **1)** is done **independently** from step **2)**, which results in an approximate greedy solution to the problem of optimal sequence matching with outliers (defined in Equation 2). Critically, if an important element is erroneously dropped in step **1)** it is impossible to recover in step **2)**. Moreover, the outlier rejection in step **1)** is order agnostic and results in drops broadly scattered over the entire sequence, which makes drop-**then**-DTW inapplicable for precise step localization. In contrast, Drop-DTW is a unified framework that solves for the optimal temporal alignment while **simultaneously** detecting outliers (as noted by Reviewer R-6MvG). The drop-cost in Drop-DTW specifies the cost of dropping a frame within this optimization process. Specifically, as defined in Algorithm 1 (lines 8-10), Drop-DTW takes into account **all** potential matches with their original costs, and enables element dropping by adding another dimension to the dynamic programming table. In other words, Drop-DTW efficiently considers *all* the possible sub-sequence alignments and finds the **exact optimal solution** to the problem of sequence matching with outliers (defined in Equation 2). **We will incorporate the above clarification in the revised manuscript.** *** * Multiple reviewers suggested a number of existing approaches that may resemble Drop-DTW. While all the approaches are relevant, none of them (or their combination) possesses the capabilities of Drop-DTW. Here, we briefly highlight the key differences between these methods and Drop-DTW. The LCSS algorithm (mentioned by R-VyHZ) is similar to drop-**then**-DTW, i.e. it drops the outliers greedily and independently from the alignment procedure which provides poor results, as shown in Table 1 of our paper. The Needleman-Wunsch and Smith-Waterman algorithms (mentioned by R-wLu7) also use the idea of a drop cost. However, it implements a, domain-specific, restricted form of matches and drops, i.e., only one-to-one matches and individual drops are allowed. As a consequence, it yield sub-optimal results in our experiments (see results and detailed discussion in response to R-VyHZ). In contrast, it is important for our applications that each item can be matched with multiple items, and both individual items and pairs of items can be dropped. Our Drop-DTW implements such posibilities and provides a more general matching algorithm. Finally, the Canonical Time Warping (CTW) or its deep counterpart (mentioned by R-VyHZ, R-wLu7) can handle noisy *representations* by learning a robust feature transformation, however it cannot handle outliers in sequences (i.e. it does *not* support dropping). In contrast to Drop-DTW, CTW does not modify the sequence matching algorithm to drop the outliers and uses classical DTW. In summary, Drop-DTW is more general and powerful than existing sequence alignment algorithms, as we show below through algorithm analysis and additional experiments in response to individual reviewer comments. **We will create a separate subsection in our related work that discusses the aforementioned works.** *** ## Reviewer 1 - n8aa >* ...The main technical contribution of the paper is a DTW algorithm that can drop outliers. The key modification is that the cost matrix is threshold by a fixed scalar, then only low-cost points are considered in DTW.... As mentioned in the common response to all reviewers, Drop-DTW algorithm does **not** threshold the cost matrix by a fixed scalar; instead, it considers the original cost matrix for all the computations. The decision to drop an element is not taken greedily based on whether the drop cost is larger than the matching cost. Instead, the elements to be dropped are defined by solving the dynamic program in Algorithm 1. >* The idea is new, but there is a concurrent paper aiming at the same purpose worth some discussion [1]. Note that this work appeared on the CVPR'21 website **after** our submission, and it was therefore impossible for us to include it in our paper at submission time. While, we both support dropping outliers, note that we propose a different and more generic approach for dropping outliers. Differently from Drop-DTW, [1] approaches the muti-step localization in 2 independent steps: **1)** it converts videos and narrations into shorter sequences of prototypes and uses their proposed DTW variant to learn a representation of step prototypes; then **2)** it uses the distance between each frame and step prototypes to segment a video. In contrast, we use Drop-DTW to directly align video content to step descriptions and automatically find an optimal segmentation of the video using dynamic programming. In short, segmentation is a direct by-product of our DTW-based learned alignment, whereas in [1] it is used to learn a representation of the key steps and segmentation is obtained in a separate step. Notably, our approach yields superior performance on the same dataset (i.e. CrossTask: Ours \{recall: **48.9**, Acc: **71.3**\} vs. [1] \{recall: 35.46, Acc: 40.99\}), although a direct comparison is not entirely fair as they start from narrations on the language side. >* I think the auxiliary clustering loss is also important, which makes me think the significance of the proposed Drop-DTW is not strong. In the instructional video step localization experiment, training with only a sequence alignment loss of any form (i.e. Drop-DTW, softDTW, D3TW or OTAM) leads to a degenerate solution. This is not specific to our method but rather to the problem setup, i.e. aligning long (~200 elements) video sequences to short (~10 items) step sequences allows for degenerate solutions, as discussed in lines 217-222. However, when the training is regularized by using $L_{clust}$ as an additional loss, Drop-DTW shows a clear advantage over the other sequence alignment methods under **all** the metrics (see Table 3). This suggests that the contribution of Drop-DTW is significant. Importantly, note that for other experiments on real datasets that do not suffer from the same imbalance (e.g. Exp 4.2 and 4.3), Drop-DTW again shows superior performance without reliance on the regularization loss. To clarify this point, we will move the clustering loss definition to the main manuscript and include the corresponding discussion. >* As shown in Algorithm 1, when using a fixed value of drop cost (i.e. d=0.3), the (non-differentiable) Drop-DTW loss is equivalent to first thresholding the cost matrix then applying DTW. But in Table 1, there is a clear gap between the ‘Drop-DTW’ column and ‘drop then DTW’ column. Could you point out the reason? As clarified in the response to all reviewers, Drop-DTW solves the alignment and dropping problem in a **unified** framework that is fundamentally different from the naive approach of first thresholding the cost matrix and then aligning the remaining elements. In the latter case, neither the drop elements nor the alignment are optimal, which yields worse results, as shown in Table 1 of our paper. Regarding the differentiability of Drop-DTW, the algorithm is always differentiable with respect to the model's parameters, however it is not differentiable with respect to the hyperparameters, such as the percentile (or the value) of the drop cost. >* The clarity of the paper can be improved. Table 1 is not clear, Figure 3 can have more legends on the matrix. Some notation should be consistent We thank the reviewer for pointing out clarification points, which we will address as suggested in the revised manuscript. ## Reviewer 2 - VyHZ >* The approach relies heavily on [Hadji et al]. We only use the SmoothMin approximation of the min operator proposed in \[Hadji et al\] (i.e. ref [1] in our paper) to enable differentiability. This is not a key component of our contribution and other differentiable min operators such as SoftMin \[Cuturi et al\] (i.e. ref [4] in our paper) or the standard (hard) min operator can be used as well. The table below displays the performance of Drop-DTW on step localization, when used with different min approximations for training: | | CrossTask | CrossTask | CrossTask | | --- | ----------- |---|---| | | Recall | Acc | IoU | | Drop-DTW + SoftMin [4] | 46.1 | 70.8|32.9| | Drop-DTW + SmoothMin [1](in paper) | 48.9 |71.3|34.2| | Drop-DTW + Hard min | 48.3 |71.2|34.3| | _[Hadji et al]_ | _43.1_ |_70.2_|_30.5_| It is clear from this table that Drop-DTW, with several different min operators (Rows 1-3), provides superior results to [Hadji et al] (Row 4). This highlights the significance of our contribution, and the lack of reliance on [Hadji et al]. >* The core technical novelty is the extension of DTW to allow outlier dropping. However, this is not really novel... For example, subsequence variant of the DTW algorithm which allows partial matching also exists (e.g., see Meinard Muller. Information Retrieval for Music and Motion. Springer Verlag, 2007. or description of LCSS in https://www.cs.unm.edu/~mueen/DTW.pdf). We thank the reviewer for pointing out these additional references, which we will include and discuss in our revised manuscript. Here, we clarify the distinctions with each reference: * Meinard Muller discuss an approach that allows for dropping outliers **only** around endpoints. This approach is *exactly* the one used most recently in [6] and is referred to as OTAM in our paper. In comparison, our method is more general as we support dropping **interspersed** outliers including those around endpoints, thereby automatically supporting sub-sequence matching using the same unified algorithm. We discuss this distinctions in lines 72-74. Note that we include a direct comparison to [6]/OTAM (denoted as OTAM) in Tables 2 and 3 as well as Figure 4, with Drop-DTW showing significant improvements. * While LCSS allows to skip some elements from matching, there are 3 important distinctions with Drop-DTW: **1)** the notion of drop cost is lacking; **2)** the decision on whether to drop or keep a match is based on thresholding the similarity matrix using an "if statement" which is not differentiable; **3)** thresholding is a greedy operation that leads to a smaller space of possible alignments, i.e. if a match is thresholded, it can not be recovered even if it leads to a better global solution. In contrast, Drop-DTW is fully-differentiable and considers all the possible match pairs in the alignment computation, which allows Drop-DTW to find the optimal alignment, given the drop costs. In fact, LCSS, is closely related to what we call drop-*then*-DTW and comparison to such approach is provided in Table 1 of our paper, where we show that our Drop-DTW algorithm outperforms the naive approach of thresholding first and then aligning using DTW (i.e. drop-*then*-DTW). In addition, we also performed a direct comparison between LCSS and Drop-DTW for step localization inference, which shows that Drop-DTW is indeed superior (i.e. CrossTask: Ours \{Acc: **71.3**, IoU: **34.2**\} vs. LCSS \{Acc: 66.1, IoU: 23.7\}). * Generalized Canonical Time Warping (GCTW) alternates between solving the temporal alignment using standard DTW, and computing linear projections of data from different modalities using Canonical Correlation Analysis (CCA). Concretely, this means that the goal of CCA is to bring the representations of the two modalities into a common embedding space to facilitate the alignment. In contrast, we use a neural network to nonlinearly embed the two modalities and our Drop-DTW function as a loss to find a good representation that supports optimal alignment, while dropping outliers in a unified framework. Also GCTW, uses CCA to project the representation to a lower dimensional feature space thereby dealing with noise in the *feature* space to some extent but is incapable of dropping portions of the inputs that coincide with outliers in the input sequences. In contrast, by design our method handles noise at the semantic level (i.e. dropping portions of the video that do not have a match in another modality, such as step descriptions.) >* For Experiments in Figure 3, how do you explain higher performance of Drop-DTW at 0 noise? As mentioned on lines 168-169, for this experiment we match TMNIST-part (which contains partial trajectory executions) to TMNIST-full (which contains the full trajectories). Therefore, even under 0% noise, an optimal alignment should be able to drop outliers around endpoints (i.e. perform subsequence matching). At 0% noise, Drop-DTW shows better results because it drops the start and/or end elements from TMNIST-full sequences, that are not present in the corresponding TMNIST-part sequence. In contrast, standard DTW does not have this capability. Please see the supplementary video for a visualization (from 00:35 to 00:50). >* For Experiments in Figure 3, do you assume that amount of noise is know for each operating point? Alternatively, how is the drop cost set? This doesn't seem to be discussed. The noise level is unknown in all experiments in Figure 3. As stated on line 175, the drop cost is set to 0.3, based on the validation data. The drop cost is fixed here as we rely on features pre-trained on MNIST digit recognition (i.e. no further training is involved in this experiment). Notably, for all other experiments the drop cost is set based on the percentile criterion defined in Equation 4. ## Reviewer 3 - >* ...Could the (exact) min not be retained and a subgradient computed for the optimal assignment?...} We thank the reviewer for the interesting suggestion. We implement the training with hard min and show the results below, in comparison to SmoothMin used in the paper. | | CrossTask | CrossTask | CrossTask | | --- | ----------- |---|---| | | Recall | Acc | IoU | | SmoothMin [1] (in paper)| 48.9 |71.3|34.2| | Hard min | 48.3 |71.2|34.3| The results show that it is possible to use the original min operator for training and to obtain comparable performance to the differentiable min approximation. >* My other bigger concern relates to the sensitivity of the trade-off between the "drop costs" and matching costs. It seems that if these are not carefully set then the model could learn that all elements are outliers and produce a trivial solution. As long as we set the drop cost as a percentile of all the matching costs, we ensure that the drop cost is somewhere between the min and max matching cost. Subsequently, this ensures that some (but not all) elements will be dropped. Here, we also provide an ablation on the effect of varying the percentile choice on the final results: | | CrossTask | CrossTask | CrossTask | | --- | ----------- |---|---| |Drop-DTW percentile | Recall | Acc | IoU | | p = 0.1 | 47.5 | 72.4 | 34.9 | | p = 0.3 (in paper)| **48.9** |**71.3** | **34.2**| | p = 0.5 | 46.2 | 69.5|31.0 | | p = 0.7 | 45.3 | 68.4 | 31.0 | | p = 0.9 | 44.6 |67.7 |30.0 | The results provided here demonstrate that Drop-DTW is rather robust to the choice of p. Importantly, as mentioned on lines 135-136, we argue that our percentile based definition is advantageous as it allows for an intuitive method to define the drop cost based on the expected amount of outliers present in the sequences. ## Reviewer 4 - wLu7 >* drop cost itself is very common in the field of bioinformatics, while the paper does not mention it. We thank the reviewer for highlighting these related works. Here, we comment on significant distinctions between our work and the suggested references. These clarifications will be included in the revised manuscript. * While Needleman-Wunsch introduces a drop cost and allows to drop some elements from matching, it is much less general than Drop-DTW in terms of possible alignments. First, it does not allow multiple elements from one sequence to match to the same element in the other sequence (i.e. the algorithm is restricted to choosing a diagonal path in cases of a match and dedicates the alternative paths to insertions or deletions). This will have a *profound* negative effect on the step localization application, where the step sequence is typically an order of magnitude shorter than the video sequence. Concretely, in the video step localization task, Drop-DTW allows multiple frames to be matched to the same step, thus enabling sequence step segmentation, whereas this is not possible with Needleman-Wunch. To support these claims, here we provide comparisons to Needleman-Wunsch: | | CrossTask | CrossTask | CrossTask | | --- | ----------- |---|---| | | Recall | Acc | IoU | | Needleman-Wunch | 43.8 |68.4 |29.4 | | Drop-DTW | **48.9** |**71.3** | **34.2**| Clearly, training with Needleman-Wunsch degrades the performance on all the metrics, compared to training with Drop-DTW. * Smith–Waterman is a variant of the Needlman-Wunch algorithm, which further allows for sub-sequence matching (i.e., it further relaxes the matching endpoints constraints). In contrast, our method relaxes the matching endpoints constraints using the same self-contained algorithm without requiring any additional considerations. In addition, Smith–Waterman relies on resetting negative matching costs to 0 to promote local matching, which makes the method less applicable for feature learning as it is not clear how to implement a differentiable version of this method. >* The drop costs are set to various values in the experiments... When training representations with Drop-DTW, the matching costs obtained by the model evolve during the training. To have the drop cost change accordingly, we define it as a percentile of the matching costs. Here, we provide results of setting the drop cost to various percentile values. In addition, we also present an alternative drop cost definition, where we represent a drop cost as a function (parametrized by a 1-layer MLP) that takes sequence x and the average step representation z as input, and produces a vector of drop costs as output (one drop cost for each element in x): | | CrossTask | CrossTask | CrossTask | | --- | ----------- |---|---| |Drop-DTW percentile | Recall | Acc | IoU | | p = 0.1 | 47.5 | 72.4 | 34.9 | | p = 0.3 (in paper)| 48.9 |71.3 | 34.2| | p = 0.5 | 46.2 | 69.5|31.0 | | p = 0.7 | 45.3 | 68.4 | 31.0 | | p = 0.9 | 44.6 |67.7 |30.0 | | learned drop-cost | **49.7** |**74.1** |**36.9** | The results provided here demonstrate the robustness of the proposed Drop-DTW algorithm for various choices of the percentile. Interestingly, this simple instantiation of a learned drop-cost yields the best overall performance, which demonstrates the adaptability of the Drop-DTW algorithm and potential for future work. In the TMNIST experiment, we use a fixed pre-trained model (i.e., no training is involved), which allows us to find a constant threshold (using cross-validation) that reasonably separates positive and negative matches. >* please clarify the criteria for dropping in "drop then DTW" setting. We drop an element from a sequence if its drop cost is lower than any alternative matching cost. >* Is it possible to achieve similar performance to drop-DTW, if we carefully set the drop criteria? The same drop cost used for Drop-DTW is used for drop-**then**-DTW in all our experiments. As shown in the table below, the best-overall results with drop-**then**-DTW (at p=0.3) were already reported in the paper. We will include the following ablation, which demonstrates that Drop-DTW outperforms drop-then-DTW for all choices of the drop percentile: | | CrossTask | CrossTask | | - | -------- | ----------- | |drop-**then**-DTW percentile | Acc | IoU | | p = 0.1 | 67.1 | 14.7 | | p = 0.3 (in paper) | 66.2 | 26.2| | p = 0.5 | 59.2 | 24.0 | | p = 0.7 | 33.6 | 21.1 | | p = 0.9 | 20.1 | 18.2 | | _Drop-DTW (p=0.3)_ (in paper) | _71.3_ | _34.2_ | >* does Table 1 show that "attention-based pooling" without time series information ( $L_{clust}$ in the table) is better than the vanilla Drop-DTW? Why?} In the instructional video step localization experiment, training with only a sequence alignment loss of any form (i.e. Drop-DTW, softDTW, D3TW or OTAM) leads to a degenerate solution. This is not specific to our method but rather to the instructional video problem setup, i.e. aligning long (~200 elements) video sequences to short (~10 items) step sequences allows for degenerate solutions. However, when the training is regularized by using $L_{clust}$ as an additional loss, Drop-DTW shows a clear advantage over the other sequence alignment methods under **all** the metrics (see Table 3). This suggests that the contribution of Drop-DTW is significant. Importantly, note that for other experiments on real datasets that do not suffer from the same imbalance (e.g. Exp 4.2 and 4.3), Drop-DTW again shows superior performance without reliance on the regularization loss. To clarify this point, we will move the cluster loss to the main manuscript and include this corresponding discussion. >* If the drop cost is determined by the percentile of the match costs (depending on the inputs), strictly speaking, it may not be called "fully differentiable". The Drop-DTW algorithm is fully-differentiable with respect to the model's parameters, however is not differentiable with respect to the hyperparameters. In this instantiation, the drop-cost can be seen as a hyperparameter and therefore, the method is differentiable. However, a learned drop-cost can also be considered as discussed above. In this case, the method becomes fully-differentiable, including the drop cost. >* I could not find any info on licenses and computational resources in the supplementary, contrary to the checklists... equation 6 should be... Thanks for highlighting these issues and typos. All our experiments were run using a single V100 GPU. Because we start from feature representations, our training/inference procedure is done in less than 2 hours. Equation 6 is indeed missing minus signs and will be fixed in the revised manuscript. As for the license, we are using publically available datasets with one of the following license types: Creative Commons Attribution-Share Alike 3.0 (MNIST), MIT (YouCook2), or BSD-3 (CrossTask), dataset-specific license (COIN), or no license (PennAction, AVE). >* This regularization seems to be related to canonical time warping[4] and deep canonical time warping[5]. In Canonical Time Warping, the goal of the CCA step is to bring the representations of the two modalities into a common embedding space to facilitate the alignment via DTW. In our case, this projection corresponds to using a neural network and DTW variant as a loss function to learn a strong representation via alignment (an alignment that supports dropping **interspersed** outliers, unlike standard DTW). Despite these differences, one can indeed draw parallels between CCA and our $L_{clust}$ since the goal of $L_{clust}$ is also to promote more coherent representations between sequences.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.