Rapid (Deep) Video Alignment

# Rapid (Deep) Video Alignment ###### tags: `Bajaj - CVC Lab` Learning by Aligning Videos in Time https://arxiv.org/pdf/2103.17260.pdf ---------- # Questions ### Q1.Network design $x_{1}$ I do not quite understand why they typically design the encoder like this (probably adapted that from a previous paper). Is there any particular reason for them to have the output in shape {frame number, 128} where each frame is represented with a 1D array instead of something like {frame number, 64, 64}. My current assumption is that this kind of input is just easier to implement with DTW. * A1: the vectorization of an image (conversion from 64x64 to a single 128 vector) is just a convenient 1-D array representation which additionally allows treating image frame sequence as a sequence (stream) of 1-D vectors ### Q2.Input output realationship For input, I believe the temporal difference between each frame is constant. However, based on the paper, it seems like the neural network will change the temporal position of frames for better alignment. I think the output of the encoder network is shaped like {frame number, 128} so where does the temporal information stored? * A2: Possibly encoded in the frame number , which is ordered sequence (mapped using a frame rate to a frame time) ### Q3. Why does the I*(x) did the work? (α < 0 for the loss?) ### Q4. It seems like they never reuse the alignment # Summary: In general, the paper illustrated an interesting way to learn video representations by aligning separate videos. There are previous works that used deep learning techniques to solve the question, but all of them were using supervised learning that needs labeled data. Some researchers has used artificial image cues as labels to train neural network to solve specific task such as colorization and object counting while other let their networks predict future frames in a video and conduct backtracking using the real frames as labels. As we all know, using labeled data are always hard to get so unsupervised learning, which does not require the input data to be labeled, will be more practical. >The most important recent work focusing on solving temporal video alignment problem introduced soft-DTW which points out a way to use traditional time alignment method DTW on video alignment problems. However, simply train on soft-DTW with pure unsupervised learning will likely get a trivial result. This paper introduced a way to regulate an unsupervised learning model to provide non-trivial results. ## Algorithm ### Notation We have video $X = {x_0,x_1,...,x_n}$ with n frames and video $Y = {y_0,y_1,...,y_m}$ m frames. Embedding of $X$ = $F(x)$ = $\{f(x_0), f(x_1),..., f(x_n)\}$ while Embedding of $Y$ = $F(y)$ = $\{f(y_0), f(y_1),..., f(y_m)\}$ ### Step 1 embedding As input data is just two arbitrary videos with different time stamps and video is just a sequence of frames of 2D pictures. To conduct supervised learning and find representations of the videos, we need to use some network to transform the input into another embedding with shapes (n, 128) and (m,128). ### (*Q1) ### Step 2 DTW -------------------------- ## Question DtW #### 1. In the paper regarding finding gradient of soft-DTW (Citation 12 in this current paper), the author mentioned that "consider in what follows multivariate discrete time series of varying length taking values in $Ω$ ⊂ $R^p$. A time series can be thus represented as a matrix of p lines and varying number of columns." I don't know how to image this in my brain. Is "p" here means num of pixel so R^p represent a frame in the time series? * Ans: #### 2. ![](https://i.imgur.com/mVogQkh.png) I don't understand the mathematical meaning of cost matrix. I also don't understand what does alignment matrix even mean in this paper where they didn't mentioned it. * Ans ------------------------- To align two embedded videos F(x) and F(y) we need to find out *alignment that minimize frame to frame difference*. **Dynamic Time Wrapping(DTW)** is a tool to find that out by computing the minimum frame by frame difference between F(x) and F(y). DTW is just a 2D alignment matrix with size (m, n) that records the alignment with the least frame to frame difference between 2 videos. All cells in DTW has value {0,1} and if DTW[i,j] == 1, frame i of X and frame j of Y will be set aligned. In order to find out the alignment with least frame to frame difference, we will first compute a distance matrix D where D[i,j] == square(abs(F(xi)-F(yj))). Using D we can further create another matrix R where each cell R[i,j] is the sum of D[i,j] and the nearby minimal difference. ![](https://i.imgur.com/TJEHafB.png) The sum of values on shortest path from (0,0) to (m,n) of R is the current alignment cost between two embedded videos while the index will be the alignment. Calculating values in DTW requires to use of function min. min() is not a differentiable function so it is impossible to apply gradient descent on the DTW. So instead of DTW, the paper adopt soft-DTW metioned above which replaces min() with a function that is differentiable and can produce a similar result. ### Step 3 Temporal Regularization After applying DTW to 2 embedded videos, we can directly apply backtracking to it to minimize alignment cost. However, simply minimizing the alignment cost will provide a trivial result where the network will just try to minimize all the entries in DTW. We need to have some other factors that drag the network in another direction to compensate for this. For this purpose, the paper introduced Inverse Difference Moment (IDM) as regularization. Formula for IDM: ![](https://i.imgur.com/WxiCDoV.png) In this case, i and j are both indexes for one embedded video, and W(i, j) will become bigger when i and j are close. S(i,j) will be bigger if frame i and frame j are close (closer frames are similar to each other). To maximize I(x), the neural network will try embed close similar frames at a close time frames in the embeding space. Since deep learning is used to minimize cost, I(x) need to be further transformed to I'(x):![](https://i.imgur.com/3kZ5i0N.png) where D(x) is a self distance matrix that a similar frame will have a smaller value. Although frames f(xi) and f(xj) that are far away from each other will have small weights in I(x), they are still attracting each other and may lead to trivial answer where the network just simply centralize all the frames temporally. To solve this problem, the algorithm will ignore all those entries in D(x) where i and j have significant temporal difference by transforming I(x) into the following (I*(x)): ![](https://i.imgur.com/OGeWeNr.png) Here, σ is a threshold of difference between i and j while λ is a marginal value (both of them are hyper paramter that need to be tuned while training). I*(X) will decrease when (i-j <= σ) && (Dx(i,j)->0) which will let the network to embed close and similar frame in close time frames. Q3) ### step 4: Loss Loss function: ![](https://i.imgur.com/yrFlcLl.png) Minimizing DTW will lead to a result where we have minimum alignment cost. (I'(x) + I'(y)) prohibits the network to produce trival result since both I'(x) and I'(y) would make the loss bigger when all the frames are too close to each other. ### Datasets In this research, researchers mainly used three dataset: 1. Pouring: Videos that captured human hand interaction with objects. 70 videos are used for trainning and 14 are used for validating. 2. Penn Action: Videos show human playing sports. 40-134 videos for training and 42-116 for validating for 13 actions 3. IKEA ASM: Videos show human assembling furnitures. All videos in Kallax Drawer Shelf will be used where 61 is used for training and 29 is used for validating. All records are labeled which will be used for evaluating correctness of the model. ### Evaluation Metrics: After training the network, the network will be freezed and a SVM classifier or linear regressor will be train on top of the network for evaluating the performance. Here are major evaluation metrics that will be used: 1. Phase Classification: Using SVM to find out the accuracy of classifying frames 2. Phase Progression: measures how good the model can predict future frame using the regressor 3. Kendall’s Tau: measures how well videos are aligned temporally if use nearest neighbor matching 4. Average Precision: is the fine-grained frame retrieval accuracy, computed as the ratio of the retrieved frames with the same phase labels as the query frame. ### Additional NOTES: * The goal of supervised learning is to learn a mapping that links an input to output objects, using examples of such pairs. This task is noticeably more difficult when the output objects have structure, i.e. when they are not vectors **(Bakir et al., 2007)**. > Here the case is where each output object is a time series, namely a family of observations indexed by time. While it is tempting to treat time as yet another feature, and handle time series of vectors as the concatenation of all these vectors, several practical issues arise when taking this simplistic approach: Time-indexed phenomena can often be stretched in some areas along the time axis (a word uttered in a slightly slower pace than usual) with no impact on their characteristics; varying sampling conditions may mean they have different lengths; time series may not be synchronized. ### The DTW paradigm. * Generative models for time series are usually built having invariances. Such properties are typically handled through latent variables and/or Markovian assumptions **(Lu ̈tkepohl, 2005, Part I,§18). ** * A simpler approach, motivated by geometry, lies in the direct definition of a discrepancy between time series that encodes these invariances, such as the Dynamic Time Warping (DTW) score (Sakoe & Chiba, 1971; 1978). * DTW computes the best possible alignment between two time series **(the optimal alignment itself can also be of interest, see e.g. Garreau et al. 2014)** of respective length n and m by computing first the n × m pairwise distance matrix between these points to solve then a dynamic program (DP) using Bellman’s recursion with a quadratic (nm) cost. #### The DTW geometry. Because it encodes efficiently a useful class of invariances, DTW has often been used in a discriminative framework (with a k-NN or SVM classifier) to predict a real or a class label output, and engineered to run faster in that context **(Yi et al., 1998**). >Recent works by Petitjean et al. (2011); Petitjean & Ganc ̧arski (2012) have, however, shown that DTW can be used for more innovative tasks, such as time series averaging using the DTW discrepancy (see Schultz & Jain 2017 for a gentle introduction to these ideas). More generally, the idea of synthetising time series centroids can be regarded as a first attempt to output entire time series using DTW as a fitting loss. From a computational perspective, these approaches are, however, hampered by the fact that DTW is not differentiable and unstable when used in an optimization pipeline ## SOFT-DTW In parallel to these developments, several authors have considered smoothed modifications of Bellman’s recursion to define smoothed DP distances (Bahl & Jelinek, 1975; Ristad & Yianilos, 1998) or kernels (Saigo et al., 2004; Cuturi et al., 2007). When applied to the DTW discrepancy, that regularization results in a soft-DTW score, which considers the soft-minimum of the distribution of all costs spanned by all possible alignments between two time series. * Despite considering all alignments and not just the optimal one, soft-DTW can be computed with a minor modification of Bellman’s recursion, in which all (min, +) operations are replaced with (+, ×). As a result, both DTW and soft-DTW have quadratic in time & linear in space complexity with respect to the sequences’ lengths. Because soft-DTW can be used with kernel machines, one typically observes an increase in performance when using soft-DTW over DTW (Cuturi, 2011) for classification. Source code is available at https: //github.com/mblondel/soft-dtw. ### Algorithm: Assuming we currently have time series $x$ and $y$ where $x$ = ${x_1,x_2,..., x_n}$ while $y$ = ${y_1,y_2, ..., y_m}$. With a predefined cost function $∆$. Note, that $∆$ can accept one time stamp from two different time series and compute the a value that represent the difference between those series, we can create a cost matrix with ∆(x,y) with size n*m. Cell in ∆(x,y) with index i,j represent the difference between xi and yj. We also have a set of alignment matrix A where each matrix in A has size n*m and has value ~~{0,1}~~ [0,1] (value from zero to one, each cell is representing the possiblity of taking a step at that path). Each alignment matrix represent a unique path from cell [0,0] to cell [m-1,n-1]. For example, one unique alignment matrix in A for n = 3, m = 4 looks like: ![](https://i.imgur.com/ko3zjdV.png) #### Question: I am not sure whether I understand this alignment matrix correctly ⟨A,∆(x,y)⟩ gives the score of A and the following two formulars consider cost of all possible alignment matrix in different ways. ![](https://i.imgur.com/MiWF94d.png) The above two equation can be unified into 1 equation using a new defined function: ![](https://i.imgur.com/bZob3wU.png) SO we combine DTW and GAK into: ![](https://i.imgur.com/GfIjq3d.png) when 𝝲 = 0, dtw0(x,y) == DTW(x,y) while 𝝲 > 0, dtw𝝲(x,y) = -𝝲logGAK. This is what so called soft-DTW. #### Question: I don't quite get why they put -𝝲log there. Intuitively, the log is trying to cancel the effect of e while 𝝲 is trying to cancel the other 𝝲 under ai, however the e^(-ai/𝝲) is inside a ∑ so the math will not work like that. (self answer: this should no be important in algorithm 1 as long as we taking all ai into consideration) In both case, dtw𝝲 can be computed using the following algorithm: ![](https://i.imgur.com/YboixhB.png) #### (Where is alginment matrix come into play here?) ### Differentiation of soft-DTW When 𝜸==0, changes in input x can only be effectly monitored when there only exist one optimum alignment matrix. Otherwise the effect may not be detect since soft-dtw only consider one of ![](https://i.imgur.com/T9633YL.png). when there is only one unique alignment matrix with minimum alighment cost, as the minimum over a finite set of linear functions of ∆, dtw0 is therefore locally differentiable with regarding to the cost matrix ∆, with gradient A*. #### Question: I don't understand why dtw0 is actually locally differentiable with regarding to ∆. The function min is not differentiable at all right? Since dtw0 is locally differentialble to ∆, we can use chain rule to see how it is actually differentiable with regarding to input x: ![](https://i.imgur.com/NI7sdKM.png) #### Question: I don't understand what is the transpose doing there? Sorry for my bad math:( However, the gradient of dtw0 regarding x is not continuous which will hampered the performance of gradient decent algorithm. when 𝝲 > 0: When 𝝲 > 0, soft-DTW actually ![](https://i.imgur.com/0jCGqL2.png) Where E[A] is the average alignment matrix under Gibbs distribution ![](https://i.imgur.com/y9QRXh7.png) When 𝝲 > 0, there is no min operator inside the equation so the gradient actually can be explicitly differentiat. However, calculating average alignment matrix with tradition forwarding style algorithm need time complexity around (n^2m^2) so the paper introduced a backward calculation that can reduce the time complexity to (nm) ### Algorithmic differentiation By using the following algorithm (algo 1):![](https://i.imgur.com/a2UYSOX.png) We can easily create a cost matrix R where cell i,j stores ri,j =δ(xi,yj)+minγ{ri−1,j−1,ri−1,j,ri,j−1}. Since ri,j's value will be used in the computation of ri+1,j+1, ri+1,j, ri,j+1, we can get a simple logic through chain rule: ![](https://i.imgur.com/ensOfwB.png) ei,j is defined for simpler notation. We know that: ![](https://i.imgur.com/qKucARP.png) Simply differentiate $r_{{i+1},j}$ with regarding to ri,j we got:![](https://i.imgur.com/xTOt5yc.png) and we can transform this equation easily into: ![](https://i.imgur.com/SXFRN3x.png) and we can apply the same algorithm and get: ![](https://i.imgur.com/6zpNsEY.png) Now we can compute the gradient of any cell in the cost matrix with regarding to other cells that are influecing them, in other words we can compute a gradient matrix E=[ei,j]. ~~#### Question: I don't understand this part:~~ ![](https://i.imgur.com/PMNFjck.png) In conclusion, here is the algorithm: ![](https://i.imgur.com/DVG0esU.png) #### Learning with the soft-DTW loss Goal: Using soft-DTW to calculate Fre ́chet means (1948) of time series. We have n time series y1 ... yn, each time series has different length m and same dimension p. In order to use soft-DTW to calculate the discripency, we are solving the following problem: ![](https://i.imgur.com/MR6AzJV.png) #### (I guess they are using gradient decent here for finding the x that minimize this objective function???? But they never described it here. I believe to make back tracking works, the alignment matrix should play as weights in neural network that uses ei,j to updates it's value right?) (NOTES: In this equation, we assume x has fixed length n. Although yi and x may have different number of columns or numbers, they both have the same dimension) since we are summing all discrepency, we add mi as negating factor to negate. 𝛌 here is just some set of normalized weight where sum of all 𝛌 == 1. #### None Convexity: Although map x->⟨A, ∆(x, y) ⟩ shares the concavity of δ, min or minᵧ can only inherit concavity of the input function. In another words, if the original function F is concave, min(F) or minᵧ(F) will stay concave; if the original function F is convex, min(F) or minᵧ(F) will only result in the minimum of F. Since this paper are using square Eclidian distance as 𝛿, the results the conducted in this experiment regarding equation 4 are not 100% correct. #### smoothing soft-DTW: ### Qeuestion: The paper suggest that when 𝛾 -> ∞ for dtwᵧ, dtwγ converges to the sum of all costs as γ → ∞. When 𝛾 == -∞, shouldn't dtwᵧ become -∞ as well? This part suppose to help solve the problem that soft-DTW isn't convex. # kalman sparse galcian process ### Experiments: ![](https://i.imgur.com/q7G52pW.png) ![](https://i.imgur.com/1FDxOTB.png) ![](https://i.imgur.com/XF2Tdbz.png) # VAEs