Self-supervised Learning using Video Data

# Self-supervised Learning using Video Data ### Understanding Uncertainty Using SRNN as an example, at each time step we will generate a prediction of $x_{t+1}$ by drawing a sample from the prior $p_\Theta(z_t|z_{t-1}, d_t)$. $d_t$ is computed deterministically from $x_t$ We can use a measure of $z_t$ to compute uncertainty, such as the entroy $H(p_\Theta(z_t|z_{t-1}, d_t))$, or the variance. Assuming we will use entropy to compute a series of uncertaity map $H_t \in R^{w \times h}$. This map can help us to extract regions of low entropy with low resolution and vice versa, forming the foundation of online compression for tracking. We can use toy datasets to visualize the uncertainty map and derive metrics to measure it. ### The application of computation compression Earlier we have used $x_t$ to denote the cordinates of object $x$. For real video tracking, the above pipeline will also work with visual feature indexed at $x_t$, i.e. the input is $g(x_t) = [CNN(I_t[x_t]), x_t]$. The model is: * Input: $I_t$, the image at $t$; $x_t$ the location predicted at the last timestamp * Feature: $g(x_t) = [CNN(I_t[x_t]), x_t]$, a concatenation of feature extracted at the location and the location itself. * Deterministc recursion: $d_t = f_d(d_{t-1}, g(x_t))$ * Stochastism: $z_t \sim f_z(z_{t-1}, d_t)$ * Prediction: $x_{t+1} \sim p(f_x(z_t, d_t))$ Assuming CNN has a parameter $r$ denoting resolution. We make $r \in R^{w \times h}$ a function of $H_t$, then the input to our model becomes $g(x_t) = [CNN(I_t[x_t], r(H[x_{t-1}]), x_t]$. Tracking with compression thus needs to add a loss term that minimizes the energy of $r$ (e.g. $||r||$), along with the usual loss terms to track accuracy. Minimizing can be done by sample at a low temporal and/or spatial resolution. ### The application of self-supervised learning If the model works reasonably well, we can have a generation model to predict the next time frame $\hat I_{t+1} = w(I_t, \hat x_t)$, minimizing this across time frames affords us self-supervised learning. Two additional ideas.... Minimizing reconstruction loss at pixie level is difficult. Instead, we can measure the loss at a lower resolution to smooth out the errors. The uncertainty map also tells us where matters more. So we can add a regularization term borrowing fron infomation bottleneck: $L_{IB} = \alpha I(I_t, H_t \odot I_t) - I(I_{t+1}, H_t \odot I_t)$. The gradient of that loss term can help to tune the model as if it learns to attend. The ability of being able to generate future frames is useful for conterfactual reasoning, i.e. answering "what if" questions. ### The application of reasoning of still images "Analysis by synthesis" is a principled approach to image understanding. However, it requires the availability of sophisticated signals such as depth of objects, as in the "dead leaf" model where objects are painted layer by layer, from frontal to back. However, until LIDAR cameras are ubiquitous and the predominant image collections are depth-enabled, this approach will remain largely inaccessible to anything other than toy datasets. Akin to the idea of counterfactual analysis, instead of asking the "what if" question, we can ask "what were." That is, the ability of training video understanding model forward in time will give us the ability of running it backward as well. Once this is done, given an image $I$ taken at $t$, we can generate a set of images $\{\hat I_{t-\tau}; \tau > 0\}$. If an object was visible in $\hat I_{t-\tau}$ but not any more in $I_t$ then we know it is occluded, by what, and at where. ## A proposal of a starting project