C3D project related papers

# C3D project related papers ###### tags: `lab-paper-reading` * [C3D: Generic Features for Video Analysis](https://research.fb.com/blog/2014/12/c3d-generic-features-for-video-analysis/) ## 3D Convolutional Neural Networks for Human Action Recognition (IEEE Transactions on Pattern Analysis and Machine Intelligence’13, Washington State Univ., Dr. Shuiwang Ji) [[Link](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6165309)] This paper presented a pioneer work in applying 3D convolution neural network for video analysis tasks. In this paper, the authors applied the 3D convolution to extract inter-frame information for human action recognition. The cross-frame temporal and spatial information considered are gray, gradient-x, gradient-y, optflow-x, and optflow-y, and they are put in five different channels. In the 3D convolution, the author specified a convolution kernel of size 7x7x3, which means 7x7 in the spatial dimension(in a 2D frame) and 3 in the temporal dimension(in 3 different frames). For the whole network architecture, the author included general design principal of CNN to increase number of feature maps in late layers. Besides, in order to improve the accuracy of the model, the author augmented the model by combining auxiliary outputs computed as high-level motion features. The loss function of the model is a weighted summation of the loss functions induced by the true action classes and the auxilary output. In the performance benchmark stage, they also combine the output of a varitey of different model achitectures such as RNN, TISR(temporally intergrated spatial response), and SPM(spatial pyramid matching method) to compare the accuracy. ## MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (CVPR'17, Google, Andrew Howard) > [Review: MobileNetV1 — Depthwise Separable Convolution (Light Weight Model)](https://towardsdatascience.com/review-mobilenetv1-depthwise-separable-convolution-light-weight-model-a382df364b69) ![](https://i.imgur.com/wyHU6S8.png =500x) This paper introduced MobileNet, an efficient CNN network targeting at inference latency on mobile applications. The most significant contribution of this paper is the idea of a depthwise separable convolution operator. It factorized the computation of a general convolution layer into two steps: the depthwise convolution and pointwise convolution. In the depthwise convolution, the filters are applied to each input feature map(there are a total of M feature map) to do the 2D convolution operation; in the pointwise convolution, these M output feature maps are applied to a 1x1 convolution and are resized into N final output feature maps. By doing so, the computation of a convolution layer is reduced from DK * DK * M * N * DF * DF to DK * DK * M * DF * DF + M * N * DF * DF, which is 1/N + 1/DK^2 times of original computation cost. In addition, the pointwise operation also saves the memory reordering overhead before applying GEMM (general matrix multiply functions) for optimization in the implementation. In the overall model of MobileNet, batch normalization and relu are applied after each convolution (applied to depthwise and pointwise convolutions respectively). The model reduces over 4000 Mult-Add operations with only 1% accuracy loss on ImageNet dataset. It also reduced 29.3 million of parameters of the model to 4.2 million. Aside from the depthwise separable convolution, the author introduced the width multiplier and resolution multiplier to observe the model size, computation cost and accuracy trade-off. When the width multiplier is set to below 25%, the accuracy of the model will drop significantly. In the experiment, the authors also showed the robustness of MobileNet architecture design as it can still perform image classification tasks with good accuracy on fine-grained image dataset. It can also be applied to large scale geolocalization problem, the distillation of face features from pre-trained network, and object detection applications like normal CNN network. ## Learning Structured Sparsity in Deep Neural Networks (NeurlIPS'16, Duke Univ., Dr. Yiran Chen and Dr. Hai Li) [[Link](http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks.pdf)] This paper proposed Structured Sparsity learning (SSL) to address the problem of irregular memory access that limits the speedup of CNN when using non-structured regularization methods such as sparsity regularization, connection pruning, and low-rank approximation. The authors explored the structured pruning of filter size, number of filters, filter shape and depth of DNNs. They first defined the penalties parameters on each neighboring vectors on the weight matrix and utilized Group Lasso to do the vector grouping so that the pruned and reserved vectors can be closer to each other respectively. In this way, the cache locality can be enhanced and thus increase the DNN inference speedup. In the experiment, SSL is tested on LeNet and MLP(on MNIST), achieving per layer speedup up to 7x-10x(filters and channels, filter shape). CifarNet on Cifar10, achieving 3x speedup. AlexNet on ImageNet, achieving up to 5x speedup. The research also contributed to learning a more efficient depth structure of ResNet, such as using ResNet18 to reach similar accuracy as ResNet32. >The group lasso [1] regulariser is a well known method to achieve structured sparsity in machine learning and statistics. The idea is to create non-overlapping groups of covariates, and recover regression weights in which only a sparse set of these covariate groups have non-zero components. There are several reasons for why this might be a good idea. Say for example that we have a set of sensors and each of these sensors generate five measurements. We don't want to maintain an unneccesary number of sensors. If we try normal LASSO regression, then we will get sparse components. However, these sparse components might not correspond to a sparse set of sensors, since they each generate five measurements. If we instead use group LASSO with measurements grouped by which sensor they were measured by, then we will get a sparse set of sensors. An extension of the group lasso regulariser is the sparse group lasso regulariser [2], which imposes both group-wise sparsity and coefficient-wise sparsity. This is done by combining the group lasso penalty with the traditional lasso penalty. In this library, I have implemented an efficient sparse group lasso solver being fully scikit-learn API compliant. ![](https://i.imgur.com/QyoGIhq.png =500x) ![](https://i.imgur.com/VHgUB4p.png =500x) ![](https://i.imgur.com/rU1P2uK.png =500x) >https://stats.biopapyrus.jp/sparse-modeling/group-lasso.html ## CondConv: Conditionally Parameterized Convolutions for Efficient Inference (NeurlIPS’19, Google Brain, Quoc Le's group)[[Link](https://arxiv.org/pdf/1904.04971.pdf)] This paper presented Condconv, a conditionally parameterized convolution operator that aims to improve CNN model inference accuracy with only a small amount of computation cost increment. It centered at the tendency of improving CNN accuracy using a larger model, which introduces quite expensive computation cost. In previous works, conditional computation requires learning(training) routing decisions from individual examples. Unlike these approaches, Condconv does not require training discrete routing, so that the model can be optimized through simple gradient descent. The Condconv operation in inference time can be analogous to “combining the learned result of several experts” through weighted parameters. Those weighted parameters are per-example routing functions which consist of three steps: global average pooling, fully-connected layer, sigmoid activation. In this way, the routing function allows the adaption of local operations using a global context. In the experiment, the author specified two ways of doing CNN inference using Condconv. The first is to compute the kernel for each example in a batch, the other is using a linear mixture of experts formulation to perform batch convolution on each branch and sum the outputs. It depends on the optimization of training batch images so that when the Condconv layer has a small number of experts, it will be more efficient to train with the latter approach. Finally, the authors experimented Condconv on several state-of-the-art CNN architectures. It achieves a top-1 performance of 78.3% accuracy with only 413M multiply-add, which is only an additional of 24M multi-add but 1.1% accuracy improvement than baseline. ## Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution (AAAI’18, U Michigan, Dr. Jia Deng’s group) [[Link](https://arxiv.org/pdf/1701.00299.pdf)] This paper proposed Dynamic Deep Nerual Network (D^2NN) that utilize selective exectution in the neural decision graph. D^2NN extends a DNN with controller modules to select the neural decision path. The controller modules are trained by intergrating the backpropagation phase with reinforcement learning (as the parameters are non-differintiable). D^2NN can be treated as a input-depenedent execution method, but unlike previous methods in this category, it adopts trained control decisions that is directly targeted at improving model efficiency. D^2NN has four variant architectures: high-low, cascade, chain, hierarchy. (1)The high-low capacity D^2NN is motivated by the idea of saving computation by choosing low-capacity sub-network for easy example. The Q-learning reward is defined as a linear combination of accuracy and efficiency. (2)The cascade D^2NN is inspired by standard cascade design in computer vision. It will regect some negative example earlier using simple features. Each control node in the DAG will decides whether to exectue the next cascade stage or not. It can acheive a nearly optimal trade-off, reducing compuation significantly with just small loss of accuracy. (3) Chain D^NN is for the scenario to test different configuration of parameter setting in a DNN model (e.g. the number of layers, filter sizes) that cannot be fully decided. It can also simulate the shortcuts between any teo layers by identity function. Although the DAG can consists of exponential number of paths to nodes. The result shows that the accuracy can be achieved with some computation saving. (4) Hierarchical D^2NN will first classify images to coarse categories and then to fine categories. It can match the accuracy of the full network with about half of the computation cost. ## Deformable Convolutional Networks (AAAI’18, Microsoft Research Asia, Dr. Jifeng Dai’s group) [[Link](https://arxiv.org/pdf/1703.06211.pdf)] >[[author info](https://scholar.google.com.hk/citations?user=SH_-B_AAAAAJ&hl=zh-CN)] > the author of r-fcn > also the team that develop Resnet, mask RCNN w/ Kaiming He In this paper, the author proposed a Deformable Convolutional Network(DCN) that consists of deformable convolution layers and deformable ROI pooling layers. It is based on the idea of augmenting the spatial sampling locations in convolution and RoI pooling with additional offsets and learning the offsets from target tasks. The receptive field and sampling locations are more adaptive so that the localization capability is enhanced, especially in the case of non-grid objects. In the deformable convolution, the author used a grid R to define the receptive field size and dilation. (The receptive field is comparable to a unit of computation in convolution: a given kernel multiply a part of the input image with the same size.) In the implementation, the sampling of an arbitrary location is implemented via bilinear interpolation such that the multiplication result of a given receptive field is a total sum of the bilinear interpolation kernel (of a sample point and that fixed point) multiply all possible integral spatial locations in the feature map. In the experiment, the author measured the capability of DCN in object detection algorithms as feature extractor and semantic segmentation. The proposed DCN brings accuracy improvement when applying its deformable convolution in 3-6 layers. The author suggests that DCN only add small overhead over model parameters and computation and improve accuracy at the same time. ## Low-Rank Approximations for Conditional Feedforward Computation in Deep Neural Networks (arXiv 2014, U. Tennessee, Dr. Itamar Arel’s group) [[Link](https://arxiv.org/pdf/1312.4461.pdf)] > [[reveiw rebuttal](https://openreview.net/forum?id=6rEnMF1okeiBO)] This paper proposed a conditional feedforward computation in the fully-convolution neural network. The author proposed an activation estimation network that utilized the property of non-linear function, relu, which only maintains the non-zero value, to do a sign prediction in matrix multiplication beforehand. The paper has been submitted to ICLR2014 but was rejected. According to the anonymous review I found, the paper is criticized with the misinterpretation of “conditional computation” (proposed by Yoshua Bengio). It makes me look into some terms such as SVD(singular value decomposition), low-rank approximation, and also the concept of “conditional computation” proposed by Yoshua Bengio and other scholars. ## TSM: Temporal Shift Module for Efficient Video Understanding (ICCV'19, MIT, Dr. Song Han’s group) [[Link](https://arxiv.org/pdf/1811.08383.pdf)] ![](https://i.imgur.com/CeDRPdN.png =500x) ![](https://i.imgur.com/L5CGYTS.png =500x) This paper proposed a Temporal Shift Module (TSM) for efficient inference of video recognition tasks on mobile. The goal of applying TSM is to achieve temporal modeling for video analysis at 2D convolution computation cost. For offline video inference, they applied bi-directional TSM for higher throughput. The past frame and future frame can be mingled to the current frame for temporal modeling; for real-time video inference, they applied uni-directional TSM, since only the past frame can be mingled to the current frame. It allows low-latency inference and preserves accuracy at the same time. Besides, the author identified the performance degradation for both large data movement and accuracy loss while doing a temporal shift. So they adopted two approaches instead: partial shift(less data movement) and residual shift(mingled with feature maps before shifting). In the experiment, the author found that residual TSM performs better than in-place TSM. The output feature map will be the addition of the input feature map and the result of it after shifting and doing convolution operation. The accuracy of the overall model is related to the portion of shifted channels. The performance reaches a peak at ¼ (1/8 for both directions). In summary, the design of the TSM module is flexible in that it can be directly inserted into the residual block of 2D CNN, the temporal receptive field will then be enlarged by 2 as if running convolution along the temporal dimension. The design also allows easy deployment to hardware: the already supported optimized library in 2D CNNs such as CuDNN can be directly applied. ## Convolutional Two-Stream Network Fusion for Video Action Recognition (CVPR'16, University of Oxford, Dr. Andrew Zisserman's Group) [[Link](https://arxiv.org/pdf/1604.06573.pdf)] > the team that proposed VGG network. This paper proposed a video recognition CNN architecture that combines both spatial and temporal information(feature maps) respectively. The author treated the model as processing spatial and temporal knowledge separately in two streams. They defined the possible merging function to address spatial and temporal information. They also experimented with the effective merge point of the two-stream models. For the spatial fusion, they discussed sum fusion, max fusion, concatenation fusion, Conv fusion, and bilinear fusion(matrix outer product of the two features); for the temporal fusion, they defined 3D pooing (with neighboring frames), 3D Conv + Pooling (for spatial-temporal information). Additionally, they defined Late fusion by averaging the prediction layer outputs. Combining those components, the author produced the experiment result of a fused DNN architecture that has over 80% accuracy on video recognition task in UCF and HMDB51 dataset. > Note the definition symbol for temporal network here [color=red] ## Learning Temporal Pose Estimation from Sparsely-Labeled Videos (NeurlIPS’19, Facebook AI, U. Penn, Du Tran and Jianbo Shi) [[Link](https://research.fb.com/wp-content/uploads/2019/11/Learning-Temporal-Pose-Estimation-from-Sparsely-Labeled-Videos.pdf?)] * PoseWrapper architecture ![](https://i.imgur.com/Wn3FrNi.png) ![](https://i.imgur.com/OZitlCh.png) * `Pose annotation propagation` During training, we force our model to warp pose heatmap fB from an unlabeled frame B such that it would match the ground-truth pose heatmap in a labeled Frame A. Afterwards, we can reverse the application direction of our network. This then, allows us to propagate pose information from manually annotated frames to unlabeled frames * `Spatiotemporal Pose Aggregation at Inference Time` Instead of using our model to propagate pose annotations on training videos, we can also use our deformable warping mechanism to aggregate pose information from nearby frames during inference in order to improve the accuracy of pose detection. ## Dynamic Kernel Distillation for Efficient Pose Estimation in Videos (ICCV'19, U. Singapore, Dr. Jiashi Feng's Group) This paper proposed a Dynamic Kernel Distillation (DKD) approach for efficient pose estimation in video. ## D3D: Distilled 3D Networks for Video Action Recognition (WACV'20, U. Mich, Dr. Jia Deng's Group) [[Link](https://arxiv.org/pdf/1812.08249.pdf)] * use optic flow for temporal stream prediction(to model the motion) * **optic flow**: invariant to texture and color, making it difficult to overfit to video datasets * action recognition performance is not well correlated with optical flow accuracy, except *near motion boundaries* and *areas of small displacement* * better or cheaper motion representations can be used in place of optical flow * **distillation**: a way of transferring knowledge from a teacher network to a (typically smaller) student network to reconstruct the output of teacher network * we can distill the temporal stream into the spatial stream and still benefit from its motion representation (don't need to explicitly compute motion vectors at inference time) :::info **Are 3D CNNs capable of learning sufficient motion representations on their own?** * train a spatial stream 3D CNN to produce optical flow. * C3D has a limitation in the training procedure to produce accurate optic flow: we demonstrate that **3D CNNs do not learn sufficiently accurate optical flow when trained on action recognition**, and that **they can learn much more accurate optical flow when trained explicitly to do so**. ::: #### Optical Flow Decoder * to evaluate the motion representations in the hidden features, we constrain the decoder such that it is unable to learn motion patterns beyond what is already learned by the 3D CNN * operate on single frame at a time(no temporal conv) * mimic the optical flow prediction network from PWCNet [31] * output of the decoder is a motion representation introduced by Im2Flow [10], which consists of three channels that encode optical flow: (mag, sin θ, cos θ), * mag: magnitude * θ: angle * the flow vector at each pixel * use TV-L1 optical flow [42] as the motion representation [10, 37, 24] * evaluation: learned optical flow using endpoint error (EPE) * 1) we freeze the 3D CNN and train the decoder. This setting tests what motion representations are learned by the 3D CNN naturally by training on action recognition. * 2) we fine-tune decoder and 3D CNN end-to-end. This setting tests what motion representations can be learned by a 3D CNN when optimized specifically for this purpose * 2 > 1: there should have improvement in spatial learning #### Distill 3D Network * goal: incorporate motion representations from the temporal stream into the spatial stream * using distillation, that is, by optimizing the spatial stream to behave similarly to the temporal stream. Our approach uses the learned temporal stream from the *typical two-stream pipeline*(TV-L1) as a teacher network, and the spatial stream as a student network. During training, we distill the knowledge from the teacher network into the student network * intuition: ==*"use spatial stream in 2D CNN to capture temporal stream through distillation"*== * ![](https://i.imgur.com/Y83IbMZ.png =500x) * λ = 1 works good in most cases * Training * step1: train temporal stream by TV-L1 as teacher network * step2: train spatial stream (as student network) using distillation procedure #### Result * better than baseline but not the fine-tuned one * the best accuracy is through ensemble of D3D and S3D-G * 在行为识别任务中，有一些行为appearance的占比更大一些，有一些行为的motion占比更大一些，而以RGB输入的3DCNN中，其更倾向于优化到关注视频的appearance，而以光流为输入的3DCNN中，其更倾向于被优化到关注视频的motion，所以才会得到上边的结论：“**当以RGB作为3DCNN的输入时，其捕获的视频中的运动信息确实要远远少于以光流作为输入**。其实本质上不能怪3DCNN，而应该怪数据，3DCNN是完全有能力提取视频的时空信息的，只是你一直喂RGB，网络会过拟合地更倾向于appearance” [[ref](https://blog.csdn.net/zzmshuai/article/details/90903936)] ## ConvNet Architecture Search for Spatiotemporal Feature Learning (Facebook AI, Du Tran, Columbia, Dr. Shih-Fu Chang) [[Link](https://arxiv.org/pdf/1708.05038.pdf)] #### Observations * Using 4 frames of input and a depth-18 network (SR18) achieves good baseline performance and fast training on UCF101. * For video classification, sampling one frame out of every 2-4 (for videos within 25-30fps), and using clip lengths between 0.25s and 0.75s yields good accuracy. * An input resolution of 128 (crop 112) is ideal both for computational complexity and accuracy of video classification given the GPU memory constrain * Using 3D convolutions across all layers seems to improve video classification performance. * A network depth of 18 layers gives a good trade-off between accuracy, computational complexity, and memory for video classification. #### Contribution * proposed **Res3D** * architecture search based on various block deployments ## Graph Distillation for Action Detection with Privileged Modalities (Stanford, Dr. Li Fei-Fei) * Multidomain, Multi modalities(e.g. pose estimation, optic flow ... etc) learning ## Forecasting Human Dynamics from Static Images (CVPR'17, U. Mich, Dr. Jia Deng) [[Link](https://arxiv.org/pdf/1704.03432.pdf)] * We hypothesize that the global pose features encoded in the *low resolution feature maps* are sufficient to drive the future predictions. #### Network Architecture ![](https://i.imgur.com/IgN25zt.png =400x) * 2D Pose Sequence Generator: Hourglass network * encoder: turn input images into "belief" on the current pose * decoder: decode the "belief" on the current pose to generate pose heatmap * [Training] pre-train the hourglass network by leveraging large human pose datasets that provide 2D body joint annotations * apply a *Mean Squared Error (MSE)* loss for the predicted and ground-truth heatmaps * 3D Skeleton Converter: reconstruct the 3D skeleton using 2D heatmap that output from the decoder of Hourglass * [Training] exploit the ground-truth 3D human poses from motion capture (MoCap) data. * 1) we randomly sample a 3D pose and camera parameters (i.e. focal length, rotation, and translation). * 2) We then project the 3D keypoints to 2D coordinates using the sampled camera parameters, followed by constructing the corresponding heatmaps. * ==> provides us with a training set that is diverse in both human poses and camera viewpoints. * We apply an MSE loss for each output of ∆, T, and f, and an equal weighting to compute the total loss. ![](https://i.imgur.com/Qu485bf.png) ## Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks (MIT, Google Research, Dr. William T. Freeman) [[Link](https://arxiv.org/pdf/1607.02586.pdf)] * use probabilistic model to **sample and synthesize** many possible future frames from a single input image. * model the conditional motion distribution of future frames * Future frame synthesis involves low- and high-level image and motion understanding. * Cross Convolutional Network: encodes image and **motion information** as feature maps and **convolutional kernels**, respectively * key components * We use **conditional variational autoencoder** to model the complex conditional distribution of future frames * our network finds an intrinsic representation of intensity changes between two images, also known as the *difference image* or *Eulerian motion* * model motion using a set of image-dependent convolution kernels operating over an image pyramid. The layer **convolve image-dependent kernels with feature maps from an observed frame**, to synthesize a probable future frame. (because image motion may differ) #### related works * [**Motion Priors**] Fleet et al. [2000] found that a local motion field can be represented by **a linear combination of a small number of bases**. * [**Motion Prediction**] Walker et al. [2016] introduced a ==**variational autoencoder**== to model **pixel-wise correlations** in the motion field. * we aim to predict Eulerian motions and to synthesize future RGB frames, while they focus on predicting the (Lagrangian) motion field. Different from their method, our model **further learns feature maps and motion kernels jointly without supervision** via our newly proposed cross convolutional network. * [**Image and Video Synthesis**] Srivastava et al. [2015] designed a LSTM network that synthesized future frames in a sequence from set of observed frames. * we build an image generation model that does not require a reference video at test time. * [Parametric Texture Synthesis] Early work in parametric texture synthesis developed a set of **hand-crafted features** that could be used to synthesize textures [Portilla and Simoncelli, 2000]. * Variational autoencoders [Kingma and Welling,2014, Yan et al., 2015] have been **used to model and sample from natural image distributions**. * Our proposed algorithm is based on the variational autoencoder, but unlike in this previous work, we **also model temporal consistency** #### Cross Convolutional Network ![](https://i.imgur.com/SVq1gMT.png =500x) --- ## Understanding image representations by measuring their equivariance and equivalence (U. Oxford, Dr. Andrea Vedaldi) [[Link](https://arxiv.org/pdf/1411.5908.pdf)] #### 3 notable properties of representation * **equivariance**: if the transformation of the input image can be transferred to the representation output * **invariance**: special case of equivariance obtained when the mapping act as the simplest possible transformation * **equivalence**: two heterogenous representations are equivalent if they exist a map such that the mapping is invertable #### Examples of equivariance * equivariant HOG transformation * `φ`: HOG feature extractor * `φ(x)`: HxW vector field of D-dimensional feature vectors or cells * `g`: image flipping around the vertical axis * `φ(x)` and `φ(gx)` are related by a well defined *permutation* of the feature components * `Mg`: the *permutation* swaps the HOG cells in the horizontal direction and, within each HOG cell, swaps the components corresponding to symmetric orientations of the gradient. * we have `φ(gx) = Mgφ(x)` * translation equivariance in convolutional representation * *HOG*, *densely-computed SIFT (DSIFT)*, and *convolutional networks* are examples of convolutional representations in the sense that **they are obtained from local and translation invariant operators**. * with sampling effect, each convolutional representation is **equivariant** to translation of the input images * My understanding: every feature maps and images are **equivariant** of each other #### learning structure sparsity * to find equivariant transformation `Mg`, or equivalence `φ(x)` * **affine transformation** [[background](https://www.zhihu.com/question/20666664)] * Given data x sampled from a set of natural images * learning amounts to optimising the regularised reconstruction error * ![](https://i.imgur.com/frBJzmK.png =300x) * R: regulariser; l : regression loss * regularisation * loss #### equivariance in CNNs: transformation layers #### equivalence in CNNs: stitching layers