# 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks ## Reason/Topic The challenges in 3D-videos for high-level perception tasks. 1. 3D data requires heterogeneous representations <=> compatability in larger system 2. perfomance of 3D CNN is worse/on-par with 2D CNN 3. limited number of open-source libraries for fast and large-scale 3D data. For these challenges, they adopt a sparse tensor and proposed the generalised sparse convolutions. The reasons that they chose sparse tensor -- 1. expressiveness and generalizability for high-dimensional spaces. (standard libraries supporting homogeneous data representation) 2. closely resembling the standard convolution with successful proof in 2D and 3D fields 3. efficiency and speed; reduction of memory and computation Based on the generalised sparse tensor (that they use to overcome the exponential increase in high-dimentional convolutions => custom kernels with non-(hyper)-cubic shapes), and because no one has tried 4D conv/perception algorithms before, they proposed a 4D spatio-temporal perception that can process 3D-videos with high-dimensional convolutions. But to enforce consistency, they propse high-dimensional conditional random fields in a 7D trilateral space(space-time-color) and use variational inference to convert the conditional random field to differentiable recurrent layers. As the experiments show, this 4D spatio-temporal convolutional neural networks outperform 2D/2D-3D hybrid methods and 3D CNN, also are robust to noise on 3D-videos. ## Sparse Tensor and Convolution ### Sparse tensor - N-dimensional extension of a sparse matrix - in COO format, which is efficient for neighborhood queries - **Coordinate (COO)**: preserve the position/indeces and the value e.g. (row, col, val) - **Compressed Saprse Row (CSR)**: preserve the number of elements in each row, and the offset in column and the value of each element e.g. (offset, val), (number of elements) - augementation of batch indices, $b_i$ ![](https://i.imgur.com/v3HiQxx.png) ### Generalized Sparse Convolution - aims at generic input and output coordinates and at arbitrary kernel shapes - spatial wheights with $K^D$ => $W_i$ as $N^{out} \times N^{in}$ - $C^{in}$ and $C^{out}$ are not necessary the same - $N^D$: set of offsets that define the shape of a kernel - $u$: the current center ![](https://i.imgur.com/EggShz6.png) ## Minkowski Engine - named them after the space-time continuum, Minkowski space, in Physics. - open-source auto-differentiation library - homogeneous representation and convolutions - convolution for the temporal axis, which has been proven to be more effective in sequence modeling ### Sparse Tensor Quantization - Quantization: The process of reducing the number of bits that represent a number. To convert the inputs into unique coordinates and associated features, and to remove collisions with *UniqueByKey* function. * Variables * **k** : the keys via *C*, sparse tensors, with hash function * **i** : sequence from $0$ to $N-1$ * **l** : target labels * Algorithm 1. *SortByKey*( val = (i,l), key = k ) to rearrange the values so that the values which have the same key will be in the consecutive range 2. *UniqueByKey*( val = (i), key = (k, l) ) to preserve the first indeces from those having same **k**, **l** 3. *ReduceByKey*( val = (l,i), key = (k), f ) to omit the values having same key * Functions - *SortByKey*: return pair-wise values and keys that sort in key order. - *UniqueByKey*: return pair-wise values and keys, which only preserves the first value in a consecutive range of values that have the same key - *ReduceByKey*: return pair-wise values and keys, which preserves one value processed through a custom function in a consecutive range of values that have the same key - *f* => (IGNORE_LABEL, i) - IGNORE_LABEL: ignoring voxels with more than one unique target labels (key(generated via tensor hash), target labels) ### Sparse Convolution 1. To generate $C^{out}$ given $C^{in}$ 2. Mapping to identify the affection between inputs and outputs. $M={(I_i,O_i)}_i \forall i \in N^D$ 3. Compute the convolution ### Max Pooling 1. To find the number of inputs per each output coordinate and indices of those inputs 2. To reduce the input features that map to the same coordinate with MaxPoolKernel - custom CUDA kernel ### Global / Average Pooling, Sum pooling Using *cuSparse* library for sparse matrix-matrix and matrix-vector multiplication. - *cuSparse*: a sparse BLAS library that belongs to CUDA library - BLAS: Basic Linear Algebra Subprograms [More details about BLAS library](https://chenhh.gitbooks.io/parallel_processing/blas/) Similar to max pooling, there is an input-to-output kernel map called $M$. However, when using the algorithm of the max pooling, it could remove density information when dividing the pooled features by the number of inputs mappped to each output. Then, they choose not to divide and name it the **sum pooling**. ### Non-spatial Functions *ReLu*: apply directly to the features $F$. *Batch normalization*: 1D batch normalization directly on $F$, for each row of $F$ represents a feature. ## Minkowski CNN Time dimension as an extra spatial dimension Problems - 1. the computational cost and the increasing number of parameters: later experiemtns show these not really improving the performance 2. conventional cross-entropy loss: not incentive to make the prediction consistent throughout the space and time ### Tesseract Kernel and Hybrid Kernel To deal with the first problem, which leads to over-parametrization, overfitting, high computational-cost and memory consumption. Hybrid kernel (non-hypercubic, non-permutohedral), a combination of a cross-shaped kernel a conventional cubic kernel. - Cubic kernel to capture the spatial geometry accurately - Cross-shaped kernel to connect the same point in space across time ### Residual Minkowski Networks For the generalized sparse tensor, it allows to define strides and kernel shapes arbitrarily - 7x7 2D convolution -> 5x5x5x1 generalized sparse convolution - U-shaped: multiple strided sparse convolution and strided sparse transpose convolutions with skip connections which connects the layers with the same stride size ## Trilateral Sationary-CRF To deal with the second problem of Minkowski CNN. That is, the cross-entropy loss does not have pair-wise terms. CRF: conditional random field. ([info](https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf)) <sub>CRF is a type of discriminative undirected probabilistic graphical model. To put it simpler, it's a derived form of EM algorithm.</sub> - bilateral space: 2D space and 3D color - trilateral space: 3D space, 1D time, and 3D chromatic space as in 3D-videos Constraint: only apply the stationarity condition With generalized sparse convolution in 7D space. ### Definition ![](https://i.imgur.com/CjFVN0j.png) * $P(X)$ : the conditional distribution that will be modeled in CRF. * $x_i$ : CRF node in 7D space * unary potential and pairwise potential * with neighbor $x_j$ & in stationarity condition * $Z$ : partition function * $X$ : the set of all nodes With **camera extrinsics** to define the spatial coordinates of nodes in the world coordinate system, which allows stationary points to have the consistent coordinates throughout the time. ### Variational Inference > In CRF, the exact inference is intractable. ([helpful intro](https://medium.com/@jonathan_hui/machine-learning-variational-inference-273d8e6480bb)) Major approximation methods: 1. Sampling 2. Variational inference To minimize divergence between the optimal $P(X)$ and an approximated distribution $Q(X)$, for which they use the *mean-field approximation*. ### Learning Weighted sum in the fixed-point equation is equivalent to a generalized sparse convolution in 7D space, for the stationarity constraint. ## Experiments 1. data augmentation and quantization - random scaling, rotation around the gravity axis, spatial translation, spatial elastic distortion, and chromatic translation and jitter 2. Momentum SGD with the Poly scheduler 3. mean Intersection over Union (mIoU) and mean Accuracy (mAcc) 4. in batch mode ### Datasets Specificaly, using elastic distortion, Gaussian noise, and chromatic shift in the color for noisy 4D Synthia experiments. ### Results 4D networks are more robust to noise. Also, the number of parameters increases less if applying the TS-CRF. Comparatively, 4D Minkowski Networks perform better with TS-CRF than others. ## Unclear words for me 1. LIDAR scans 2. pointnet-variants [22, 23] 3. continuous convolutions [11, 15] 4. surface convolutions [20, 29] 5. an octree convolution [24] 6. shallow model - CRF 7. COO format 8. pixel -> voxel 9. Thrust library : C++ template library for CUDA 10. BLAS library 11. cuSparse library