L9-ConvolutionalNetworks

--- disqus: ierosodin --- # L9-ConvolutionalNetworks > Organization contact [name= [ierosodin](ierosodin@gmail.com)] ###### tags: `deep learning` `學習筆記` ==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)== * http://www.deeplearningbook.org/contents/convnets.html * CNN * **Learning a Hierarchy of Feature Extractors** * Each layer of hierarchy extracts features from output of previous layer * **Use convolution in place of matrix multiplication in at least one of the layers of neural networks.** * Everything else stays the same * Maximum likelihood * Negative log-likelihood * Back-propagation * 1-D: regularly sampled time-series data * 2-D: images * 3-D: volume data, video * Operation in 2D * $S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(m, n)K(i - m, j - n)$ * **Cross-correlation (more convenient)** * $S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(i + m, j + n)K(m, n)$ * **Motivation** * **Sparse and local connectivity** * Kernel is usually much smaller than input * Less memory and computation required * **Parameter (weight) sharing** * Use the same set of parameters for more than one function in a model * **Equivalent representation** * Automatically generalize across spatial translations of inputs * $f(g(x)) = g(f(x))$ * Efficiency of Convolution * Compare | Convolution | Dense matrix | Sparse matrix :---: | :---: | :---: | :---: Store floats | $2$ | $> 8e9$ | $2*319*280 = 178,640$ Float muls or adds | $319*280*3 = 267,960$ | $> 16e9$ | $267,960$ * For color images, intput/output of convolution are 3-D tensors and kernels are 4-D tensors * $S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(i + m, j + n)K(m, n)$ * Pooling * This stage is sometimes called detector stage. * **Pooling helps to make the representation become approximately invariant to small translation and rotation.** * Functions * Maximum within a rectangular neighborhood * Average * L2 norm * Weight average based on the distance from the central pixel * LSE (Log-Sum-Exp) $y = log\sum e^{x_i}$ * **Zero-padding Control Size** * No zero-padding * Valid convolution * Shrinking size * Just enough zero-padding * Same convolution * Keep the size * Padding with k-1 zeros * Padding with enough zeros * Full convolution * Every pixel is visited k times. * Input Size to Output Size * $\text{out} = \frac{\text{in} + 2 * \text{padding} - \text{dilation} \times (\text{kernel size} - 1) - 1}{\text{stride}} + 1$ * Connection Methods * Local Connections * like convolution, but no sharing * $Z_{i, j, k} = \sum_{l, m, n}[V_{l, j + m - 1, k + n - 1}w_{i, j, k, l, m, n}]$ * Kernel is 6-D * Tiled Convolutions * cycle between groups of shared parameters * original convolution only one group of shared parameters * $Z_{i, j, k} = \sum_{l, m, n}[V_{l, j + m - 1, k + n - 1}w_{i, j\%t, k\%t, l, m, n}]$ * Partial Connectivity Between Channels * Each output channel i is a function of only a subset of input channels. * Efficient Convolution Algorithm * A d-dimensional kernel is separable if it can be expressed as the outer product of d vectors. * Convolution with a d-dimensional kernel can be decomposed into d one-dimensional convolutions. * The composed approach is signiﬁcantly faster than performing one d-dimensional convolution with their outer product. * Complexity: $O(w^d)$ -> $O(wd)$ * the kernel is called separable. * Initialization of Kernels * Randomly initialized * Designed by hand * Learned with an unsupervised criterion * For example, Coates et al. (2011) apply k-means clustering to small image patches, then use each learned centroid as a convolution kernel.