---
disqus: ierosodin
---
# L9-ConvolutionalNetworks
> Organization contact [name= [ierosodin](ierosodin@gmail.com)]
###### tags: `deep learning` `學習筆記`
==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)==
* http://www.deeplearningbook.org/contents/convnets.html
* CNN
* **Learning a Hierarchy of Feature Extractors**
* Each layer of hierarchy extracts features from output of previous layer
* **Use convolution in place of matrix multiplication in at least one of the layers of neural networks.**
* Everything else stays the same
* Maximum likelihood
* Negative log-likelihood
* Back-propagation
* 1-D: regularly sampled time-series data
* 2-D: images
* 3-D: volume data, video
* Operation in 2D
* $S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(m, n)K(i - m, j - n)$
* **Cross-correlation (more convenient)**
* $S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(i + m, j + n)K(m, n)$
* **Motivation**
* **Sparse and local connectivity**
* Kernel is usually much smaller than input
* Less memory and computation required
* **Parameter (weight) sharing**
* Use the same set of parameters for more than one function in a model
* **Equivalent representation**
* Automatically generalize across spatial translations of inputs
* $f(g(x)) = g(f(x))$
* Efficiency of Convolution
* Compare | Convolution | Dense matrix | Sparse matrix
:---: | :---: | :---: | :---:
Store floats | $2$ | $> 8e9$ | $2*319*280 = 178,640$
Float muls or adds | $319*280*3 = 267,960$ | $> 16e9$ | $267,960$
* For color images, intput/output of convolution are 3-D tensors and kernels are 4-D tensors
* $S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(i + m, j + n)K(m, n)$
* Pooling
* This stage is sometimes called detector stage.
* **Pooling helps to make the representation become approximately invariant to small translation and rotation.**
* Functions
* Maximum within a rectangular neighborhood
* Average
* L2 norm
* Weight average based on the distance from the central pixel
* LSE (Log-Sum-Exp) $y = log\sum e^{x_i}$
* **Zero-padding Control Size**
* No zero-padding
* Valid convolution
* Shrinking size
* Just enough zero-padding
* Same convolution
* Keep the size
* Padding with k-1 zeros
* Padding with enough zeros
* Full convolution
* Every pixel is visited k times.
* Input Size to Output Size
* $\text{out} = \frac{\text{in} + 2 * \text{padding} - \text{dilation} \times (\text{kernel size} - 1) - 1}{\text{stride}} + 1$
* Connection Methods
* Local Connections
* like convolution, but no sharing
* $Z_{i, j, k} = \sum_{l, m, n}[V_{l, j + m - 1, k + n - 1}w_{i, j, k, l, m, n}]$
* Kernel is 6-D
* Tiled Convolutions
* cycle between groups of shared parameters
* original convolution only one group of shared parameters
* $Z_{i, j, k} = \sum_{l, m, n}[V_{l, j + m - 1, k + n - 1}w_{i, j\%t, k\%t, l, m, n}]$
* Partial Connectivity Between Channels
* Each output channel i is a function of only a subset of input channels.
* Efficient Convolution Algorithm
* A d-dimensional kernel is separable if it can be expressed as the outer product of d vectors.
* Convolution with a d-dimensional kernel can be decomposed into d one-dimensional convolutions.
* The composed approach is significantly faster than performing one d-dimensional convolution with their outer product.
* Complexity: $O(w^d)$ -> $O(wd)$
* the kernel is called separable.
* Initialization of Kernels
* Randomly initialized
* Designed by hand
* Learned with an unsupervised criterion
* For example, Coates et al. (2011) apply k-means clustering to small image patches, then use each learned centroid as a convolution kernel.