# Problem definition
The basic problem we are trying to solve is called **semantic image segmentation**. Depending on the context it can also be called *land-cover classfication* or *pixel labelling*.
Normal image segmentation involves grouping pixels that belong to the same object. The word semantic in *semantic segmentation* is making sense of what the segmentations mean and basically classifying the pixels into known classes. For example distinguishing the foreground from the background in a photograp as in the example below.

The following is taken from *Marmanis et al., 20.17*.
> Semantic segmentation in urban areas poses the additional challenge that many man-made object categories are composed of a large number of different materials, and that objects in cities (such as buildings or trees) are small and interact with each other through occlusions, cast shadows, inter-reflections, etc"
I find this interesting because the problem that we face is the same nature. Our goal is to detect drains in the images, but in the RGB images we might see shadows and vegetation and in a way the drains might look very different in one image or the next.
# The technology
## Convolutional neural nets
The main building block here is a [convolutional neural net][1] (CNN). This is a type of neural net that naturally processes images.
A convolutional layer takes as input an array (image) and computes the convolution with a kernel, a smaller matrix containing weights. The convolution operation is computed by sliding a window on the original array and taking the dot product between the image and the kernel.
The weights in ther kernel are parameters to be learned. Thus this process can be though of a learning different filters with different features, and the convolution tests whether the extend to which the features are present in different parts of the image.
Because this still leads to some localisation, CNNs include *pooling* layers which also work with a sliding window that takes the average or the maximum in the part of the matrix being viewed. When combined with the convolution layers, CNNs should be translationally invariant meaning that they detect features irrespective of their position in the image. Pooling layers are specified as opposed to learned.
Compared to fully connected NNs, the CNNs have fewere weights: only the kernel needs to be learned and the biases (the +b term in an affine function) resulting in much faster training.
Common activation functions that are used are:
- Sigmoid: $(1 + e^{-x})^{-1}$.
- ReLU: $\max(0,x)$
From what I could understand, ReLU is preferred because it leads to faster training.
## Extensions for image segmentation
CNNs were originally designed to classify whole images (is this the photo of a whorse, a motorcycle, an elephant, a flower? choose the most likely) or to detect objects in whole images (does this image contain x, yes or no). The convolution step basically summarises a patch with many pixels in the image.
However, this is not what we need for image segmentation. In this case we need to assign a label to each pixel, not to the image as a whole.
An extension for image segmentation is a **Fully convolutional network** (FCN). Another approached that seems to be less used is to add a deconvolution layers which up-size the learned representations. Here we focus on FCN.
An FCN basically adds a new layer for pixewise prediction at the end of a CNN. To be honest, I don't fully understand how they do this. Normally the last layers of a CNN are dense but this can be reinterpreted as convolutional layers. Citing the authors of the original paper here, *Shelhamer et al., 2017*,
> Fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions. Doing so casts these nets into fully convolutional networks that take input of any size and make spatial output maps.


I think a useful ressource is the [U-Net][2] network. It seems to be a refinement over the FCN, and according to Wikipedia "its architecture was modified and extended to work with fewer training images and to yield more precise segmentations". It was originally proposed and trained for biomedical image segmentation. The PyTorch ressource can be found [here][3].
U-Net is a type of architecture which is knows as Encoder-Decoder. The encoder layers aim to find a lower dimensional hidden representation of the inputs that preserves as much of the original information as possible. Then the decoder takes this representation and upscales it to match the original dimention while also performing the application we set out to do, image segmentation in our case. Another network with this type of architecture but which seems to be more lightweight is SegNet. The implementation is [here][7].
Google has a model called [Deeplab-v3+][6].
## Applications to aerial or sattelite imagery
- *Marmanis et al*, 2017, Semantic segmentation of aerial images with and ensemble of CNNs.
- They employ both image and DEM data as input and employ two structurually identical networks for each that with a "late fusion approach" meaning that the predictions are merged shortly before the final layer that outputs the class probabilities.
- Apparently, a small number of neurons concentrations the activation so Local Response Normalization (a re-scaling of the activation so that spikes are damped) is used and it seems to be crucial. LRU is also used by Krizhevsky at al, 2012 in the foundational paper of CNNs.
- Finally, they use an ensemble of CNNs to avoid local minima but this seems to be an overkill for us.
## Other techniques
Since the number of training examples might be an issue, we might consider unsupervised or semi-supervised models.
- *Kim et al*, 2020, Unsupervised Learning of Image SegmentationBased on Differentiable Feature Clustering.
- The approach is not based on FCN but it could be compatible.
- Because the learning is unsupervised, the representations (kernel weights) but also the pixel labels need to be found.
- They employ three heuristic criteria: (a) pixels of similar features should be assigned the same label,(b) spatially continuous pixels should be assigned the same label,and \(c\) the number of unique labels should be large.
- TBH: this approach sounds complicated but the good thing is that [the code][4] is provided and it's in PyTorch.
- *Van Gansbeke et al.*, 2021, Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.
- Instead of using an end-to-end pipeline this method uses two steps first deciding regions of the image where pixels are likely to belong together.
- "Mid-level visual groups, like objects, transfer well across datasets, since they do not depend on any pre-defined ground-truth classes"
- In a second step, the pixels are embedded using constrastive loss (I don't know what this means).
- Found this model because it is ranked as the best performing in a benchmark dataset.
- [The code][5] is provided and it's in PyTorch.
[1]: https://deepai.org/machine-learning-glossary-and-terms/convolutional-neural-network
[2]: https://en.wikipedia.org/wiki/U-Net
[3]: https://pytorch.org/hub/mateuszbuda_brain-segmentation-pytorch_unet/
[4]: https://paperswithcode.com/paper/unsupervised-learning-of-image-segmentation#code
[5]: https://paperswithcode.com/paper/unsupervised-semantic-segmentation-by
[6]: https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.
[7]: https://github.com/trypag/pytorch-unet-segnet