# Report #2
Danila Romanov & Mark Zakharov
## Police U-Turn
~~Due to us forgetting to apply to the Kaggle competition before deadline~~
For reasons beyond our control, we were forced to change the topic of our project.
We chose another, somewhat similar competition - [iWildCam 2020](https://www.kaggle.com/c/iwildcam-2020-fgvc7)
## New Task Description
The competition's main goal is to train a CV model, that would be able to predict animal species given a camera trap image.
The community generally agrees that 2D image classification problem is more or less solved - top ImageNet models classify images better than a specially instructed human.
But here is the catch - camera traps are unmanned, placed in nature, sometimes placed automatically (that is, dropped from drones, planes or helicopters, thus making it impossible to adjust their position), unmaintained (camera may fall on the ground, get obscured by leaves, branches etc.), cheap (therefore unprepared for filming in special conditions, such as night, unstable illumination, quick-moving objects, unusual perspective, weather conditions etc.), placed for a long terms (which means background would be changing as the time passes, making it impossible to perform median background substraction).
Cameras are also subject to malfunctions due to environmental conditions, so a portion of corrupted images will be presented in the training set.
From a practical point of view, we have to deal with an enormous dataset - 84GB of archived training data, 25GB of archived test set data, and an approval to use additional data from 2017 and 2018 challenges. Such datasets barely fit into hard drives, not to mention RAM. Thus, pipelining such amount of data and making it work is a very challenging task from ML DevOps perspective.
The authors also provide satellite images of regions where the cameras are placed. This may be used as a prior for adjusting animal class distributions across different regions - one should not expect to see a polar bear in a desert or a giraffe in the mountains.
All that makes this task differ from ordinary image classification problems and to be more interesting for our team in both research and practical way.
P.S. It's been 9 days since the competition had started, but according to the leaderboard, only one person surpassed the baseline.
## Starting approach
We will be using a recently released architecture - EfficientNet - throughout our experiments.
This new architecture carries a significant advancement in the image classification task by showing better Top-1 accuracy, than any other model, and having significantly fewer parameters at the same time.
Authors also present a big family of models (B0 to B7), thus helping end-users to gently tradeoff size and execution speed for accuracy.

Speaking about implementation, our first attempt would be quite simple:
1. Use pre-trained EfficientNet weights to extract embedding from both camera trap and satellite image. Each image is transformed into a 7x6xEMBED tensor, where EMBED number depends on a particular version of EfficientNet (~1500-2500).
2. To get an embedding vector, we will use GlobalMaxPooling2D, transforming 7x6xEMBED tensor into a 1xEMBED vector.
3. Condition trap embedding with satellite embedding by simply concatenating them.
4. Add a fully-connected layer mapping embedding vector into N classes.
We already can see some questions rising:
1. Why both camera shot and a satellite image are represented by vectors of the same (and quite large) size, given that the former carries much more relevant information than the latter?
2. Would not GlobalMaxPooling lose too much information from 7x6xEMBED tensor?
3. Given that the majority of image regions contain little to no relevant information and we are concerned about both performance and accuracy, should not we use visual attention mechanism? If yes, how?
## EfficientNet description
The main observation of the paper's authors was that scaling a convolutional network (a method that is used to achieve SOTA results on computer viisonnn) in a single dimension (width, depth or image resolution) was efficient only for some time, meaning that after some threshold performance plateaus and does not increase any further. For instance, ResNet-1000 has compararable accuracy to ResNet-101 despite having about 10 times more layers.
What authors propose is using a neural search to find a base architecture first, and then expanding it by increasing network's depth by $\alpha^{\phi}$, network's width by $\beta^{\phi}$ and image resolution by $\gamma^{\phi}$ for $\alpha$, $\beta$ and $\gamma$ found using hyperparameter optimization and $\phi$ is a user-defined parameter that defines how much the network is expanded.
The main building block of the base network is MBConv, which stands for mobile inverted convolution. The main concept behind MBConv is separating spatial convolutions and depth-wise convolutions to save parameters while doing essentially the same and, hence. achieving similar performance. Apart from that, it looks just like a ResNet block with its residual connections.
On top of that, squeeze-and-excitation blocks are used. They multiply channels of all CNN’s layers by learned weights, one set of weights for each layer. The weights are a non-linear function of layer's outputs. Empirical results show that the deeper the layer, the more sparse the weight distribution is produced, meaning that in deeper layers only a portion of channels is used to decide a result of classification task. Back in 2017, squeeze-and-excitation blocks dramatically increased performance of Convolutional Networks while introducing only a tiny fraction (about 1%) of new parameters.