# GA2: From the ground up
We know the images of CIFAR-100 are 32x32 in RGB. We want to use convolutions so we start of by adding a layer to the network.
Added: Conv2D - 32 feature maps, kernel = 3x3 + ReLU

We notice pretty slow convergence to 1.0 in the graph. The reason for this is probably that the model is not complex enough. Maybe we also do not capture the true fine details of the features.
Lets first try to make the kernel size smaller. Maybe we can capture more details this way (2x2)

Really did not help. We are probably just making the images lower res and not really capturing more details. Lets keep them at 3x3 and lets double the number of feature maps (64)
We now have a doubling of parameters (1.3m vs 600k)

Looks pretty good. Lets try something different and lets make the network twice as deep. We add two convolution layers of 32, each with a ReLU
We now still have around 600k parameters and no doubling

This is clearly better. We saturate to 1.0 more quickly and we only have half the number of parameters. Lets try to make the network a bit wider now. Instead of 32 feature maps in the last convolution now 64.
Now the number of trainable parameters jumps from 600k to 1.3m again.

This is way better than the other model with 1.3m parameters that we had before. Lets try to work in depth again. We now add 3 conv layers with 32 feature maps each. We again, only have 600k parameters!

Very similar and even better! And this with only 600k parameters! Lets try to widen the last added layer a bit so we have plenty of feature maps to work with. 2 conv layers with 32 and one with 64 now.
This makes it jump to 1.3m parameters again now.

We now converge way faster to 1.0 but dont really have an increase in validation accuracy. Lets try to reduce the amount of parameters a bit by adding in a max-pool layer. conv32 conv32 maxpool conv64
We now only have 300k parameters!

Although we saturate way slower to 1.0, we have only 400k parameters and a validation score that is around 5% better! This is the way to go. Lets make the network a bit deeper again. conv32 conv32 maxpool conv64 conv64
Again only around 300k parameters.

Looks pretty similar to the previous one but slower convergence. Lets try to make the last layer a bit wider. conv32 conv32 maxpool conv64 conv128
We now have around 700k parameters.

We have way faster saturation to 1.0 and the validation accuracy increased with around 3%. Lets add another maxpool and again a wider layer:
conv32 conv32 maxpool conv64 conv128 maxpool conv256
Again, we have around 700k parameters. I expect slower convergence again but better validation score.

Indeed, way slower convergence but a higher validation score. We again make the network a deeper. This will result in more parameters but faster convergence again. conv32 conv32 maxpool conv32 conv128 maxpool conv256 conv256
Exactly as expected: 1.3m parameters

Hmmm, not really great convergence. Make it deeper again? conv32 conv32 maxpool conv32 conv128 maxpool conv256 conv256 conv256
A decent chunk of extra parameters now: 1.9m

Nah, this aint it. Lets try just widening the last convolution layer: conv32 conv32 maxpool conv32 conv128 maxpool conv256 conv512
Now 2.3m parameters. So that is a lot more. But hopefully it does something

Not really an improvement unfortunately. Lets keep the two convolutions of 256 at the end but include another maxpool in the end
Now we only have 1m parameters. This is pretty decent

Not that much better performance. But we only have 1m parameters so lets add another layer to the network (conv512). We also notice that we do not fully converge to 1 anymore in the end. I think we are jumping around the minimum and that we should lower the learning rate.
Now we have a cool 2.3m parameters :o

Looks pretty nice but we should most certainly finetune everything. Lets first add batch normalization.
Same amount of parameters but added batch normalization before every relu after the convolutions. This will help stabilize the system because we can see that the loss curve is a lot more jumpy.

If we ignore the last epochs :p we notice that the network is a LOT more stable. We also go fast to the 1.0 and the validation accuracy is very stable and remains the same.

If we look at the image above, our loss curve is pretty steep. We should try to make our loss and accuracy curve look more like the red one. For this, we decrease the learning rate from 0.001 to 0.0008

Yikes, not great. Lets make the learning rate a bit higher?? (0.0012)

Not too bad, validation accuracy also increased! Lets make it even higher. (0.0014)

Alright, the validation accuracy was even higher now. Lets increase it even more! (0.0016)

Bit too high now. Lets lower it a bit again to 0.0015

Owww yee, thats it, consistent 68% val acc

With maxpool in the end, we get the same and less parameters. Lets keep this.
We should regularize now to fight the overfitting
Lets add some dropout to the layer with the most amount of parameters. (last 512 conv: 0.4)

Not great lets do it to the second to last layer.

Again, not amazing. Lets go one layer higher again but only with a dropout of 0.2 because it has way less parameters.

That is pretty good! We could probably even make it higher. Lets do 0.3

Not really much improvement. Even higher? 0.5

This is too high. We get lower validation acc and we are still severe overfitting. Lets keep the 0.3 and add 0.2 in the other layers

Not great lets increase the 0.3 to 0.4 and the 0.2 to 0.3

Still not really better validation scores
Maybe add some more dropout to the first layer?

Very similar! Lets do the opposite. Higher dropouts in the beginning? 0.3 0.3 0.2 0.2

Maybe it is a bit to high now. Lets do 0.2 0.3 0.3 0.3

Not to bad, lets do 0.2 0.3 0.4 0.4

I like this, consisten 67% val acc again and less overfitting on the train data. Lets try to add some L2 regularization to the last convolutional layer 0.01

Lets do more! 0.02

Maybe a bit too much. Lets do 0.015 and increase the learning rate to 0.02

Not too good, lets try 0.005 and just set the learning rate back to what it was.

Pretty decent, lets add the same L2 to the two layers before it and increase the learning rate a bit to 0.0017

Hmm not that stable anymore. Decrease the learning rate again because it jumps around a lot. Lets also decrease the regulation in the layers a bit.

Not too bad, we will give a bit more epochs to it and decrease the learning rate a bit more (epochs 80, lr 0.0012). We will also decrease the L2 regularization of the third last layer a bit.

This stagnates more than the model with the other L2 regularizations and lr of 0.0017. Lets try this model again and throw 85 epochs at it.

Thrash
Lets do something crazy, lets add another 512 layer to it.
Now 4.5m parameters tho :o

Still pretty bad and unstable. Lets add a maxpool layer inbetween the last two layers.
Again, 4.5m parameters :/

Terrible. Finally, lets try to widen the last layer a bit. (only gives 2.5m parameters. Is more acceptable)

Nah, not good
This model, we will try with augmentation
