# Stage 3 weird stuff ## Reg 0.01 on swish layer ![](https://i.imgur.com/K1jqmFh.png) ![](https://i.imgur.com/ZsV38GQ.png) train mae = 27.308105 validation mae = 29.559326 Strangly capped to 400. ## No reg on swish layer ![](https://i.imgur.com/Ww82koZ.png) ![](https://i.imgur.com/OcYmOyb.png) train mae = 27.329678 validation mae = 29.81079 Is able to reach the peaks (no more cap on 400) but overfits more in general. ## No dense layer in the beginning ![](https://i.imgur.com/mSloXXP.png) ![](https://i.imgur.com/O2ft4wI.png) train mae = 27.710047 validation mae = 30.06289 Is not able to go sub 30 anymore. Cant do complex combinations of features too well? ## Dense layer of 32 in the beginning ![](https://i.imgur.com/m4B5lQb.png) ![](https://i.imgur.com/EqT9NdB.png) train mae = 27.703184 validation mae = 29.692074 No performance increase nor more or less overfitting. The model can simply do nothing with the extra combinations anymore? There are no more sensible feature combinations. ## Window size increased to one week ![](https://i.imgur.com/zKo4z0j.png) ![](https://i.imgur.com/zsGrsW9.png) train mae = 27.52961 validation mae = 29.551836 Similar performance to one day. Overfits a bit less. Maybe it is able to remember on which days the pollution is generally higher? ## Window size increased to two weeks ![](https://i.imgur.com/wTpNkPq.png) ![](https://i.imgur.com/W2YSZQT.png) train mae = 27.841536 validation mae = 29.77808 Learns a lot slower we thing that the large window size pollutes the memory with too much noisy data. That is why it is not really able to learn that much more. ## Window size of 3 days ![](https://i.imgur.com/axah72i.png) ![](https://i.imgur.com/sIG3auE.png) train mae = 27.845932 validation mae = 29.91356 Funny enough, 3 days is really not good. One week, fine. One day, fine. 3 Days, bad. Why? ## Adding sin and cos ![](https://i.imgur.com/wM4Maio.png) ![](https://i.imgur.com/j7IXv3O.png) train mae = 27.645061 validation mae = 29.514944 Normally, no improvement noticed. When we widened the dense layer before the GRU, the model improved. We think it was able to combine the larger trends with the sin and the cos, thus improving the model a bit. Because of these combination, for the model, the data became less noisy and the model overfitted less.