Loss capacity project

# Loss capacity project We'll use this document as a journal. ###### Week 16May ------------ ![](https://i.imgur.com/HPLUYTm.png) ![](https://i.imgur.com/2C0ys8h.png) ###### Week 9May ------------ Here are (stil) preliminary results. The hiperparameters are not optimal yet. ![](https://i.imgur.com/zOZ03EI.png) **Question**: It seems that for some representations the probe is more difficult to train than the others. E.g. in the figures from last week, 2 linear functions produce 2 different curves with limited training (noisy_uniform vs random) although they are both linear (invertible?) transformations. With proper training, the probe would reach the same accuracy (aprox 100) even in the linear case, thus the num params vs acc curves would look the same. But the functions could be diffrent, one could more complex than the other. *Thus, should we also take into account some notion of complexity of the probe into account?* Maybe spectral norm? Or Lipschitz constant of the learned probe function. What is nice about spectral norm and also Lipschitz constant is that they are both 1.0 for permutation matrices. Would the other functions learned by the probe have a norm / constant higher than 1? So that we can say that representations with probes with spectral norm closser to one are preferred? ------------- ###### Week 2May ### Datasets For now I used the datasets from this repo: https://github.com/bethgelab/InDomainGeneralizationBenchmark ### Task: plot curves of simple synthetic models We plot the R-squared scores obtained by the representation of different models. We measure the capacity as the difference betwen total number of parameters of the probe and the number of parameters of the linear probe $Capacity = params - paramsLinearProbe$ First try. On Dsprites with models:: - Noisy labels: $h = y + N(0, eps)$ - Noisy uniform mix: $h = M y$, where $M_{ij} = 1/ K + N(0, eps)$ - Random linear mix: $h = M y$, M random - Raw data: $h = x$ - simple conv net: untrained and randomly initialised (scratch) or suvervised on predicting the factors - Resnet18: untrained and randomly initialized, pretrained on imagened, or supervised on predicting the factors dsprites ![](https://i.imgur.com/EkRZ5RB.png) mpi3d` ![](https://i.imgur.com/f1VuVj5.png) This is just a single run, with the probe having 2 hidden layers with various hidden sized. The first point in each curve represents the performance of the linear probe. #### Observations: The behavior of raw_data vs resnet18 pretrained is what we are interested in. The resnet18 is better at linear probing but not when more capacity is used. How do we interpret this? Is the raw_data 'more disentangled' than the representation of a pretrained resnet18? For simple cases like Dsprites with simple factors like (shape, scale, orientation, x-position, and y-position) this might be the case. It might be interesting to also think about a notion of compactness of the representation?