Take two different overlapping patches from a single image x1 and x2. Since they came from the same image and are close to each other (since they overlap), they are related. We use a neural network (fθ) to find a representation for each of these patches, say z1 and z2. Since the patches are related, z1 and z2 should also be related. In other words, z1 should be able to predict z2.
Let's say we take random patches from another image, x3, x4 … x10. We calculate z3 = fθ(x3) … z10 = fθ(x10). Since z1 is related to z2 but not to zi for i > 2, it should be able to pick out z2 from a set of vectors zi for i > 1.
It is a two step process:
Step 1: A vector at time step t, vt, defines a context ct. This can be done by passing vt to an autoregressive model gar. Sometimes not just the vector at last time step but all vectors from time 0 to time t are used to define a context.
(v0, v1, …, vt) → [ gar ] → ct
gar could be a GRU, LSTM or CNN.
Step 2: The context vector at time t, ct can predict encoded vectors k steps ahead of time, zt+k where k>0. This is done by a simple linear transformation of ct. We use a separate linear transformation Wk for predict
zt+k.
In other words, W1 is used to predict ct+1, W2 is used to predict ct+2, W3 is used to predict ct+3 and so on.
ct → [ W1 ] → zt+1
ct → [ W2 ] → zt+2
ct → [ W3 ] → zt+3
…
ct → [ Wk ] → zt+k
Simple: a dot product. If zt+k came from the same image, the dot product should have a high value. If it came from a different image, it should have a low value. This can be turned into a loss by passing the dot product to sigmoid function and then calculating binary cross entropy loss.
The Keras implementation is lifted straight from here.
It is a simple CNN:
It is a simple GRU. I have modified the docstring to make it clearer:
Note that k
is the number of time steps ahead in future the encoded vectors at which are predicted. We need to define a Dense
(or Linear
if you are coming from PyTorch) layer for each value to k. This is implemented here
All of the above functions are bundled into a single model here.
This paper basically made improvements on the previous implementation of CPC.
We revisit CPC in terms of its architecture and training methodology, and arrive at a new implementation with a dramatically-improved ability to linearly separate image classes.
Here is how they made these improvements:
Increasing model capacity: the original CPC model used only the first 3 stacks of a ResNet-101 ... we converted the third residual stack of ResNet-101 to use 46 blocks with4096-dimensional feature maps and 512-dimensional bottleneck layers
.
Replacing batch normalization with layer normalization: We hypothesize that batch normalization allows these models to find a trivial solution to CPC: it introduces a dependency between patches (through the batch statistics) that can be exploited to bypass the constraints on the receptive field. Nevertheless we find that we can reclaim much of batch normalization’s training efficiency using layer normalization
.
Predicting not just top to bottom from from all directions: we repeatedly predict the patch using context frombelow, the right and the left, resulting in up to four times as many prediction tasks.
Augmenting image patches better: The originalCPC model spatially jitters individual patches independently. We further this logic by adopting the ‘color dropping’ method of [14], which randomly drops two of the three color channels in each patch, and find it to deliver systematic gains (+3% accuracy). We therefore continued by adding a fixed, generic augmentation scheme using the primitives from Cubuk et al. [10] (e.g. shearing, rotation, etc), as well as random elastic deformations and color transforms [11] (+4.5% accuracy).
There is also some material on data-efficiency but I am going to skip it.
self-supervised-learning