Contrastive Predictive Coding

Combined notes for

Data Efficient Image Recognition with Constrastive Predictive Coding

The basic idea

Take two different overlapping patches from a single image x₁ and x₂. Since they came from the same image and are close to each other (since they overlap), they are related. We use a neural network (f_θ) to find a representation for each of these patches, say z₁ and z₂. Since the patches are related, z₁ and z₂ should also be related. In other words, z₁ should be able to predict z₂.

But what do we mean by a vector z₁ predicting another vector z₂?

Let's say we take random patches from another image, x₃, x₄ … x₁₀. We calculate z₃ = f_θ(x₃) … z₁₀ = f_θ(x₁₀). Since z₁ is related to z₂ but not to z_i for i > 2, it should be able to pick out z₂ from a set of vectors z_i for i > 1.

But what do we mean by a vector picking out another particular vector from a set of vectors?

It is a two step process:
Step 1: A vector at time step t, v_t, defines a context c_t. This can be done by passing v_t to an autoregressive model g_ar. Sometimes not just the vector at last time step but all vectors from time 0 to time t are used to define a context.

(v₀, v₁, …, v_t) → [ g_ar ] → c_t

g_ar could be a GRU, LSTM or CNN.

Step 2: The context vector at time t, c_t can predict encoded vectors k steps ahead of time, z_t+k where k>0. This is done by a simple linear transformation of c_t. We use a separate linear transformation W_k for predict
z_t+k.

In other words, W₁ is used to predict c_t+1, W₂ is used to predict c_t+2, W₃ is used to predict c_t+3 and so on.

c_t → [ W₁ ] → z_t+1

c_t → [ W₂ ] → z_t+2

c_t → [ W₃ ] → z_t+3

…

c_t → [ W_k ] → z_t+k

How do we measure the accuracy of prediction?

Simple: a dot product. If z_t+k came from the same image, the dot product should have a high value. If it came from a different image, it should have a low value. This can be turned into a loss by passing the dot product to sigmoid function and then calculating binary cross entropy loss.

Code

The Keras implementation is lifted straight from here.

f_θ (image patch x_t → encoded vector z_t)

It is a simple CNN:

def network_encoder(x, code_size):

    ''' Define the network mapping images to embeddings '''

    x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.LeakyReLU()(x)
    x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.LeakyReLU()(x)
    x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.LeakyReLU()(x)
    x = keras.layers.Conv2D(filters=64, kernel_size=3, strides=2, activation='linear')(x)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.LeakyReLU()(x)
    x = keras.layers.Flatten()(x)
    x = keras.layers.Dense(units=256, activation='linear')(x)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.LeakyReLU()(x)
    x = keras.layers.Dense(units=code_size, activation='linear', name='encoder_embedding')(x)

    return x

g_ar

It is a simple GRU. I have modified the docstring to make it clearer:

def network_autoregressive(x):

    ''' x is a iterable of vectors z1, z2, ... zt '''

    # x = keras.layers.GRU(units=256, return_sequences=True)(x)
    # x = keras.layers.BatchNormalization()(x)
    x = keras.layers.GRU(units=256, return_sequences=False, name='ar_context')(x)

    return x

Implementation of W_k transformations

Note that k is the number of time steps ahead in future the encoded vectors at which are predicted. We need to define a Dense (or Linear if you are coming from PyTorch) layer for each value to k. This is implemented here

def network_prediction(context, code_size, predict_terms):

    ''' `predict_terms` is only used to determine the number of time steps ahead of time. '''

    outputs = []
    for i in range(predict_terms):
        outputs.append(keras.layers.Dense(units=code_size, activation="linear", name='z_t_{i}'.format(i=i))(context))

    if len(outputs) == 1:
        output = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(outputs[0])
    else:
        output = keras.layers.Lambda(lambda x: K.stack(x, axis=1))(outputs)

    return output

Taking the sigmoid of dot product

Implementation here:

class CPCLayer(keras.layers.Layer):

    ''' Computes dot product between true and predicted embedding vectors '''

    def __init__(self, **kwargs):
        super(CPCLayer, self).__init__(**kwargs)

    def call(self, inputs):

        # Compute dot product among vectors
        preds, y_encoded = inputs
        dot_product = K.mean(y_encoded * preds, axis=-1)
        dot_product = K.mean(dot_product, axis=-1, keepdims=True)  # average along the temporal dimension

        # Keras loss functions take probabilities
        dot_product_probs = K.sigmoid(dot_product)

        return dot_product_probs

    def compute_output_shape(self, input_shape):
        return (input_shape[0][0], 1)

All of the above functions are bundled into a single model here.

Data-Efficient Image Recognition with Contrastive Predictive Coding

This paper basically made improvements on the previous implementation of CPC.

We revisit CPC in terms of its architecture and training methodology, and arrive at a new implementation with a dramatically-improved ability to linearly separate image classes.

Here is how they made these improvements:

Increasing model capacity: the original CPC model used only the first 3 stacks of a ResNet-101 ... we converted the third residual stack of ResNet-101 to use 46 blocks with4096-dimensional feature maps and 512-dimensional bottleneck layers.
Replacing batch normalization with layer normalization: We hypothesize that batch normalization allows these models to find a trivial solution to CPC: it introduces a dependency between patches (through the batch statistics) that can be exploited to bypass the constraints on the receptive field. Nevertheless we find that we can reclaim much of batch normalization’s training efficiency using layer normalization.
Predicting not just top to bottom from from all directions: we repeatedly predict the patch using context frombelow, the right and the left, resulting in up to four times as many prediction tasks.
Augmenting image patches better: The originalCPC model spatially jitters individual patches independently. We further this logic by adopting the ‘color dropping’ method of [14], which randomly drops two of the three color channels in each patch, and find it to deliver systematic gains (+3% accuracy). We therefore continued by adding a fixed, generic augmentation scheme using the primitives from Cubuk et al. [10] (e.g. shearing, rotation, etc), as well as random elastic deformations and color transforms [11] (+4.5% accuracy).

There is also some material on data-efficiency but I am going to skip it.

Contrastive Predictive Coding

Combined notes for

Data Efficient Image Recognition with Constrastive Predictive Coding

The basic idea

But what do we mean by a vector z1 predicting another vector z2?

But what do we mean by a vector picking out another particular vector from a set of vectors?

How do we measure the accuracy of prediction?

Code

fθ (image patch xt → encoded vector zt)

gar

Implementation of Wk transformations

Taking the sigmoid of dot product

Data-Efficient Image Recognition with Contrastive Predictive Coding

tags: self-supervised-learning

Read more

How to implement a GNN

Metaformer is Actually What You Need for Vision [CVPR 2022]

Patches are all you need?

MLP-Mixer: An All-MLP Architecture for Vision

But what do we mean by a vector z₁ predicting another vector z₂?

f_θ (image patch x_t → encoded vector z_t)

g_ar

Implementation of W_k transformations

tags: `self-supervised-learning`