Paper: https://arxiv.org/abs/2111.11418
Key idea: abstract the network architecture from high performing models like Transformers, MLP-Mixers etc. It is this network that gives good performance. They replace transformer, MLP-mixer etc with pooling to prove this statement.
The main thing to understand is how pooling works:
class Pooling(nn.Module):
"""
Implementation of pooling for PoolFormer
--pool_size: pooling size
"""
Main claim: patches are what lead to an improved performace at least to a certain extent
Stem implementation
self.stem = nn.Sequential(
nn.Conv2d(in_chans, dim, kernel_size=patch_size, stride=patch_size),
activation(),
nn.BatchNorm2d(dim)
)
Blocks implementation
Mostly the same people behind ViT paper.
Adequate (84.15 top 1 on ImageNet by Mixer-L/16) but not SOTA. Benefits much more from scaling up.
Common part with ViT: Divide an image into NxN patches, unroll each patch and do a linear transform.
Some simple Mlp = Linear -> Activation -> Dropout -> Linear -> Dropout style MLP layers implemented here
class Mlp(nn.Module):
""" MLP as used in Vision Transformer, MLP-Mixer and related networks
"""
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
super().__init__()
Another self-supervised image representation model
Key idea: During pre-training, we randomly mask some proportion of image patches, and feed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.
Image patches are flattened into vectors and linearly projected.
In our experiments, we split each 224 × 224 image into a 14 × 14 grid of image patches, where each patch is 16 × 16.
An image is broken down into patches. Each patch has two representations: patch representation via transformer encoder and tokens from a fixed vocabulary.
80.1% top-1 with linear evaluation of ViT base
Self-supervised without contrastive loss. Distillation model where the teacher model is an EMA of the student model.
Code
Data Augmentation
There are two kinds of crops - global (used by teacher) and local. Student model uses both global and local crops.
This class defined the augmentations.
Random horizontal flip, color jitter and grayscale
Trained on ImageNet only (1.2 million images)
Competitive performance on ImageNet (84.4%)
In addition to CLS token, also added distillation token => responsible for predicting output of a CNN model (RegNetY-16GF)
To make it work on a smaller amount of data, three techniques were used:
data augmentation: repeated augmentation, auto-augment, rand-augment, random erasing, mixup, cutmix
optimization
regularization: trained on (224,224) then finetuned on (384,384) (Question: HOW???)
it was ensured that the L2 norm of enlarged patches was the same as the L2 norm of regular patches
Prelude: Two ways to look at the same problem
Imagine a factory where the widget-maker makes a stream of widgets, and the widget-tester removes the faulty ones. You don’t know what tolerance the widget tester is set to, and wish to infer it.
Way 1 to create n widgets:
Sample a widget from a distribution of widgets. Condition on the widget passing the test.
If it passes the test, create n-1 widgets recursively
If it fails, create n widgets recursively
Let's take a look at marbles problem earlier:
Bag1 → Unknown distribution → Sample1 → Color1
Bag2 → Unknown distribution → Sample2 → Color2
...
BagN → Unknown distribution → SampleN → ColorN
Here we know deterministically that Color1 came from Bag1 and so on. What if we remove this information?
[Bag1, Bag2, ..., BagN] → Sample bag1 → Sample Color1
It's the same idea of updating prior beliefs, just applied to neural networks.
Key idea
The parameters of a neural network comes from one or more Gaussian distributions. Given some data we can update the priors to come up with neural nets that fit the data better.
var dm = 10 //size of hidden layer
var makeFn = function(M1,M2,B1){
return function(x){
return T.toScalars(
Humans choose the least complex hypothesis that fits the data well.
How is complexity measured? How is fitness measured?
If fitness is semantically measured and complexity is syntactically measured (eg description length of the hypothesis in some representation language, or a count of the number of free parameters used to specify the hypothesis), the two are incommensurable.
In Bayesian models both complexity and fitness are measured semantically. Complexity is measured by flexibility: the ability to generate a more diverse set of observations.
Key Idea: The Law of Conservation of Belief
Since all probabilities should add up to 1, a complex model spreads its probabilities over a larger number of possibilities whereas a simple model will have high probabilities for a smaller set of events. Hence:
Key idea
We learn generalized concepts naturally:
poodle, Dalmatian, Labrador → dog
sedan, coupe, convertible, wagon → car
How do we build models that can learn these abstract concepts?
Example 1: Bags with colored balls
Each bag can learn its own categorical distribution. It explains previously observed data well but fails to generalize.
Key idea
How can we create complex hypotheses and representation spaces? Simply by using stochastic recursion. When the recursion is going to end is not deterministic, it is probabilistic.
Here is how we can create an infinite amount of mathematical expressions:
var randomConstant = function() {
return uniformDraw(_.range(10))
}
var randomCombination = function(f,g) {
Learning and the rate of learning
Let's say you see a series of heads when a coin is tossed. Your beliefs about the bias of the coin depend on two items:
How likely it is to see a biased coin?
How much data have you seen?
One can measure the rate of learning (when the inferred belief of a learner comes close to the actual fact that the coin is biased).
var fairnessPosterior = function(observedData) {
return Infer({method: 'enumerate'}, function() {
Markov Chain Monte Carlo (MCMC)
The idea is to find a Markov Chain whose stationary distribution is the same as the conditional distrbution we want to estimate. Eg. we want to estimate a geometric distribution:
var p = .7
var geometric = function(p){
return ((flip(p) == true) ? 1 : (1 + geometric(p)))
}
var post = Infer({method: 'MCMC', samples: 25000, lag: 10, model: function(){
Bayesian cognitive model and Bayesian data analysis are the same thing under the hood; they just have different contexts.
If the generative model is a hypothesis about a person’s model of the world, then we have a Bayesian cognitive model – the main topic of this book. If the generative model is instead the scientist’s model of how the data are generated, then we have Bayesian data analysis.
Two competing hypotheses
Consider an example of a spinning coin (as opposed to a flipping coin).
Scientists believe the probability of heads up is uniform [0,1]
People might believe it is the same as flipping an unbiased coin i.e. 0.5
Two forms of dependence are explored in detail:
a) Screening off: The graphical model looks like • ← • → • or • → • → •. It is called so because if the variable(s) in the middle node are observed, the corner variables become independent.
Screening off is a purely statistical phenomenon. For example, consider the the causal chain model, where A directly causes C, which in turn directly causes B. Here, when we observe C – the event that mediates an indirect causal relation between A and B – A and B are still causally dependent in our model of the world: it is just our beliefs about the states of A and B that become uncorrelated. There is also an analogous causal phenomenon. If we can actually manipulate or intervene on the causal system, and set the value of C to some known value, then A and B become both statistically and causally independent (by intervening on C, we break the causal link between A and C).
b) Explaning away: The graphical model looks like • → • ← •. If the bottom variable is observed, previously independent variables (the two roots at the top) become dependent.
The most typical pattern of explaining away we see in causal reasoning is a kind of anti-correlation: the probabilities of two possible causes for the same effect increase when the effect is observed, but they are conditionally anti-correlated, so that observing additional evidence in favor of one cause should lower our degree of belief in the other cause. (This pattern is where the term explaining away comes from.)
Causal Dependence
expression A depends on expression B if it is ever necessary to evaluate B in order to evaluate A
What about an expression like:
A = C ? B + 2 : 5
Does A depend on B? Answer is only in certain contexts.
Much of cognition can be understood in terms of conditional inference. In its most basic form, causal attribution is conditional inference: given some observed effects, what were the likely causes? Predictions are conditional inferences in the opposite direction: given that I have observed some cause, what are its likely effects?
Inference can be done in various ways. The most basic way is rejection sampling:
var model = function () {
var A = flip()
var B = flip()
var C = flip()
var D = A + B + C
condition(D >= 2)