vinsis

@vinsis

Joined on May 26, 2019

  • import torch from torch_scatter import scatter We will use this graph: num_nodes = 5 num_edges = 6 num_edge_types = 3 edge_index = torch.LongTensor([
     Like  Bookmark
  • Paper: https://arxiv.org/abs/2111.11418 Key idea: abstract the network architecture from high performing models like Transformers, MLP-Mixers etc. It is this network that gives good performance. They replace transformer, MLP-mixer etc with pooling to prove this statement. The main thing to understand is how pooling works: class Pooling(nn.Module): """ Implementation of pooling for PoolFormer --pool_size: pooling size """
     Like  Bookmark
  • Main claim: patches are what lead to an improved performace at least to a certain extent Stem implementation self.stem = nn.Sequential( nn.Conv2d(in_chans, dim, kernel_size=patch_size, stride=patch_size), activation(), nn.BatchNorm2d(dim) ) Blocks implementation
     Like  Bookmark
  • Mostly the same people behind ViT paper. Adequate (84.15 top 1 on ImageNet by Mixer-L/16) but not SOTA. Benefits much more from scaling up. Common part with ViT: Divide an image into NxN patches, unroll each patch and do a linear transform. Some simple Mlp = Linear -> Activation -> Dropout -> Linear -> Dropout style MLP layers implemented here class Mlp(nn.Module): """ MLP as used in Vision Transformer, MLP-Mixer and related networks """ def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.): super().__init__()
     Like  Bookmark
  • Another self-supervised image representation model Key idea: During pre-training, we randomly mask some proportion of image patches, and feed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches. Image patches are flattened into vectors and linearly projected. In our experiments, we split each 224 × 224 image into a 14 × 14 grid of image patches, where each patch is 16 × 16. An image is broken down into patches. Each patch has two representations: patch representation via transformer encoder and tokens from a fixed vocabulary.
     Like  Bookmark
  • 80.1% top-1 with linear evaluation of ViT base Self-supervised without contrastive loss. Distillation model where the teacher model is an EMA of the student model. Code Data Augmentation There are two kinds of crops - global (used by teacher) and local. Student model uses both global and local crops. This class defined the augmentations. Random horizontal flip, color jitter and grayscale
     Like  Bookmark
  • Trained on ImageNet only (1.2 million images) Competitive performance on ImageNet (84.4%) In addition to CLS token, also added distillation token => responsible for predicting output of a CNN model (RegNetY-16GF) To make it work on a smaller amount of data, three techniques were used: data augmentation: repeated augmentation, auto-augment, rand-augment, random erasing, mixup, cutmix optimization regularization: trained on (224,224) then finetuned on (384,384) (Question: HOW???) it was ensured that the L2 norm of enlarged patches was the same as the L2 norm of regular patches
     Like  Bookmark
  • Pretrained on 300+ million images SOTA on ImageNet (88.55%) Added CLS token to patch tokens => responsible for predicting true label
     Like  Bookmark
  • Prelude: Two ways to look at the same problem Imagine a factory where the widget-maker makes a stream of widgets, and the widget-tester removes the faulty ones. You don’t know what tolerance the widget tester is set to, and wish to infer it. Way 1 to create n widgets: Sample a widget from a distribution of widgets. Condition on the widget passing the test. If it passes the test, create n-1 widgets recursively If it fails, create n widgets recursively
     Like  Bookmark
  • Let's take a look at marbles problem earlier: Bag1 → Unknown distribution → Sample1 → Color1 Bag2 → Unknown distribution → Sample2 → Color2 ... BagN → Unknown distribution → SampleN → ColorN Here we know deterministically that Color1 came from Bag1 and so on. What if we remove this information? [Bag1, Bag2, ..., BagN] → Sample bag1 → Sample Color1
     Like  Bookmark
  • It's the same idea of updating prior beliefs, just applied to neural networks. Key idea The parameters of a neural network comes from one or more Gaussian distributions. Given some data we can update the priors to come up with neural nets that fit the data better. var dm = 10 //size of hidden layer var makeFn = function(M1,M2,B1){ return function(x){ return T.toScalars(
     Like  Bookmark
  • Humans choose the least complex hypothesis that fits the data well. How is complexity measured? How is fitness measured? If fitness is semantically measured and complexity is syntactically measured (eg description length of the hypothesis in some representation language, or a count of the number of free parameters used to specify the hypothesis), the two are incommensurable. In Bayesian models both complexity and fitness are measured semantically. Complexity is measured by flexibility: the ability to generate a more diverse set of observations. Key Idea: The Law of Conservation of Belief Since all probabilities should add up to 1, a complex model spreads its probabilities over a larger number of possibilities whereas a simple model will have high probabilities for a smaller set of events. Hence:
     Like  Bookmark
  • Key idea We learn generalized concepts naturally: poodle, Dalmatian, Labrador → dog sedan, coupe, convertible, wagon → car How do we build models that can learn these abstract concepts? Example 1: Bags with colored balls Each bag can learn its own categorical distribution. It explains previously observed data well but fails to generalize.
     Like  Bookmark
  • Key idea How can we create complex hypotheses and representation spaces? Simply by using stochastic recursion. When the recursion is going to end is not deterministic, it is probabilistic. Here is how we can create an infinite amount of mathematical expressions: var randomConstant = function() { return uniformDraw(_.range(10)) } var randomCombination = function(f,g) {
     Like  Bookmark
  • Learning and the rate of learning Let's say you see a series of heads when a coin is tossed. Your beliefs about the bias of the coin depend on two items: How likely it is to see a biased coin? How much data have you seen? One can measure the rate of learning (when the inferred belief of a learner comes close to the actual fact that the coin is biased). var fairnessPosterior = function(observedData) { return Infer({method: 'enumerate'}, function() {
     Like  Bookmark
  • Markov Chain Monte Carlo (MCMC) The idea is to find a Markov Chain whose stationary distribution is the same as the conditional distrbution we want to estimate. Eg. we want to estimate a geometric distribution: var p = .7 var geometric = function(p){ return ((flip(p) == true) ? 1 : (1 + geometric(p))) } var post = Infer({method: 'MCMC', samples: 25000, lag: 10, model: function(){
     Like  Bookmark
  • Bayesian cognitive model and Bayesian data analysis are the same thing under the hood; they just have different contexts. If the generative model is a hypothesis about a person’s model of the world, then we have a Bayesian cognitive model – the main topic of this book. If the generative model is instead the scientist’s model of how the data are generated, then we have Bayesian data analysis. Two competing hypotheses Consider an example of a spinning coin (as opposed to a flipping coin). Scientists believe the probability of heads up is uniform [0,1] People might believe it is the same as flipping an unbiased coin i.e. 0.5
     Like  Bookmark
  • Two forms of dependence are explored in detail: a) Screening off: The graphical model looks like • ← • → • or • → • → •. It is called so because if the variable(s) in the middle node are observed, the corner variables become independent. Screening off is a purely statistical phenomenon. For example, consider the the causal chain model, where A directly causes C, which in turn directly causes B. Here, when we observe C – the event that mediates an indirect causal relation between A and B – A and B are still causally dependent in our model of the world: it is just our beliefs about the states of A and B that become uncorrelated. There is also an analogous causal phenomenon. If we can actually manipulate or intervene on the causal system, and set the value of C to some known value, then A and B become both statistically and causally independent (by intervening on C, we break the causal link between A and C). b) Explaning away: The graphical model looks like • → • ← •. If the bottom variable is observed, previously independent variables (the two roots at the top) become dependent. The most typical pattern of explaining away we see in causal reasoning is a kind of anti-correlation: the probabilities of two possible causes for the same effect increase when the effect is observed, but they are conditionally anti-correlated, so that observing additional evidence in favor of one cause should lower our degree of belief in the other cause. (This pattern is where the term explaining away comes from.)
     Like  Bookmark
  • Causal Dependence expression A depends on expression B if it is ever necessary to evaluate B in order to evaluate A What about an expression like: A = C ? B + 2 : 5 Does A depend on B? Answer is only in certain contexts.
     Like  Bookmark
  • Much of cognition can be understood in terms of conditional inference. In its most basic form, causal attribution is conditional inference: given some observed effects, what were the likely causes? Predictions are conditional inferences in the opposite direction: given that I have observed some cause, what are its likely effects? Inference can be done in various ways. The most basic way is rejection sampling: var model = function () { var A = flip() var B = flip() var C = flip() var D = A + B + C condition(D >= 2)
     Like  Bookmark