fhuszar

@fhuszar

Joined on Sep 14, 2020

  • In previous notes I wrote about the approximation power of simple neural networks, such as multilayer perceptrons with ReLU activation function. These networks are able to represent piecewise linear functions with large number of linear pieces, which in turn can approximate a broad class of continuous functions provided that the number of pieces is large enough. But piecewise linear functions feel like very clunky function approximators. Although they can approximate a smooth functon, they can also in theory zig-zag between datapoints in a wild fashion, presumably leading to really poor generalisation performance. As neural networks are overparametrised, meaning they have more parameters than would be strictly needed to fit the training data, minimising the training loss is an underdetermined problem with multiple solutions. The different possible solutions may have wildly different properties. There is no guarantee that a minimum of the training loss will be a 'nice' solution that generalises to unseen test data. Thankfully, even simple neural networks like ReLU MLPs have built-in inductive biases which make it more likely that "nicer" kind of solutions to learning problems are more often found. These biases or tendencies towards nice solutions are not fully understood, uncovering and characterising them is an active area of research. There are different notions of 'nice' that have been put forward and researchers were able to produce more or less strong evidence of the presence of these kind of inductive biases. In this note I mention three ways to describe what "nice" solution means.
     Like  Bookmark
  • Deep learning has seen tremendous growth over the last decade and a half, resulting in a sweeping change in artificial intelligence and the way computers are used to solve real-world problems. I wanted to illustrate this sweeping change through four timelines: applications methodology responsible use and safety theory of deep learning Timeline 1: Breakthroughs in Applications Of course one of the reasons deep learning is so important is because of the several applications it has enabled. With it, computers were able to tackle problems that evaded previous approaches. In particular, deep learning excels in situations where natural images, video, or language data has to be processed, understood, manipulated, or acted on. Here is a timeline of some significant headliner breakthroughs.
     Like 1 Bookmark
  • extension of $abc$-conjecture to positive rational numbers Let’s represent each natural number as the sequence of prime multiplicities $\alpha(n, p)$. For example, the number $36 = 2^23^2$ is represented by the sequence $2, 2, 0, 0, 0, \ldots$. Similarly, $15$ would be represented as $0, 1, 1, 0, 0, 0, \ldots$. We can describe the notion of relative primeness in terms of orthogonality of these sequences. Let’s define a scalar product between two numbers as $\langle a, b \rangle = \sum_{p\in\text{primes}} \alpha(a, p) \alpha(b, p)$, where $\alpha(n,p)$ is the multiplicity of prime $p$ in number n. Using this definition, we can note that two natural numbers will have a scalar product of $0$ exactly when they are coprime / relatively prime. With this inner product thinking in mind, we can extend our consideration to positive rational numbers. Each positive rational number can be written as $\frac{k}{l}$ where $k$ and $l$ are relatively prime. We can map this rational number to the sequence $\alpha(r, p) - \alpha(q, p): p \in \text{primes}$. So, for example $\frac{2}{3}$ would become $1, -1, 0, 0, 0, \ldots$ and $\frac{36}{25}$ would be represented as $2, 2, -2, 0, 0, \ldots$. We can similarly extend our notion of scalar product to rational numbers $\frac{a}{b}$ and $\frac{c}{d}$ as follows:
     Like  Bookmark
  • Start with Audience I usually start with asking the audience to volunteer definitions, thoughts on AI. I ask them to tell me What AI is. I wait a few minutes then usually a few participants will start saying ideas. It's important to remind them that there is no good answer, and they can say something moderately silly. If there's no volunteers or very few, I sometimes prompt the audience (anyone thought of robots?) I write each thought on my iPad (or a whiteboard) and then I spend time discussing each. Saying why they are good thoughts, and perhaps highlighting ways in which they don't fully capture AI. Humans vs Machines figure I then say 'Here is how I usually think about AI' and draw a Venn diagram. On the left, I draw the set of tasks humans can accomplish (explaining that this includes everything from filing your taxes, to solving maths problems or doing the dishes) and on the right, overlapping, I draw the set of tasks machines/computation/computers can accomplish. I then talk about how there are examples everywhere. If I have time I ask participants to volunteer ideas of things only humans can do, or things only machines can do, etc.
     Like  Bookmark
  • An Intro Guide for High-School Students Preparing for AI Olympiads PyTorch is a powerful and versatile library for numerical computing and machine learning. It’s built on Python, a language known for its simplicity and vast ecosystem. While AI can be implemented in other programming languages, the overwhelming majority of open-source tools and frameworks—like TensorFlow, Scikit-learn, and PyTorch—are written in Python. This makes Python the de facto standard for artificial intelligence and machine learning. This note introduces three key aspects of PyTorch: its capabilities for numerical programming, its use of automatic differentiation, and its features for building neural networks and differentiable programs. 1. Numerical Programming with PyTorch PyTorch as a Numerical Programming Tool PyTorch is, first and foremost, a numerical programming package, akin to a "calculator app" for Python. It provides tools to perform mathematical operations efficiently and is especially optimized for large-scale computations like those in machine learning. Another example of a numerical programming library is NumPy, but PyTorch goes beyond it by offering features like GPU support and automatic differentiation, which make it more suitable for deep learning. Tensors and Their Operations
     Like  Bookmark
  • Let's consider a fully observed MDP and $K$ agents each trained to optimise a private reward function $r_k$. For simplicity, let's say that the reward is a function of the terminate state of the MDP. Example: The MDP can be an online chat with a human, where actions are the chatbot's responses, and the state is the current transcript of the chat so far. There $K$ agents are different chatbot LLMs. Each version might optimise for a slightly different outcome. It is possible that one or multiple agents has a sinister goal, i.e. that their private reward function rewards misleading or deceiving the user. Our goal is to design a mechanism, a game, whereby the $K$ agents collaboratively control the MDP, in such a way that they reveal information about their private reward functions along the way. We seek an incentive compatible mechanism: a mechanism in which the rational behaviour of each agent is to honestly share private information they have. I.e. in this case, if the agent wants to optimise for reward function $r$, it's optimal behaviour involves sharing information relating to $r$ honestly. Related: Vickrey–Clarke–Groves mechanism The VCG mechanism is a truthful mechanism that involves bidding on a set of items or outcomes by a set of agents.
     Like  Bookmark
  • Modul 1: Október-November: basics of DL [DL] pytorch intro [DL] numerical computing, tensor computing [DL] optimization, optimizers [DL] basic MLP, Weights&Biases [matek] LinAlg 1: alapok, dot product, matrix multiplication, cosine similarity [matek] ML 1: Supervised learning: cross-validation, overfitting, hyperparameter tuning [matek] ML 2: squared loss, cross entropy Első kaggle contest
     Like  Bookmark
  • Feladatok május 22 Jancsi és Juliska apukája mézeskalácsokat sütött, amiket egyenlően szétosztott a két gyerek között. A gyerekek rögtön leültek enni. Juliska megette az ő mézeskalácsainak negyedét. Jancsi pedig megette az ő mézeskalácsainak harmad részét. Hány mézeskalácsot sütött Apa összesen, ha Jancsi pontosan 3-mal több mézeskalácsot evett, mint Juliska? Szofi kiszámolta az $123456789\times 8$ szorzat eredményét, és meglepődve tapasztalta hogy az eredmény majdnem $987654321$, azzal a különbséggel hogy két számjegy más sorrendben szerepelt. Mi ennek a két számjegynek az összege? Melyik a legkisebb háromjegyű szám, ami ugyanazt a maradékot adja 3-mal osztva, mint 2-vel osztva? Az $1\square2\square3\square4$ kifejezésben minden $\square$ helyére $+$ vagy $\times$ műveletet írhatunk. Mi a legnagyobb szám amit így kaphatunk? Hannának kétféle színű $1\times1$-es méretű Lego kockája van, narancs és kék. Ezeket használva 10 hosszúságú csíkot épített, mint ami a képen látszik. Összesen hányféle minta jöhetett ki Hanna építményében? Screenshot 2024-05-22 at 12.37.00 Leó is szeretett vona olyan mintákat kirakni, mint Hanna, de neki más féle lego kockái voltak: fekete színű $2\times1$ egység méretű, illetve fehér színű $1\times1$-es. Ilyen kockákból rakott ki egy $10\times1$ hosszú csíkot, mint a lenti példa mutatja. Leó vagy Hanna tudott többféle mintát kirakni a kockáikkal? Hány féle különböző mintát tudott Leo kirakni? Screenshot 2024-05-22 at 12.30.05 Leó aztán egy falat épített ugyanilyen kockákból, 10 egység széles, 1 egység mély, és 3 egység magas, mint a kéopen. Hányféle minta alakulhatott ki az egész falon? (Leó nem figyelt arra, hogy az egmás feletti téglák kötésben legyenek)
     Like  Bookmark
  • Lesson 1: pytorch a deep learning framework A deep learning framework provides three areas of functionality: numerical array computing:a convenient API to manipulate tensors and arrays: perform operations between arrays: elementwise, indexing, matrix products, einstein sum notation. common API over different computing architectures: run the same code on CPU or GPU. Demo of backend selection on colab, speed comparison, API tips for moving tensors to/from cuda. EASY EXERCISE: simple-ish pytorch array indexing. OR DIFFICULT EXERCISE: implement math equation using einsum automatic differentiation
     Like  Bookmark
  • Mi az utolsó számjegye annak a számnak amit úgy kapunk hogy $x = \underbrace{9\cdot9\cdot9\cdot9\cdot9\cdot9\cdot9\cdot9\cdot9}_{9\text{-szer}}$? Mi az utolsó számjegye annak a számnak, amit úgy kapunk hogy $x = \underbrace{9\cdot9\cdots9\cdot9}_{100\text{-szor}}$ Mi az utolsó számjegye annak hogy $\underbrace{7\cdot7\cdots7\cdot7}_{100\text{-szor}}$? Juli ének órán unatkozott, ezért összeszorozta az első 10 pozitív egész számot, azaz kiszámolta, hogy mennyi $1\cdot2\cdot3\cdot4\cdot5\cdot6\cdot7\cdot8\cdot9\cdot10$. A szám néhány utolsó számjegye $0$ volt. Meg tudod-e mondani hogy hány nullás számjegy szerepelt a szám végén? Juli nyelvtan órán még jobban unatkozott, ezért folytatta a szorzást és kiszámolta azt hogy $1\cdot 2\cdot 3\cdots 23 \cdot 24 \cdot 25$. Meg tudjátok-e mondani, hogy ennek a számnak hány nullás volt a végén? Gyuri, a református marslakó zoknisfiókjában kék, piros és zöld zoknik vannak, mindegyikből végtelen darab. Hány zoknit kell a sötétben kihúznia Gyurinak, hogy biztosan legyen közte egy pár? Igen ám, de a marslakóknak 7 lába van. Hány zoknit kell előhúznia hogy biztosan legyen minden lábára ugyanolyan színű? Gyurinak ma 'odd socks day' lesz az iskolában, ezért úgy kell felöltöznie, hogy egymás melletti lábain semmiképp se legyen ugyanolyan színű zokni (a lábai egy körben helyezkednek el). Hány zoknit kell kihúznia, hogy biztosan meg tudja ezt tenni?
     Like  Bookmark
  • A Research Question: Non-identifiable behaviour of meta-learned Bayesian inference. Let's say we train a transformer or an ARLM in general on exchangeable data (mixture of iids). This, essentially, meta-trains the model to perform implicit Bayesian inference. To say that the model performs implicit Bayesian inference means that, roughly, "when evaluated on data consistent with the pretraining mixture, it's predictions/completions are indistinguishable from (or very close to) what exact Bayesian inference would predict". However, how should we expect a model which performs implicit Bayesian inference in a Bayesian model behave when confronted with observations with respect to which the Bayesian model is misspecified, i.e. on prompts that are out-of-distribution? Continuing to perform (implicit) Bayesian inference in a misspecified model is, as it will be discussed below, suboptimal. It's concievable that the model does even better than Bayesian inference on OOD. Or that it matches Bayesian behaviour in the OOD setting. Or that it does absolute garbage in OOD setting. Little bit more formally So, we train an ARLM on a mixture distribution of the form
     Like  Bookmark
  • Most generalisation bounds assume data is sampled i.i.d from some distribution. This prevents these bounds to say anything meaningful about situations like domain adaptation or domain generalisation, where multiple related datasets are available from sligthly different domains/contexts, or transfer learning where some information about a dataset captured by a learning algorithm is transferred over to a related task, or meta-learning where we are learning to learn from data. Relaxing the i.i.d. assumption to exchangeability might allow a mathematically tractable way to extend generalisation bounds to a broader set of situations. For example, different domains may be modelled as i.i.d. copies of an underlying exchangeable process. Relaxing i.i.d. to exchangeable The first question to investigate is whether it is possible to obtain generalisation bounds for exchangeable, rather than i.i.d. data generating processes, or how to even defien it. We assume there exists an infinitely exchangeable process $Z_1, Z_2, \ldots$, and our training dataset is a finite marginal $\mathcal{D} = [Z_1, \ldots, Z_N]$. Based on this, we can define the empirical risk, in the usual fashion:
     Like  Bookmark
  • Causal de Finetti world ICM generative model A multivariate exchangeable sequence ${X_{i, n}}_{i\in{1,\ldots,D}, n\in\mathbb{N}}$ is a collection of random variables that is infinitely exchangeable in the variable $n$. An ICM (independent causal mechanism) generative model with directed acyclic graph $\mathcal{G}$ over node set $V = {1,\ldots,D}$ as its causal structure is defined by the following joint distribution over such variables: $$ P({X_{i, n}}{i\in{1,\ldots,D}, n\in{1\ldots N}}) = \int \cdots \int \prod{n=1}^{N}\prod_{i=1}^{D} P(X_{i,n}\vert X_{\text{Pa}(i), n}, \theta_i) d\pi(\theta_1)\cdots d\pi(\theta_D), $$
     Like  Bookmark
  • A sequence $X_1, \ldots, X_N$ of random variables is finite exchangeable if for all permutations $\pi$ $$ P(X_1=x_1,\ldots,X_N=x_n) = P(X_1 = x_{\pi(1)},\ldots,X_N = x_{\pi(N)}) $$ An infinite collection of RVs $X_1, X_2, \ldots$ is infinite exchangeable, if for any $N$, $X_1, \ldots, X_N$ is exchangeable. Extending finite exch. sequences Any finite subset of an exchangeable sequence is finite exchangeability. A natural question to ask is whether a finite exchangeable sequence of length $N$ can be extended to an infinitely exchangeable sequence.
     Like  Bookmark
  • Probabilistic Representation Learning Ferenc Huszár Computer Lab, Cambridge University slides: https://hackmd.io/@fhuszar/rJWPUmCCo On the advantages of using slides 8h7g09-1 Thursday
     Like  Bookmark
  • Consider a probabilistic model represented by its likelihood $p(y\vert\theta)$. Maximizing this likelihood to find the most likely parameter $\theta$ is equivalent to minimizing the negative log-likelihood, i.e. the loss function $\mathcal{L}(\theta, \mathcal{D}) = -\sum_{y\in\mathcal{D}} \log p(y\vert \theta)$. Many common loss functions have interpretations as the negative log likelihood of a probabilistic model, for example, the squared loss is the log loss of a Normal distribution. To tackle this optimization we use gradient descent (GD) over the parameter $\theta$, which finds the steepest descent direction in each step around the local neighbourhood of the current value of $\theta$ in the parameter space. If we change how we parametrise our model, we also change what the meaning of steepest descent direction is. Now, if $\theta_1$ and $\theta_2$ are two parameter sets describing the same model, updating our model in the steepest descent direction around $\theta_1$ might result in a different model than if we update in the steepest direction around $\theta_2$. This is why you saw those differences in Assignment 1. Notice how the loss function can be thought of as a composition of two mappings: $$ \theta \mapsto p(y\vert \theta) \mapsto \mathcal{L}(\theta, \mathcal{D}) $$
     Like  Bookmark
  • Stochastic Gradient Descent In previous lectures we described gradient descent as follows: $$ \mathbb{w}_{t+1} = \mathbf{w}t - \eta \nabla\mathbf{w}\hat{L}(\mathbf{w}, \mathcal{D}), $$ where $\hat{L}$ is the empirical loss function evaluated at the training data $\mathcal{D}$. For models trained on large datasets this is both computationally prohibitive and unneccessary: Calculating $\hat{L}$ requies us to sum up the loss on all datapoints. There might be too many datapoints in our training example to loop through for a each gradient update of our parameters.
     Like 3 Bookmark
  • These are my notes on the following paper: https://arxiv.org/abs/2104.04874 which analyses the inductive bias of SGD relative to full-batch GD using Taylor expansions. In order to make this paper easier to understand, I present the key ideas using modern machine learning notation. The expected effect of a GD step on the generalisation gap The generalisation gap, for the purposes of this paper, is the difference between a model's test loss and training loss. In this work we look at how this generalisation gap changes in a single step of gradient descent on the training data, starting from a fixed parameter $\theta$. Since the training and test data are random, the change in change in generalisation gap is also random, we will be interested in the average of this change, when we average over realisations of the training and test data. Let $\mathcal{D}{train}$ and $\mathcal{D}{test}$ be a random training and test set, drawn i.i.d. from the same underlying distribution over some data $x$. let $\mathcal{L}(\theta, \mathcal{D}) = \frac{1}{\vert\mathcal{D}\vert}\sum_{x\in \mathcal{D}}\ell(x, \theta)$ be the empirical loss evaluated on a set of data $\mathcal{D}$. Let $g(\theta, \mathcal{D}) = \nabla_\theta \mathcal{L}(\theta, \mathcal{D})$ be the gradient evaluated on a set of data $\mathcal{D}$.
     Like 1 Bookmark
  • These are some general instructions on presenting a paper at a reading group, or reading seminar. I am writing these notes with our reading seminar-style master's level course in mind: R252 Theory of Deep Learning. Preparations: Reading and Understanding the paper It's not just the one paper: Even though the topic of a reading group presentation is usually a single paper, you should expect to read several more papers, and related content. Sometimes you will need up on the background, say an area of Maths you don't know or forgot about - Wikipedia might suffice here, or perhaps the paper builds strongly on a previous one, in which case you might spend quite a bit of time reading the original old paper before starting the actual paper. So, part of reading a paper for a reading group presentation actually involves quite a bit of looking up and sorting through possibly relevant information. Find video lectures, blog posts, and follow-up papers: Since ML is a very quickly evolving field, it may be that the paper you're reading and presenting already has follow-ups, or sometimes rebuttals. It's a good idea to look at papers citing the work you're reading using google scholar "Cited by" functionality, to identify other influential papers that may build on it. It's also a very good idea to look for blog posts, video lectures about the paper, to get a feel for how others presented the content, and what they focussed on as they read it. Reading papers: There's a bunch of good advice out there on how to read papers. I like Keshav's writeup How to Read a Paper which suggests a 'progressive', multi-pass approach. Andrew Ng also has a video with advice on finding and reading ML papers, here is a writeup about this advice. Try to make the paper yours: The best reading group presentations are those that don't merely follow the narrative structure of the paper, but which add some extra interpretation, or new dimension to it. As I prepare for presenting a paper, I always think about ways in which I can make the presentation mine, adding something that's not obvious from reading the paper. How much of the maths to understand depends on who you're presenting to, what the purpose of the presentation is, and the role the maths plays in the paper. My rule of thumb: for a reading group presentation on theory it's useful to understand the the "moral" of the maths and proofs, that is, what the purpose of the mathematical statements is, and why, fundamentally, are they true. Don't worry if you don't understand every technical step of a proof to the point you could reproduce it, at least in my course. Focus on proofs only if you think the proof itself is insightful, if it tells you something about why the claims are true. Identify critical assumptions that make the proof work. Mathematics always deals with a model of the 'real world', and it's important to understand the aspects in which the mathematical model differs from real world, and whether you think it's reasonable to ignore these other aspects. The presentation
     Like  Bookmark
  • Why I would be very excited to study computer science So you've been doing competitive programming, maybe maths competitions, too. You did well, so it's natural for you to go study computer science at a top university. But are you aware of the huge world of computer science that is about to open up for you? This is why, if I were you, I were super excited to start a Computer Science degree in 2024. Computer science is so much more than programming. But even if we stay within the realms of programming there are tons and tons of interesting stuff to learn about various programming paradigms. There is imperative, procedural, declarative, functional programming, object-oriented programming. There are too many programming languages to learn, many different ways of thinking about problem solving. You may know how to implement an algorithm in C++ but wait until you have to find out how to implement it in Haskell. And the theory of programming languages leads to a rich landscape full of theory that provides solid computational foundations for mathematical reasoning: type theory, category theory, formal verification. Some programming langauages allow you to write software with mathematical guarantees that it does what you wanted it to do. Others allow you to express proofs of mathematical theorems as code, making it possible for mathematical proofs to be algorithmically verified by a computer. In ten year's time perhaps all of today's research mathematics will be "implemented" on top of theorem provers. And perhaps soon enough, computers will be better at finding proofs than human mathematicians. Computer science is also much more than finding fast algorithms. Theoretical computer science studies which problems admit fast algorithms in the first place. What kind of computation can you even devise an algorithm for. And does it matter what kind of computer you design the algorithm for? What happens if you have a completely new type of computing paradigm, like quantum computers. Some quantum algorithms solve problems exponentially faster than classical algorithms. But is this true for all kinds of algorithms? What are the limits of quantum computing? How do you even program a quantum computer? And we want to tackle problems that are so complex and messy that no one can possibly hope to devise an algorithm to solve them: Folding proteins, playing Go, conversing in human language, generating natural looking images, or mathematical theory. In machine learning we don't program what algorithm ot functions the computer should use, instead we only program general learning algorithms that allow it to learn from examples and find a function that works well enough. Artificial Intelligence is advancing crazy fast. The AI models we take for granted today: nothing like them existed just five years ago. Who knows what this field is going to look like in 5 or 10 years in the future?
     Like  Bookmark