# An opinionated guide to ML
> "It is important to view knowledge as sort of a semantic tree — make sure you understand the fundamental principles, i.e. the trunk and big branches, before you get into the leaves or there is nothing for them to hang on to." - Elon Musk
### Simple > accurate
Many people overestimate the importance of learning what's correct and accurate, when it's often better to learn something slightly wrong but simple to understand. Imagine teaching a kid that gravity at the poles is stronger than at the equator due to mass distribution and rotation, before teaching them about the effect of mass on gravity. That's how most technical guides to machine learning (ML) often feel like: overly detailed info-dumps that are unhelpful in forming intuitions on how things work.
This guide does the exact opposite. Wherever possible, I have deliberately chosen to oversimplify so as to help you reach an intuitive understanding.
### The relevant parts are not deep, because they are quite new
> Researchers hate this one weird trick!
If you pick up any old ML textbook you'll probably find stuff on LSTMs, RNNs, and whatnot. Next time, flip the book over and check if it's published before 2017. If it was, you can throw it away. That's because...
... the Transformer machine learning architecture has won, and it was only discovered in [2017](https://arxiv.org/abs/1706.03762).
ChatGPT? Transformer. Stable Diffusion? Transformer. LLaMA? Transformer.
And what about "emergent properties", like chain-of-thought and few-shot learning?
Wasn't even a thing until [2020](https://arxiv.org/abs/2005.14165).
With the power of hindsight we can ignore all the hard work that has gone into other branches of ML research and only focus on a handful of high-impact concepts.
The fastest way to get up to speed with the state-of-the-art is by grokking the following concepts (and their contexts) in this order:
### Perceptron > Multilayer Perceptron > Word Embeddings > ReLU > Adam Optimizer > Transformer > Emergent Capabilities
```
(Optional)
If you want to experiment with code, do this setup:
- Install VSCode
- Install VSCode Extensions Remote-SSH <- works with Jupyter notebooks!
- Install Copilot
- Install Miniconda (use Miniforge if you're on OSX)
- Install Jupyter (not Jupyterlab)
- Install Pytorch (not tensorflow nor JAX)
If you must use a GPU, only Nvidia is compatible.
If you must use Linux, use Ubuntu 22.04.
If you must use a cloud provider, just pick anyone (but use Ubuntu 22.04, https://gist.github.com/amir-saniyan/b3d8e06145a8569c0d0e030af6d60bea)
```
### Perceptron

#### TLDR: A perceptron is like a neuron. It has inputs (x, y, z), weights (w) and an output (output). It learns stuff using a loss function and backpropagation to update weights.
output = w_0 * x + w_1 * y + w_2 * z
At the start, we use randomized weights and the output is also random.
We then use a loss function to calculate how far the output is from the correct answer, and use backpropagation to modify the weights. We can use an algorithm like gradient descent to do backpropagation.
###### Here's some python code I generated using ChatGPT to "implement a perceptron", but you can just ask ChatGPT yourself for other models / examples. Teach a man to fish and all that.
```
import torch
# Define the input tensor and target tensor
x = torch.tensor([2.0, 3.0, 1.0])
y_true = torch.tensor([1.0])
# Define the perceptron model with a single neuron
model = torch.nn.Linear(3, 1)
# Define the loss function (mean squared error)
loss_fn = torch.nn.MSELoss()
# Define the optimizer (stochastic gradient descent)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Perform forward pass
y_pred = model(x)
loss = loss_fn(y_pred, y_true)
# Perform backward pass
optimizer.zero_grad()
loss.backward()
# Update the model parameters
optimizer.step()
```
### Multilayer Perceptron (MLP)

#### TLDR: Many perceptrons connected in layers. It's the most basic neural network. It's a universal approximator and it can technically learn anything. Earlier weights learn slowest.
You have many inputs, you use random weights, you get some output. To train it, we use a loss function to check if the generated output is close to the expected output. If not, we backpropagate the error into the weights and update the weights. Repeat until your loss function plateaus.
Some interesting insights:
- Multilayer perceptrons are [universal approximators](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf), meaning this architecture is capable of learning any type of pattern / function. Limited by the number of weights and layers of course.
- Earlier layers learn slower, because they are further from the output so the weight updates are smaller
### Word Embeddings


#### TLDR: Word embeddings convert words to numbers which can be used as inputs. This is separate step that happens before training the model. Dot-product between 2 word embeddings gives you their semantic similarity.
As you can see in the MLP diagram, inputs have to be numbers, otherwise you can't multiply them with weights. If you want to have the MLP learn English, it needs to process words into numbers. The way we do this is using word embeddings.
You could do what's shown in the first diagram, which is called a one-hot encoding, which is just a diagonal matrix of "1"s. Or you can build a smaller one yourself using different features to describe the words. In practice people already have pre-built word-embeddings which they using algorithms like Word2Vec, and each column (feature) contains values that don't always correspond to human-understandable concepts.
Using a word embedding, we can discover the similarity between two words using the dot-product of the word embedding vectors of the two words.
### ReLU

#### TLDR: ReLU is a non-linear activation function with an easy to calculate gradient. This massively speeds up backpropagation calculation.
Previously, in a perceptron, we used weighted-average to determine the output but actually we often use another step called an "activation function" to process the output.
The reason for this is that using just the weighted-average is basically just a linear combination and that can't learn complex functions and patterns.
Previously, researchers used sigmoids and all sorts of functions for the perceptron but during backpropagation phase, you need to take the derivative of the sigmoid, which is hard. Linear looking things like ReLU are very easy to differentiate. What's surprising is that ReLU seems lossy but it turns out to not matter.
This made deeper neural networks with more layers much more feasible. State of the art uses variants like GLU, but the key ideas are still relevant:
- keep differentiation simple and fast
- make some parts 0 so the function is non-linear.
### Adam optimizer

#### TLDR: One-size-fits-all optimizer that Just Works. Stop wasting time fiddling with hyperparameters.
The optimization function is crucial in affecting weight updates. Picking a poor one leads to lots of zigzagging during weight updates which is sub-optimal and slows down training speed.
As a result, lots of researchers wasted time fiddling with different gradient calculation algorithms, and it would often give significant improvements.
This all changed when the [Adam optimizer](https://arxiv.org/abs/1412.6980) was shown to be pretty much always as good as other customized optimizers, and was effective on almost everything. Nowadays everyone uses the Adam optimizer or its variants.
### Transformer

{%youtube g2BRIuln4uc %}
^Watch the video above.
Some insights:
- self-attention allows learning relationships between words
- transformers can be trained in parallel easily, whereas the previous architecture of recurrent neural networks could not
- you can get large amounts of data and train transformers using self-supervised training
1. Take lots of text from the internet
2. Blank out some words
3. Have the model try to fill in the blanks
#### (Un?)Fortunately, nothing that important has been discovered since the Transformer architecture.
We can just take a look at [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/), a model released by Meta in February 2023. If you look at section 2.2, LLaMA has only 3 changes from the original 2017 Transformer paper:
- Pre-normalization
- SwiGLU activation function
- Rotary Embeddings.
### Emergent Capabilities

~~TLDR: As of April 2023 nobody knows anything for sure other than large models display these emergent capabilities.~~
<!--
~~Yet they do.~~
~~Why?~~
~~No one really knows.~~
~~All we know is that, when models get larger and get trained on more data, they suddenly do much much better on reasoning tasks, exhibiting some form of "understanding".~~
~~In the diagram, we can see that large models suddenly show improved performance beyond a certain size. Note that the pink dotted line is random performance: for example Persian QA has 4 options so randomly picking an answer gives you 25% chance of getting it right. As you can see in the graphs, at some point during scaling, models go from no better than chance to significantly better than chance.~~
-->
Update May 1st: A very strong correlation between training on code and emergent ability on reasoning tasks has been found.
Initially just a [hypothesis](https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1), [more recent research](https://yaofu.notion.site/Towards-Complex-Reasoning-the-Polaris-of-Large-Language-Models-c2b4a51355b44764975f88e6a42d4e75) shows that code-based models that are subsequently trained / fine-tuned on text show strong reasoning capabilities. [replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) was trained on mainly code and a fine-tuned version of this model [beat SOTA on the HumanEval benchmark](https://blog.replit.com/replit-developer-day-recap#newmodel), even though it is much smaller than other models.

### What now?
We seem to have stumbled upon Artificial General Intelligence by complete accident using the scaling hypothesis applied to large language models. [Sam Altman himself thinks](https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/) that we have reached the limit of the scaling hypothesis.
I guess we'll just have to wait and see?
Meanwhile, while waiting to [turn into a paperclip](https://en.wikipedia.org/wiki/Instrumental_convergence#Paperclip_maximizer), you might wanna read up on [superintelligence](https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html) and [AI alignment](https://www.youtube.com/watch?v=YicCAgjsky8&t=182s) and why [AI might kill us all](https://www.youtube.com/watch?v=gA1sNLL6yg4).