# Introduction to PyTorch:
## An Intro Guide for High-School Students Preparing for AI Olympiads
PyTorch is a powerful and versatile library for numerical computing and machine learning. It’s built on **Python**, a language known for its simplicity and vast ecosystem. While AI can be implemented in other programming languages, the overwhelming majority of open-source tools and frameworks—like TensorFlow, Scikit-learn, and PyTorch—are written in Python. This makes Python the de facto standard for artificial intelligence and machine learning.
This note introduces three key aspects of PyTorch: its capabilities for numerical programming, its use of automatic differentiation, and its features for building neural networks and differentiable programs.
---
## 1. Numerical Programming with PyTorch
### PyTorch as a Numerical Programming Tool
PyTorch is, first and foremost, a **numerical programming package**, akin to a "calculator app" for Python. It provides tools to perform mathematical operations efficiently and is especially optimized for large-scale computations like those in machine learning. Another example of a numerical programming library is **NumPy**, but PyTorch goes beyond it by offering features like GPU support and **automatic differentiation**, which make it more suitable for deep learning.
### Tensors and Their Operations
The primary data structure in PyTorch is the **tensor**, a generalization of scalars, vectors, and matrices. Tensors can have arbitrary dimensions and shapes:
- **Scalars**: A single number, shape `[]`.
- **Vectors**: A 1D array, shape `[n]`.
- **Matrices**: A 2D array, shape `[m, n]`.
- **Higher dimensions**: For example, a 3D tensor might have shape `[batch_size, height, width]`.
PyTorch supports a wide range of operations, from elementwise functions (addition, multiplication, exponentiation) to **matrix multiplication** using the `@` operator.
Example:
```python
import torch
x = torch.tensor([[1, 2], [3, 4]])
y = torch.tensor([[5, 6], [7, 8]])
z = x @ y # Matrix multiplication
```
### Vectorized Code for Efficiency
One of the key principles in PyTorch is **vectorization**: performing operations directly on entire tensors instead of iterating through their elements. This approach is much faster because PyTorch uses highly optimized C libraries for tensor operations. For example:
```python
# Inefficient Python loop
for i in range(len(a)):
c[i] = a[i] + b[i]
# Efficient PyTorch operation
c = a + b
```
### CPU, GPU, and TPU
PyTorch supports computations on multiple platforms:
- **CPU**: A general-purpose processor with a small number of cores (4–32) optimized for diverse tasks.
- **GPU (Graphics Processing Unit)**: A highly parallel processor with thousands of cores (e.g., NVIDIA GPUs can have 2,000–10,000 cores). GPUs are optimized for repeated, parallel computations, such as matrix multiplication or elementwise operations. They are designed with memory layouts that ensure data is physically close to the computing cores, minimizing delays.
- **TPU (Tensor Processing Unit)**: A specialized processor for machine learning, developed by Google, optimized for matrix and tensor computations.
GPUs excel at operations like:
1. **Matrix multiplication**: Repeated multiplications and additions in parallel.
2. **Elementwise matrix operations**: Applying a function (e.g., exponentiation) to each element of a matrix.
To use a GPU in PyTorch, you move tensors to the GPU with `.cuda()`:
```python
x = x.cuda() # Move tensor to GPU
y = y.cuda()
z = x @ y # Perform computation on GPU
```
To bring tensors back to the CPU, use `.cpu()`:
```python
z = z.cpu()
```
### Experimenting in Google Colab
In Google Colab, you can request a GPU backend by selecting **Runtime > Change runtime type > GPU**. This allows you to compare the speed of computations on the CPU and GPU. For example, try multiplying two large matrices and observe the difference.
#### Common Mistakes with GPU
To fully benefit from the GPU, ensure:
1. All tensors involved in a computation are moved to the GPU.
2. Avoid frequent transfers between CPU and GPU memory, as this is a common bottleneck.
3. Optimize GPU throughput by structuring computations to minimize idle cores.
---
## 2. Automatic Differentiation in PyTorch
### Computational Graphs and Gradients
PyTorch builds a **computational graph** in the background whenever you perform tensor operations. This graph tracks the sequence of operations and is essential for **automatic differentiation**, a method for computing gradients (derivatives). Gradients are critical for training machine learning models, as they indicate how to adjust model parameters to reduce error.
PyTorch uses **reverse-mode automatic differentiation**, commonly referred to as backpropagation, to calculate gradients efficiently. Backpropagation uses the chain rule of calculus and is akin to dynamic programming, where intermediate results are reused.
For more details, see [this note on automatic differentiation](https://hackmd.io/@fhuszar/SyHTInWeu).
### Using Gradients in PyTorch
1. **Enabling Gradients**: To enable gradient tracking, set `requires_grad=True` when creating a tensor:
```python
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
```
2. **Backpropagation**: Call `.backward()` to compute gradients:
```python
y = x ** 2
y.sum().backward() # Computes dy/dx for each x
print(x.grad) # Prints gradients
```
3. **Disabling Gradients**: Use `torch.no_grad` for inference to save memory and computation:
```python
with torch.no_grad():
y = model(x)
```
4. **Graph Management**: After `.backward()`, the computational graph is destroyed by default. Use `.detach()` to keep a tensor without tracking its history:
```python
detached_x = x.detach()
```
---
## 3. Building Neural Networks with PyTorch
### Neural Networks and Differentiable Programming
PyTorch simplifies the creation of neural networks, which are composed of layers performing differentiable computations. These layers are combined to process inputs, transform data, and generate outputs.
Here’s a simple example of a multi-layer perceptron (MLP):
```python
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 50), # Input size 10, output size 50
nn.ReLU(), # Activation function
nn.Linear(50, 1) # Output size 1
)
```
### Loss Functions and Optimization
1. **Loss Functions**: Quantify the difference between predictions and true values (e.g., mean squared error or cross-entropy loss).
2. **Optimization Algorithms**: Use gradients to adjust model parameters, e.g., stochastic gradient descent (SGD) or Adam.
Example:
```python
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Training loop (simplified)
for data, target in dataloader:
optimizer.zero_grad()
predictions = model(data)
loss = loss_fn(predictions, target)
loss.backward()
optimizer.step()
```
### Additional Utilities
PyTorch provides tools to streamline workflows:
- **Dataset and Dataloader Classes**: Simplify data handling and batching.
- **Transformations**: Facilitate data preprocessing (e.g., normalization, augmentation).
For more advanced details, explore the [Elements of Differentiable Programming](https://arxiv.org/abs/2403.14606).
---
PyTorch is a flexible and efficient framework for numerical programming, automatic differentiation, and building neural networks. With these tools, you can create powerful machine learning models and explore the exciting world of AI.