Introduction to PyTorch:

An Intro Guide for High-School Students Preparing for AI Olympiads

PyTorch is a powerful and versatile library for numerical computing and machine learning. It’s built on Python, a language known for its simplicity and vast ecosystem. While AI can be implemented in other programming languages, the overwhelming majority of open-source tools and frameworks—like TensorFlow, Scikit-learn, and PyTorch—are written in Python. This makes Python the de facto standard for artificial intelligence and machine learning.

This note introduces three key aspects of PyTorch: its capabilities for numerical programming, its use of automatic differentiation, and its features for building neural networks and differentiable programs.

1. Numerical Programming with PyTorch

PyTorch as a Numerical Programming Tool

PyTorch is, first and foremost, a numerical programming package, akin to a "calculator app" for Python. It provides tools to perform mathematical operations efficiently and is especially optimized for large-scale computations like those in machine learning. Another example of a numerical programming library is NumPy, but PyTorch goes beyond it by offering features like GPU support and automatic differentiation, which make it more suitable for deep learning.

Tensors and Their Operations

The primary data structure in PyTorch is the tensor, a generalization of scalars, vectors, and matrices. Tensors can have arbitrary dimensions and shapes:

Scalars: A single number, shape [].
Vectors: A 1D array, shape [n].
Matrices: A 2D array, shape [m, n].
Higher dimensions: For example, a 3D tensor might have shape [batch_size, height, width].

PyTorch supports a wide range of operations, from elementwise functions (addition, multiplication, exponentiation) to matrix multiplication using the @ operator.

Example:

import torch
x = torch.tensor([[1, 2], [3, 4]])
y = torch.tensor([[5, 6], [7, 8]])
z = x @ y  # Matrix multiplication

Vectorized Code for Efficiency

One of the key principles in PyTorch is vectorization: performing operations directly on entire tensors instead of iterating through their elements. This approach is much faster because PyTorch uses highly optimized C libraries for tensor operations. For example:

# Inefficient Python loop
for i in range(len(a)):
    c[i] = a[i] + b[i]

# Efficient PyTorch operation
c = a + b

CPU, GPU, and TPU

PyTorch supports computations on multiple platforms:

CPU: A general-purpose processor with a small number of cores (4–32) optimized for diverse tasks.
GPU (Graphics Processing Unit): A highly parallel processor with thousands of cores (e.g., NVIDIA GPUs can have 2,000–10,000 cores). GPUs are optimized for repeated, parallel computations, such as matrix multiplication or elementwise operations. They are designed with memory layouts that ensure data is physically close to the computing cores, minimizing delays.
TPU (Tensor Processing Unit): A specialized processor for machine learning, developed by Google, optimized for matrix and tensor computations.

GPUs excel at operations like:

Matrix multiplication: Repeated multiplications and additions in parallel.
Elementwise matrix operations: Applying a function (e.g., exponentiation) to each element of a matrix.

To use a GPU in PyTorch, you move tensors to the GPU with .cuda():

x = x.cuda()  # Move tensor to GPU
y = y.cuda()
z = x @ y     # Perform computation on GPU

To bring tensors back to the CPU, use .cpu():

z = z.cpu()

Experimenting in Google Colab

In Google Colab, you can request a GPU backend by selecting Runtime > Change runtime type > GPU. This allows you to compare the speed of computations on the CPU and GPU. For example, try multiplying two large matrices and observe the difference.

Common Mistakes with GPU

To fully benefit from the GPU, ensure:

All tensors involved in a computation are moved to the GPU.
Avoid frequent transfers between CPU and GPU memory, as this is a common bottleneck.
Optimize GPU throughput by structuring computations to minimize idle cores.

2. Automatic Differentiation in PyTorch

Computational Graphs and Gradients

PyTorch builds a computational graph in the background whenever you perform tensor operations. This graph tracks the sequence of operations and is essential for automatic differentiation, a method for computing gradients (derivatives). Gradients are critical for training machine learning models, as they indicate how to adjust model parameters to reduce error.

PyTorch uses reverse-mode automatic differentiation, commonly referred to as backpropagation, to calculate gradients efficiently. Backpropagation uses the chain rule of calculus and is akin to dynamic programming, where intermediate results are reused.

For more details, see this note on automatic differentiation.

Using Gradients in PyTorch

Enabling Gradients: To enable gradient tracking, set requires_grad=True when creating a tensor:
```
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
```

Backpropagation: Call .backward() to compute gradients:

y = x ** 2
y.sum().backward()  # Computes dy/dx for each x
print(x.grad)       # Prints gradients

Disabling Gradients: Use torch.no_grad for inference to save memory and computation:
```
with torch.no_grad():
    y = model(x)
```
Graph Management: After .backward(), the computational graph is destroyed by default. Use .detach() to keep a tensor without tracking its history:
```
detached_x = x.detach()
```

3. Building Neural Networks with PyTorch

Neural Networks and Differentiable Programming

PyTorch simplifies the creation of neural networks, which are composed of layers performing differentiable computations. These layers are combined to process inputs, transform data, and generate outputs.

Here’s a simple example of a multi-layer perceptron (MLP):

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 50),  # Input size 10, output size 50
    nn.ReLU(),          # Activation function
    nn.Linear(50, 1)    # Output size 1
)

Loss Functions and Optimization

Loss Functions: Quantify the difference between predictions and true values (e.g., mean squared error or cross-entropy loss).
Optimization Algorithms: Use gradients to adjust model parameters, e.g., stochastic gradient descent (SGD) or Adam.

Example:

loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop (simplified)
for data, target in dataloader:
    optimizer.zero_grad()
    predictions = model(data)
    loss = loss_fn(predictions, target)
    loss.backward()
    optimizer.step()

Additional Utilities

PyTorch provides tools to streamline workflows:

Dataset and Dataloader Classes: Simplify data handling and batching.
Transformations: Facilitate data preprocessing (e.g., normalization, augmentation).

For more advanced details, explore the Elements of Differentiable Programming.

PyTorch is a flexible and efficient framework for numerical programming, automatic differentiation, and building neural networks. With these tools, you can create powerful machine learning models and explore the exciting world of AI.