# How to Machine Learning with PyTorch !
><i class="fa fa-file-text"></i> [PyTorch Tutorial](https://docs.pytorch.org/tutorials/)
:::success
:bulb: **Typical PyTorch training order [guide]**
1. Set Up
- Import libraries
- `torch, random, numpy, torchvision, torch.utils.data`
2. Data Transforms
- Train transforms: heavy augmentations
- `Resize→RandomFlip→Rotation→ColorJitter→ToTensor→Normalize`
- Eval transforms (eval & test): minimal preprocessing
- `Resize→ToTensor→Normalize`
3. Datasets & Dataloaders
- Load train "base_train" ds w/o transformations
- Create [80/20] train/validation set w/ `random_split`
- Apply train transforms train ds and eval transforms on val ds
- Load the test ds separately and apply eval transforms
- Create dataloaders for all 3 ds
- `batch_size, shuffle, num_workers `
4. Define Model
- Check/set device
- Identify suitable model type
- Define `nn.Module` and set hyper-parameters
- `dropout_rate`
- def `__init__,foward`
- Include layers, activations, pooling, batch-norm, skip connections, etc.
5. Loss Function
- Choose suitable loss function
6. Optimizer & Scheduler
- Choose suitable optimizer
- `weight_decay`
- Choose suitable scheduler
7. Validation / Metrics Function
- Write validation function that returns average loss + accuracy
8. (Optional) Hyper-Parameter Search
- One time tuning... grid/random/Bayesian Search
- Loop over hyper-params
- `lr, wd, dropout, etc.`
- Call training routine
- Get best configs for the model
9. Training Loop
- Consider single model, traditional ensemble, or snapshot ensemble training
- For each epoch:
- Train: `model.train()`--> optimizer zero grad--> forward pass--> loss backward--> optimizer step
- Validate: call validation function on val ds
- Early Stop & Checkpoint: compare val_loss to best so far... save state dict if improved... break if no improvement
- Scheduler step: get val loss from validate function and scheduler step
- Record epoch progress: epoch, train/val loss, train/val accuracy
10. Test Evaluation
- Post training results
- Run `evaluate(best_model, test_loader)`
- Get final accuracy
11. Load and save model
- Save your final model weights w/ `torch.save(model.state_dict(), path)`
- Reload w/ `model.load_state_dict(torch.load(path))` and `model.eval()`
12. (Optional) Visuals & Reports
- Simple & clean analysis graphs and reports to maximize ML
- Class visuals, different progress tracking, Confusion matrix and classification report, Per-class matrix, Grad-CAM / Saliency Maps, etc.
13. (Optional) Add-ons & other features
-Other ML concepts/algorithms help with training accuracy and ML performance; however, these can heavly depend on resources
- Model ensembles,
:::
## Datasets & Dataloaders
:::info
:information_source: **info**
- Dataset code should be decoupled from model training code for better readability and modularity
- Two PyTorch data primitives: `torch.utils.data.DataLoader` & `torch.utils.data.Dataset`
- `Dataset` stores the samples and their corresponding labels
- `DataLoader` wraps an iterable around the Dataset to enable easy access to the samples
- PyTorch domain libraries provide a number of pre-loaded datasets that subclass `torch.utils.data.Dataset `
- Or you can download your own from the internet and apply your data directory
:::
## Transforms
:::info
:information_source: **info**
- Transforms are used to manipulate data so that it is suitable for training
- TorchVision has two params: `transform` & `target_transform`
- `transform` modifies features
- `target_transform` modifies labels
- e.g. `RandomHorizontalFlip`, `ColorJitter`, `ToTensor()`,`Normalize(mean, std)`
- Usually, imgs turn into tensors and labels can be left as integers (all depending on the model)
:::
## Neural Networks
:::info
:information_source: **info**
- Neural networks comprise of layers/modules that perform operations on data.
- The torch.nn namespace provides all the building blocks you need to build your own neural network
- Every module in PyTorch subclasses the nn.Module
- NNs consist of layers, a nested structure that allows for building/managing complex architectures easily
:::
## Autograd
:::info
:information_source: **info**
- Autograd is PyTorch's automatic calculus engine, usually happens during training loop
- *Back propagation* is a common algorithm used for training models
- ... parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter
- PyTorch's built-in engine `torch.autograd` supports automatic computation of gradient for any computational graph
``` python
loss.backward() # accumulate ∂loss/∂param
optimizer.step() # update params
```
:::
## Optimization
:::info
:information_source: **info**
- Training a model is an iterative process...
- Guess output --> calculate loss --> collects the derivatives of the error with respect to its parameters --> optimizes these parameters using gradient descent
:::
## Save & Load Model
:::info
:information_source: **info**
- PyTorch models store the learned parameters in an internal state dictionary, called `state_dict`
- `torch.save`
- To load model weights, you need to create an instance of the same model first, and then load the parameters
- `load_state_dict()`
:::
## List of model types
| Model Type | Key Characteristics | Popular Use Cases |
| ------------------------------------------ | ----------------------------------------------------------- | ---------------------------------------------------- |
| **Feedforward Neural Networks** | Basic neural network for general tasks | General-purpose classification, regression |
| **Convolutional Neural Networks (CNNs)** | Uses convolutional layers for image data | Image classification, object detection, segmentation |
| **Recurrent Neural Networks (RNNs)** | Handles sequential data | Time series, language modeling, speech recognition |
| **Transformers** | Uses attention mechanisms for sequential data | NLP tasks (translation, summarization) |
| **Generative Adversarial Networks (GANs)** | Consists of generator and discriminator | Image generation, style transfer, data augmentation |
| **Autoencoders** | Learns compressed representations of data | Data compression, anomaly detection |
| **Reinforcement Learning (RL)** | Learns from interaction with an environment | Robotics, game playing, autonomous vehicles |
| **Siamese Networks** | Compares pairs of inputs for similarity | Face verification, image similarity |
| **Attention Mechanisms** | Focuses on important parts of the input | NLP tasks, machine translation |
| **Capsule Networks (CapsNets)** | Focuses on spatial relationships between features | Image classification, object detection |
| **Neural Style Transfer** | Combines the content of one image with the style of another | Artistic image generation |
:::success
:bulb: **Summary of the CNN Process:**
*SIMPLE PATTERN: [Conv→ReLU→Pool→Flatten→FC]*
Input Image: Start with an image (e.g., 28x28 pixels for grayscale).
Convolutional Layer: Apply convolution filters to detect features like edges and textures. Filter/kernal is multiplied with small matrix of the image then summed into a single number (feature map).
Activation (ReLU): Apply an activation function (typically ReLU) to introduce non-linearity, allowing for more complex learning. Positive values are outputed directly while negatives are turned to zeros.
>[!Tip]Formula
`ReLU(x)=max(0,x)`
Pooling: Use max pooling to downsample the feature maps and reduce dimensionality. A matrix window is slid over feature map, selecting the max value in each region. This allows the feature map to shrink but still keeping the important values
Flattening: Flatten the feature maps into a 1D vector. The Fully Connected Layer only accepts and expects a 1D input.
Fully Connected Layer: Pass the flattened vector through fully connected layers. All neurons from this layer are connected with all the neurons from the previous layer. The network computes a weighted sum by multiplying the vector by its weight and adding a bias. Activation function (like ReLU) is applied and outputs the final result
Output Layer: Finally, output a classification or regression result, depending on the task. An output could be a vector listeing probabilities corresponding to each class.
>**Important Notes**
> Conv → BatchNorm → ReLU → Pool → Conv → BatchNorm → ReLU → Pool → Flatten → FC1 → ReLU → FC2
> - Pool halves the dimensions
> - The last layer of FC needs to match the number of classes
> - `nn.BatchNorm2d()` improves speed and performance: normalizes each feature-map (channel) to zero mean and unit variance and scales and shifts the normalized output with learnable parameters
> - `nn.Dropout()` randomly disables (set to 0) a fraction of neurons during training to prevent overfitting. It makes the model less reliant on any single feature and helps generalize better.
:::
## List of Loss Functions
| Loss Function | Use Case | Target Format |
| ------------------- | --------------------------------- | -------------------- |
| `CrossEntropyLoss` | Multiclass classification | Integer class labels |
| `BCEWithLogitsLoss` | Binary/multi-label classification | Float 0/1 vector |
| `NLLLoss` | Classification with log-softmax | Integer class labels |
| `MSELoss` | Regression | Float value |
| `L1Loss` | Regression (less sensitive) | Float value |
| `KLDivLoss` | Distribution comparison | Log-prob vs prob |
## List of Optimizers
| Optimizer | When to use | Notes |
| --------- | ------------------------------------ | ------------------------------------- |
| `Adam` | Default for most deep learning tasks | Fast, adaptive, great out of the box |
| `SGD` | Classic choice | Good with momentum, but slower |
| `RMSprop` | RNNs, time series | Handles non-stationary gradients well |
| `AdamW` | Transformers / weight decay | Adam + better weight regularization |
## List of LR Schedulers
| Scheduler | Pros | When to use |
| ------------------- | ------------------------------------------- | ----------------------------------------------------- |
| `StepLR` | Simple, predictable drops | You know roughly when to decay (e.g. every 10 epochs) |
| `MultiStepLR` | Multiple custom decay points | You have prior on milestones (e.g. \[30, 60, 90]) |
| `ExponentialLR` | Smooth exponential decay | You want continuous decay |
| `CosineAnnealingLR` | Warm restarts & smooth “cosine” decay | Modern “no-a priori” recipes |
| `ReduceLROnPlateau` | Decay triggered by validation metric stalls | You don’t know good epochs in advance |
## List of key architectural choices & hyperparameters
| **Attribute** | **Impact on Model** | **Typical Values / Notes** |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- |
| **# Convolutional Layers** | More layers → higher capacity & larger receptive field; risk of overfitting and longer training time. | 2–5 “blocks” for small datasets; up to 50+ for deep nets |
| **Filter Count (Channels)** | More channels per layer → more features learned, but more params and compute. | 32→64→128 or 64→128→256 in successive blocks |
| **Kernel Size** | Larger kernels capture more context per layer; increases parameters and compute. | 3×3 (standard), 5×5 or 7×7 for wider receptive field |
| **Stride / Pooling** | Controls spatial downsampling. Stride>1 or pooling reduces resolution, lowers memory but may lose detail. | Pool 2×2 / stride=2 after each conv block |
| **Activation Function** | Non-linearities let the net learn complex mappings. Different types affect gradient flow (e.g. ReLU vs. LeakyReLU vs. ELU). | ReLU (default), LeakyReLU(0.1) for “dying ReLU” issue |
| **Dropout Rate & Placement** | Randomly zeros activations to reduce overfitting. Higher rate → stronger regularization but slower convergence. | 0.1–0.3 after conv blocks; 0.3–0.5 before FC layers |
| **Weight Decay (L2 Reg)** | Penalizes large weights → smoother, more generalizable models; too high → underfitting. | 1e-5, 1e-4 (common), 1e-3 for stronger reg |
| **Label Smoothing** | Softens targets → reduces overconfidence, can improve calibration, especially with many classes. | ε = 0.05–0.2 |
| **BatchNorm Momentum** | Controls running‐stat updates. Low momentum → stable but slow adaptation; high → fast but noisy. | 0.01, 0.05, 0.1 (default), 0.2 |
| **Batch Size** | Larger batches → smoother gradient estimates but requires more memory and can converge to sharp minima; small batches → noisier updates but better generalization sometimes. | 16, 32, 64, 128 |
| **Learning Rate** | Biggest lever on convergence speed and stability. Too high → divergence; too low → slow training. | 1e-4 – 1e-2 for Adam; 1e-3 – 1e-1 for SGD |
###### *Lists organized by chatGPT