ML Notes - HackMD

# How to Machine Learning with PyTorch ! ><i class="fa fa-file-text"></i> [PyTorch Tutorial](https://docs.pytorch.org/tutorials/) :::success :bulb: **Typical PyTorch training order [guide]** 1. Set Up - Import libraries - `torch, random, numpy, torchvision, torch.utils.data` 2. Data Transforms - Train transforms: heavy augmentations - `Resize→RandomFlip→Rotation→ColorJitter→ToTensor→Normalize` - Eval transforms (eval & test): minimal preprocessing - `Resize→ToTensor→Normalize` 3. Datasets & Dataloaders - Load train "base_train" ds w/o transformations - Create [80/20] train/validation set w/ `random_split` - Apply train transforms train ds and eval transforms on val ds - Load the test ds separately and apply eval transforms - Create dataloaders for all 3 ds - `batch_size, shuffle, num_workers ` 4. Define Model - Check/set device - Identify suitable model type - Define `nn.Module` and set hyper-parameters - `dropout_rate` - def `__init__,foward` - Include layers, activations, pooling, batch-norm, skip connections, etc. 5. Loss Function - Choose suitable loss function 6. Optimizer & Scheduler - Choose suitable optimizer - `weight_decay` - Choose suitable scheduler 7. Validation / Metrics Function - Write validation function that returns average loss + accuracy 8. (Optional) Hyper-Parameter Search - One time tuning... grid/random/Bayesian Search - Loop over hyper-params - `lr, wd, dropout, etc.` - Call training routine - Get best configs for the model 9. Training Loop - Consider single model, traditional ensemble, or snapshot ensemble training - For each epoch: - Train: `model.train()`--> optimizer zero grad--> forward pass--> loss backward--> optimizer step - Validate: call validation function on val ds - Early Stop & Checkpoint: compare val_loss to best so far... save state dict if improved... break if no improvement - Scheduler step: get val loss from validate function and scheduler step - Record epoch progress: epoch, train/val loss, train/val accuracy 10. Test Evaluation - Post training results - Run `evaluate(best_model, test_loader)` - Get final accuracy 11. Load and save model - Save your final model weights w/ `torch.save(model.state_dict(), path)` - Reload w/ `model.load_state_dict(torch.load(path))` and `model.eval()` 12. (Optional) Visuals & Reports - Simple & clean analysis graphs and reports to maximize ML - Class visuals, different progress tracking, Confusion matrix and classification report, Per-class matrix, Grad-CAM / Saliency Maps, etc. 13. (Optional) Add-ons & other features -Other ML concepts/algorithms help with training accuracy and ML performance; however, these can heavly depend on resources - Model ensembles, ::: ## Datasets & Dataloaders :::info :information_source: **info** - Dataset code should be decoupled from model training code for better readability and modularity - Two PyTorch data primitives: `torch.utils.data.DataLoader` & `torch.utils.data.Dataset` - `Dataset` stores the samples and their corresponding labels - `DataLoader` wraps an iterable around the Dataset to enable easy access to the samples - PyTorch domain libraries provide a number of pre-loaded datasets that subclass `torch.utils.data.Dataset ` - Or you can download your own from the internet and apply your data directory ::: ## Transforms :::info :information_source: **info** - Transforms are used to manipulate data so that it is suitable for training - TorchVision has two params: `transform` & `target_transform` - `transform` modifies features - `target_transform` modifies labels - e.g. `RandomHorizontalFlip`, `ColorJitter`, `ToTensor()`,`Normalize(mean, std)` - Usually, imgs turn into tensors and labels can be left as integers (all depending on the model) ::: ## Neural Networks :::info :information_source: **info** - Neural networks comprise of layers/modules that perform operations on data. - The torch.nn namespace provides all the building blocks you need to build your own neural network - Every module in PyTorch subclasses the nn.Module - NNs consist of layers, a nested structure that allows for building/managing complex architectures easily ::: ## Autograd :::info :information_source: **info** - Autograd is PyTorch's automatic calculus engine, usually happens during training loop - *Back propagation* is a common algorithm used for training models - ... parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter - PyTorch's built-in engine `torch.autograd` supports automatic computation of gradient for any computational graph ``` python loss.backward() # accumulate ∂loss/∂param optimizer.step() # update params ``` ::: ## Optimization :::info :information_source: **info** - Training a model is an iterative process... - Guess output --> calculate loss --> collects the derivatives of the error with respect to its parameters --> optimizes these parameters using gradient descent ::: ## Save & Load Model :::info :information_source: **info** - PyTorch models store the learned parameters in an internal state dictionary, called `state_dict` - `torch.save` - To load model weights, you need to create an instance of the same model first, and then load the parameters - `load_state_dict()` ::: ## List of model types | Model Type | Key Characteristics | Popular Use Cases | | ------------------------------------------ | ----------------------------------------------------------- | ---------------------------------------------------- | | **Feedforward Neural Networks** | Basic neural network for general tasks | General-purpose classification, regression | | **Convolutional Neural Networks (CNNs)** | Uses convolutional layers for image data | Image classification, object detection, segmentation | | **Recurrent Neural Networks (RNNs)** | Handles sequential data | Time series, language modeling, speech recognition | | **Transformers** | Uses attention mechanisms for sequential data | NLP tasks (translation, summarization) | | **Generative Adversarial Networks (GANs)** | Consists of generator and discriminator | Image generation, style transfer, data augmentation | | **Autoencoders** | Learns compressed representations of data | Data compression, anomaly detection | | **Reinforcement Learning (RL)** | Learns from interaction with an environment | Robotics, game playing, autonomous vehicles | | **Siamese Networks** | Compares pairs of inputs for similarity | Face verification, image similarity | | **Attention Mechanisms** | Focuses on important parts of the input | NLP tasks, machine translation | | **Capsule Networks (CapsNets)** | Focuses on spatial relationships between features | Image classification, object detection | | **Neural Style Transfer** | Combines the content of one image with the style of another | Artistic image generation | :::success :bulb: **Summary of the CNN Process:** *SIMPLE PATTERN: [Conv→ReLU→Pool→Flatten→FC]* Input Image: Start with an image (e.g., 28x28 pixels for grayscale). Convolutional Layer: Apply convolution filters to detect features like edges and textures. Filter/kernal is multiplied with small matrix of the image then summed into a single number (feature map). Activation (ReLU): Apply an activation function (typically ReLU) to introduce non-linearity, allowing for more complex learning. Positive values are outputed directly while negatives are turned to zeros. >[!Tip]Formula `ReLU(x)=max(0,x)` Pooling: Use max pooling to downsample the feature maps and reduce dimensionality. A matrix window is slid over feature map, selecting the max value in each region. This allows the feature map to shrink but still keeping the important values Flattening: Flatten the feature maps into a 1D vector. The Fully Connected Layer only accepts and expects a 1D input. Fully Connected Layer: Pass the flattened vector through fully connected layers. All neurons from this layer are connected with all the neurons from the previous layer. The network computes a weighted sum by multiplying the vector by its weight and adding a bias. Activation function (like ReLU) is applied and outputs the final result Output Layer: Finally, output a classification or regression result, depending on the task. An output could be a vector listeing probabilities corresponding to each class. >**Important Notes** > Conv → BatchNorm → ReLU → Pool → Conv → BatchNorm → ReLU → Pool → Flatten → FC1 → ReLU → FC2 > - Pool halves the dimensions > - The last layer of FC needs to match the number of classes > - `nn.BatchNorm2d()` improves speed and performance: normalizes each feature-map (channel) to zero mean and unit variance and scales and shifts the normalized output with learnable parameters > - `nn.Dropout()` randomly disables (set to 0) a fraction of neurons during training to prevent overfitting. It makes the model less reliant on any single feature and helps generalize better. ::: ## List of Loss Functions | Loss Function | Use Case | Target Format | | ------------------- | --------------------------------- | -------------------- | | `CrossEntropyLoss` | Multiclass classification | Integer class labels | | `BCEWithLogitsLoss` | Binary/multi-label classification | Float 0/1 vector | | `NLLLoss` | Classification with log-softmax | Integer class labels | | `MSELoss` | Regression | Float value | | `L1Loss` | Regression (less sensitive) | Float value | | `KLDivLoss` | Distribution comparison | Log-prob vs prob | ## List of Optimizers | Optimizer | When to use | Notes | | --------- | ------------------------------------ | ------------------------------------- | | `Adam` | Default for most deep learning tasks | Fast, adaptive, great out of the box | | `SGD` | Classic choice | Good with momentum, but slower | | `RMSprop` | RNNs, time series | Handles non-stationary gradients well | | `AdamW` | Transformers / weight decay | Adam + better weight regularization | ## List of LR Schedulers | Scheduler | Pros | When to use | | ------------------- | ------------------------------------------- | ----------------------------------------------------- | | `StepLR` | Simple, predictable drops | You know roughly when to decay (e.g. every 10 epochs) | | `MultiStepLR` | Multiple custom decay points | You have prior on milestones (e.g. \[30, 60, 90]) | | `ExponentialLR` | Smooth exponential decay | You want continuous decay | | `CosineAnnealingLR` | Warm restarts & smooth “cosine” decay | Modern “no-a priori” recipes | | `ReduceLROnPlateau` | Decay triggered by validation metric stalls | You don’t know good epochs in advance | ## List of key architectural choices & hyperparameters | **Attribute** | **Impact on Model** | **Typical Values / Notes** | | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | | **# Convolutional Layers** | More layers → higher capacity & larger receptive field; risk of overfitting and longer training time. | 2–5 “blocks” for small datasets; up to 50+ for deep nets | | **Filter Count (Channels)** | More channels per layer → more features learned, but more params and compute. | 32→64→128 or 64→128→256 in successive blocks | | **Kernel Size** | Larger kernels capture more context per layer; increases parameters and compute. | 3×3 (standard), 5×5 or 7×7 for wider receptive field | | **Stride / Pooling** | Controls spatial downsampling. Stride>1 or pooling reduces resolution, lowers memory but may lose detail. | Pool 2×2 / stride=2 after each conv block | | **Activation Function** | Non-linearities let the net learn complex mappings. Different types affect gradient flow (e.g. ReLU vs. LeakyReLU vs. ELU). | ReLU (default), LeakyReLU(0.1) for “dying ReLU” issue | | **Dropout Rate & Placement** | Randomly zeros activations to reduce overfitting. Higher rate → stronger regularization but slower convergence. | 0.1–0.3 after conv blocks; 0.3–0.5 before FC layers | | **Weight Decay (L2 Reg)** | Penalizes large weights → smoother, more generalizable models; too high → underfitting. | 1e-5, 1e-4 (common), 1e-3 for stronger reg | | **Label Smoothing** | Softens targets → reduces overconfidence, can improve calibration, especially with many classes. | ε = 0.05–0.2 | | **BatchNorm Momentum** | Controls running‐stat updates. Low momentum → stable but slow adaptation; high → fast but noisy. | 0.01, 0.05, 0.1 (default), 0.2 | | **Batch Size** | Larger batches → smoother gradient estimates but requires more memory and can converge to sharp minima; small batches → noisier updates but better generalization sometimes. | 16, 32, 64, 128 | | **Learning Rate** | Biggest lever on convergence speed and stability. Too high → divergence; too low → slow training. | 1e-4 – 1e-2 for Adam; 1e-3 – 1e-1 for SGD | ###### *Lists organized by chatGPT