18 Accelerated SGD & ResNets

# 18 Accelerated SGD & ResNets # Recap - **SGD**: SGD updates the model's parameters (weights) by minimizing the loss function using the gradient (slope) of the loss with respect to the parameters. - **SGD with momentum**: Smooths updates - **RMSProp**(Root Mean Square Propagation): Adapts learning rate per parameter - RMSProp adapts the learning rate for each parameter based on how frequently it's updated, helping with non-stationary objectives - Benefits: - Works well for recurrent neural networks (RNNs) - Solves vanishing/exploding gradient issues better than SGD - **Adam**: Combines momentum + adaptive learning + bias correction - Adam combines the ideas of momentum (from SGD with momentum) and RMSProp to adaptively adjust the learning rate for each parameter. ![Screenshot_2025-06-07_at_8.55.39_AM](https://hackmd.io/_uploads/BJphmIEXlx.jpg) ![Screenshot_2025-06-07_at_9.01.02_AM](https://hackmd.io/_uploads/SyvGE84mlx.jpg) ![Screenshot_2025-06-07_at_8.55.46_AM](https://hackmd.io/_uploads/BJPpQLEmxe.jpg) ![Screenshot_2025-06-07_at_8.57.22_AM](https://hackmd.io/_uploads/S1zCQUNXlg.jpg) ![Screenshot_2025-06-07_at_9.00.43_AM](https://hackmd.io/_uploads/HJACmUNXex.jpg) ![Screenshot_2025-06-07_at_9.00.50_AM](https://hackmd.io/_uploads/rJIyEI4Xle.jpg) # **Gradient Descent in Excel** - Started with `graddesc.xlsx` in Excel to manually implement **Stochastic Gradient Descent (SGD)** for linear regression (target: slope = 2, intercept = 30). - Used **finite differencing** (tiny perturbation) to estimate gradients and compared them with **analytic derivatives**. - Updated weights using: wnew=wold−learning rate×gradient # **SGD Enhancements in Excel** - Implemented **Momentum** by blending current gradients with previous updates using β (e.g., 0.9). - Implemented **RMSProp**: - Tracks moving average of squared gradients. - Normalizes gradients using this variance to adapt step size. - Implemented **Adam**: - Combines momentum and RMSProp. - Achieves much faster convergence toward target weights. # Learning rate annealing ## Experimenting with Learning Rate Schedulers - **Definition**: - A **Scheduler** in machine learning is a **tool** that adjusts the **learning rate** during training, usually based on the **epoch**, **iteration**, or **validation performance**. - A **Learner** is an abstraction or object that: - Wraps a model (like a neural network or classifier) - Handles training, validation, evaluation, and inference - Connects together the model, data, loss function, optimizer, and metrics - Demonstrated a simple **annealing scheduler** in Excel using average gradient squared to **automatically reduce learning rate**. - Transitioned to **PyTorch** and explored: - `torch.optim.lr_scheduler` module. - How optimizers and schedulers manage **parameter groups** and **internal state**. - Created **custom learning rate recorder callback** to plot learning rates during training. # PyTorch learning rate schedulers - **Definition**: Instead of using a fixed or decaying learning rate, you: - Start with a small LR - Gradually increase to a maximum LR - Then gradually decrease to a very small LR - Demonstrated: - **CosineAnnealingLR** - **OneCycleLR** (from Leslie Smith's paper) **By ChatGPT**: | Feature | OneCycleLR | CosineAnnealingLR | | ----------------- | --------------------------- | ------------------------- | | LR Pattern | Up → Down (triangular) | Smooth cosine decay | | Super-convergence | Yes | No | | Ideal for | Fast training in few epochs | Stable long-term training | | Momentum support | Yes (inverse pattern) | No | | Complexity | Moderate | Simple | - Showed **momentum and learning rate curves** during One-Cycle training. - **What is T-Fixup?** T-Fixup is a training initialization method designed to train Transformer models without using Layer Normalization or warm-up learning rates — two techniques typically required for stable training. >It was introduced in the 2020 paper: >“T-Fixup: Simple Fixes for Transformers without Layer Norm” by Zhang et al. # Working with PyTorch optimizers # Neural network architecture improvements - ResNets - Deeper and wider networks(kernal size) ## What is ResNet? **ResNet** (short for **Residual Network**) is a **deep convolutional neural network architecture** introduced by **Microsoft Research** in the paper: > "Deep Residual Learning for Image Recognition" > > *(Kaiming He et al., 2015, CVPR)* It was **revolutionary** because it solved the **"degradation problem"** that happens when deep networks get worse as they grow deeper. ![Screenshot_2025-06-07_at_7.48.58_AM](https://hackmd.io/_uploads/rJxcmI4Xlg.jpg) The key idea is using a **skip connection** to allow deeper networks to train successfully. A skip connection (also called a **shortcut connection**) is a direct path that skips one or more layers and **adds the input** directly to the output of a later layer. ## Residual Block Structure (ResNet Block) - Input → Conv (3×3, stride=1, padding=1) - → BatchNorm → ReLU - → Conv (3×3) → BatchNorm - Add input (**skip connection**) - → Final ReLU - **Key idea**: Add input (identity) to output element-wise - **Requires matching dimensions between input and output** | Parameter | Meaning | | -------------- | ----------------------------------------------------------------------- | | **Conv (3×3)** | A **3×3 filter/kernel** slides over the image to extract local patterns | | **stride=1** | The filter moves **1 pixel at a time** (no skipping) | | **padding=1** | **1-pixel border** of zeros is added around the input to preserve size | ## Challenges in Other Tasks - Input/output might not have same shape (e.g., image classification) - Input signal may still get lost in middle layers ![Screenshot 2025-06-09 at 8.38.21 AM](https://hackmd.io/_uploads/BJ_ABIEmll.jpg) ![Screenshot 2025-06-09 at 8.38.11 AM](https://hackmd.io/_uploads/ryRlL84mxl.jpg) ## Why ResNet is Important - Enabled successful training of very deep networks (100+ layers) - Won the ImageNet 2015 competition - Forms the backbone of many modern CNNs (used in classification, detection, segmentation) ## Example Use Cases - Image classification - Object detection (used in Faster R-CNN, Mask R-CNN) - Medical imaging - Transfer learning (ResNet50 pretrained on ImageNet) ## Improving CNN Model Architectures - Original model: 4 Conv layers, up to 64 channels. - Modifications: - Deeper model (up to 128 channels) → Accuracy: **91.7%** - **ResNet-style skip connections** → Accuracy: **92.2%** - Custom ResNet outperformed `ResNet18d` from `timm` (**92.0%**) - Key insight: **Common-sense architectural tweaks** beat pre-built models for Fashion-MNIST. - Polling⇒把圖片變小 - Test time augmentation ## **Reducing Parameters & FLOPs** - Used **Global Average Pooling** for flexibility with input sizes. ![Screenshot 2025-06-09 at 12.14.30 PM](https://hackmd.io/_uploads/B1ssOt4mge.jpg) - Calculated **parameters and FLOPs** for each layer. - Replaced early ResBlock with a simple Conv layer → **lower compute**, **same accuracy** (92.7%). - **FLOPs** stands for **Floating Point Operations per Second**. ## What does FLOPs mean? FLOPs is a **measure of computational performance**, especially in areas like: - Deep learning - Scientific computing - Graphics and simulations It indicates **how many floating-point arithmetic operations** (like additions or multiplications with decimal numbers) a system or processor can perform **every second**. # Data Augmentation techniques ## Data Augmentation - **Data augmentation** is a technique used to artificially increase the size and diversity of a training dataset by applying transformations to existing data samples. - It’s most commonly used in image, text, and audio tasks to help models generalize better and prevent overfitting. - Achieved **93.8%** accuracy in **20 epochs**. - Implemented: - `RandomCrop` + padding - `RandomHorizontalFlip` - `RandErase` (random block replaced with noise) ## Random erasing - It improves model robustness by randomly removing a rectangular region from an image and filling it with: - Random values - A constant value (e.g., 0) - Mean pixel values ![image](https://hackmd.io/_uploads/HkMOsFNmel.png) ## Test-Time Augmentation(TTA) & Ensembling - Applied horizontal flip at inference, then averaged predictions. → Boosted accuracy to **94.2%** - Tried **ensembling** two 25-epoch models → ~**94%**, but didn’t beat best. ## Advanced Data Augmentation Ideas - **Random Copying**: replace image patch with another patch from same image. ![image](https://hackmd.io/_uploads/Hk-ojFE7xx.png) - Trained with deeper and wider ResNet, reached **94.6%** accuracy in **50 epochs**. # Homework - Build your own: - Cosine annealing scheduler - 1-Cycle scheduler using PyTorch API - Try to **beat Jeremy’s results** (5, 20, or 50 epoch Fashion-MNIST) using: - Custom models - Data augmentation - Thoughtful experimentation - Share progress on [forums.fast.ai](https://forums.fast.ai/) # Key Takeaways by ChatGTP and [NoteGPT](https://notegpt.io/youtube-transcript-generator) ## Introduction to SGD The lesson begins with an introduction to stochastic gradient descent (SGD) and its importance in training models, particularly in the context of linear regression using properties of Excel sheets to illustrate the concepts. ## Implementing Linear Regression with SGD Jeremy demonstrates how to implement linear regression using SGD by calculating the intercept and slope, predicting values using mean squared error, and employing finite differencing to estimate gradients. ## Momentum and Advanced Optimizers The use of momentum is introduced to accelerate learning in SGD. Different optimizers are discussed, including RMSProp and Adam, detailing how they modify gradients to improve convergence speed. ## Learning Rate Adjustment The lesson emphasizes the significance of adjusting learning rates dynamically via techniques like cosine annealing and one-cycle learning rates, enhancing model training efficiency and performance. ## Data Augmentation Techniques Data augmentation is covered extensively, showcasing methods such as random crops, horizontal flips, and random erasing techniques to enhance model robustness against overfitting. ## Ensembling Models for Better Accuracy Finally, the concept of ensembling is introduced, showing how combining predictions from multiple models can yield higher accuracy, illustrating the importance of diverse training methodologies. # Jupyter notebooks - [https://github.com/fastai/course22p2/blob/master/nbs/12_accel_sgd.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/12_accel_sgd.ipynb) - [https://github.com/fastai/course22p2/blob/master/nbs/13_resnet.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/13_resnet.ipynb) - [https://github.com/fastai/course22p2/blob/master/nbs/14_augment.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/14_augment.ipynb) # Other resources - [台大 NTU 李宏毅 2021 機器學習筆記](https://chsiang.notion.site/ntuml2021notes?v=87bcbe3e176e422aa85cfb86900c5fd6) - [【機器學習2021】批次 (batch) 與動量 (momentum)](https://www.youtube.com/watch?v=zzbr1h9sF54&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=5) - [【機器學習2021】卷積神經網路 (Convolutional Neural Networks, CNN)](https://www.youtube.com/watch?v=OP5HcXJg2Aw) - [【機器學習2021】機器學習任務攻略](https://www.youtube.com/watch?v=WeHM2xpYQpw) - [Residual Networks (ResNet)](https://www.youtube.com/watch?v=w1UsKanMatM) - [ResNet Visualization](https://tensorspace.org/html/playground/resnet50.html) - [ResNet (actually) explained in under 10 minutes](https://www.youtube.com/watch?v=o_3mboe1jYI) - [What is the average pooling in deep learning?](https://www.youtube.com/watch?v=iIaocj4z4J4)