17 Initialization/normalization

# 17 Initialization/normalization # Keywords ![Screenshot 2025-06-02 at 12.12.29 PM](https://hackmd.io/_uploads/BJIY6Sifgl.jpg) | Term | Description | Analogy | | --- | --- | --- | | Neuron | Unit that processes input and outputs result | Brain cell | | Layer | Collection of neurons at the same level | Layer of brain tissue | | Channel | Depth dimension for features (e.g., RGB = 3 channels) | Color filter / Feature map | ## What is a **Neuron**? A **neuron** is the **basic computation unit** in a neural network. ### It performs: 1. **Linear combination:** z=w1x1+w2x2+⋯+wnxn+b where: - xi: inputs - wi: weights - b: bias term 2. **Passes the result to an activation function** (optional in some cases). ![Screenshot 2025-06-04 at 7.26.35 AM](https://hackmd.io/_uploads/r1RkRs6Glx.jpg) ![Screenshot 2025-06-04 at 7.31.10 AM](https://hackmd.io/_uploads/Syfq0iaGel.jpg) Network parameter 𝜃: all the weights and biases in the **neurons**. ## What is an **Activation Function**? An **activation function** is a **non-linear transformation** applied to the neuron's output. ### Its purpose: - Adds **non-linearity** to the model - Enables the network to learn complex patterns (not just straight lines) - Determines whether the neuron "fires" or not ### Common Activation Functions | Function | Formula | Characteristics | | --- | --- | --- | | **ReLU (Rectified Linear Unit)** | `f(x) = max(0, x)` | Simple and fast, used in hidden layers | | **Sigmoid** | `f(x) = 1 / (1 + e^(-x))` | Smooth output between 0 and 1, used in binary classification | | **Tanh** | `f(x) = (e^x - e^(-x)) / (e^x + e^(-x))` | Output between -1 and 1, zero-centered | | **Softmax** | `f(xᵢ) = e^(xᵢ) / Σe^(xⱼ)` | Used in the output layer for multi-class classification | | **Leaky ReLU** | `f(x) = x if x > 0 else 0.01x` | Fixes the "dying ReLU" problem by allowing a small gradient when x < 0 | ## **Channel vs. Layer** | Feature | **Channel** | **Layer** | | --- | --- | --- | | **Definition** | A dimension representing different types of data or features in an input or output tensor | A complete building block in a neural network (e.g., convolution, activation, pooling) | | **Purpose** | Holds multiple features or filters at one point in the network | Performs a transformation on data (like convolution, ReLU, etc.) | | **Where it's used** | Inside inputs/outputs of layers (e.g., in image tensors or feature maps) | In the model architecture (e.g., Conv2D layer, Dense layer) | | **Analogy** | Like different “TV channels” showing different content from the same place | Like "floors" in a building, each doing a different job | | **Typical Quantity** | You can have 3 (RGB), 64, 128... depends on number of filters | Layers are stacked (e.g., 10 layers in a CNN model) | ## **Types of Layers** | Layer Type | Purpose | When Used | | --- | --- | --- | | **Convolutional Layer** | Extract local features | Beginning of CNN | | **Pooling Layer** | Downsample feature maps | After convolution layers | | **Fully Connected Layer** | Final decision/classification | End of CNN | | **Normalization Layer** | Stabilize/accelerate training | After convolution or dense | | **Dropout Layer** | Prevent overfitting | During training | ![Screenshot 2025-06-04 at 7.35.04 AM](https://hackmd.io/_uploads/By8o12azee.jpg) ![Screenshot 2025-06-04 at 7.35.15 AM](https://hackmd.io/_uploads/BJr3J3TGel.jpg) # **Lesson Theme** - Balanced theory and practice - Emphasized importance of: - Proper **initialization** - Effective **normalization** - Smart **optimization strategies** for building robust deep learning models ## Callback class and TrainLearner subclass - The addition of **getattr** in the Callback class and the new TrainLearner subclass both streamline training code, reducing boilerplate and improving code clarity, which is vital for experimenting rapidly in research and production. - HooksCallback and ActivationStats - Enhancements to the miniai library callbacks simplify training workflows. - Introduction of HooksCallback and ActivationStats improves activation monitoring. - Practical goal set to exceed 90% accuracy on Fashion-MNIST without architecture changes. **Initialization** refers to how we set the initial values of model parameters (weights and biases) before training begins. ## Why is Initialization Important? - Training a model involves updating weights step-by-step using gradients. - If weights are poorly initialized, the model may: - Train very slowly 🚶‍♂️ - Get stuck in bad local minima or plateaus - Suffer from vanishing or exploding gradients - So good initialization gives the model a better starting point. ## Xavier/Kaiming initialization - The instructor emphasizes learning rate selection using a learning rate finder and the importance of normalized weights and inputs in training stability. Techniques such as weight initialization using **Xavier (Glorot)** and **Kaiming (He)** methods are introduced to maintain zero mean and unit variance throughout the network layers. - **Xavier (Glorot)** Initialization: Maintains the variance of activations across layers, suitable for sigmoid and tanh activations. - **Kaiming (He)** Initialization: Designed for ReLU activations, it helps in preserving the variance of activations in deep networks. ## Variance, standard deviation, and covariance ## General ReLU activation function - Standard ReLU zeroes out negative inputs, skewing mean activations. - Modifying ReLU to a **“general ReLU”** , which incorporates a constant shift and leaky slope, allows for **negative outputs**, enabling the network to maintain more balanced activation distributions, thereby improving training stability and final accuracy. - Issue with standard **ReLU**: Causes **positive-biased outputs** - Introduced **General ReLU**: Allows controlled negative output to stabilize learning - **Leaky ReLU**: allows a small gradient for negative values. ![image](https://hackmd.io/_uploads/rJqih5sMxe.png) https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29 ### Layer-wise Sequential Unit Variance (LSUV) - LSUV initialization offers practical layer-wise normalization: Iteratively adjusting weights based on batch activation statistics helps ensure the network maintains normalized outputs layer-by-layer, reducing the problems caused by poor initialization and accelerating convergence. - It works in two steps per layer: - Initialize weights orthogonally (good for preserving variance) - Pass a mini-batch forward and rescale the layer’s weights so that the output of that layer has unit variance (≈1.0) - Example Use Case: - In a convolutional neural network for image classification: - You initialize each layer with LSUV, - Then start training with SGD or Adam, - The model will often reach good performance more quickly than with naive initialization. ### Compared To: | Initialization | Preserves variance? | Data-aware? | Notes | | -------------- | ------------------- | ----------- | -------------------------- | | Xavier | Partially | No | Good for tanh | | Kaiming | Better for ReLU | No | Good for ReLU-based nets | | **LSUV** | ✅ Yes (empirically) | ✅ Yes | Rescales using actual data | ![image](https://hackmd.io/_uploads/HyDwhciGlx.png) A **batch** is a subset of the training dataset used to compute one update to the model’s weights during training. **Normalization** is a data preprocessing technique used to rescale features so they have similar ranges or statistical properties (like mean and standard deviation). **This makes training more stable and faster.** ## Why is Normalization Important? - Improves model performance - Especially for gradient-based models (e.g., neural networks) - Prevents features with large values from dominating Helps gradients converge more smoothly ## Layer Normalization and Batch Normalization - The instructor explains the underlying mechanisms of BatchNorm, including the use of exponential moving averages for means and variances, and their importance during inference. The concept of buffers in deep learning frameworks is highlighted as a key feature to maintain these statistics effectively. ### Normalization Techniques - Explained and compared: - **BatchNorm**: Normalizes across batch dimension - **LayerNorm**: Normalizes across feature dimension per sample - **InstanceNorm**: Normalizes per image or instance - **GroupNorm**: Hybrid approach for small batch sizes ![image 1](https://hackmd.io/_uploads/rJseRBjGle.png) | Scenario | Recommended Norm | | --- | --- | | Large batch training (CNN) | BatchNorm | | NLP / Transformers | LayerNorm | | Image style transfer / GANs | InstanceNorm | | Small batch sizes (e.g., medical images) | GroupNorm | ## Types of Training: | Type | Description | | ------------------------------------- | ---------------------------------------------------------- | | **Batch Gradient Descent** | Uses the **entire dataset** to compute each gradient | | **Stochastic Gradient Descent (SGD)** | Uses **1 sample** at a time for each update | | **Mini-Batch Gradient Descent** | Uses a **small batch (e.g., 32 or 64 samples)** per update | ## Accelerated SGD, RMSProp, and Adam optimizers ![Screenshot_2025-06-02_at_11.23.36_AM](https://hackmd.io/_uploads/HJCqtSofxe.jpg) ![Screenshot_2025-06-02_at_11.25.02_AM](https://hackmd.io/_uploads/S1koYBifxx.jpg) ![Screenshot_2025-06-02_at_11.27.28_AM](https://hackmd.io/_uploads/r11vYHsfgl.jpg) ## Optimization Algorithms - **SGD with momentum**: Smooths updates - **RMSProp**: Adapts learning rate per parameter - **Adam**: Combines momentum + adaptive learning + bias correction ### Experimenting with batch sizes and learning rates | Optimizer | Adaptive LR | Momentum | When to Use | | --- | --- | --- | --- | | SGD | ❌ | Optional | Simple models, large datasets | | SGD + Momentum | ❌ | ✅ | Deep networks, smoother training | | RMSProp | ✅ | ❌ | RNNs, noisy gradients, reinforcement learning | | Adam | ✅ | ✅ | Default for most deep learning tasks | # Jupyter notebooks - [https://github.com/fastai/course22p2/blob/master/nbs/09_learner.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/09_learner.ipynb) - [https://github.com/fastai/course22p2/blob/master/nbs/10_activations.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/10_activations.ipynb) - [https://github.com/fastai/course22p2/blob/master/nbs/11_initializing.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/11_initializing.ipynb) - [https://github.com/fastai/course22p2/blob/master/nbs/12_accel_sgd.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/12_accel_sgd.ipynb) # Other resources - [https://hackmd.io/@shaoeChen/B1CoXxvmm/https%3A%2F%2Fhackmd.io%2Fs%2FHJU9aUY7Q](https://hackmd.io/@shaoeChen/B1CoXxvmm/https%3A%2F%2Fhackmd.io%2Fs%2FHJU9aUY7Q) - ML Lecture 6: Brief Introduction of Deep - ML Lecture 7: Backpropagation - [https://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html](https://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html) - [https://www.youtube.com/watch?v=yKKNr-QKz2Q](https://www.youtube.com/watch?v=yKKNr-QKz2Q)