# **Preclass Session 5: Optimization**
[toc]
## **TEFPA Pipeline**
### **Definition**
```mermaid
graph LR
Task --> Experience --> Function_Space --> Performance_measure --> Algorithm
```
- **Task ($\mathcal{T}$)**: is defined of what is the pair of input/output $(x,y)$.
- **Experience (${\mathcal E}$)**: the training dataset $\bf D$ to become experience ${\mathcal E}$ to train the model.
- In supervised learning: $\bf D = (X,y)$ with $\bf x^t$ is each sample.
- In unsupervised learning: $\bf D = (X)$ với $\bf x^t$ is each sample.
- **Function Space ($\mathcal{F}$)**: Is a set of parameterized functions (or parametric functions) $f_\theta = ({\boldsymbol \phi}_\alpha, p_w)$ kwhen given the common parameter $\theta$ the change is called the function space $\mathcal{F}$.
- **Performance ($\mathcal{P}$)**: Call $\hat{\theta}$ is the optimized parameters and $\theta^k$ is the current dataset.
- To deteremine whether $\theta^k$ is optimized or not we need to use Performance Measure $\mathcal{P}$ to evaluate the parametrized function $f_{\theta^k}$
- $\mathcal{P}(\theta): {\theta}^k \xrightarrow{(f_{\theta^k},D)} \text{a score}\in\mathbb{R}$. If $\mathcal{P}$ is a Loss Function $\rightarrow$ $\bf \hat{\theta} = \arg \min \space \mathcal{P}(\theta, X,y)$ with $\bf (X,y)$ are constants from $\bf {\mathcal E}$
- **Algorithm ${\mathcal A}$**: Once there is performance, apply a search algorithm or optimize ${\mathcal A}$ to update from $\theta^k$ to $\theta^{k+1}$ (through iterations) and finally achieve the optimal $\hat{\theta}$.
### **Example: Dog and Cat Classification**
- **Task ($\mathcal{T}$)**: input $\bf x$ is an image that contains any photo of a dog or cat $\rightarrow$ output $\bf y$ is the prediction of a cat or a dog.
- **Experience (${\mathcal E}$)**: The self-scraping dataset $\bf D$ in the Session 2 Lab (after splitting) $\bf D = \{\{x^t, y^t\}\}_{t=1}^{2984} = (X,y)$
- **Function Space ($\mathcal{F}$)**: is all functions (models) generated through the training process below.
- **Performance ($\mathcal{P}$)**: We evaluate performance of the function $f_{\theta^k}$ when using the parameters set $\theta^k$ with loss function: **Binary Crossentropy (BE)** $\rightarrow$ optimal $\hat{\theta} = \arg \min \space BE(\theta, \bf{X}, y)$
- **Algorithm ${\mathcal A}$**: We use optimizer **Adam** (or SGD) to update the parameters set $\theta$ based on the loss value (performance). For instance, intial loss is **0.4822** with random $\theta^k$ and after 100 epochs (iterations) we achieve the loss of **0.0386** with optimal $\hat{\theta}$
## **Metrics**
Note: $\bf X \in \mathbb{R}^{N \times d}$ $\rightarrow$ $\bf N$ samples and $\bf d$ features
### **Q1: If Logistic Regression is considered, what is the $\theta^k$?**
- $\theta^k$ is the set of $\bf (w, b)$ in which:
- $\bf w$ is a weight vector $\bf {w} = (w_1, w_2,..., w_d)$ with $d$ is the dimension of data (or features) of $\bf X$
- $\bf b$ (bias) if a number added after performing dot product $\bf w^TX$
### **Q2: If Softmax Regression is considered, What is the $\theta^k$?**
- $\theta^k$ is the set of $\bf (W, b)$ in which:
- $\bf W$ is a matrix $\in \mathbb{R}^{c \times d}$ with $c$ is the number of classes (labels). Or $\bf W = [w_1,...w_t,...,w_c]^T$ with $\bf w_t$ is a row vector with $d$ dimensions.
- Bias $\bf b \in \mathbb{R}^c$ is a **vector** with $C$ dimension, added after performing dot product $\bf W^TX$
## **Tensorflow advanced**
### **Initialize weights for layers**
The weights are randomly generated at an early stage and then gradually optimized during training.
Current models mostly use `relu` with the weight-initialization formula associated `He Initialization`
```python
model.add(LayerName(...., kernel_initializer="he_normal"))
```
### **Change the learning rate of optimizer**
```python
from tensorflow.keras.optimizers import Adam, AdamW
adam = Adam(learning_rate=0.0003)
model.compile(optimizer=adam, ...)
```
### **Early Stopping**
The EarlyStopping technique allows us to stop training when the loss or metric does not decrease after n_epoch.
```python
from tensorflow.keras.callbacks import EarlyStopping
cb_early_stopping = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)
history = model.fit(..., callbacks=[cb_early_stop])
```