# Global Optimization
contributed by <`kylekylehaha`>
###### tags:`Data Science`
## Outline
- Gloval optimization
- Random Search / Grid Search
- Evolution Algorithms / Particle Swarm
- Gradient Descent-based Algorithm
- Hyper-parameter optimization
- Surrogate-based optimization
---
In global optimization problems, there is a target function $f$ and a set of parameters $\theta = \{\theta_1,\theta_2,...,\theta_n\}$.
We want to find the parameter set $\theta^*$, such that $f(\theta^*)$ is minimized. $argmin_{\theta}f(\theta)$
Various algorithms to such problems:
- Random Search / Grid Search
- Evolution Algorithms
- Swarm Intelligence Algorithms
- Gradient Descent
---
**If we know the function, the task could be easier.**
E.g., convex function(碗公型): find global optimum is not difficult.
若我們知道 function 長得如何,根本就不用 global optimization, 可以直接微分即可。
- Most of the time, we don't know the exact form of the function.
- Given an input X, we know only the output f(x).
- Finding the point that makes the first derivate=0
**Moreover, the variables are continuous.**
If the variabels are discrete
- Brute-force, branch-and-bound
- Time-consuming, but works.
- 將每個點都試一次即可。
However, if the variables are continuous
- Brute-force or B&B no longer works.

---
**General optimization algorithms**
- Simple search algorithms
- Random search & grid search
- Evolution and swarm intelligence
- Genetic algorithm, evolution strategy, particle swarm
- Gradient-based algorithm
- Gradient descent, adagrad, adam
**How to measure?**
- Efficiency
- The time to obtain a solution, e.g., converge
- Solution quality
---
## Simple search algo
### Grid Search
將每個變數視為離散的點,犧牲一點 solution quality。
- Discretize the search space and search with bruth-force.
- 每個變數為一個軸,故十個變數就是十維。
- 
---
### Random Search
- $r$ is the radius for search.
- Random search includs 2 steps:
1. Find a point $\theta^*$ in search space. We record $\theta^*$ as the optimum point. (一開始可以隨便找一個點視為 $\theta^*$)
2. Uniformly random choose a point $\theta$, with a disrance at most r from $\theta^*$. If $f(\theta) \lt f(\theta^*)$, then set $\theta^*$ as $\theta$. Repeated the step.
---
## Evolution Algorithm
想法: 我們找到一組不錯的 parameter, 故在這組 parameter 附近可能也會有一組不錯的。由一組好的解,逐步往更好的解走下去。
Evolutionary algorithms are able to solve optimization problems.
- Genetic algorithm (GA)
- Evolution strategy (ES)
**Main idea**
- Start with some random candidate solutions.
- Recombine current candidate solutions to generate new candidates.
- Crossover
- Find other candidates based on current candidates
- Slightly change the generated
- Mutation(突變)
- Add some randomness into search process
- Prevent converge at local minimal too early
- Keep good candidate solutions.
- Repeat 2~4 steps
可以視為**有系統的** 搜尋整個 space。透過 recombine & change,生成好的 candidate.
---
### Genetic Algorithm (GA)
In GA, **gene is a string of binary number.**
Four steps in GA
- Initialize the population
- Randomly select individuals for crossover
- Mutate
- Select those individuals with good objective values as the population of next iteration.

---
**Initialization**
Randomly generate a bunch of binary vectors (population). Each parameter encode into 0 or 1.
Each binary vector represents a candidate solution.
Each vector is called chromosome.

---
**Crossover**
Randomly select two chromosomes from population.
Randomly choose a crossover point.
Offspring(後代) is created by exchanging the gene of their parnets among themselves until the crossover point is reached.
Idea: 如果 A1 不錯,是否代表其中一個 gene 表現很好,那我們去改變其他不好的 gene,說不定結果更好。

---
**Mutation**
After offspring is created, randomly change some bits of their genes.
有點類似 ativate function 的感覺,避免 crossover 兩次後又變回 parent.
---
### Evolution strategy (ES)
Evolutin strategy (ES) is similar to genetic algorithm (GA).
==But gene, instead of binary, they are real number in ES==
Two types of genes in ES:
- One record the candidate solution
- ==One record the **mutation strength**==
標注顏色的地方即為和 GA 不同的部分。
---
**Crossover**
Randomly select both **candidate solution/mutation strength** from parents.
---
**Mutation**
Add a random noise into solution.
The value is sampled by normal distribution. $\mu$=0, $\sigma$ = mutation strength.
也就是說,若 mutation strength 越小, $\sigma$ 越小,分佈就越窄,越容易取到 $\mu$, 也就是 0。

---
In conclusion, the four steps for the evolutionary algorithms.
Initialzie->Crossover->Mutation->Selection (repeat crossover)
(將來變成流程圖)
---
### Differential Evolution (DE)
Similar to ES, but the mutation/selection operations are different.
The **differential** in DE means "difference" instead of "differentiation".
---
**Initialization**
Given an upper bound and a lower bound of each candidate solution
$$
\theta_{j,i,G}
$$
,where
- j: j-th dimension
- i: i-th candidate solution
- G: G-th iteration
$\theta^L_j \le \theta_{j,i,1}, \le \theta^U_j$
Randomly generate a fixed number of candidate solutions $\theta_{j,i,1}$ uniformly on intervals [$\theta^L_j, \theta^U_j$]
---
**Mutation**
In the G-th generation, we randomly select three candidate solutions $\theta_{p,G}, \theta_{q,G}, \theta_{r,G}$ to generate each vector $v_{i_G+1}$ for all I, (p, q, r are indices)
$$
v_{i,G+1}=\theta_{p,G}+F(\theta_{q,G}-\theta_{r,G})
$$
, where F is a constant in [0, 2]
---
**Recombination**
Given a probability parameter CR
We recombine differnet dimensions (randomly assigned by $I_rand$) of $v$ and $\theta$ to generate a new vecotr u
$$
u_{j,i,G+1}=
\begin{cases}
v_{j,i,G+1}& if \ rand_{j,i} \le CR\ or\ j = I_{rand}\\
\theta_{j,i,G}& otherwise
\end{cases}
$$
, where rand~j,i~ ~ uniform[0, 1], I~rand~ is a random interget in [1, 2,..., D], 用來決定哪一個 dimension。
I~rand~ ensures that $V_{i,G+1} \ne \theta_{i,G}$
---
**Selection**
The parameter of next geneartion $\theta_{i,G+1}$ is the better one among $\theta_{i,G}$ and $u_{i,G+1}$
$$
\theta_{i,G+1}=
\begin{cases}
u_{i,G+1}& if \ f(u_{i,G+1}) \le f(\theta_{i, G}) \\
\theta_{i,G}& otherwise
\end{cases}
$$
, where f is objective funciton. p.s., we are considering a minization problem here.
Repeat the above steps until minimum of the limit of the number of iterations is reached.
---
## Particle Swarm Optimization
將每個 particle 視為一群麻雀、螞蟻,彼此會溝通來找到最好的位置。

$$
v_{j,t}=rand(0,c_1) \times \theta^*_{t-1}+rand(0, c_2) \times \theta^*_{j,t-1}+w \times v_{j,t-1}
$$
, where rand(a, b) is a variable which is uniform random in [a, b]
**Update**
Velocity: $v_{j,t}=rand(0,c_1) \times \theta^*_{t-1}+rand(0, c_2) \times \theta^*_{j,t-1}+w \times v_{j,t-1}$
Position: $\theta_{j,t}=\theta_{j-1,t}+v_{j,t}$
- c~1~ decides how fast to approach $\theta^*_{t-1}$, the best position of all particles.
- c~2~ decides how fast to approach $\theta^*_{j,t-1}, the best position of particle-j$
- w decides how much of last velocity retains.
- Usually, c1, c2, w are 2, 2, and 0.7, respectively.
---
## Gradient Descent-based algorithms
**Momentum**
Gradient Descent (GD) with momentum $m_t$: 希望可以加速突破 local min
$$
m_{t+1}=\beta_1 \cdot m_t + \alpha \cdot g_t \\
\theta_{t+1} = \theta_t - m_{t+1}
$$

Advances of momentum:
1. It can escape from local minimum.
2. It usually reaches optimum faster.
---
### AdaGrad (Adaptive Gradient Descent)
一開始讓 learning rate 走快一點,快到 optimum 時將 learning rate 變小。
AdaGrad keeps a variable V to compute the squared sum of previous gradients.
For t-th move, $V_{t+1}=\sum_{i=0}^tg^2_i$, i.e., $V_{t+1}=V_t+g^2_t$
The move rule changes to:
$$
\theta_{t+1}=\theta_t-\alpha \cdot \frac{g_t}{\sqrt{V_{t+1}+\epsilon}}
$$
, where $\epsilon$ 避免分母為零。
隨著時間增加,t 越大,V~t~ 也就越大,代表我們要走得比較小步,故 V~t~ 在分母。
Challenges:
- V~t~ will be very large as it moves many time because V~t~ is increasing.
- We will move very little or even stop.
---
### ADAM (Adaptive Moment Estimation)
Combine Moemnt and AdaGrad
For evert move, it needs to keep 3 variables:
1. $\theta_t$: position in the search space.
2. $m_t$: called 1-st moment, recording the speed and direction.
3. $V_t$: called 2-nd moment, recording the size of gradients.
**Move rule:**
$$
\theta_t=\theta_{t-1}-\alpha \cdot \frac{\hat{m}}{\sqrt{\hat{v_t}}+ \epsilon} \\
m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t \\
v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2
$$
==需注意會有 heavily biased toward $v_0$==
(Iteration 1)$v_1=\beta_2 \cdot v_0 + (1-\beta_2) \cdot g^2_1$
(Iteration 2)$v_2=\beta_2 \cdot v_1 + (1-\beta_2) \cdot g^2_2=\beta_2 \cdot (\beta_2 \cdot v_0 + (1-\beta_2) \cdot g^2_1)+(1-\beta_2) \cdot g^2_2$
(Iteration 3)...
因此,要做點修正:
$$
\hat{m_t} = \frac{m_t}{1-\beta^t_1} \\
\hat{v_t} = \frac{v_t}{1-\beta^t_2}
$$

---
## Hyper-parameter Optimization
Hyper-parameters that are input to machine learning algorithms effect the final performance (test accuracy) significantly.
為了區別 model 內部的 parameter, hyper-parameter 是指 depth of the decision tree, number of weak classifier in adaboost。也就是說,hyper-parameter 是用來控制 model, 而非 model 本身。
雖然可以直接拿 global optimization 用在 hpyer-parameter 上,但會有些問題:
- We need to wait until the training of NN is done.
- Take a lot of time
- ==Performance of NN is expansive-to-evaluate==
As a result, there are special technique
- Bayesian optimization
- Surrogate-base
- Reinforcement learning ??
---
## Optimization for Expensive Function
例如: 想知道該藥是否對人有好處,但召集人的成本很高,無法瘋狂的測試不同配方
我們可以先建立一個小型 model, 該 model 和 objective function 很像。故我們在 model 上就可以瘋狂測試,找到一個不錯的後再去 objective function 測試。
---
### Surrogate-based Optimization
> Create a function to approximate the expensive function
**Surrogate-based methodology**
Given:
- A set of parameters $\theta$ to find $\theta=\{\theta_1, \theta_2, ... , \theta_n\}$
- A function $f$ take $\theta$ as input and output a real value
- Lower **bound** and upper **bound** of parameters
- $Lower_i \le \theta_i \le Upper_i$
- Sometimes, special constraint may be exist
- $(\theta_i)^2+(\theta_j)^2 \le d$
Output:
- A set of parameters $\theta^*={\theta^*_1, \theta^*_2,...,\theta^*_n}$ such that $f(\theta^*)$ is minimized, and the constraints can be satisfied.
The **objective function** is the function we want to optimize.
The object function is expensive for resources, like time. So, we create a cheap function, **surrogate function**, to approximate the object function.
We run an optimization algorithm on the surrogate function to find a **promising point**, like the optimum. Feed the promising point to the objective function.


- Observed points: 已知在 objective function 上的值,用來建構 surrogate optimization.

---
### Surrogate Function
We use **regression** function to build **surrogate function**. Introduce 3 regressions:
- Quadratic regression
- RBF
- SVR
---
#### Quadratic regression
Simple regression
$$
S(x)=x^TQx+w^Tx+b
$$
Find $argmin_{Q,w,b} \sum_{\theta \in O}||S(\theta)-F(\theta)||$, with the method of Least Square.
> (1) Build surrogate funciton
---
因為這個 S 只有以下三種型態,故可以直接公式解得知 optimum, 也就是 promising point.

> (2) Select next promising point
---
Advantage:
- It can rapidly find the minimal point by formula, instead of optimization algorithms.
Disadvantage:
- It has bad performance for many observed points with outliers.
- The generality is not good.

可以看到 minimal on surrogate 和實際的 minimal objective function 有落差。
---
#### RBF Regression (Radial basis function)
- Radial function $\phi:[0, \infty)->R$
- a is called center of radial function $\phi(||x-a||)$
RBF regression:
$$
S(x)=\sum^n_{i=1}\lambda_i \phi(||x-\theta_i||)
$$
- n is the number of observed points.
- 每個點都有一個 rbf(一個波)
- $\lambda$ 讓 rbf 長高
- 而最後的 regression 為所有長高後的波加總起來,形成 continuous function。

---
How to get $\lambda_i$?
> (1) Build surrogate function

observed point 的值和 surrogate function 值相同(一開始紅色的部分)
只有接近 observed point 的地方不是 0

---
How to select $\hat{\theta}$?
> Select next promising point
1. FInd the minimal point $\theta^*$ in **observed points O**
2. Use random search to generate **sample points** centered on $\theta^*$
3. Select the point among the **sample points** with minimum $S(\cdot)$ as the selected point $\hat{\theta}$
---
Advantage:
- It can fit continuous funciton
Disadvantage:
- The RBF function only fits area near observed point
- efficiency lower than quadratic regression(because use formula to solve)
---