Global Optimization

contributed by <kylekylehaha>

tags:`Data Science`

Outline

Gloval optimization
- Random Search / Grid Search
- Evolution Algorithms / Particle Swarm
- Gradient Descent-based Algorithm
Hyper-parameter optimization
- Surrogate-based optimization

In global optimization problems, there is a target function

f

and a set of parameters

θ = {θ_{1}, θ_{2}, . . ., θ_{n}}

We want to find the parameter set

θ^{*}

, such that

f (θ^{*})

is minimized.

a r g m i n_{θ} f (θ)

Various algorithms to such problems:

Random Search / Grid Search
Evolution Algorithms
Swarm Intelligence Algorithms
Gradient Descent

If we know the function, the task could be easier.
E.g., convex function(碗公型): find global optimum is not difficult.

若我們知道 function 長得如何，根本就不用 global optimization, 可以直接微分即可。

Most of the time, we don't know the exact form of the function.
Given an input X, we know only the output f(x).
Finding the point that makes the first derivate=0

Moreover, the variables are continuous.
If the variabels are discrete

Brute-force, branch-and-bound
Time-consuming, but works.
將每個點都試一次即可。

However, if the variables are continuous

Brute-force or B&B no longer works.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

General optimization algorithms

Simple search algorithms
- Random search & grid search
Evolution and swarm intelligence
- Genetic algorithm, evolution strategy, particle swarm
Gradient-based algorithm
- Gradient descent, adagrad, adam

How to measure?

Efficiency
- The time to obtain a solution, e.g., converge
Solution quality

Simple search algo

Grid Search

將每個變數視為離散的點，犧牲一點 solution quality。

Discretize the search space and search with bruth-force.
每個變數為一個軸，故十個變數就是十維。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Random Search

$r$ is the radius for search.
Random search includs 2 steps:

Find a point
$θ^{*}$ in search space. We record
$θ^{*}$ as the optimum point. (一開始可以隨便找一個點視為
$θ^{*}$ )
Uniformly random choose a point
$θ$ , with a disrance at most r from
$θ^{*}$ . If
$f (θ) < f (θ^{*})$ , then set
$θ^{*}$ as
$θ$ . Repeated the step.

Evolution Algorithm

想法: 我們找到一組不錯的 parameter, 故在這組 parameter 附近可能也會有一組不錯的。由一組好的解，逐步往更好的解走下去。

Evolutionary algorithms are able to solve optimization problems.

Genetic algorithm (GA)
Evolution strategy (ES)

Main idea

Start with some random candidate solutions.
Recombine current candidate solutions to generate new candidates.
- Crossover
- Find other candidates based on current candidates
Slightly change the generated
- Mutation(突變)
- Add some randomness into search process
- Prevent converge at local minimal too early
Keep good candidate solutions.
Repeat 2~4 steps

可以視為有系統的 搜尋整個 space。透過 recombine & change，生成好的 candidate.

Genetic Algorithm (GA)

In GA, gene is a string of binary number.

Four steps in GA

Initialize the population
Randomly select individuals for crossover
Mutate
Select those individuals with good objective values as the population of next iteration.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Initialization
Randomly generate a bunch of binary vectors (population). Each parameter encode into 0 or 1.

Each binary vector represents a candidate solution.

Each vector is called chromosome.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Crossover
Randomly select two chromosomes from population.

Randomly choose a crossover point.

Offspring(後代) is created by exchanging the gene of their parnets among themselves until the crossover point is reached.

Idea: 如果 A1 不錯，是否代表其中一個 gene 表現很好，那我們去改變其他不好的 gene，說不定結果更好。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Mutation
After offspring is created, randomly change some bits of their genes.

有點類似 ativate function 的感覺，避免 crossover 兩次後又變回 parent.

Evolution strategy (ES)

Evolutin strategy (ES) is similar to genetic algorithm (GA).

But gene, instead of binary, they are real number in ES

Two types of genes in ES:

One record the candidate solution
One record the mutation strength

標注顏色的地方即為和 GA 不同的部分。

Crossover
Randomly select both candidate solution/mutation strength from parents.

Mutation
Add a random noise into solution.

The value is sampled by normal distribution.

μ

=0,

σ

= mutation strength.
也就是說，若 mutation strength 越小，

σ

越小，分佈就越窄，越容易取到

μ

, 也就是 0。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In conclusion, the four steps for the evolutionary algorithms.
Initialzie->Crossover->Mutation->Selection (repeat crossover)
(將來變成流程圖)

Differential Evolution (DE)

Similar to ES, but the mutation/selection operations are different.

The differential in DE means "difference" instead of "differentiation".

Initialization
Given an upper bound and a lower bound of each candidate solution

θ_{j, i, G}

,where

j: j-th dimension
i: i-th candidate solution
G: G-th iteration

θ_{j}^{L} \leq θ_{j, i, 1}, \leq θ_{j}^{U}

Randomly generate a fixed number of candidate solutions

θ_{j, i, 1}

uniformly on intervals [

θ_{j}^{L}, θ_{j}^{U}

]

Mutation
In the G-th generation, we randomly select three candidate solutions

θ_{p, G}, θ_{q, G}, θ_{r, G}

to generate each vector

v_{i_{G} + 1}

for all I, (p, q, r are indices)

v_{i, G + 1} = θ_{p, G} + F (θ_{q, G} - θ_{r, G})

, where F is a constant in [0, 2]

Recombination
Given a probability parameter CR

We recombine differnet dimensions (randomly assigned by

I_{r} a n d

) of

v

and

θ

to generate a new vecotr u

u_{j, i, G + 1} = {\begin{cases} v_{j, i, G + 1} & i f r a n d_{j, i} \leq C R o r j = I_{r a n d} \\ θ_{j, i, G} & o t h e r w i s e \end{cases}

, where rand_j,i ~ uniform[0, 1], I_rand is a random interget in [1, 2,…, D], 用來決定哪一個 dimension。

I_rand ensures that

V_{i, G + 1} \neq θ_{i, G}

Selection
The parameter of next geneartion

θ_{i, G + 1}

is the better one among

θ_{i, G}

and

u_{i, G + 1}

θ_{i, G + 1} = {\begin{cases} u_{i, G + 1} & i f f (u_{i, G + 1}) \leq f (θ_{i, G}) \\ θ_{i, G} & o t h e r w i s e \end{cases}

, where f is objective funciton. p.s., we are considering a minization problem here.

Repeat the above steps until minimum of the limit of the number of iterations is reached.

Particle Swarm Optimization

將每個 particle 視為一群麻雀、螞蟻，彼此會溝通來找到最好的位置。

v_{j, t} = r a n d (0, c_{1}) \times θ_{t - 1}^{*} + r a n d (0, c_{2}) \times θ_{j, t - 1}^{*} + w \times v_{j, t - 1}

, where rand(a, b) is a variable which is uniform random in [a, b]

Update
Velocity:

v_{j, t} = r a n d (0, c_{1}) \times θ_{t - 1}^{*} + r a n d (0, c_{2}) \times θ_{j, t - 1}^{*} + w \times v_{j, t - 1}

Position:

θ_{j, t} = θ_{j - 1, t} + v_{j, t}

c₁ decides how fast to approach
$θ_{t - 1}^{*}$ , the best position of all particles.
c₂ decides how fast to approach
$θ_{j, t - 1}^{*}, t h e b e s t p o s i t i o n o f p a r t i c l e - j$
w decides how much of last velocity retains.
Usually, c1, c2, w are 2, 2, and 0.7, respectively.

Gradient Descent-based algorithms

Momentum
Gradient Descent (GD) with momentum

m_{t}

: 希望可以加速突破 local min

m_{t + 1} = β_{1} \cdot m_{t} + α \cdot g_{t} θ_{t + 1} = θ_{t} - m_{t + 1}

Advances of momentum:

It can escape from local minimum.
It usually reaches optimum faster.

AdaGrad (Adaptive Gradient Descent)

一開始讓 learning rate 走快一點，快到 optimum 時將 learning rate 變小。

AdaGrad keeps a variable V to compute the squared sum of previous gradients.

For t-th move,

V_{t + 1} = \sum_{i = 0}^{t} g_{i}^{2}

, i.e.,

V_{t + 1} = V_{t} + g_{t}^{2}

The move rule changes to:

θ_{t + 1} = θ_{t} - α \cdot \frac{g_{t}}{\sqrt{V_{t + 1} + ϵ}}

, where

ϵ

避免分母為零。

隨著時間增加，t 越大，V_t 也就越大，代表我們要走得比較小步，故 V_t 在分母。

Challenges:

V_t will be very large as it moves many time because V_t is increasing.
We will move very little or even stop.

ADAM (Adaptive Moment Estimation)

Combine Moemnt and AdaGrad

For evert move, it needs to keep 3 variables:

$θ_{t}$ : position in the search space.
$m_{t}$ : called 1-st moment, recording the speed and direction.
$V_{t}$ : called 2-nd moment, recording the size of gradients.

Move rule:

θ_{t} = θ_{t - 1} - α \cdot \frac{\hat{m}}{\sqrt{\hat{v_{t}}} + ϵ} m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot g_{t} v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot g_{t}^{2}

需注意會有 heavily biased toward

v_{0}

(Iteration 1)

v_{1} = β_{2} \cdot v_{0} + (1 - β_{2}) \cdot g_{1}^{2}

(Iteration 2)

v_{2} = β_{2} \cdot v_{1} + (1 - β_{2}) \cdot g_{2}^{2} = β_{2} \cdot (β_{2} \cdot v_{0} + (1 - β_{2}) \cdot g_{1}^{2}) + (1 - β_{2}) \cdot g_{2}^{2}

(Iteration 3)…

因此，要做點修正:

\hat{m_{t}} = \frac{m_{t}}{1 - β_{1}^{t}} \hat{v_{t}} = \frac{v_{t}}{1 - β_{2}^{t}}

Hyper-parameter Optimization

Hyper-parameters that are input to machine learning algorithms effect the final performance (test accuracy) significantly.

為了區別 model 內部的 parameter, hyper-parameter 是指 depth of the decision tree, number of weak classifier in adaboost。也就是說，hyper-parameter 是用來控制 model, 而非 model 本身。

雖然可以直接拿 global optimization 用在 hpyer-parameter 上，但會有些問題：

We need to wait until the training of NN is done.
- Take a lot of time
- Performance of NN is expansive-to-evaluate

As a result, there are special technique

Bayesian optimization
Surrogate-base
Reinforcement learning ??

Optimization for Expensive Function

例如: 想知道該藥是否對人有好處，但召集人的成本很高，無法瘋狂的測試不同配方

我們可以先建立一個小型 model, 該 model 和 objective function 很像。故我們在 model 上就可以瘋狂測試，找到一個不錯的後再去 objective function 測試。

Surrogate-based Optimization

Create a function to approximate the expensive function

Surrogate-based methodology
Given:

A set of parameters
$θ$ to find
$θ = {θ_{1}, θ_{2}, . . ., θ_{n}}$
A function
$f$ take
$θ$ as input and output a real value
Lower bound and upper bound of parameters
- $L o w e r_{i} \leq θ_{i} \leq U p p e r_{i}$
Sometimes, special constraint may be exist
- $(θ_{i})^{2} + (θ_{j})^{2} \leq d$

Output:

A set of parameters
$θ^{*} = θ_{1}^{*}, θ_{2}^{*}, . . ., θ_{n}^{*}$ such that
$f (θ^{*})$ is minimized, and the constraints can be satisfied.

The objective function is the function we want to optimize.

The object function is expensive for resources, like time. So, we create a cheap function, surrogate function, to approximate the object function.

We run an optimization algorithm on the surrogate function to find a promising point, like the optimum. Feed the promising point to the objective function.

Observed points: 已知在 objective function 上的值，用來建構 surrogate optimization.

Surrogate Function

We use regression function to build surrogate function. Introduce 3 regressions:

Quadratic regression
RBF
SVR

Quadratic regression

Simple regression

S (x) = x^{T} Q x + w^{T} x + b

Find

a r g m i n_{Q, w, b} \sum_{θ \in O} | | S (θ) - F (θ) | |

, with the method of Least Square.

(1) Build surrogate funciton

因為這個 S 只有以下三種型態，故可以直接公式解得知 optimum, 也就是 promising point.

(2) Select next promising point

Advantage:

It can rapidly find the minimal point by formula, instead of optimization algorithms.

Disadvantage:

It has bad performance for many observed points with outliers.
The generality is not good.

可以看到 minimal on surrogate 和實際的 minimal objective function 有落差。

RBF Regression (Radial basis function)

Radial function
$ϕ : [0, \infty) - > R$
a is called center of radial function
$ϕ (| | x - a | |)$

RBF regression:

S (x) = \sum_{i = 1}^{n} λ_{i} ϕ (| | x - θ_{i} | |)

n is the number of observed points.
每個點都有一個 rbf(一個波)
$λ$ 讓 rbf 長高
而最後的 regression 為所有長高後的波加總起來，形成 continuous function。

How to get

λ_{i}

(1) Build surrogate function

observed point 的值和 surrogate function 值相同(一開始紅色的部分)

只有接近 observed point 的地方不是 0

How to select

\hat{θ}

Select next promising point

FInd the minimal point
$θ^{*}$ in observed points O
Use random search to generate sample points centered on
$θ^{*}$
Select the point among the sample points with minimum
$S (\cdot)$ as the selected point
$\hat{θ}$

Advantage:

It can fit continuous funciton

Disadvantage:

The RBF function only fits area near observed point
efficiency lower than quadratic regression(because use formula to solve)

Global Optimization

tags:Data Science

Outline

Simple search algo

Grid Search

Random Search

Evolution Algorithm

Genetic Algorithm (GA)

Evolution strategy (ES)

Differential Evolution (DE)

Particle Swarm Optimization

Gradient Descent-based algorithms

AdaGrad (Adaptive Gradient Descent)

ADAM (Adaptive Moment Estimation)

Hyper-parameter Optimization

Optimization for Expensive Function

Surrogate-based Optimization

Surrogate Function

Quadratic regression

RBF Regression (Radial basis function)

Read more

Firmware Interview

華碩 ASUS 軟體工程師(電競筆電)

群聯面試 Phison Interview

CNN

tags:`Data Science`