Try   HackMD

Global Optimization

contributed by <kylekylehaha>

tags:Data Science

Outline

  • Gloval optimization
    • Random Search / Grid Search
    • Evolution Algorithms / Particle Swarm
    • Gradient Descent-based Algorithm
  • Hyper-parameter optimization
    • Surrogate-based optimization

In global optimization problems, there is a target function

f and a set of parameters
θ={θ1,θ2,...,θn}
.

We want to find the parameter set

θ, such that
f(θ)
is minimized.
argminθf(θ)

Various algorithms to such problems:

  • Random Search / Grid Search
  • Evolution Algorithms
  • Swarm Intelligence Algorithms
  • Gradient Descent

If we know the function, the task could be easier.
E.g., convex function(碗公型): find global optimum is not difficult.

若我們知道 function 長得如何,根本就不用 global optimization, 可以直接微分即可。

  • Most of the time, we don't know the exact form of the function.
  • Given an input X, we know only the output f(x).
  • Finding the point that makes the first derivate=0

Moreover, the variables are continuous.
If the variabels are discrete

  • Brute-force, branch-and-bound
  • Time-consuming, but works.
  • 將每個點都試一次即可。

However, if the variables are continuous

  • Brute-force or B&B no longer works.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


General optimization algorithms

  • Simple search algorithms
    • Random search & grid search
  • Evolution and swarm intelligence
    • Genetic algorithm, evolution strategy, particle swarm
  • Gradient-based algorithm
    • Gradient descent, adagrad, adam

How to measure?

  • Efficiency
    • The time to obtain a solution, e.g., converge
  • Solution quality

Simple search algo

將每個變數視為離散的點,犧牲一點 solution quality。

  • Discretize the search space and search with bruth-force.
  • 每個變數為一個軸,故十個變數就是十維。
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • r
    is the radius for search.
  • Random search includs 2 steps:
  1. Find a point
    θ
    in search space. We record
    θ
    as the optimum point. (一開始可以隨便找一個點視為
    θ
    )
  2. Uniformly random choose a point
    θ
    , with a disrance at most r from
    θ
    . If
    f(θ)<f(θ)
    , then set
    θ
    as
    θ
    . Repeated the step.

Evolution Algorithm

想法: 我們找到一組不錯的 parameter, 故在這組 parameter 附近可能也會有一組不錯的。由一組好的解,逐步往更好的解走下去。

Evolutionary algorithms are able to solve optimization problems.

  • Genetic algorithm (GA)
  • Evolution strategy (ES)

Main idea

  • Start with some random candidate solutions.
  • Recombine current candidate solutions to generate new candidates.
    • Crossover
    • Find other candidates based on current candidates
  • Slightly change the generated
    • Mutation(突變)
    • Add some randomness into search process
    • Prevent converge at local minimal too early
  • Keep good candidate solutions.
  • Repeat 2~4 steps

可以視為有系統的 搜尋整個 space。透過 recombine & change,生成好的 candidate.


Genetic Algorithm (GA)

In GA, gene is a string of binary number.

Four steps in GA

  • Initialize the population
  • Randomly select individuals for crossover
  • Mutate
  • Select those individuals with good objective values as the population of next iteration.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Initialization
Randomly generate a bunch of binary vectors (population). Each parameter encode into 0 or 1.

Each binary vector represents a candidate solution.

Each vector is called chromosome.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Crossover
Randomly select two chromosomes from population.

Randomly choose a crossover point.

Offspring(後代) is created by exchanging the gene of their parnets among themselves until the crossover point is reached.

Idea: 如果 A1 不錯,是否代表其中一個 gene 表現很好,那我們去改變其他不好的 gene,說不定結果更好。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Mutation
After offspring is created, randomly change some bits of their genes.

有點類似 ativate function 的感覺,避免 crossover 兩次後又變回 parent.


Evolution strategy (ES)

Evolutin strategy (ES) is similar to genetic algorithm (GA).

But gene, instead of binary, they are real number in ES

Two types of genes in ES:

  • One record the candidate solution
  • One record the mutation strength

標注顏色的地方即為和 GA 不同的部分。


Crossover
Randomly select both candidate solution/mutation strength from parents.


Mutation
Add a random noise into solution.

The value is sampled by normal distribution.

μ=0,
σ
= mutation strength.
也就是說,若 mutation strength 越小,
σ
越小,分佈就越窄,越容易取到
μ
, 也就是 0。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


In conclusion, the four steps for the evolutionary algorithms.
Initialzie->Crossover->Mutation->Selection (repeat crossover)
(將來變成流程圖)


Differential Evolution (DE)

Similar to ES, but the mutation/selection operations are different.

The differential in DE means "difference" instead of "differentiation".


Initialization
Given an upper bound and a lower bound of each candidate solution

θj,i,G
,where

  • j: j-th dimension
  • i: i-th candidate solution
  • G: G-th iteration

θjLθj,i,1,θjU

Randomly generate a fixed number of candidate solutions

θj,i,1 uniformly on intervals [
θjL,θjU
]


Mutation
In the G-th generation, we randomly select three candidate solutions

θp,G,θq,G,θr,G to generate each vector
viG+1
for all I, (p, q, r are indices)

vi,G+1=θp,G+F(θq,Gθr,G)
, where F is a constant in [0, 2]


Recombination
Given a probability parameter CR

We recombine differnet dimensions (randomly assigned by

Irand) of
v
and
θ
to generate a new vecotr u

uj,i,G+1={vj,i,G+1if randj,iCR or j=Irandθj,i,Gotherwise
, where randj,i ~ uniform[0, 1], Irand is a random interget in [1, 2,, D], 用來決定哪一個 dimension。

Irand ensures that

Vi,G+1θi,G


Selection
The parameter of next geneartion

θi,G+1 is the better one among
θi,G
and
ui,G+1

θi,G+1={ui,G+1if f(ui,G+1)f(θi,G)θi,Gotherwise
, where f is objective funciton. p.s., we are considering a minization problem here.

Repeat the above steps until minimum of the limit of the number of iterations is reached.


Particle Swarm Optimization

將每個 particle 視為一群麻雀、螞蟻,彼此會溝通來找到最好的位置。

vj,t=rand(0,c1)×θt1+rand(0,c2)×θj,t1+w×vj,t1
, where rand(a, b) is a variable which is uniform random in [a, b]

Update
Velocity:

vj,t=rand(0,c1)×θt1+rand(0,c2)×θj,t1+w×vj,t1
Position:
θj,t=θj1,t+vj,t

  • c1 decides how fast to approach
    θt1
    , the best position of all particles.
  • c2 decides how fast to approach
    θj,t1,thebestpositionofparticlej
  • w decides how much of last velocity retains.
  • Usually, c1, c2, w are 2, 2, and 0.7, respectively.

Gradient Descent-based algorithms

Momentum
Gradient Descent (GD) with momentum

mt: 希望可以加速突破 local min
mt+1=β1mt+αgtθt+1=θtmt+1

Advances of momentum:

  1. It can escape from local minimum.
  2. It usually reaches optimum faster.

AdaGrad (Adaptive Gradient Descent)

一開始讓 learning rate 走快一點,快到 optimum 時將 learning rate 變小。

AdaGrad keeps a variable V to compute the squared sum of previous gradients.

For t-th move,

Vt+1=i=0tgi2, i.e.,
Vt+1=Vt+gt2

The move rule changes to:

θt+1=θtαgtVt+1+ϵ
, where
ϵ
避免分母為零。

隨著時間增加,t 越大,Vt 也就越大,代表我們要走得比較小步,故 Vt 在分母。

Challenges:

  • Vt will be very large as it moves many time because Vt is increasing.
  • We will move very little or even stop.

ADAM (Adaptive Moment Estimation)

Combine Moemnt and AdaGrad

For evert move, it needs to keep 3 variables:

  1. θt
    : position in the search space.
  2. mt
    : called 1-st moment, recording the speed and direction.
  3. Vt
    : called 2-nd moment, recording the size of gradients.

Move rule:

θt=θt1αm^vt^+ϵmt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2

需注意會有 heavily biased toward

v0

(Iteration 1)

v1=β2v0+(1β2)g12
(Iteration 2)
v2=β2v1+(1β2)g22=β2(β2v0+(1β2)g12)+(1β2)g22

(Iteration 3)

因此,要做點修正:

mt^=mt1β1tvt^=vt1β2t


Hyper-parameter Optimization

Hyper-parameters that are input to machine learning algorithms effect the final performance (test accuracy) significantly.

為了區別 model 內部的 parameter, hyper-parameter 是指 depth of the decision tree, number of weak classifier in adaboost。也就是說,hyper-parameter 是用來控制 model, 而非 model 本身。

雖然可以直接拿 global optimization 用在 hpyer-parameter 上,但會有些問題:

  • We need to wait until the training of NN is done.
    • Take a lot of time
    • Performance of NN is expansive-to-evaluate

As a result, there are special technique

  • Bayesian optimization
  • Surrogate-base
  • Reinforcement learning ??

Optimization for Expensive Function

例如: 想知道該藥是否對人有好處,但召集人的成本很高,無法瘋狂的測試不同配方

我們可以先建立一個小型 model, 該 model 和 objective function 很像。故我們在 model 上就可以瘋狂測試,找到一個不錯的後再去 objective function 測試。


Surrogate-based Optimization

Create a function to approximate the expensive function

Surrogate-based methodology
Given:

  • A set of parameters
    θ
    to find
    θ={θ1,θ2,...,θn}
  • A function
    f
    take
    θ
    as input and output a real value
  • Lower bound and upper bound of parameters
    • LoweriθiUpperi
  • Sometimes, special constraint may be exist
    • (θi)2+(θj)2d

Output:

  • A set of parameters
    θ=θ1,θ2,...,θn
    such that
    f(θ)
    is minimized, and the constraints can be satisfied.

The objective function is the function we want to optimize.

The object function is expensive for resources, like time. So, we create a cheap function, surrogate function, to approximate the object function.

We run an optimization algorithm on the surrogate function to find a promising point, like the optimum. Feed the promising point to the objective function.

  • Observed points: 已知在 objective function 上的值,用來建構 surrogate optimization.


Surrogate Function

We use regression function to build surrogate function. Introduce 3 regressions:

  • Quadratic regression
  • RBF
  • SVR

Quadratic regression

Simple regression

S(x)=xTQx+wTx+b

Find

argminQ,w,bθO||S(θ)F(θ)||, with the method of Least Square.

(1) Build surrogate funciton


因為這個 S 只有以下三種型態,故可以直接公式解得知 optimum, 也就是 promising point.

(2) Select next promising point


Advantage:

  • It can rapidly find the minimal point by formula, instead of optimization algorithms.

Disadvantage:

  • It has bad performance for many observed points with outliers.
  • The generality is not good.

可以看到 minimal on surrogate 和實際的 minimal objective function 有落差。


RBF Regression (Radial basis function)

  • Radial function
    ϕ:[0,)>R
  • a is called center of radial function
    ϕ(||xa||)

RBF regression:

S(x)=i=1nλiϕ(||xθi||)

  • n is the number of observed points.
  • 每個點都有一個 rbf(一個波)
  • λ
    讓 rbf 長高
  • 而最後的 regression 為所有長高後的波加總起來,形成 continuous function。


How to get

λi?

(1) Build surrogate function


observed point 的值和 surrogate function 值相同(一開始紅色的部分)

只有接近 observed point 的地方不是 0


How to select

θ^?

Select next promising point

  1. FInd the minimal point
    θ
    in observed points O
  2. Use random search to generate sample points centered on
    θ
  3. Select the point among the sample points with minimum
    S()
    as the selected point
    θ^

Advantage:

  • It can fit continuous funciton

Disadvantage:

  • The RBF function only fits area near observed point
  • efficiency lower than quadratic regression(because use formula to solve)