contributed by <kylekylehaha
>
Data Science
In global optimization problems, there is a target function and a set of parameters .
We want to find the parameter set , such that is minimized.
Various algorithms to such problems:
If we know the function, the task could be easier.
E.g., convex function(碗公型): find global optimum is not difficult.
若我們知道 function 長得如何,根本就不用 global optimization, 可以直接微分即可。
Moreover, the variables are continuous.
If the variabels are discrete
However, if the variables are continuous
General optimization algorithms
How to measure?
將每個變數視為離散的點,犧牲一點 solution quality。
想法: 我們找到一組不錯的 parameter, 故在這組 parameter 附近可能也會有一組不錯的。由一組好的解,逐步往更好的解走下去。
Evolutionary algorithms are able to solve optimization problems.
Main idea
可以視為有系統的 搜尋整個 space。透過 recombine & change,生成好的 candidate.
In GA, gene is a string of binary number.
Four steps in GA
Initialization
Randomly generate a bunch of binary vectors (population). Each parameter encode into 0 or 1.
Each binary vector represents a candidate solution.
Each vector is called chromosome.
Crossover
Randomly select two chromosomes from population.
Randomly choose a crossover point.
Offspring(後代) is created by exchanging the gene of their parnets among themselves until the crossover point is reached.
Idea: 如果 A1 不錯,是否代表其中一個 gene 表現很好,那我們去改變其他不好的 gene,說不定結果更好。
Mutation
After offspring is created, randomly change some bits of their genes.
有點類似 ativate function 的感覺,避免 crossover 兩次後又變回 parent.
Evolutin strategy (ES) is similar to genetic algorithm (GA).
But gene, instead of binary, they are real number in ES
Two types of genes in ES:
標注顏色的地方即為和 GA 不同的部分。
Crossover
Randomly select both candidate solution/mutation strength from parents.
Mutation
Add a random noise into solution.
The value is sampled by normal distribution. =0, = mutation strength.
也就是說,若 mutation strength 越小, 越小,分佈就越窄,越容易取到 , 也就是 0。
In conclusion, the four steps for the evolutionary algorithms.
Initialzie->Crossover->Mutation->Selection (repeat crossover)
(將來變成流程圖)
Similar to ES, but the mutation/selection operations are different.
The differential in DE means "difference" instead of "differentiation".
Initialization
Given an upper bound and a lower bound of each candidate solution
,where
Randomly generate a fixed number of candidate solutions uniformly on intervals []
Mutation
In the G-th generation, we randomly select three candidate solutions to generate each vector for all I, (p, q, r are indices)
, where F is a constant in [0, 2]
Recombination
Given a probability parameter CR
We recombine differnet dimensions (randomly assigned by ) of and to generate a new vecotr u
, where randj,i ~ uniform[0, 1], Irand is a random interget in [1, 2,…, D], 用來決定哪一個 dimension。
Irand ensures that
Selection
The parameter of next geneartion is the better one among and
, where f is objective funciton. p.s., we are considering a minization problem here.
Repeat the above steps until minimum of the limit of the number of iterations is reached.
將每個 particle 視為一群麻雀、螞蟻,彼此會溝通來找到最好的位置。
, where rand(a, b) is a variable which is uniform random in [a, b]
Update
Velocity:
Position:
Momentum
Gradient Descent (GD) with momentum : 希望可以加速突破 local min
Advances of momentum:
一開始讓 learning rate 走快一點,快到 optimum 時將 learning rate 變小。
AdaGrad keeps a variable V to compute the squared sum of previous gradients.
For t-th move, , i.e.,
The move rule changes to:
, where 避免分母為零。
隨著時間增加,t 越大,Vt 也就越大,代表我們要走得比較小步,故 Vt 在分母。
Challenges:
Combine Moemnt and AdaGrad
For evert move, it needs to keep 3 variables:
Move rule:
需注意會有 heavily biased toward
(Iteration 1)
(Iteration 2)
(Iteration 3)…
因此,要做點修正:
Hyper-parameters that are input to machine learning algorithms effect the final performance (test accuracy) significantly.
為了區別 model 內部的 parameter, hyper-parameter 是指 depth of the decision tree, number of weak classifier in adaboost。也就是說,hyper-parameter 是用來控制 model, 而非 model 本身。
雖然可以直接拿 global optimization 用在 hpyer-parameter 上,但會有些問題:
As a result, there are special technique
例如: 想知道該藥是否對人有好處,但召集人的成本很高,無法瘋狂的測試不同配方
我們可以先建立一個小型 model, 該 model 和 objective function 很像。故我們在 model 上就可以瘋狂測試,找到一個不錯的後再去 objective function 測試。
Create a function to approximate the expensive function
Surrogate-based methodology
Given:
Output:
The objective function is the function we want to optimize.
The object function is expensive for resources, like time. So, we create a cheap function, surrogate function, to approximate the object function.
We run an optimization algorithm on the surrogate function to find a promising point, like the optimum. Feed the promising point to the objective function.
We use regression function to build surrogate function. Introduce 3 regressions:
Simple regression
Find , with the method of Least Square.
(1) Build surrogate funciton
因為這個 S 只有以下三種型態,故可以直接公式解得知 optimum, 也就是 promising point.
(2) Select next promising point
Advantage:
Disadvantage:
可以看到 minimal on surrogate 和實際的 minimal objective function 有落差。
RBF regression:
How to get ?
(1) Build surrogate function
observed point 的值和 surrogate function 值相同(一開始紅色的部分)
只有接近 observed point 的地方不是 0
How to select ?
Select next promising point
Advantage:
Disadvantage: