# Global Optimization contributed by <`kylekylehaha`> ###### tags:`Data Science` ## Outline - Gloval optimization - Random Search / Grid Search - Evolution Algorithms / Particle Swarm - Gradient Descent-based Algorithm - Hyper-parameter optimization - Surrogate-based optimization --- In global optimization problems, there is a target function $f$ and a set of parameters $\theta = \{\theta_1,\theta_2,...,\theta_n\}$. We want to find the parameter set $\theta^*$, such that $f(\theta^*)$ is minimized. $argmin_{\theta}f(\theta)$ Various algorithms to such problems: - Random Search / Grid Search - Evolution Algorithms - Swarm Intelligence Algorithms - Gradient Descent --- **If we know the function, the task could be easier.** E.g., convex function(碗公型): find global optimum is not difficult. 若我們知道 function 長得如何,根本就不用 global optimization, 可以直接微分即可。 - Most of the time, we don't know the exact form of the function. - Given an input X, we know only the output f(x). - Finding the point that makes the first derivate=0 **Moreover, the variables are continuous.** If the variabels are discrete - Brute-force, branch-and-bound - Time-consuming, but works. - 將每個點都試一次即可。 However, if the variables are continuous - Brute-force or B&B no longer works. ![](https://i.imgur.com/qHC1yBb.png) --- **General optimization algorithms** - Simple search algorithms - Random search & grid search - Evolution and swarm intelligence - Genetic algorithm, evolution strategy, particle swarm - Gradient-based algorithm - Gradient descent, adagrad, adam **How to measure?** - Efficiency - The time to obtain a solution, e.g., converge - Solution quality --- ## Simple search algo ### Grid Search 將每個變數視為離散的點,犧牲一點 solution quality。 - Discretize the search space and search with bruth-force. - 每個變數為一個軸,故十個變數就是十維。 - ![](https://i.imgur.com/I6Dil5g.png) --- ### Random Search - $r$ is the radius for search. - Random search includs 2 steps: 1. Find a point $\theta^*$ in search space. We record $\theta^*$ as the optimum point. (一開始可以隨便找一個點視為 $\theta^*$) 2. Uniformly random choose a point $\theta$, with a disrance at most r from $\theta^*$. If $f(\theta) \lt f(\theta^*)$, then set $\theta^*$ as $\theta$. Repeated the step. --- ## Evolution Algorithm 想法: 我們找到一組不錯的 parameter, 故在這組 parameter 附近可能也會有一組不錯的。由一組好的解,逐步往更好的解走下去。 Evolutionary algorithms are able to solve optimization problems. - Genetic algorithm (GA) - Evolution strategy (ES) **Main idea** - Start with some random candidate solutions. - Recombine current candidate solutions to generate new candidates. - Crossover - Find other candidates based on current candidates - Slightly change the generated - Mutation(突變) - Add some randomness into search process - Prevent converge at local minimal too early - Keep good candidate solutions. - Repeat 2~4 steps 可以視為**有系統的** 搜尋整個 space。透過 recombine & change,生成好的 candidate. --- ### Genetic Algorithm (GA) In GA, **gene is a string of binary number.** Four steps in GA - Initialize the population - Randomly select individuals for crossover - Mutate - Select those individuals with good objective values as the population of next iteration. ![](https://i.imgur.com/5Od0YuZ.png) --- **Initialization** Randomly generate a bunch of binary vectors (population). Each parameter encode into 0 or 1. Each binary vector represents a candidate solution. Each vector is called chromosome. ![](https://i.imgur.com/LBfoQlQ.png) --- **Crossover** Randomly select two chromosomes from population. Randomly choose a crossover point. Offspring(後代) is created by exchanging the gene of their parnets among themselves until the crossover point is reached. Idea: 如果 A1 不錯,是否代表其中一個 gene 表現很好,那我們去改變其他不好的 gene,說不定結果更好。 ![](https://i.imgur.com/3Pzv2qV.png) --- **Mutation** After offspring is created, randomly change some bits of their genes. 有點類似 ativate function 的感覺,避免 crossover 兩次後又變回 parent. --- ### Evolution strategy (ES) Evolutin strategy (ES) is similar to genetic algorithm (GA). ==But gene, instead of binary, they are real number in ES== Two types of genes in ES: - One record the candidate solution - ==One record the **mutation strength**== 標注顏色的地方即為和 GA 不同的部分。 --- **Crossover** Randomly select both **candidate solution/mutation strength** from parents. --- **Mutation** Add a random noise into solution. The value is sampled by normal distribution. $\mu$=0, $\sigma$ = mutation strength. 也就是說,若 mutation strength 越小, $\sigma$ 越小,分佈就越窄,越容易取到 $\mu$, 也就是 0。 ![](https://i.imgur.com/D1IcnY4.png) --- In conclusion, the four steps for the evolutionary algorithms. Initialzie->Crossover->Mutation->Selection (repeat crossover) (將來變成流程圖) --- ### Differential Evolution (DE) Similar to ES, but the mutation/selection operations are different. The **differential** in DE means "difference" instead of "differentiation". --- **Initialization** Given an upper bound and a lower bound of each candidate solution $$ \theta_{j,i,G} $$ ,where - j: j-th dimension - i: i-th candidate solution - G: G-th iteration $\theta^L_j \le \theta_{j,i,1}, \le \theta^U_j$ Randomly generate a fixed number of candidate solutions $\theta_{j,i,1}$ uniformly on intervals [$\theta^L_j, \theta^U_j$] --- **Mutation** In the G-th generation, we randomly select three candidate solutions $\theta_{p,G}, \theta_{q,G}, \theta_{r,G}$ to generate each vector $v_{i_G+1}$ for all I, (p, q, r are indices) $$ v_{i,G+1}=\theta_{p,G}+F(\theta_{q,G}-\theta_{r,G}) $$ , where F is a constant in [0, 2] --- **Recombination** Given a probability parameter CR We recombine differnet dimensions (randomly assigned by $I_rand$) of $v$ and $\theta$ to generate a new vecotr u $$ u_{j,i,G+1}= \begin{cases} v_{j,i,G+1}& if \ rand_{j,i} \le CR\ or\ j = I_{rand}\\ \theta_{j,i,G}& otherwise \end{cases} $$ , where rand~j,i~ ~ uniform[0, 1], I~rand~ is a random interget in [1, 2,..., D], 用來決定哪一個 dimension。 I~rand~ ensures that $V_{i,G+1} \ne \theta_{i,G}$ --- **Selection** The parameter of next geneartion $\theta_{i,G+1}$ is the better one among $\theta_{i,G}$ and $u_{i,G+1}$ $$ \theta_{i,G+1}= \begin{cases} u_{i,G+1}& if \ f(u_{i,G+1}) \le f(\theta_{i, G}) \\ \theta_{i,G}& otherwise \end{cases} $$ , where f is objective funciton. p.s., we are considering a minization problem here. Repeat the above steps until minimum of the limit of the number of iterations is reached. --- ## Particle Swarm Optimization 將每個 particle 視為一群麻雀、螞蟻,彼此會溝通來找到最好的位置。 ![](https://i.imgur.com/xQnjJfD.png) $$ v_{j,t}=rand(0,c_1) \times \theta^*_{t-1}+rand(0, c_2) \times \theta^*_{j,t-1}+w \times v_{j,t-1} $$ , where rand(a, b) is a variable which is uniform random in [a, b] **Update** Velocity: $v_{j,t}=rand(0,c_1) \times \theta^*_{t-1}+rand(0, c_2) \times \theta^*_{j,t-1}+w \times v_{j,t-1}$ Position: $\theta_{j,t}=\theta_{j-1,t}+v_{j,t}$ - c~1~ decides how fast to approach $\theta^*_{t-1}$, the best position of all particles. - c~2~ decides how fast to approach $\theta^*_{j,t-1}, the best position of particle-j$ - w decides how much of last velocity retains. - Usually, c1, c2, w are 2, 2, and 0.7, respectively. --- ## Gradient Descent-based algorithms **Momentum** Gradient Descent (GD) with momentum $m_t$: 希望可以加速突破 local min $$ m_{t+1}=\beta_1 \cdot m_t + \alpha \cdot g_t \\ \theta_{t+1} = \theta_t - m_{t+1} $$ ![](https://i.imgur.com/e5RvbAb.png) Advances of momentum: 1. It can escape from local minimum. 2. It usually reaches optimum faster. --- ### AdaGrad (Adaptive Gradient Descent) 一開始讓 learning rate 走快一點,快到 optimum 時將 learning rate 變小。 AdaGrad keeps a variable V to compute the squared sum of previous gradients. For t-th move, $V_{t+1}=\sum_{i=0}^tg^2_i$, i.e., $V_{t+1}=V_t+g^2_t$ The move rule changes to: $$ \theta_{t+1}=\theta_t-\alpha \cdot \frac{g_t}{\sqrt{V_{t+1}+\epsilon}} $$ , where $\epsilon$ 避免分母為零。 隨著時間增加,t 越大,V~t~ 也就越大,代表我們要走得比較小步,故 V~t~ 在分母。 Challenges: - V~t~ will be very large as it moves many time because V~t~ is increasing. - We will move very little or even stop. --- ### ADAM (Adaptive Moment Estimation) Combine Moemnt and AdaGrad For evert move, it needs to keep 3 variables: 1. $\theta_t$: position in the search space. 2. $m_t$: called 1-st moment, recording the speed and direction. 3. $V_t$: called 2-nd moment, recording the size of gradients. **Move rule:** $$ \theta_t=\theta_{t-1}-\alpha \cdot \frac{\hat{m}}{\sqrt{\hat{v_t}}+ \epsilon} \\ m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t \\ v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2 $$ ==需注意會有 heavily biased toward $v_0$== (Iteration 1)$v_1=\beta_2 \cdot v_0 + (1-\beta_2) \cdot g^2_1$ (Iteration 2)$v_2=\beta_2 \cdot v_1 + (1-\beta_2) \cdot g^2_2=\beta_2 \cdot (\beta_2 \cdot v_0 + (1-\beta_2) \cdot g^2_1)+(1-\beta_2) \cdot g^2_2$ (Iteration 3)... 因此,要做點修正: $$ \hat{m_t} = \frac{m_t}{1-\beta^t_1} \\ \hat{v_t} = \frac{v_t}{1-\beta^t_2} $$ ![](https://i.imgur.com/kcUNcRP.png) --- ## Hyper-parameter Optimization Hyper-parameters that are input to machine learning algorithms effect the final performance (test accuracy) significantly. 為了區別 model 內部的 parameter, hyper-parameter 是指 depth of the decision tree, number of weak classifier in adaboost。也就是說,hyper-parameter 是用來控制 model, 而非 model 本身。 雖然可以直接拿 global optimization 用在 hpyer-parameter 上,但會有些問題: - We need to wait until the training of NN is done. - Take a lot of time - ==Performance of NN is expansive-to-evaluate== As a result, there are special technique - Bayesian optimization - Surrogate-base - Reinforcement learning ?? --- ## Optimization for Expensive Function 例如: 想知道該藥是否對人有好處,但召集人的成本很高,無法瘋狂的測試不同配方 我們可以先建立一個小型 model, 該 model 和 objective function 很像。故我們在 model 上就可以瘋狂測試,找到一個不錯的後再去 objective function 測試。 --- ### Surrogate-based Optimization > Create a function to approximate the expensive function **Surrogate-based methodology** Given: - A set of parameters $\theta$ to find $\theta=\{\theta_1, \theta_2, ... , \theta_n\}$ - A function $f$ take $\theta$ as input and output a real value - Lower **bound** and upper **bound** of parameters - $Lower_i \le \theta_i \le Upper_i$ - Sometimes, special constraint may be exist - $(\theta_i)^2+(\theta_j)^2 \le d$ Output: - A set of parameters $\theta^*={\theta^*_1, \theta^*_2,...,\theta^*_n}$ such that $f(\theta^*)$ is minimized, and the constraints can be satisfied. The **objective function** is the function we want to optimize. The object function is expensive for resources, like time. So, we create a cheap function, **surrogate function**, to approximate the object function. We run an optimization algorithm on the surrogate function to find a **promising point**, like the optimum. Feed the promising point to the objective function. ![](https://i.imgur.com/9Dwq0gh.png) ![](https://i.imgur.com/yj5pfu0.png) - Observed points: 已知在 objective function 上的值,用來建構 surrogate optimization. ![](https://i.imgur.com/kOgiXB0.png) --- ### Surrogate Function We use **regression** function to build **surrogate function**. Introduce 3 regressions: - Quadratic regression - RBF - SVR --- #### Quadratic regression Simple regression $$ S(x)=x^TQx+w^Tx+b $$ Find $argmin_{Q,w,b} \sum_{\theta \in O}||S(\theta)-F(\theta)||$, with the method of Least Square. > (1) Build surrogate funciton --- 因為這個 S 只有以下三種型態,故可以直接公式解得知 optimum, 也就是 promising point. ![](https://i.imgur.com/0elrJG8.png) > (2) Select next promising point --- Advantage: - It can rapidly find the minimal point by formula, instead of optimization algorithms. Disadvantage: - It has bad performance for many observed points with outliers. - The generality is not good. ![](https://i.imgur.com/H6qKfUG.png) 可以看到 minimal on surrogate 和實際的 minimal objective function 有落差。 --- #### RBF Regression (Radial basis function) - Radial function $\phi:[0, \infty)->R$ - a is called center of radial function $\phi(||x-a||)$ RBF regression: $$ S(x)=\sum^n_{i=1}\lambda_i \phi(||x-\theta_i||) $$ - n is the number of observed points. - 每個點都有一個 rbf(一個波) - $\lambda$ 讓 rbf 長高 - 而最後的 regression 為所有長高後的波加總起來,形成 continuous function。 ![](https://i.imgur.com/z6YQPHU.png) --- How to get $\lambda_i$? > (1) Build surrogate function ![](https://i.imgur.com/0k0BJGZ.png) observed point 的值和 surrogate function 值相同(一開始紅色的部分) 只有接近 observed point 的地方不是 0 ![](https://i.imgur.com/SDarWpe.png) --- How to select $\hat{\theta}$? > Select next promising point 1. FInd the minimal point $\theta^*$ in **observed points O** 2. Use random search to generate **sample points** centered on $\theta^*$ 3. Select the point among the **sample points** with minimum $S(\cdot)$ as the selected point $\hat{\theta}$ --- Advantage: - It can fit continuous funciton Disadvantage: - The RBF function only fits area near observed point - efficiency lower than quadratic regression(because use formula to solve) ---