Intelligent Optimization

--- tags: Master --- # Intelligent Optimization [TOC] ## Introduction **Exam:** - `45-60 mins` - questions: `true/false`, `multiple choice`, `number input` - allowed material: `pen and paper`, `calculator` - 👁‍🗨 once the exam has been completed, submit the answers but `do not disconnect` from the video-conference until the time is over **Theory:** from [Lionbook](https://intelligent-optimization.org/LIONbook/) `Optimization for learning` vs `Learning for Optimization` (more like the focus of this course?) ## Ch.24 - Greedy and Local Search [255-267] 🎯`Chapter context:` search for global optima of functions on **discrete variables** as input. **Greedy and Local Search** are simple and effective ways to identify improving solutions based on `step-by-step` `short-sighted changes`. 💥Greedy constructions and local search **can be combined**: initial complete solution with greedy construction and improve with local search. | | Constructive Greedy 🧱 | Perturbative Local Search 🔎 | | ------------------------------- | -------------------------------------------- | -------------------------------------------- | | **Short-sighted local changes** | create admissible solution from scratch | improve complete solution by trial and error | | **Optimal solution** | mostly no (`TSP`, ...), few yes (`MST`, ...) | | **Traveling Salesman Problem (**`TSP`**):** find shortest closed path connecting all nodes exactly one time. - NP-complete - 👎 `greedy approach` sucks, non optimal solutions that tend to be not much better than random - `local search` can start from greedy and get better solutions in less time, but still... **Minimum Spanning Tree (**`MST`**):** connect all nodes with a tree with minimum total edge cost. - polynomial - 👍 `greedy approach` guarantees optimal solution **Local Search:** repeatedly perturb (modify) current solution in search of better configurations. - **Perturbation:** move from one configuration to neighboring one (usually better) - stops at local minima/maxima - :warning:`Critical issue:` selecting a **neighborhood structure** appropriate to a given problem ![image-20210207205731069](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210207205731069.png) - **Big Valley property:** in many real word problems local minima tend to be clustered and closer to global minima - **"kicking"** a configuration from a local minima and repeating local search would more likely result in getting a better minimum than restarting local search from a random configuration - :warning: but some "big valleys" can be just local attractors - exploited by `Variable Neighborhood Search` (VNS) and `Iterated Local Search` (ILS) ![image-20210207212154255](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210207212154255.png) ![image-20210118161130056](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210118161130056.png) ## Ch.25 - Stochastic Global Optimization [269-285] 🎯`Chapter context:` search for global optima of functions on **continuous variables** as input. **Throwing random points** onto the input space can be a robust `brute-force` method for the continuous optimization case. **Stochastic Global Optimization (**`SGO`**):** suite methods relying on the **random generation of sample points** in the search space, ranging from `global random search` to methods based on `probabilistic assumptions`. - :heavy_check_mark: More popular than `deterministic global optimization` because _simpler_, insensitive to the structure of the objective function and feasibility region, and easily modifiable to guarantee _theoretical convergence_. - (deterministic requires stronger assumptions) - 👎 practical efficiency, can be very slow (RSO to the rescue) - **Global Optimization:** hard problem, usually combines `robust global schemes` with `fast local search schemes` - `Eg:`**Multistart Local Search:** multiple runs of local search from interesting points (?) - 👁‍🗨 "global schemes" try to minimize the searching area for the "local schemes" - :warning: without strong function assumptions like `Lipschitz continuity` or "big valley", convergence cannot be ensured for sparse input sequences ### Lipschitz continuity **Lipschitz continuity:** assumption that real world problems are of `limited-variability` $$ |f(x_1) − f(x_2)| ≤ K ||x_1 − x_2|| \quad ,K\ge0 $$ ![Lipschitz_Visualisierung](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/Lipschitz_Visualisierung.gif) - $\forall x$ point evaluated, a **cone** $f(x) − K||x − x'||$ defines an area which the function cannot leave, because it cannot change too fast ![image-20210208202047232](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210208202047232.png) - it makes sense to place the next sample at the point of `minimum of the underestimate` ### Pure random search **Pure random search (**`PRS`**):** repeated `generation of random points` to be evaluated in the admissible region (usually from same distribution) - simple but a good reference to benchmark against, easy to study properties - 👁‍🗨 assumes non-pathological cases (`smooth boundaries`, ...) **Convergence theorem:** if, for all possible histories of stochastic sample generation, the probability distributions $p_j(x)$ guarantee that each ball is hit with probability which sum up to infinity, an **arbitrary $δ$-close approximation of the optimum will be found infinitely often**. - _"hitting a ball"_: $p_j\big(B(\textbf{x}^*,\epsilon)\big)$ is the probability of sampling with probability distribution $j$ a point inside an hyper-sphere $B$ with center $\textbf{x}^*$ and radius $\epsilon$ - `Uniform distribution` is an immediate way to satisfy the convergence requirements - Take 🏠: **convergence is not difficult to prove**, just mix some uniform probability with your favorite sampling strategy that learns from history - but may not mean much in practice as convergence rate can be unfeasibly slow 💥**Curse of Dimensionality:** affects convergence rate of PRS - Goal: hit the ball $B(\textbf{x}^*,\epsilon)=\{\textbf{x}\in A: || \textbf{x}-\textbf{x}^*|| \le \epsilon\}$ with **i.i.d.** samples $\textbf{x}_j$ with distribution $p(\textbf{x})$ - $p(B)= \dfrac{\text{vol}(B)}{\text{vol}(A)}$ - but in a $d$-cube $A$ of length 1, $p(B)$ is proportional to $\epsilon^d$ - in higher dimensions is **exponentially more difficult to hit the ball** - `Bonus intuition:` considering the optimal point $\textbf{x}^*$ as an array of $d$-coordinates, a random array $\textbf{x}_j$ is close only if all its randomly generated coordinates are close. **Moral:** approximating the global optimum of `generic functions` is infeasible with many dimensions. **Hope:** `functions with special forms` can be feasible in approximating the global minimum. Are they common? ![image-20210208210001944](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210208210001944.png) ### Simulated Annealing (Vanilla) 🎯**Goal:** find better solution around found local optima, given problems with particular structures (e.g. `Big Valley`). **How?** allow for `stochastic moves` in neighborhood and `allow worsening moves` - (allowing worse moves with deterministic approach, can lead to infinite looping) **Simulated Annealing:** metallurgy analogy, slow cooling gives more chances of finding states with lower internal energy - 👍 preferable where exact algorithms fail (e.g.: gradient descent, branch and bound, ...) - 👎 bad if exact optimum is needed, either local or global ![Hill_Climbing_with_Simulated_Annealing](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/Hill_Climbing_with_Simulated_Annealing.gif) - based on the theory of `Markov processes`, generate new samples based on the latest points - :warning: memory-less, does not learn from experience - `Temperature`: parameter $T$ which determines the **probability that worsening moves are accepted** - (always accept a selected neighbor if is improving one) - $T\rightarrow \infin$ : random walk - $T\rightarrow 0$ : standard local search - :warning: $T_{\text{start}}$ $T_{\text{end}}$, and `cooling rate` must be set accordingly to the problem in order to avoid very poor results - $T_{\text{end}}$ can be used as end condition, but does not have to - 💥 **neighbor selection is random** (or smth based on it...) - Works with both discrete and continuous problems - `Discrete`: apply a limited set of local changes to the current solution - `Continuous`: sample with a probability distribution centered on the current point (e.g.: with Gaussian distribution) - 👎 asymptotic convergence is guaranteed, but usually is too slow (can be even slower than enumeration) - _repeated local search_ and even _pure random search_ can have better asymptotic results - :x: vanilla SA is pretty bad, see Ch.30 for the smarter "online-learning" version, which sacrifices theoretical guarantees using heuristics for better results in many practical cases ### Statistical Inference Collect useful statistics to **influence the sampling**. **Branch and Probability Bound:** generalizes branch and bound to continuous variables when no error-proof bound is available - :warning: feasible only in low dimensional space (`d<10`), number of tree nodes tends to explode - Iterative: 1. split (branch) the set of admissible input values into subsets, obtaining a tree organization - usually `hyper-rectangles` 2. calculate the potential value of the individual subsets for further investigation - statistics by sampling the subspace 3. select the most interesting subsets for additional splits. **Random Multistart:** repeat local search from random initial points, but how many samplings in order to find all possible local optima? ![image-20210209033528886](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210209033528886.png) - 👁‍🗨 the size and form of the **attraction basins** depends on the specific local search scheme (which may for example filter out very small sub-basins) - _statistical inference_ can be used to estimate the number of trials with a **probabilistic guarantee** that all `local minimizers` have been found - (one of them should be the global optimum, or a close approximation) #### Inertial Shaker Candidate points are generated in an `adaptive search box` and a moving average of the steps filters out evaluation noise and high-frequency oscillations. - 👍 simple practical choice to go beyond Pure Random Search 👁‍🗨More complex and less scalable version is the _Reactive Affine Shaker_. ![image-20210209155448120](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210209155448120.png) - `Inertial Shaker:` at each step, apply a random double shot in a box around the current input point only if successful, otherwise retry another double shot - `trend` direction is adapted at each successful step ![image-20210209162304878](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210209162304878.png) - `Double shot:` apply a displacement for every dimension as long as it improves the result - (the second shot is tried only if the first shot does not improve solution, both are canceled if not improving) - if double shot fails $\implies$ shrink search box ![image-20210116173759205](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210116173759205.png) ## Ch.26 - Derivative-Based Optimization [287-291] [293 (bisection)] [295-296 (multi-dimension)] [298-299 (gradient descent)] 🎯`Chapter context:` show some methods to **optimize smooth functions** (differentiable) of continuous variables by using gradients to help with the search direction. ...derivatives review... ### One-dimensional Optimization `Roots:` points where a function is zero, useful to find in the derivative functions (black box) to get the optima. - found by **solving local models** (_Taylor series_, _Newton's method_, _bisection_, ...) **Newton's Method** for root finding: based on tangents - (ndr: non è tra le pagine date dal prof) - 👍 fast convergence - need to start sufficiently close to the target root ![NewtonIteration_Ani](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/NewtonIteration_Ani.gif) **Bisection method** for root finding: `binary search`... - safer alternative to Newton's method for convergence - 👍 convergence in a logarithmic number of steps - nice, but not easily extended to more than one dimension ![bisection_method3](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/bisection_method3.gif) 👁‍🗨 If derivatives are not available or they are too costly to calculate, one can approximate them with the `secant method`. ### Multi-dimensional Optimization In many cases, finding the minimum of the quadratic model by `matrix inversion` is not the most efficient and robust manner if the linear system becomes very large, a frequent case in machine learning. **Ill-conditioning:** solving a linear system can be obstacled by numerical instability, due to finite-precision computation. ![image-20210209175133700](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210209175133700.png) #### Gradient Descent - discrete case analogy: `local search` (naive, use local information to do the next best choice) - needs careful $\epsilon$ step value choice - :warning: sensitive to `ill conditioning`: tends to go zig-zag if data is stretched ![image-20210116173843480](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210116173843480.png) ## Ch.27 - RSO: Intro [313-320] ⚒`Motto:` "learn on the job", generate internal explicit models from searching experience on a certain instance of a black-box problem. **Reactive Search Optimization (**`RSO`**):** use `machine learning` and `statistics` for automated **online self-tuning** and **dynamic adaptation**. - **Online learning:** implicitly accumulate knowledge of a black-box optimization problem as the function gets evaluated by newly generated input points (`self-tuning`) - need appropriate data structures for `saving` and `retrieving` past configurations (usually, RSO overhead is small) The following chapters will further present RSO (until Ch.32). To summarize, **RSO can adapt in a online manner** at least meta-parameters like: - `prohibition periods` (stay away from these areas) $\longrightarrow \text{Ch.27}$ - `neighborhood` (if no improving move, get another neighborhood) $\longrightarrow \text{Ch.28}$ - `iterations` of local search (kick the system to a new attractor) $\longrightarrow \text{Ch.29}$ - dynamic `modifications of the objective function` (push up local minima) $\longrightarrow \text{Ch.31}$ - amount of `randomness` (use noise to jump out of an attractor) $\longrightarrow \text{Ch.30?}, \text{Ch.31}$ ### Tabu-RSO RSO variant based on prohibitions. **Prohibitions:** stay away from certain areas in order to find new solutions, optimized online - encourage `diversification` - at a given time $t$, the set of moves is split between _admissible moves_ and _tabu_ moves (prohibitions) - in general, the `inverses` to recent moves are prohibited for a time dependent period $T^{(t)}$ ![image-20210209195136451](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210209195136451.png) - bigger $T$ forces more searching steps before allowing a reversing move (increases the minimum loop size) - tends to form a `lasso` around a local optimum 🎯**Goal:** for a local configuration, `find the minimal prohibition` value for escaping an attraction basing in the area. - 👁‍🗨 determine $T$ in a **reactive way**, where its value changes accordingly the local structure of the problem ![image-20210209200756390](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210209200756390.png) Initially $T=1$, increases if trapped in attraction basin, decreases if new regions are visited. (ndr: this still seems too naive, seems easy to just loop around the same attraction basins) ![image-20210116173859516](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210116173859516.png) ## Ch.28 - RSO: Adapting Neighborhoods and Selection [327-332] **Variable Neighborhood Search (**`VNS`**):** [meta-heuristic](http://www.scholarpedia.org/article/Metaheuristic_Optimization) which adapts the neighborhood at every step, both in descent to local minima and in escape from the valleys. - Set of `neighborhoods defined apriori`, during search use the most appropriate ones - `Considerations:` **reactive nature**, many possible implementations (does not have to use online learning) - How to setup the `neighborhoods`? How many? Are disjoint? - help from problem knowledge or experimentation - How to setup `neighborhood selection`? order of consideration? transitioning? - many possibilities: from `randomly cycling` neighborhoods, to `online learning` (e.g.: minimal diversification strategy) - How to `evaluate neighborhood`? Use a **local search** heuristic (choosing only an improving neighbor) - `highest descent` heuristic - `first descent` heuristic ![image-20210210160822312](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210210160822312.png) **Reduced-VNS:** randomly extract one neighbor for the current neighborhood, move only if improving - neighborhood is usually ordered according the `strength of the perturbation` - `Variable Neighborhood Descent` (VND): deterministic alternative ![image-20210210202758041](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210210202758041.png) - if neighbor is not improving $\implies$ pass to larger neighborhood and repeat - else $\implies$ turn back to default neighborhood **Skewed-VNS:** accept also `worsening moves` if they lead the search trajectory sufficiently far ![image-20210210203828622](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210210203828622.png) ![image-20210117041114140](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041114140.png) ## Ch.29 - RSO: Iterated Local Search [333-337] **Repeated Local Search:** memory-less, repeat local search from different starting points, keep best solution. **Iterated Local Search (**`ILS`**):** another local search extension, iterated local search from previous local optimum. - ❗ assumes a `clustered distribution` of local minima $\implies$ easier to find local minima from a local minimum rather than random point - ❗ `keeps memory` of found local minima - 🦶 based on `kicks`: perturbations have to be of the right size though - too weak $\implies$ fall back to the same local minimum - too strong $\implies$ memory-less random sampling - 👁‍🗨 should adjust in `reactive` manner (increase kick only if necessary) (ndr: how does it compare to VNS?) ![image-20210210215113952](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210210215113952.png) ![image-20210120202249339](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210120202249339.png) ![image-20210117041131421](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041131421.png) ## Ch.30 - RSO: Online learning in Simulated Annealing [339-342] ![image-20210120203448685](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210120203448685.png) **Gist:** vanilla "Markov-process" SA is dumb, consider smarter extensions that use memory. - 👎 new mechanisms, math theory guarantees are not valid anymore (eg.: global convergence) - 👍 usually much better in practice POSSIBLE IMPROVEMENTS: - **Phase transition:** use 2 cooling rates, $\alpha$ and $\beta$, use the faster rate during a _phase transition_, that is when _specific heat_ is maximal - 💡`Idea:` adjust temperature scaling and cooling by collecting statistics (estimate the distribution of the values in $f$) - max temp scale $=\sigma_f$ (standard deviation) - min temp scale $=$ minimum change (!?) - `specific heat` $=\sigma_{f^2}/T$, is the variance of the energy per temperature - when variance is very high at a given temperature $\implies$ a phase transition is occurring there - **Re-heating:** cooling can lead to `freezing` at local minimum (infinitely more difficult to find better values), use **non-monotonic cooling schedules** to avoid that - :warning: same old story: reheating has to be neither too slow to fall back in the current minimizer, neither too fast to have random walk - can use online learning for runtime adjusting Reactive SA solutions vs **SequenceHeuristic**: in the latter, accept worsening solution if tot trials fail (entrapment evidence) - both perform well, probably thanks to their **acceptance of bad perturbations** (not some magic in the annealing itself) **Adaptive Simulated Annealing (**`ASA`**):** very fast simulated re-annealing, automatic runtime adjustment of parameters (cooling schedule,step selection ..) - continuous space - generate random `steps of adaptive size` (adjusted so that half of moves is accepted) - adjust step size and temperature to concentrate sampling in promising areas ![image-20210117041146213](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041146213.png) ## Ch.31 - RSO: Dynamic Landscapes and Noise Levels [343-348] How to adaptively **modify the objective function** depending on the search history and current state in order to deviate the search trajectory away from local optimizer? ![image-20210120203831074](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210120203831074.png) **GSAT:** algorithm for SAT, used as reference for possible extensions that modify the objective function ![image-20210212175818301](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210212175818301.png) - `Breakout method:` increase weight of violated constraints until escape from local minimum - `Clause–weighting method:` weight clauses for determining which variable to flip - improves GSAT for problems characterized by strong asymmetries - can encourage more priority on satisfying the “most difficult” clauses - `Problem:` clause weights can increase too much if they never get satisfied - can `reduce` the clause weight when a clause is satisfied, or by - can use a `weight decay` scheme (consider the recency of moves) - many other versions... **👁‍🗨Moral:** can use weights to make local minima disappear, but be aware about possible collateral effects (new spurious local minima, global minimum shadowing, ...) ### Guided Local Search **GLS:** similar to approaches for GSAT, based on **dynamic penalties** to distribute the search effort over the regions of search space - augmented cost function: $$ h(X)=f(X)+\lambda \sum_ip_iI_i(X) $$ - $I_i(X)$ "indicates" if feature $i$ is present in solution $X$ - $p_i$ is the penalty value for feature $i$ - $\lambda$ is the diversification factor, reactive implementation is suggested - when local minimum is found, increase by one the penalties which maximize $I_i(X')\dfrac{c_i}{1+p_i}$ - $c_i$ is the cost of feature $i$ - 👁‍🗨 _features_ and _costs_ often come directly from the objective function, they are not adaptive - 👍 $c_i$ promotes the penalization of more costly features - 👍 $p_i$ promotes the penalization of features that have not been penalized many times - can be seen as a special case of Tabu search (?!) - usually combined with **fast local search (**`FLS`**):** evaluate only a subset of neighbors where you expect improving moves, avoid brute-force evaluation - 👍 much faster - 👍 FLS does not impact the dynamics and does not change the search trajectory (not sure) - can adapt the neighborhood based on the penalized features, analogous to `tabu-RSO` - :warning: still sensitive to problems related the modification of the objective function, so be careful ### Adapting Noise Adapting the level of randomness (like in Simulated Annealing) is another mechanism to **trade-off intensification and exploration**. - good calibration avoids searches which are either too greedy or too much like random walks ![image-20210117041156530](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041156530.png) ## Ch.33 - Linear and Quadratic Programming [357-364] **Moral:** check if problem is a simple one before considering general-purpose optimization methods. **Linear Programming/Optimization:** solution of problems with a `linear objective function` and `linear constraints`. - polynomial time (if $\textbf{x}$ does not have to be integers) - **Canonical form:** $$ \text{maximize} \quad \textbf{c}^T\textbf{x} \\ \text{subject to} \quad A\textbf{x} \le \textbf{b} \\ \text{and} \quad \textbf{x} \ge 0 $$ - where: - $\textbf{x}$ represents a vector of $n$ _variables to be determined_ - $\textbf{c}$ and $\textbf{b}$ are vectors of _coefficients_ - $A$ is a matrix $m \times n$ of coefficients, one row $\forall \ m$ constraints **Convex polytope:** each row $A_i\textbf{x} \le b_i$ describes a half-space in $n$-dimensions, the intersection of all the $m$ constrained spaces results in a convex region 💥`Convex feasible region` + `Linear objective function` - $\implies$ **local optimum == global optimum** - $\implies$ **local optimum only on vertices** ![05c116_7de0f25cfaea49369e6fe18b7f8e7851_mv2](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/05c116_7de0f25cfaea49369e6fe18b7f8e7851_mv2.gif) Example: the image shows linear optimization with 2 variables (2D) and 5 constraints. The linear cost function is maximized at the rightmost vertex. **Simplex algorithm** for linear programming - 👍 global optimum guaranteed (due linear programming assumptions) - 👎 exponential complexity worst case (but they say is rarely exponential in practice) ![image-20210207182554074](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210207182554074.png) 1. select a `starting vertex` (can be selected from a simplified version of the original problem) 2. **local search** steps: move only to neighbor vertex that have higher objective value 3. end when `local maximum` is reached (all its neighbors have smaller objective value) **Integer Linear Programming (**`ILP`**):** vector $\textbf{x}$ can have only integers - more difficult than the continuous case (`NP-hard` just to tell if it has a feasible solution) ![image-20210207184734643](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210207184734643.png) - **Relaxation approach:** relax ILP problem to LP problem and solve, then round $\textbf{x}$ to closest integers - 👍 better complexity - 👎 no optimality guarantee - 👎 no feasibility guarantee **Quadratic Programming:** optimize a `quadratic function` subject to `linear constraints`. ![image-20210117041319651](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041319651.png) ## Ch.34 - Branch and Bound, Dynamic Programming [365-372] ### Branch and Bound Try to detect avoidable enumerations when exploring the tree. ![image-20210212210350616](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210212210350616.png) - nodes are partial solutions, leafs are complete solutions - parts of the tree can be `pruned` when it is known that they cannot lead to optimal solutions - `bounds:` conditions that are updated during the search, not respecting them leads to pruning - `branching:` assigning a value to a variable in a partial solution (non-leaf node) and recursively expand the subtree - `backtraking:` when branching leads to dead end, return to previous steps and try unexplored alternatives - 👎 still exponential, like enumeration ### Dynamic Programming Avoid repeated calculations by remembering partial results. `When?` problem solution is combination of solutions from its sub-problems, and sub-problems can be shared. **Viterbi’s algorithm:** ... ![image-20210117041337245](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041337245-1612832483369.png) ## Ch.35 - Satisfiability [373-380] [389-391] **Context:** `SAT` and `MAX-SAT` are good _"toy problems"_ to apply the principles we have seen so far. Solving SAT is a prototypical form of the more general **constraint programming (**`CP`**)**. **SAT:** tell if boolean formula of boolean variables is satisfiable (all clauses are true). **MAX-SAT:** given a `disjunctive normal form` (DNF), find the assignment of $n$ variables that satisfies the maximum number of $m$ clauses. - `MAX W–SAT`: version where clauses have a weight ### Resolution and Splitting for SAT Can adapt the `branch and bound` method for SAT: resolution. **Resolution rule:** produces a new clause (`resolvent`) implied by two clauses containing `complementary literals`. - **Resolvent:** clause produced by a resolution rule, contains all the literals that **do not have complements** - 💥 an empty resolvent is obtained $\iff$ the original formula is `unsatisfiable` - if is not a tautology, it is added to the clauses set for further resolution inferences - 👁‍🗨 **it does not alter the set of solutions** ![image-20210213183049849](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210213183049849.png) - **DP (**`Davis-Putnam`**) algorithm:** based on resolution rules, removes one variable at a time by using all possible resolvents on that variable. - 👎 memory explosion **Splitting rule:** at each node, assign a new variable and split problem in two based on the variable assignments, and neither contains the selected variable - if clause does not contain selected variable, put in both sub-problems - if variable is chosen to _true_ the sub-problem will only have clauses where the variable is not true, and vice-versa ![image-20210213184941889](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210213184941889.png) - can use `backtracking` - **DPLL algorithm:** limit memory explosion by replacing resolution rule with the splitting rule. - 👍 better memory usage than DP, but still exponential ![image-20210213184529611](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210213184529611.png) ### Linear Programming for MAX W-SAT - can use `branch and bound` with `ILP` - `linear programming relaxation:` gives an upper bound on the number of satisfied clauses - 👎 poor bounds in practice ### Continuous Approaches **Interior point algorithm:** ... In some techniques the `MAX–SAT` (or `SAT`) problem is transformed into an **unconstrained optimization problem** on the real space and solved by using existing global optimization techniques. ### Approximation Algorithms Approximate best MAX-SAT solution. - `greedy construction` strategies - **GreedyJohnson:** iteratively assign the literal that occurs in most clauses and delete satisfied clauses - $1/2$ guaranteed performance ### Local Search for SAT `Perturbative local search` is very effective for MAX-SAT. Different variations of local search with **randomness** techniques have been proposed for SAT and MAX–SAT. ![image-20210117041351667](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041351667.png) ## Ch.36 - Strategic Modeling (?ndr) [403-412] **Context:** data-driven models for which experimental data is not available at the beginning, and needs to be acquired in a strategic manner. 🎯**Goal:** collect data by minimizing cost (experiments). - called `design of experiments` in science - called `active learning` or`query learning` in ML If running the real model is too costly $\implies$ can build a `surrogate` (approximated) model based on the executed experiments. ### Design of Experiments (DOE) Systematic and principled manner to generate examples to characterize, predict, and then improve the behavior of any system or process. **Goal:** get a surrogate model from experiments. ![image-20210213204332195](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210213204332195.png) - `rule of thumb:` the number of examples has to be larger than the number of parameters **Full factorial design:** generate sample points on a **regular grid** based on number of `factors` (variables) levels - 👍 simple - 👎 grid-like nature (may miss relevant effects) - 👎 very expensive (exploding number of samples wrt. number of factors) - `reduced factorial designs` alleviate this ![image-20210214010015985](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210214010015985.png) - Figure: full-factorial DOE with 5 levels for each factor, in three dimensions **Randomized design: pseudo Montecarlo sampling** - 👍 simple - 👍 can get interesting points that grids would miss - 👍 good in the absence of detailed and trusted information about the physical system under analysis - 👎 can leave large regions of the design space unexplored (needs a large sample) - `stratified Monte Carlo sampling` can solve this by subdividing in equal bins and sampling for each one (reduces the needed sample size) ![image-20210214011207613](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210214011207613.png) **Latin hypercube sampling (**`LHS`**):** its motivation is to ensure that, for a given number of samples $k$, and $k$ equally-spaced bins, for all one-dimensional projections of the samples there will be one and only one sample in each bin. - `Latin square:` if there is only one sample in each row and each column (like sudoku) - 👍 good representation with less samples ![image-20210214012250855](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210214012250855.png) **Surrogate model (**`emulator`**):** a simplified model of the outcome is used when an outcome of interest cannot be easily directly measured. - usually constructed based on the results of a `simulator` (seen as `black-box`) - also called `Response Surface Methodology` (RSM) - `iterated RSM` in closer areas can be used to ensure that an optimal point for the surrogate model corresponds to the one in the real system ![image-20210214014606903](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210214014606903.png) ### Active or Query Learning At each iteration, active learning methods **select the inputs maximizing an informativeness measure**, which estimates how much an input would improve the current model if it was added to the current training set. - **Uncertainty sampling (**`US`**)** principle: input with highest `predictive uncertainty` as the most informative training example for the current model - can estimate confidence on predictions with **Gaussian Process Regression** - **Query-by-committee strategy:** use a committee of models trained on the current set of examples - they have competing hypotesis - the input considered most informative is the one on which they `disagree the most` - **Expected model change** principle: most informative input is the one that would change the model the most (has the greatest influence on model parameters) - **Expected error reduction:** select the input that is expected to yield the greatest reduction in the `generalization error` of the current model - computationally expensive - **Variance reduction** approach: query the input that minimizes the model variance - assumption that the bias error is negligible ![image-20210117041417062](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041417062.png) ## Ch.39 - Genetics, Evolution and Nature-Inspired Analogies [459-465] **Context:** analogies from nature can be `inspiring` but also `misleading` when they are translated into procedures for problem solving and optimization. - e.g.: **Population-based algorithms:** using multiple interacting search streams - Similar ideas: ensembles, pools, agents, committees, multiple threads, mixture of experts, `genetic algorithms`, particle swarms, `evolution strategies` - :warning: analogies with nature are not always justified and more effective schemes can be developed, do not need to imitate nature for the sake of imitating it **Goal:** let’s concentrate on relationships between _Genetic Algorithms_ and _intelligent Search_ strategies. ### Genetic algorithms and Evolution Strategies **Idea:** `natural selection`, each individual is a candidate solution described by its **genetic content**. - **Mutation:** randomly change bit values (gene $==$ position?) - usually very rare - used to explore the neighborhoods (`perturbative serch`) - :warning: Careful **rate** choice. Too small and we stuck with the `initial solution`, too big and we get `random restart`. - **Recombination:** uniform `cross-over`, parents have $1/2$ chance to pass each bit - applied to random couples, creates two sons - **Cross-over:** dubious analogy, does it work? - similar parents can lead to `premature convergence` - `fitness function:` defines the suitability for environment - fitter individuals produce a larger offspring ![image-20210215034150507](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210215034150507.png) There are at least 3 **critical issues** when adopting biological mechanisms for solving optimization problems: 1. not easy to demonstrate that they are `effective` 2. demonstrate that they are efficient `efficient` 3. is biologic-based intrinsically better? (e.g. individual experience is not passed in genes) `Hybridized Genetic Algorithms:` based on **Baldwin effect** idea (?) - **Memetic algorithms:** extension of GA, add learning during individuals life (local search) - 💥 How to implement `individual learning` during lifetime? - **Lamarckian evolution:** replace the initial genotype with the better solution identified by local search - local search does not modify genotype, but evaluate the `fitness function` based on the "learning potential" 👁‍🗨 More efficient algorithms are based on one simpler _"one parent for one descendant"_ recombination. Don't have to imitate nature. ![image-20210117041448705](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041448705.png) ## Ch.40 - Multi-Objective Optimization [467-473] **Context:** most of the real-world problem cannot be cast into the simple and mathematically crystal-clear form of minimizing a function $f(x)$. - **Multi-Objective Optimization Problems (**`MOOP`**)** $\longrightarrow$ most problems have **more than one objective** to reach, to be maximized - but have to **trade-off** between `conflicting objectives` **MORSO (**`Multi-objective Reactive Search Optimization`**):** solution methods for multi-objective optimization characterized by **incremental and learning paths**. ### Pareto Optimality Defining desirable objectives without spelling out the `trade-offs` and the `relative importance` of the different objectives can lead to ill-defined problems, as **objective functions can be conflicting**. (objective vector cannot contain the best solution for each individual objective) - **Pareto Optimal:** a point $\textbf{x}$ such that there is no other point that dominates it in $f$ - $\textbf{x}\prec\textbf{x}'$ means that the objective vector $\textbf{x}$ `dominates` $\textbf{x}'$ (has no worse values than it, and at least one value that is better) - **Pareto frontier:** set of Pareto optimal points, typically the whole set is `unknown` - 💥 Pareto-optimal solutions cannot be improved in any of the objectives without degrading at least one of the other objectives - 👁‍🗨 designer can make trade-offs within this set, rather than considering the full range of every parameter ![image-20210215171131774](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210215171131774.png) Multi-objective approaches based on Pareto frontier: - **Linear Scalarization (**weighted-sum**):** obtain and optimize a `single objective` function by using linear combination - 👎 if not convex Pareto-front $\implies$ some Pareto-optimal solutions cannot be identified ![image-20210215174438220](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210215174438220.png) - **Tchebycheff approach:** ? uses single-objective local search? - 👎 aggregation function is `not smooth` for a continuous MOOP - used in **MOEA/D:** multiobjective evolutionary algorithm based on decomposition - decompose a multi-objective optimization problem into a number of scalar optimization subproblems, using a set of different weight vectors, one for each scalarized problem, and optimizes them simultaneously - **Pareto local search (**`PLS`**):** natural extension of single objective local search methods - used in **MOMAD:** memetic algorithm based on `decomposition`, is a `population-based` iterative method and thus an `anytime algorithm` ![image-20210117041503917](https://raw.githubusercontent.com/andreamatt/typoraImgs/master/img/image-20210117041503917.png) ## TODO - `stochastic global optimization` examples? - `reactive` vs `adaptive` vs `online learning`? - `Inertial shaker` vs `reactive affine shaker`? - vanilla `simulated annealing` vs online learning based one? - why vanilla `simulated annealing` has guaranteed global convergence? - `iterated local search` vs `tabu-RSO`? - “strategic oscillations”: `re-heating SA` vs `tabu-RSO` vs `VNS`? - `random multi-start` vs `repeated local search`? - `GSAT` vs `WalkSAT`? 🌞 END 🌞