---
tags: Master
---
# Intelligent Optimization
[TOC]
## Introduction
**Exam:**
- `45-60 mins`
- questions: `true/false`, `multiple choice`, `number input`
- allowed material: `pen and paper`, `calculator`
- πβπ¨ once the exam has been completed, submit the answers but `do not disconnect` from the video-conference until the time is over
**Theory:** from [Lionbook](https://intelligent-optimization.org/LIONbook/)
`Optimization for learning` vs `Learning for Optimization` (more like the focus of this course?)
## Ch.24 - Greedy and Local Search
[255-267]
π―`Chapter context:` search for global optima of functions on **discrete variables** as input.
**Greedy and Local Search** are simple and effective ways to identify improving solutions based on `step-by-step` `short-sighted changes`.
π₯Greedy constructions and local search **can be combined**: initial complete solution with greedy construction and improve with local search.
| | Constructive Greedy π§± | Perturbative Local Search π |
| ------------------------------- | -------------------------------------------- | -------------------------------------------- |
| **Short-sighted local changes** | create admissible solution from scratch | improve complete solution by trial and error |
| **Optimal solution** | mostly no (`TSP`, ...), few yes (`MST`, ...) | |
**Traveling Salesman Problem (**`TSP`**):** find shortest closed path connecting all nodes exactly one time.
- NP-complete
- π `greedy approach` sucks, non optimal solutions that tend to be not much better than random
- `local search` can start from greedy and get better solutions in less time, but still...
**Minimum Spanning Tree (**`MST`**):** connect all nodes with a tree with minimum total edge cost.
- polynomial
- π `greedy approach` guarantees optimal solution
**Local Search:** repeatedly perturb (modify) current solution in search of better configurations.
- **Perturbation:** move from one configuration to neighboring one (usually better)
- stops at local minima/maxima
- :warning:`Critical issue:` selecting a **neighborhood structure** appropriate to a given problem

- **Big Valley property:** in many real word problems local minima tend to be clustered and closer to global minima
- **"kicking"** a configuration from a local minima and repeating local search would more likely result in getting a better minimum than restarting local search from a random configuration
- :warning: but some "big valleys" can be just local attractors
- exploited by `Variable Neighborhood Search` (VNS) and `Iterated Local Search` (ILS)


## Ch.25 - Stochastic Global Optimization
[269-285]
π―`Chapter context:` search for global optima of functions on **continuous variables** as input.
**Throwing random points** onto the input space can be a robust `brute-force` method for the continuous optimization case.
**Stochastic Global Optimization (**`SGO`**):** suite methods relying on the **random generation of sample points** in the search space, ranging from `global random search` to methods based on `probabilistic assumptions`.
- :heavy_check_mark: More popular than `deterministic global optimization` because _simpler_, insensitive to the structure of the objective function and feasibility region, and easily modifiable to guarantee _theoretical convergence_.
- (deterministic requires stronger assumptions)
- π practical efficiency, can be very slow (RSO to the rescue)
- **Global Optimization:** hard problem, usually combines `robust global schemes` with `fast local search schemes`
- `Eg:`**Multistart Local Search:** multiple runs of local search from interesting points (?)
- πβπ¨ "global schemes" try to minimize the searching area for the "local schemes"
- :warning: without strong function assumptions like `Lipschitz continuity` or "big valley", convergence cannot be ensured for sparse input sequences
### Lipschitz continuity
**Lipschitz continuity:** assumption that real world problems are of `limited-variability`
$$
|f(x_1) β f(x_2)| β€ K ||x_1 β x_2|| \quad ,K\ge0
$$

- $\forall x$ point evaluated, a **cone** $f(x) β K||x β x'||$ defines an area which the function cannot leave, because it cannot change too fast

- it makes sense to place the next sample at the point of `minimum of the underestimate`
### Pure random search
**Pure random search (**`PRS`**):** repeated `generation of random points` to be evaluated in the admissible region (usually from same distribution)
- simple but a good reference to benchmark against, easy to study properties
- πβπ¨ assumes non-pathological cases (`smooth boundaries`, ...)
**Convergence theorem:** if, for all possible histories of stochastic sample generation, the probability distributions $p_j(x)$ guarantee that each ball is hit with probability which sum up to infinity, an **arbitrary $Ξ΄$-close approximation of the optimum will be found infinitely often**.
- _"hitting a ball"_: $p_j\big(B(\textbf{x}^*,\epsilon)\big)$ is the probability of sampling with probability distribution $j$ a point inside an hyper-sphere $B$ with center $\textbf{x}^*$ and radius $\epsilon$
- `Uniform distribution` is an immediate way to satisfy the convergence requirements
- Take π : **convergence is not difficult to prove**, just mix some uniform probability with your favorite sampling strategy that learns from history
- but may not mean much in practice as convergence rate can be unfeasibly slow
π₯**Curse of Dimensionality:** affects convergence rate of PRS
- Goal: hit the ball $B(\textbf{x}^*,\epsilon)=\{\textbf{x}\in A: || \textbf{x}-\textbf{x}^*|| \le \epsilon\}$ with **i.i.d.** samples $\textbf{x}_j$ with distribution $p(\textbf{x})$
- $p(B)= \dfrac{\text{vol}(B)}{\text{vol}(A)}$
- but in a $d$-cube $A$ of length 1, $p(B)$ is proportional to $\epsilon^d$
- in higher dimensions is **exponentially more difficult to hit the ball**
- `Bonus intuition:` considering the optimal point $\textbf{x}^*$ as an array of $d$-coordinates, a random array $\textbf{x}_j$ is close only if all its randomly generated coordinates are close.
**Moral:** approximating the global optimum of `generic functions` is infeasible with many dimensions.
**Hope:** `functions with special forms` can be feasible in approximating the global minimum. Are they common?

### Simulated Annealing (Vanilla)
π―**Goal:** find better solution around found local optima, given problems with particular structures (e.g. `Big Valley`).
**How?** allow for `stochastic moves` in neighborhood and `allow worsening moves`
- (allowing worse moves with deterministic approach, can lead to infinite looping)
**Simulated Annealing:** metallurgy analogy, slow cooling gives more chances of finding states with lower internal energy
- π preferable where exact algorithms fail (e.g.: gradient descent, branch and bound, ...)
- π bad if exact optimum is needed, either local or global

- based on the theory of `Markov processes`, generate new samples based on the latest points
- :warning: memory-less, does not learn from experience
- `Temperature`: parameter $T$ which determines the **probability that worsening moves are accepted**
- (always accept a selected neighbor if is improving one)
- $T\rightarrow \infin$ : random walk
- $T\rightarrow 0$ : standard local search
- :warning: $T_{\text{start}}$ $T_{\text{end}}$, and `cooling rate` must be set accordingly to the problem in order to avoid very poor results
- $T_{\text{end}}$ can be used as end condition, but does not have to
- π₯ **neighbor selection is random** (or smth based on it...)
- Works with both discrete and continuous problems
- `Discrete`: apply a limited set of local changes to the current solution
- `Continuous`: sample with a probability distribution centered on the current point (e.g.: with Gaussian distribution)
- π asymptotic convergence is guaranteed, but usually is too slow (can be even slower than enumeration)
- _repeated local search_ and even _pure random search_ can have better asymptotic results
- :x: vanilla SA is pretty bad, see Ch.30 for the smarter "online-learning" version, which sacrifices theoretical guarantees using heuristics for better results in many practical cases
### Statistical Inference
Collect useful statistics to **influence the sampling**.
**Branch and Probability Bound:** generalizes branch and bound to continuous variables when no error-proof bound is available
- :warning: feasible only in low dimensional space (`d<10`), number of tree nodes tends to explode
- Iterative:
1. split (branch) the set of admissible input values into subsets, obtaining a tree organization
- usually `hyper-rectangles`
2. calculate the potential value of the individual subsets for further investigation
- statistics by sampling the subspace
3. select the most interesting subsets for additional splits.
**Random Multistart:** repeat local search from random initial points, but how many samplings in order to find all possible local optima?

- πβπ¨ the size and form of the **attraction basins** depends on the specific local search scheme (which may for example filter out very small sub-basins)
- _statistical inference_ can be used to estimate the number of trials with a **probabilistic guarantee** that all `local minimizers` have been found
- (one of them should be the global optimum, or a close approximation)
#### Inertial Shaker
Candidate points are generated in an `adaptive search box` and a moving average of the steps filters out evaluation noise and high-frequency oscillations.
- π simple practical choice to go beyond Pure Random Search
πβπ¨More complex and less scalable version is the _Reactive Affine Shaker_.

- `Inertial Shaker:` at each step, apply a random double shot in a box around the current input point only if successful, otherwise retry another double shot
- `trend` direction is adapted at each successful step

- `Double shot:` apply a displacement for every dimension as long as it improves the result
- (the second shot is tried only if the first shot does not improve solution, both are canceled if not improving)
- if double shot fails $\implies$ shrink search box

## Ch.26 - Derivative-Based Optimization
[287-291] [293 (bisection)] [295-296 (multi-dimension)] [298-299 (gradient descent)]
π―`Chapter context:` show some methods to **optimize smooth functions** (differentiable) of continuous variables by using gradients to help with the search direction.
...derivatives review...
### One-dimensional Optimization
`Roots:` points where a function is zero, useful to find in the derivative functions (black box) to get the optima.
- found by **solving local models** (_Taylor series_, _Newton's method_, _bisection_, ...)
**Newton's Method** for root finding: based on tangents
- (ndr: non Γ¨ tra le pagine date dal prof)
- π fast convergence
- need to start sufficiently close to the target root

**Bisection method** for root finding: `binary search`...
- safer alternative to Newton's method for convergence
- π convergence in a logarithmic number of steps
- nice, but not easily extended to more than one dimension

πβπ¨ If derivatives are not available or they are too costly to calculate, one can approximate them with the `secant method`.
### Multi-dimensional Optimization
In many cases, finding the minimum of the quadratic model by `matrix inversion` is not the most efficient and robust manner if the linear system becomes very large, a frequent case in machine learning.
**Ill-conditioning:** solving a linear system can be obstacled by numerical instability, due to finite-precision computation.

#### Gradient Descent
- discrete case analogy: `local search` (naive, use local information to do the next best choice)
- needs careful $\epsilon$ step value choice
- :warning: sensitive to `ill conditioning`: tends to go zig-zag if data is stretched

## Ch.27 - RSO: Intro
[313-320]
β`Motto:` "learn on the job", generate internal explicit models from searching experience on a certain instance of a black-box problem.
**Reactive Search Optimization (**`RSO`**):** use `machine learning` and `statistics` for automated **online self-tuning** and **dynamic adaptation**.
- **Online learning:** implicitly accumulate knowledge of a black-box optimization problem as the function gets evaluated by newly generated input points (`self-tuning`)
- need appropriate data structures for `saving` and `retrieving` past configurations (usually, RSO overhead is small)
The following chapters will further present RSO (until Ch.32). To summarize, **RSO can adapt in a online manner** at least meta-parameters like:
- `prohibition periods` (stay away from these areas) $\longrightarrow \text{Ch.27}$
- `neighborhood` (if no improving move, get another neighborhood) $\longrightarrow \text{Ch.28}$
- `iterations` of local search (kick the system to a new attractor) $\longrightarrow \text{Ch.29}$
- dynamic `modifications of the objective function` (push up local minima) $\longrightarrow \text{Ch.31}$
- amount of `randomness` (use noise to jump out of an attractor) $\longrightarrow \text{Ch.30?}, \text{Ch.31}$
### Tabu-RSO
RSO variant based on prohibitions.
**Prohibitions:** stay away from certain areas in order to find new solutions, optimized online
- encourage `diversification`
- at a given time $t$, the set of moves is split between _admissible moves_ and _tabu_ moves (prohibitions)
- in general, the `inverses` to recent moves are prohibited for a time dependent period $T^{(t)}$

- bigger $T$ forces more searching steps before allowing a reversing move (increases the minimum loop size)
- tends to form a `lasso` around a local optimum
π―**Goal:** for a local configuration, `find the minimal prohibition` value for escaping an attraction basing in the area.
- πβπ¨ determine $T$ in a **reactive way**, where its value changes accordingly the local structure of the problem

Initially $T=1$, increases if trapped in attraction basin, decreases if new regions are visited.
(ndr: this still seems too naive, seems easy to just loop around the same attraction basins)

## Ch.28 - RSO: Adapting Neighborhoods and Selection
[327-332]
**Variable Neighborhood Search (**`VNS`**):** [meta-heuristic](http://www.scholarpedia.org/article/Metaheuristic_Optimization) which adapts the neighborhood at every step, both in descent to local minima and in escape from the valleys.
- Set of `neighborhoods defined apriori`, during search use the most appropriate ones
- `Considerations:` **reactive nature**, many possible implementations (does not have to use online learning)
- How to setup the `neighborhoods`? How many? Are disjoint?
- help from problem knowledge or experimentation
- How to setup `neighborhood selection`? order of consideration? transitioning?
- many possibilities: from `randomly cycling` neighborhoods, to `online learning` (e.g.: minimal diversification strategy)
- How to `evaluate neighborhood`? Use a **local search** heuristic (choosing only an improving neighbor)
- `highest descent` heuristic
- `first descent` heuristic

**Reduced-VNS:** randomly extract one neighbor for the current neighborhood, move only if improving
- neighborhood is usually ordered according the `strength of the perturbation`
- `Variable Neighborhood Descent` (VND): deterministic alternative

- if neighbor is not improving $\implies$ pass to larger neighborhood and repeat
- else $\implies$ turn back to default neighborhood
**Skewed-VNS:** accept also `worsening moves` if they lead the search trajectory sufficiently far


## Ch.29 - RSO: Iterated Local Search
[333-337]
**Repeated Local Search:** memory-less, repeat local search from different starting points, keep best solution.
**Iterated Local Search (**`ILS`**):** another local search extension, iterated local search from previous local optimum.
- β assumes a `clustered distribution` of local minima $\implies$ easier to find local minima from a local minimum rather than random point
- β `keeps memory` of found local minima
- π¦Ά based on `kicks`: perturbations have to be of the right size though
- too weak $\implies$ fall back to the same local minimum
- too strong $\implies$ memory-less random sampling
- πβπ¨ should adjust in `reactive` manner (increase kick only if necessary)
(ndr: how does it compare to VNS?)



## Ch.30 - RSO: Online learning in Simulated Annealing
[339-342]

**Gist:** vanilla "Markov-process" SA is dumb, consider smarter extensions that use memory.
- π new mechanisms, math theory guarantees are not valid anymore (eg.: global convergence)
- π usually much better in practice
POSSIBLE IMPROVEMENTS:
- **Phase transition:** use 2 cooling rates, $\alpha$ and $\beta$, use the faster rate during a _phase transition_, that is when _specific heat_ is maximal
- π‘`Idea:` adjust temperature scaling and cooling by collecting statistics (estimate the distribution of the values in $f$)
- max temp scale $=\sigma_f$ (standard deviation)
- min temp scale $=$ minimum change (!?)
- `specific heat` $=\sigma_{f^2}/T$, is the variance of the energy per temperature
- when variance is very high at a given temperature $\implies$ a phase transition is occurring there
- **Re-heating:** cooling can lead to `freezing` at local minimum (infinitely more difficult to find better values), use **non-monotonic cooling schedules** to avoid that
- :warning: same old story: reheating has to be neither too slow to fall back in the current minimizer, neither too fast to have random walk
- can use online learning for runtime adjusting
Reactive SA solutions vs **SequenceHeuristic**: in the latter, accept worsening solution if tot trials fail (entrapment evidence)
- both perform well, probably thanks to their **acceptance of bad perturbations** (not some magic in the annealing itself)
**Adaptive Simulated Annealing (**`ASA`**):** very fast simulated re-annealing, automatic runtime adjustment of parameters (cooling schedule,step selection ..)
- continuous space
- generate random `steps of adaptive size` (adjusted so that half of moves is accepted)
- adjust step size and temperature to concentrate sampling in promising areas

## Ch.31 - RSO: Dynamic Landscapes and Noise Levels
[343-348]
How to adaptively **modify the objective function** depending on the search history and current state in order to deviate the search trajectory away from local optimizer?

**GSAT:** algorithm for SAT, used as reference for possible extensions that modify the objective function

- `Breakout method:` increase weight of violated constraints until escape from local minimum
- `Clauseβweighting method:` weight clauses for determining which variable to flip
- improves GSAT for problems characterized by strong asymmetries
- can encourage more priority on satisfying the βmost difficultβ clauses
- `Problem:` clause weights can increase too much if they never get satisfied
- can `reduce` the clause weight when a clause is satisfied, or by
- can use a `weight decay` scheme (consider the recency of moves)
- many other versions...
**πβπ¨Moral:** can use weights to make local minima disappear, but be aware about possible collateral effects (new spurious local minima, global minimum shadowing, ...)
### Guided Local Search
**GLS:** similar to approaches for GSAT, based on **dynamic penalties** to distribute the search effort over the regions of search space
- augmented cost function:
$$
h(X)=f(X)+\lambda \sum_ip_iI_i(X)
$$
- $I_i(X)$ "indicates" if feature $i$ is present in solution $X$
- $p_i$ is the penalty value for feature $i$
- $\lambda$ is the diversification factor, reactive implementation is suggested
- when local minimum is found, increase by one the penalties which maximize $I_i(X')\dfrac{c_i}{1+p_i}$
- $c_i$ is the cost of feature $i$
- πβπ¨ _features_ and _costs_ often come directly from the objective function, they are not adaptive
- π $c_i$ promotes the penalization of more costly features
- π $p_i$ promotes the penalization of features that have not been penalized many times
- can be seen as a special case of Tabu search (?!)
- usually combined with **fast local search (**`FLS`**):** evaluate only a subset of neighbors where you expect improving moves, avoid brute-force evaluation
- π much faster
- π FLS does not impact the dynamics and does not change the search trajectory (not sure)
- can adapt the neighborhood based on the penalized features, analogous to `tabu-RSO`
- :warning: still sensitive to problems related the modification of the objective function, so be careful
### Adapting Noise
Adapting the level of randomness (like in Simulated Annealing) is another mechanism to **trade-off intensification and exploration**.
- good calibration avoids searches which are either too greedy or too much like random walks

## Ch.33 - Linear and Quadratic Programming
[357-364]
**Moral:** check if problem is a simple one before considering general-purpose optimization methods.
**Linear Programming/Optimization:** solution of problems with a `linear objective function` and `linear constraints`.
- polynomial time (if $\textbf{x}$ does not have to be integers)
- **Canonical form:**
$$
\text{maximize} \quad \textbf{c}^T\textbf{x} \\
\text{subject to} \quad A\textbf{x} \le \textbf{b} \\
\text{and} \quad \textbf{x} \ge 0
$$
- where:
- $\textbf{x}$ represents a vector of $n$ _variables to be determined_
- $\textbf{c}$ and $\textbf{b}$ are vectors of _coefficients_
- $A$ is a matrix $m \times n$ of coefficients, one row $\forall \ m$ constraints
**Convex polytope:** each row $A_i\textbf{x} \le b_i$ describes a half-space in $n$-dimensions, the intersection of all the $m$ constrained spaces results in a convex region
π₯`Convex feasible region` + `Linear objective function`
- $\implies$ **local optimum == global optimum**
- $\implies$ **local optimum only on vertices**

Example: the image shows linear optimization with 2 variables (2D) and 5 constraints. The linear cost function is maximized at the rightmost vertex.
**Simplex algorithm** for linear programming
- π global optimum guaranteed (due linear programming assumptions)
- π exponential complexity worst case (but they say is rarely exponential in practice)

1. select a `starting vertex` (can be selected from a simplified version of the original problem)
2. **local search** steps: move only to neighbor vertex that have higher objective value
3. end when `local maximum` is reached (all its neighbors have smaller objective value)
**Integer Linear Programming (**`ILP`**):** vector $\textbf{x}$ can have only integers
- more difficult than the continuous case (`NP-hard` just to tell if it has a feasible solution)

- **Relaxation approach:** relax ILP problem to LP problem and solve, then round $\textbf{x}$ to closest integers
- π better complexity
- π no optimality guarantee
- π no feasibility guarantee
**Quadratic Programming:** optimize a `quadratic function` subject to `linear constraints`.

## Ch.34 - Branch and Bound, Dynamic Programming
[365-372]
### Branch and Bound
Try to detect avoidable enumerations when exploring the tree.

- nodes are partial solutions, leafs are complete solutions
- parts of the tree can be `pruned` when it is known that they cannot lead to optimal solutions
- `bounds:` conditions that are updated during the search, not respecting them leads to pruning
- `branching:` assigning a value to a variable in a partial solution (non-leaf node) and recursively expand the subtree
- `backtraking:` when branching leads to dead end, return to previous steps and try unexplored alternatives
- π still exponential, like enumeration
### Dynamic Programming
Avoid repeated calculations by remembering partial results.
`When?` problem solution is combination of solutions from its sub-problems, and sub-problems can be shared.
**Viterbiβs algorithm:** ...

## Ch.35 - Satisfiability
[373-380] [389-391]
**Context:** `SAT` and `MAX-SAT` are good _"toy problems"_ to apply the principles we have seen so far.
Solving SAT is a prototypical form of the more general **constraint programming (**`CP`**)**.
**SAT:** tell if boolean formula of boolean variables is satisfiable (all clauses are true).
**MAX-SAT:** given a `disjunctive normal form` (DNF), find the assignment of $n$ variables that satisfies the maximum number of $m$ clauses.
- `MAX WβSAT`: version where clauses have a weight
### Resolution and Splitting for SAT
Can adapt the `branch and bound` method for SAT: resolution.
**Resolution rule:** produces a new clause (`resolvent`) implied by two clauses containing `complementary literals`.
- **Resolvent:** clause produced by a resolution rule, contains all the literals that **do not have complements**
- π₯ an empty resolvent is obtained $\iff$ the original formula is `unsatisfiable`
- if is not a tautology, it is added to the clauses set for further resolution inferences
- πβπ¨ **it does not alter the set of solutions**

- **DP (**`Davis-Putnam`**) algorithm:** based on resolution rules, removes one variable at a time by using all possible resolvents on that variable.
- π memory explosion
**Splitting rule:** at each node, assign a new variable and split problem in two based on the variable assignments, and neither contains the selected variable
- if clause does not contain selected variable, put in both sub-problems
- if variable is chosen to _true_ the sub-problem will only have clauses where the variable is not true, and vice-versa

- can use `backtracking`
- **DPLL algorithm:** limit memory explosion by replacing resolution rule with the splitting rule.
- π better memory usage than DP, but still exponential

### Linear Programming for MAX W-SAT
- can use `branch and bound` with `ILP`
- `linear programming relaxation:` gives an upper bound on the number of satisfied clauses
- π poor bounds in practice
### Continuous Approaches
**Interior point algorithm:** ...
In some techniques the `MAXβSAT` (or `SAT`) problem is transformed into an **unconstrained optimization problem** on the real space and solved by using existing global optimization techniques.
### Approximation Algorithms
Approximate best MAX-SAT solution.
- `greedy construction` strategies
- **GreedyJohnson:** iteratively assign the literal that occurs in most clauses and delete satisfied clauses
- $1/2$ guaranteed performance
### Local Search for SAT
`Perturbative local search` is very effective for MAX-SAT.
Different variations of local search with **randomness** techniques have been proposed for SAT and MAXβSAT.

## Ch.36 - Strategic Modeling (?ndr)
[403-412]
**Context:** data-driven models for which experimental data is not available at the beginning, and needs to be acquired in a strategic manner.
π―**Goal:** collect data by minimizing cost (experiments).
- called `design of experiments` in science
- called `active learning` or`query learning` in ML
If running the real model is too costly $\implies$ can build a `surrogate` (approximated) model based on the executed experiments.
### Design of Experiments (DOE)
Systematic and principled manner to generate examples to characterize, predict, and then improve the behavior of any system or process.
**Goal:** get a surrogate model from experiments.

- `rule of thumb:` the number of examples has to be larger than the number of parameters
**Full factorial design:** generate sample points on a **regular grid** based on number of `factors` (variables) levels
- π simple
- π grid-like nature (may miss relevant effects)
- π very expensive (exploding number of samples wrt. number of factors)
- `reduced factorial designs` alleviate this

- Figure: full-factorial DOE with 5 levels for each factor, in three dimensions
**Randomized design: pseudo Montecarlo sampling**
- π simple
- π can get interesting points that grids would miss
- π good in the absence of detailed and trusted information about the physical system under analysis
- π can leave large regions of the design space unexplored (needs a large sample)
- `stratified Monte Carlo sampling` can solve this by subdividing in equal bins and sampling for each one (reduces the needed sample size)

**Latin hypercube sampling (**`LHS`**):** its motivation is to ensure that, for a given number of samples $k$, and $k$ equally-spaced bins, for all one-dimensional projections of the samples there will be one and only one sample in each bin.
- `Latin square:` if there is only one sample in each row and each column (like sudoku)
- π good representation with less samples

**Surrogate model (**`emulator`**):** a simplified model of the outcome is used when an outcome of interest cannot be easily directly measured.
- usually constructed based on the results of a `simulator` (seen as `black-box`)
- also called `Response Surface Methodology` (RSM)
- `iterated RSM` in closer areas can be used to ensure that an optimal point for the surrogate model corresponds to the one in the real system

### Active or Query Learning
At each iteration, active learning methods **select the inputs maximizing an informativeness measure**, which estimates how much an input would improve the current model if it was added to the current training set.
- **Uncertainty sampling (**`US`**)** principle: input with highest `predictive uncertainty` as the most informative training example for the current model
- can estimate confidence on predictions with **Gaussian Process Regression**
- **Query-by-committee strategy:** use a committee of models trained on the current set of examples
- they have competing hypotesis
- the input considered most informative is the one on which they `disagree the most`
- **Expected model change** principle: most informative input is the one that would change the model the most (has the greatest influence on model parameters)
- **Expected error reduction:** select the input that is expected to yield the greatest reduction in the `generalization error` of the current model
- computationally expensive
- **Variance reduction** approach: query the input that minimizes the model variance
- assumption that the bias error is negligible

## Ch.39 - Genetics, Evolution and Nature-Inspired Analogies
[459-465]
**Context:** analogies from nature can be `inspiring` but also `misleading` when they are translated into procedures for problem solving and optimization.
- e.g.: **Population-based algorithms:** using multiple interacting search streams
- Similar ideas: ensembles, pools, agents, committees, multiple threads, mixture of experts, `genetic algorithms`, particle swarms, `evolution strategies`
- :warning: analogies with nature are not always justified and more effective schemes can be developed, do not need to imitate nature for the sake of imitating it
**Goal:** letβs concentrate on relationships between _Genetic Algorithms_ and _intelligent Search_ strategies.
### Genetic algorithms and Evolution Strategies
**Idea:** `natural selection`, each individual is a candidate solution described by its **genetic content**.
- **Mutation:** randomly change bit values (gene $==$ position?)
- usually very rare
- used to explore the neighborhoods (`perturbative serch`)
- :warning: Careful **rate** choice. Too small and we stuck with the `initial solution`, too big and we get `random restart`.
- **Recombination:** uniform `cross-over`, parents have $1/2$ chance to pass each bit
- applied to random couples, creates two sons
- **Cross-over:** dubious analogy, does it work?
- similar parents can lead to `premature convergence`
- `fitness function:` defines the suitability for environment
- fitter individuals produce a larger offspring

There are at least 3 **critical issues** when adopting biological mechanisms for solving optimization problems:
1. not easy to demonstrate that they are `effective`
2. demonstrate that they are efficient `efficient`
3. is biologic-based intrinsically better? (e.g. individual experience is not passed in genes)
`Hybridized Genetic Algorithms:` based on **Baldwin effect** idea (?)
- **Memetic algorithms:** extension of GA, add learning during individuals life (local search)
- π₯ How to implement `individual learning` during lifetime?
- **Lamarckian evolution:** replace the initial genotype with the better solution identified by local search
- local search does not modify genotype, but evaluate the `fitness function` based on the "learning potential"
πβπ¨ More efficient algorithms are based on one simpler _"one parent for one descendant"_ recombination. Don't have to imitate nature.

## Ch.40 - Multi-Objective Optimization
[467-473]
**Context:** most of the real-world problem cannot be cast into the simple and mathematically crystal-clear form of minimizing a function $f(x)$.
- **Multi-Objective Optimization Problems (**`MOOP`**)** $\longrightarrow$ most problems have **more than one objective** to reach, to be maximized
- but have to **trade-off** between `conflicting objectives`
**MORSO (**`Multi-objective Reactive Search Optimization`**):** solution methods for multi-objective optimization characterized by **incremental and learning paths**.
### Pareto Optimality
Defining desirable objectives without spelling out the `trade-offs` and the `relative importance` of the different objectives can lead to ill-defined problems, as **objective functions can be conflicting**. (objective vector cannot contain the best solution for each individual objective)
- **Pareto Optimal:** a point $\textbf{x}$ such that there is no other point that dominates it in $f$
- $\textbf{x}\prec\textbf{x}'$ means that the objective vector $\textbf{x}$ `dominates` $\textbf{x}'$ (has no worse values than it, and at least one value that is better)
- **Pareto frontier:** set of Pareto optimal points, typically the whole set is `unknown`
- π₯ Pareto-optimal solutions cannot be improved in any of the objectives without degrading at least one of the other objectives
- πβπ¨ designer can make trade-offs within this set, rather than considering the full range of every parameter

Multi-objective approaches based on Pareto frontier:
- **Linear Scalarization (**weighted-sum**):** obtain and optimize a `single objective` function by using linear combination
- π if not convex Pareto-front $\implies$ some Pareto-optimal solutions cannot be identified

- **Tchebycheff approach:** ? uses single-objective local search?
- π aggregation function is `not smooth` for a continuous MOOP
- used in **MOEA/D:** multiobjective evolutionary algorithm based on decomposition
- decompose a multi-objective optimization problem into a number of scalar optimization subproblems, using a set of different weight vectors, one for each scalarized problem, and optimizes them simultaneously
- **Pareto local search (**`PLS`**):** natural extension of single objective local search methods
- used in **MOMAD:** memetic algorithm based on `decomposition`, is a `population-based` iterative method and thus an `anytime algorithm`

## TODO
- `stochastic global optimization` examples?
- `reactive` vs `adaptive` vs `online learning`?
- `Inertial shaker` vs `reactive affine shaker`?
- vanilla `simulated annealing` vs online learning based one?
- why vanilla `simulated annealing` has guaranteed global convergence?
- `iterated local search` vs `tabu-RSO`?
- βstrategic oscillationsβ: `re-heating SA` vs `tabu-RSO` vs `VNS`?
- `random multi-start` vs `repeated local search`?
- `GSAT` vs `WalkSAT`?
π END π