WS22/23 - Applied Numerial Optimization

# WS22/23 - Applied Numerial Optimization > * What is the **definition of optimality**? (unconstrained/ constrained problems) (1) Unconstrained problem: No constraints on the feasible solutions. The goal is to find the solution that max. or min. a given objective function. (2) Constrained problem: There are limitations on the feasible solutions and the goal is to find the best solution that satisfies all the constraints and optimizes the objective function. > * Which components is an optimization problem consisting of? How can such a problem in general be formulated mathematically? (1) **Objective function, decision variable and constraints** (2) Min./Max. f(x), subject to $g_i(x) \le 0$ or $g_i(x) = 0$ (i = 1, 2, ..., m) and $h_j(x) = 0$ (j = 1, 2, ..., n) > * Which conditions do the solution of a constrained optimization problem have to fulfill in general? (1) Feasibility: The solution must satisfy all of the constraints, i.e., it must lie within the feasible region defined by the constraints. (2) Complementary slackness: The solution must satisfy the complementary slackness conditions, which state that for each constraint, either **the constraint must be active ($c=0$)**(i.e., the constraint must be binding) or the corresponding Lagrange multiplier must be zero **($\lambda>0$)**. (3) Dual feasibility: The solution must satisfy the dual feasibility conditions, which state that the corresponding Lagrange multipliers must be non-negative for inequality constraints and zero for equality constraints. ![](https://i.imgur.com/cFWeGEa.png) ## Convex problem: * 定義：Under which conditions is an optimization problem called convex? An optimization problem is convex, when the objective function f is convex on its feasible set $\Omega$, and the feasible set is convex. * How can we check the convexity of a smooth unconstrained optimization problem? (1) the feasible set: EC are linear. Region formed by IC is convex (2) the objective function: Hessian is at least positive semi-definite. ## Directional Derivative: ![](https://i.imgur.com/GiwtWtA.png) ## Necessary and sufficient condition: > * Explain the difference between the necessary and sufficient optimality conditions in case of unconstrained optimization. (1) Necessary optimality condition: the 1st order necessary condition (stationarity condition) **(一次微分一定要為0)** and 2nd order necessary condition. **(Hessian一定要positive semi-definite)** (2) Sufficient optimality conditions: the 2nd order sufficient condition **(Hessian不一定要positive definite)**  ![](https://i.imgur.com/pPijUBK.png) ![](https://i.imgur.com/VxTH8kt.png) ![](https://i.imgur.com/gMQRRXP.png) ### 1st order necessary optimality condition: ![](https://i.imgur.com/AO96GoG.png =600x) ### 2nd order necessary optimality condition: ![](https://i.imgur.com/l7SJQBG.png =400x) ### 2nd order sufficient optimality condition: ![](https://i.imgur.com/NJJsdlu.png =350x) > * Which **conceptual solution strategies for unconstrained optimization problems** did you learn in the lecture? Describe the basic ideas of the methods. (1) **Gradient Descent**: The gradient descent method is a **first-order optimization algorithm** that iteratively updates the decision variables in the direction of the negative gradient of the objective function. **The gradient points in the direction of the steepest increase in the function, and the negative gradient points in the direction of the steepest decrease.** This method is typically used for smooth, differentiable functions. (2) **Newton's Method**: Newton's method is a **second-order optimization algorithm** that uses the second derivative (the Hessian matrix) of the objective function to iteratively improve the solution. Newton's method takes advantage of the local curvature of the function to converge more rapidly to the optimal solution than gradient descent. (3) **Conjugate Gradient**: The conjugate gradient method is a **first-order optimization algorithm** that uses a set of conjugate search directions to efficiently minimize the objective function. The method is well suited for **optimizing symmetric, positive definite matrices**. (4) **Quasi-Newton Methods**: Quasi-Newton methods are **first-order optimization algorithms** that **approximate the Hessian matrix with a lower-dimensional approximation**, such as the **BFGS or DFP** algorithms. These methods are more computationally efficient than Newton's method, but still take advantage of second-order information about the objective function to converge more rapidly. (5) **Stochastic Gradient Descent**: Stochastic gradient descent is a variation of gradient descent that updates the decision variables using a randomly selected subset of the data, rather than using the full data set. This method is well suited for large-scale optimization problems with a large number of data points. ## Direct Method: (直接解 objective function) * These methods hope to converge to optimality conditions. ### Line Search Approach: #### 1. Steepest decent: (highest rate of change) * 方法：Select direction and steps * 演算法： ![](https://i.imgur.com/XhyO6Wy.png) ![](https://i.imgur.com/GMr6sen.png) ![](https://i.imgur.com/YACl36E.png =300x) **How to determine step length?** ##### (1) Armijo Condition: ![](https://i.imgur.com/XqdqyKg.png) 缺點：The constant c must be properly chosen; Any sufficiently small step length satisfies the Armijo condition but leads to very slow convergence. ##### (2) Wolfe Condition: **(exclude small steps)** ![](https://i.imgur.com/S4D13OU.png) * 優點：A larger step length must be taken when the current slope is steep. >* Which line-search strategies do you know? Describe the basic ideas. Line-search strategies are methods used in optimization algorithms to determine the step size, or step length, for each iteration of the algorithm. The following are common line-search strategies: (1) **Armijo rule**: (a) a sufficient decrease condition that requires the objective function to decrease by a sufficient amount in the direction of the search. (b) **The step size is gradually reduced** until the Armijo condition is satisfied. (2) **Wolfe conditions**: (a) a set of two sufficient decrease conditions that balance the desire for a large step size with the requirement that the objective function decrease sufficiently. (b) The first Wolfe condition requires the gradient of the objective function to point in the direction of the search. $(c)$ The second Wolfe condition **requires the step size to be sufficiently small (If the step size is too large, it maight hard to convergence)**. > * Which classes of methods for the determination of the search direction did you learn in the lecture? Explain the particular advantages and disadvantages. The search direction is the direction in which the optimization algorithm moves in the feasible space to find the optimal solution. (1) **Gradient descent**: The search direction is taken in the direction of **the negative gradient of the objective function**, which points towards the steepest decrease of the function. 優點：The gradient descent method is simple to implement and easy to understand 缺點: slow to converge and may get stuck in local minima. (2) **Newton's method**: The search direction is calculated by solving a system of linear equations based on the **Hessian matrix** and the gradient of the objective function. 優點：Newton's method is faster to converge than gradient descent and is more likely to find the global optimum 缺點：Computationally expensive to compute and invert the Hessian matrix. (3) **Conjugate gradient**: This method uses a combination of the gradient and a history of previous search directions to determine the search direction. The search direction is taken in the direction of the conjugate of the gradient and the previous search directions. The conjugate gradient method is faster to converge than gradient descent and requires less memory to store than Newton's method, but it may be sensitive to the choice of initial search direction. (4) **Quasi-Newton methods**: These methods use an approximation of the Hessian matrix to determine the search direction, rather than the exact Hessian matrix used in Newton's method. 優點: faster to compute than Newton's method 缺點：the approximation of the Hessian matrix may not always be accurate, leading to slow convergence or getting stuck in local minima. > * Describe the steepest descent method in comparison to the Newton method. The steepest descent method only uses the first-order information (the gradient) of the objective function, while the Newton method uses the second-order information (the gradient and the Hessian matrix) of the objective function. #### 2. Conjugated gradient: #### 3. Newton's Method: * $H^{(k)}p^{(k)}=-g^{(k)}$ * 優點：比較快收斂(quadratic convergence locally) * 缺點：要計算inverse of Hessian，computational expensive ![](https://i.imgur.com/Oe9QnSG.png =320x) ![](https://i.imgur.com/zcfLjoC.png) #### 4. 比較： ![](https://i.imgur.com/RThscpF.png =450x) > * Visualize and describe one step of the Newton method. Explain the Newton-line search algorithm. (1) **Given an initial guess** (2) **Compute the gradient and the Hessian matrix** (3) **Determine the search direction**: The search direction is the direction of steepest descent along the gradient, which can be found by solving the equation Hd = -g, where H is the Hessian matrix, g is the gradient, and d is the search direction. (4) **Update the guess**: The current guess is updated along the search direction by a certain step size. The step size determines the magnitude of the update and can be found by a line search algorithm, such as the Newton-line search algorithm. > * Why is the Newton method often not directly applied? (1) Computational complexity (Hessian) (2) Inaccurate gradient and Hessian approximation: The gradient and Hessian approximation can be inaccurate, especially for non-linear and high-dimensional problems. This can result in slow convergence or even divergence of the algorithm. (3) Sensitivity to initial guess: The Newton method can be sensitive to the initial guess and may converge to a local minimum instead of the global minimum. ### Inexact Newton Method: Only applicable when Hessian is positive definite. #### 1. Newton-CG Method: ![](https://i.imgur.com/bGVFUQb.png) #### 2. Modified Newton Method: * It still requires the Hessian for modification. * Replace the Hessian by a sufficiently positive definite matrix $𝐵_𝑘$, which is modified from the Hessian by adding a multiple of identity or modified Cholesky factorization. * 想法： (1) Approximate Hessian with metric 優點：Easy to calculate/update, close to Hessian and positive definite (2) Approximate inverse of Hessian (3) Approximately solve the system of eqs ![](https://i.imgur.com/EDGuw6F.png) #### 3. Quasi-Newton: * It can work Hessian-free and attain superlinear convergence. * Replace the Hessian by $𝐵_𝑘$ and work out an update rule for it in each step. * $𝐵_𝑘$ is symmetric positive definite. ![](https://i.imgur.com/srjdina.png) #### 4. Gauss-Newton Method: * If $J^{k}$ is full-rank, $p^{k}$ is always a decent direction. If $J^{k}$ is singular, the decent direction $p^{k}$ is not reliable. Quasi-Newton is therefore more efficient. ![](https://i.imgur.com/zfz91kj.png) > * Which practical Newton- methods do you know? (1) Quasi-Newton methods: These methods **approximate the Hessian matrix using a low-rank approximation**, which is updated at each iteration. This results in a faster computation of the search direction, but the approximation may be inaccurate. (2) Newton-CG method: This method combines the Newton method with the Conjugate Gradient method to handle large-scale optimization problems. (3) Trust-region Newton method: This method uses a trust-region approach to handle the challenge of inaccurate gradient and Hessian approximation. The method updates the guess within a trust region and adjusts the size of the region based on the accuracy of the approximation. ### Trust Region Approach ![](https://i.imgur.com/Gx0DRKd.png) ![](https://i.imgur.com/3cIwHqR.png) > * Explain the concept of the trust-region method. The trust-region method adjusts the size of the trust region at each iteration based on the quality of the approximation (contraction rate). If the approximation is good, the trust region is expanded, allowing the algorithm to explore a larger area of the search space. If the approximation is poor, the trust region is shrunk, guiding the optimization process towards a better approximation. #### Dogleg method * The Dogleg method is characterized by the use of a dogleg step, which is a combination of two types of steps: a gradient step and a Cauchy step. (1) The gradient step is a step in the direction of the gradient of the objective function, which provides a good approximation of the solution when the objective function is well-behaved. (2) The Cauchy step is a step in the direction of the Cauchy point, which provides a good approximation of the solution when the objective function is ill-conditioned. (3) The Dogleg method uses the dogleg step to update the approximation of the solution and the trust region size at each iteration. ## Indirect Method(用optimization解問題) 1. set up optimality conditions 2. try to solve the system of equations (or equations and inequalities) via iterative decent ## KKT ![](https://i.imgur.com/O1MWJny.png) * Active set vs. inactive set ![](https://i.imgur.com/GzJfwZF.png) > * Set up the necessary optimality conditions for a constrained optimization problem and explain how the equations are set up? KKT (1) **Primal feasibility** conditions: These ensure that all the constraints are satisfied at the optimal solution. For a general constraint of the form $g(x) \le 0$, the feasibility condition is given by $g(x^{*}) \le 0$. (2) **Dual feasibility** conditions: These ensure that the Lagrange multipliers (also known as shadow prices or dual variables) associated with the constraints are non-negative. For a constraint of the form $g(x) \le 0$, the dual feasibility condition is given by $\lambda \ge 0$, where $\lambda$ is the corresponding Lagrange multiplier. (3) **Complementary slackness** conditions: These link the primal and dual solutions and ensure that the Lagrange multipliers are positive only if the corresponding constraints are active (i.e., binding). For a constraint of the form $g(x) \le 0$, the complementary slackness condition is given by $\lambda g(x^{*}) = 0$. ## Linear Programming (LP) * 定義：A LP problem consists of a linear objective function and linear constraints. > * How is the standard form of a linear program formulated? ![](https://i.imgur.com/hNn7fYU.png) > * Set up the KKT conditions for a linear program in standard form. ![](https://i.imgur.com/dg6XD9k.png) ### Simplex Method The Simplex method is based on the property of linear programs known as duality, which states that the optimal value of the linear program is equal to the optimal value of its dual problem. The Simplex method uses this property to find an optimal solution by iteratively improving the current basic feasible solution until an optimal solution is reached. * Feasible Point * Iteration Sequence * Dual Linear Program * Strong Duality > * Explain the main features of the Simplex method. On which property of linear programs is the Simplex method based on? The Simplex method is a popular algorithm for solving linear programs in standard form. It involves iteratively moving from one basic feasible solution to another by pivoting a non-basic variable into the basis and a basic variable out of the basis, until an optimal solution is found. ![](https://i.imgur.com/mQ6E7jj.png) (1) Iterative process: The Simplex method is an iterative process that starts with a basic feasible solution and improves it by pivoting variables in and out of the basis. (2) Pivoting rules: The Simplex method uses a set of rules to determine which variable should be pivoted in and out of the basis. The most common rule is **the largest coefficient rule, which selects the variable with the largest positive reduced cost as the next pivot variable.** (3) Termination: The Simplex method **terminates when all the variables have non-negative reduced costs**, indicating that an optimal solution has been found. (4) Unboundedness and Infeasibility: The Simplex method can also detect when the linear program is unbounded or infeasible, and it terminates in these cases. ### Interior Point Method 1. Primal-dual method 2. Adapted Newton Method > * Outline the primal-dual algorithm for solving a linear program in standard form. At each iteration of the algorithm, the primal and dual variables are updated in such a way that the duality gap, which is the difference between the optimal value of the primal problem and the optimal value of the dual problem, is reduced. This is done by solving a system of linear equations that involves both the primal and dual variables, and by using the solutions to update the variables in an appropriate manner. ![](https://i.imgur.com/FXRIY06.png) > * What is the principle of the interior-point method? It approaches the optimal solution by moving from an initial feasible point in the interior of the feasible region to the optimal solution located at one of the vertices of the feasible region. ## NLP > * Which class of methods to solve general NLP have you learned in the lecture? Describe the basic ideas. Three solution strategies (1) Elimination of variables (to convert to unconstrained problem): so called reduced-space formulation, to simplify the problem by eliminating 𝑚 variables using equalities. (2) Approximation as series of unconstrained problems (3) Approximation as series of simpler constrained problems ### Quadratic Penalty Method (QPM) replace constraints by adding quadratic penalty to objective. ($\mu$愈小, penalty愈大) ### Augmented Lagrangian Method (ALM) improve QPM to avoid ill-conditioning by estimating Lagrange parameters 優點：解決 Hessian ill-conditioned, the violation of constraints is small for large $\mu$ (converge for larger $\mu$) ### Log-Barrier Method (LBM) use logarithmic barrier to enforce strict satisfaction of inequalities. > * Explain penalty-methods for solving NLPs. Adding a penalty term to the objective function that becomes larger as the constraints are violated. The idea is to force the solution towards feasibility by increasing the cost of violating the constraints. (1) Initialization: A starting point is chosen and a value for the penalty parameter is selected. (2) Penalty function evaluation: The objective function is modified by adding the penalty term, which is a function of the constraint violations and the current value of the penalty parameter. (3) Optimization: The modified objective function is optimized using a gradient-based or a Newton-type method to find the next iterate. (4) Update of the penalty parameter: The value of the penalty parameter is updated, usually by increasing it, to increase the cost of violating the constraints. (5) Termination: The algorithm terminates when a satisfactory solution is found, or when a stopping criterion is met. 優點： simple to implement and can be effective for solving NLP problems with moderate to large size and moderate nonlinearity. 缺點：(1) the algorithm may converge slowly if the penalty parameter is not set appropriately. (2) may not globally optimal, as they only guarantee convergence to a locally optimal solution. To overcome this limitation, barrier methods, which are a generalization of penalty methods, can be used. Barrier methods add a term to the objective function that increases as the solution approaches the boundary defined by the constraints, thus ensuring that the solution stays inside the feasible region. > * Explain the general framework of the SQP algorithm. It is a gradient-based method that uses a sequence of quadratic approximations of the objective function and the constraints to find an optimal solution. ![](https://i.imgur.com/BMU9N1j.png) (1) Initialization: A starting point is chosen and a quadratic approximation of the objective function and the constraints is formed. (2) Linearization: The nonlinear constraints are linearized using the current iterate and the quadratic approximation. (3) Solution of a QP subproblem: The linearized problem is solved using a standard quadratic programming (QP) solver, and the solution is used to update the current iterate. (4) Update of the approximation: The approximation is updated using the new iterate and the gradient of the objective function and the constraints. (5) Termination: The algorithm terminates when a satisfactory solution is found, or when a stopping criterion is met. ## Integer Problem * ILP * Mixed-Integer Linear Programs (MILP) * Restrictions, Relaxations and Approximations > * What is a mixed-integer optimization problem? How can such a problem formulated mathematically? ![](https://i.imgur.com/lZu4HIm.png) It is a type of optimization problem where some of the variables are restricted to be integer values. Such problems arise in many real-world applications, such as scheduling, resource allocation, and design optimization, where some variables must take integer values because they represent discrete decisions, such as the selection of a particular product, the allocation of a specific resource, or the scheduling of a task. ## Branch & Bound > * Explain the Branch and Bound method for the solution of a mixed-integer problem. The method is based on the idea of dividing the feasible region into smaller sub-regions and solving a sequence of relaxation problems that provide lower and upper bounds on the optimal value. (1) Initialization: A relaxation problem is solved to obtain a feasible solution, which serves as a lower bound on the optimal value. (2) Branching: The feasible region is divided into two or more sub-regions by fixing one or more variables to integer values, thus creating new constraints. (3) Relaxation: A relaxation problem is solved for each sub-region to obtain a new lower bound and a feasible solution. (4) Pruning: If a sub-region is found to be infeasible or if its lower bound is greater than the current upper bound, it is discarded. (5) Update of the upper bound: The upper bound is updated by taking the minimum of the objective function values of the feasible solutions found so far. (6) Termination: The algorithm terminates when all sub-regions have been explored, or when the lower and upper bounds are equal to within a specified tolerance, which indicates that an optimal solution has been found. ## Dynamic Optimization Problem > * Set up a general optimal control problem. ![](https://i.imgur.com/4We8JR9.png) ![](https://i.imgur.com/6IxYa0z.png) > * Which concepts did you learn to solve a dynamic optimization problem? It solves sequential decision problems by breaking down complex problems into smaller subproblems. It often involve discrete-time models, i.e. forward Euler Method, where the decisions and states are updated at discrete time steps. The state transition refers to the way that the state of the system changes over time based on the current state and the decision taken. ### Direct Methods #### Simultaneous Method * Full-descretization method * Explicit Euler 缺點: small consistent order, accuracy but not fast enough * Implicit Euler Scheme 優點: high consistency order, stability * Collocation Method: > * Describe the procedure of the simultaneous method. Which problem results from the full-discretization and which properties does this problem have? ![](https://i.imgur.com/EYeA02o.png) ![](https://i.imgur.com/tVj7Vrv.png) In this method, the continuous parameter in the constraints of the SIP is replaced by a finite set of discrete values. (1) Discretization: The continuous parameter in the constraints is replaced by a finite set of discrete values. (2) Formulation: The discretized problem is formulated as a finite optimization problem with a set of constraints for each discrete value of the parameter. (3) Solution: The discretized problem is solved using standard optimization methods. The full-discretization can result in a problem with a large number of constraints, which can make the solution process computationally intensive. ![](https://i.imgur.com/ytEckyG.png) #### Single Shooting Method (Sequential Method) * The control variables u(t) are approximated by a series of (typically orthonormal) basis functions $\sigma_{i,k}(t)$ and corresponding coefficients $p_{i,k}$ ![](https://i.imgur.com/TeNg15J.png) >* What is the switching structure of a process? A switching structure refers to a point in time where the optimal control strategy or decision changes. This can occur in response to changes in the **system's environment, objectives, or constraints**. It determines the optimal control policies and decisions over time. ![](https://i.imgur.com/MR4kWYo.png) For example, a too simple switching structure may not capture the complexity of the system, leading to sub-optimal solutions. On the other hand, a complex switching structure with too many decision points can result in excessive computational time and increased complexity. > * What is the basic idea of the single shooting method to solve dynamic optimization problems? Single shooting is a method in which the entire boundary value problem is discretized into a finite number of points, and the solution is represented by a set of unknown variables. The unknown variables are optimized so that the solution satisfies the boundary conditions. In single shooting, the entire boundary value problem is solved in one shot, and the solution is optimized as a whole. > * Explain the different methods to determine the gradient of the objective function in sequential methods. (1) Forward finite difference: ![](https://i.imgur.com/UJh5KIg.png) (2) Forward sensitivities: 優點：If implemented efficiently, faster than finite difference 缺點：Hard to implement efficiently ![](https://i.imgur.com/HjnJeFS.png) #### 比較： ![](https://i.imgur.com/Sggd890.png) * Explain the term shooting method. The shooting method is a numerical approach to solving boundary value problems by converting them into initial value problems and iteratively adjusting the initial conditions until the solution satisfies the boundary conditions. It is commonly used in engineering and physics to solve differential equations that describe dynamic systems * What is the difference between single and multiple shooting? Single shooting is a method in which the entire boundary value problem is solved in one shot, while multiple shooting is a method in which the boundary value problem is divided into a number of subproblems and solved separately. * When would you use a data-driven model? Can these models be extrapolated? Explain hybrid modeling. A data-driven model is used when there is a large amount of data available, and when the underlying relationship between the inputs and outputs is not well understood. Hybrid modeling is an approach that combines the strengths of both first-principle models and data-driven models. Data-driven models can be extrapolated, but the accuracy of extrapolation depends on the quality of the data used to build the model and the similarity between the training and extrapolation regions. ### Indirect Methods (類似Lagrange Multiplier) #### Adjoint Equations * Optimize-then-discretize ## Uncertainty * What problems can arise if uncertainty is neglected in the solution of optimization problems? Compared to neglecting uncertainties, both stochastic approaches and robust approaches result in more reliable and conservative solutions. * Name two basic notions of how uncertainty can be addressed. (1) Stochastic approaches (Probabilistic modeling): This approach involves modeling uncertainty as a probability distribution, allowing the optimization problem to be solved under various scenarios of uncertainty. (2) Robust approaches: This approach involves designing the optimization problem to be resilient to uncertainty, by considering the worst-case scenarios or worst-case performance measures. * What is the essential feature of a two-stage stochastic program? In the first stage, decision variables are selected based on uncertain information, and in the second stage, the outcome of the uncertain information is revealed and the decision variables are used to determine the final objective function value. (1) determines the fist-stage decision process before the uncertainty is realized. (2) describes the second-stage decision process for a specific uncertainty scenario. ## Semi Infinite Program * What are the basic properties of a semi-infinite program (SIP)? SIP has finitely many variables ($x$) and infinitely many constraints ($y$). SIPs are useful for worst-case optimization=“robust optimization” * How does the discretized version of an SIP relate to the original problem? ![](https://i.imgur.com/Y4a7NCY.png) <style> h2{color:#382FB5;} h3{color:#5353E8;} h4{color:#5e79ff;} h5{color:#05b8ff;} <style>