Search Direction

# Search Direction line search 할 때 어디로 line 그어서 어느 방향으로 이동할 지를 결정해서 one dimensional problem을 반복적으로 푸는 것이다. 오늘은 이 search direction을 어떻게 정하는지 알아 보자. Matines & Andrew 의 text book에서는 아래의 5가지 방법을 소개한다. 1) Steepest Descent 2) Conjugate Gradient 3) Newton's Method 4) Quasi-Newton Methods 5) Limited-Memory Quasi-Newton Methods ## Steepest Descent Gradient가 빠르게 증가하는 반대 방향으로 search direction을 정하는 것이다. \begin{equation} p = - \nabla f \end{equation} 또는 normalized를 이용하기도 함 \begin{equation} p = - \frac{\nabla f}{||f||} \end{equation} curvature가 방향에 따라 일정하면 빠르게 수렴한다. 근데 실제 문제에서 curvature가 방향에 따라 일정한 경우가 거의 없으므로 대부분의 경우에서 steepest descent는 비효율적이다. ## Conjugate Gradient 처음 정한 search direction의 orthogonal 한 방향으로 search direction을 정하는 것. \begin{equation} p^{(k)}=-\nabla f^{(k)} + \beta p^{(k-1)} \end{equation} where \begin{equation} \beta^{k} = \frac{||\nabla f^{(k)}||^{2}}{||\nabla f^{(k-1)}||^{2}} \end{equation} 이렇게 하면 어떤 특정한 문제를 풀 때에는 Steepest Descent보다 빠르게 수렴한다. ## Newton's Method Curvature information에 영향을 많이 받는데 그러면 이걸 아예 문제 풀이에 활용하는 방법은 없을까? 해서 나온 것이 Newton's method이다. Taylor expansion에서 2nd order term까지 활용해 보자. 함수 $f(x)$ 이 있는데, $x$에서 작은 $s$만큼 떨어진 곳의 함수 값 f(x+s)는 Taylor approximation에 의해서 아래와 같다. \begin{equation} f(x+s) = f(x)+\nabla f^{T}(x)s+\frac{1}{2}s^{T}H(x)s + \cdots \end{equation} 이것의 derivative가 '0' 가 되는 지점이 극점이다. \begin{equation} \frac{d[f(x+s)]}{ds} = \nabla f + Hs = 0 \end{equation} 정리하면 \begin{equation} Hs = -\nabla f \\ \end{equation} 따라서 small step $s$는 아래와 같다. \begin{equation} s=-H^{-1} \nabla f \end{equation} line search의 step size $\alpha$, search direction $p$일 때 다음 step $s=\alpha p$ 인 것을 떠올려 보자! 여기 $s=-H^{-1} \nabla f$ 에서 direction이 gradient descent의 $p=-\nabla f$ 인 상태에서 step size가 inverse Hessian 인 것과 같다고 볼 수 있다. 여기에! 뭔 문제가 3가지 있다. - Hessian이 positive definite가 아닐 수 있다. 그러면 이 방법으로 찾아지는 search direction이 descent direction이 아닐 수도 있어서 mininum을 찾을 수 없을 수도 있다. 이걸 해결하기 위해서 Hessian을 modify해서 positive definite이 되게 하는 방법이 있다. 그러면 descent direction이면서 approximate curvature를 찾아낼 수 있다. - 예측한 새로운 guess가 더 안 좋을 수 있다. 그럼 backtracking을 해서 다시 찾아볼 수 있다. - Hessian을 계산하는게 어렵거나 비싸다. 이걸 해결하기 위해서 quasi-Newton methods가 나옴 ## Quasi-Newton Methods 앞에서 언급한 것과 같이 Newton's method에서 Hessian을 계산하는게 어렵기 때문에 quasi-Newton methods로 Hessian을 approximation 한다. 이 방법 중에 가장 많이 쓰이는 것이 Broyden–Fletcher–Goldfarb–Shanno (BFGS) 이다. BFGS는 Davidon–Fletcher–Powell (DFP)에서 출발하였다. 일단 둘 다, gradient를 구하고 난 다음에 Secant rule을 이용해서 Hessian을 approximate한다. 따라서 approximated Hessian $B$는 다음과 같다. \begin{equation} B^{(k+1)} = \frac{\nabla f^{(k+1)} - \nabla f^{(k)}}{x^{(k+1)}-x^{(k)}} \end{equation} DFP의 경우에 무한으로 존재하는 Hessian 중에서 이전 iteration의 Hessian $B^{(k)}$과 현재 Hessian $B^{(k+1)}$의 차이가 가장 적은 Hessian을 찾아서 이것을 approximated Hessian으로 쓰는 방법이다. 따라서 minimum 값을 구하기 위한 optimization 문제를 풀어줘야 한다. 그것을 적용하는 과정에서 Newton's method의 step $p = -H^{-1} \nabla f \approx -B^{-1} \nabla f$를 근사화하는데 여기서 approximated Hessian의 inverse 값 $B^{-1}$을 $V$로 두고 바로 $V$를 구하는 방법이 BFGS 이다. 따라서 Newton's method의 step $p \approx -V\nabla f$가 되고 approximated inverse Hessian $V$ 를 구하기 위해서 아래와 같은 연산이 수행된다. \begin{aligned} \underset {V^{(k+1)}}{\text{minimize}} &\quad ||V^{(k+1)}-V^{(k)}|| \\ \text{subject to} & \quad V^{(k+1)}(\nabla f^{(k+1)}-\nabla f^{(k)}) = x^{(k+1)} - x^{(k)}\\ & \quad V^{(k+1)} = {V^{(k+1)}}^{T} \end{aligned} $V$는 positive semi-definite matrix를 이어야하므로 initial 값으로 identity matrix를 이용한다. ## Limited-Memory Quasi-Newton Methods Quasi-Newton Methods에서 Hessian matrix를 계속 memory에 저장하고 있어야 하는데 변수의 수가 너무 많아서 Hessian matrix가 너무 크게 되고 memory 용량이 적어서 다 저장하지 못하는 경우에 Limited-memory quasi-Newton methods가 쓰인다. Limited-memory quasi-Newton methods의 대표적인 방법이 Limited-memory BFGS (L-BFGS) 이다. 우리는 Hessian 값 자체가 필요한 것이 아니라 Newton's method의 step 으로 $p=-H^{-1} \nabla f$ 값이 필요한 것이다. 따라서 L-BFGS 방법은 approximated inverse matrix $V$와 $\nabla f$의 product 값을 한번에 구해서 matrix vector product하는 계산도 줄이고 Hessian matrix 저장하는 공간도 줄인다.