補充：large coefficients for high-order polynomials

# 補充：large coefficients for high-order polynomials ## 背景本篇內容為筆記「[4.8 Model Selection Procedures](https://hackmd.io/@pipibear/BJ5s6ZIO0)」內容的延伸。在那篇筆記裡，我們在討論的是當在選擇要用哪個 model 時，我們希望能找到 error 最小的那個 model，而 error 的定義方式有許多種，其中一種叫做 Bayesian Model Selection。在介紹這種方法時，我們舉了一個 regression model 的例子，討論 given 一組 dataset 的情況下，要選 degree 多大的 polynomial 去 fit 這些 data points 會比較好。前面幾篇筆記裡都講過，隨著 polynomial 的 degree 越大，雖然 bias 會變小，但是 variance 卻會因為 fit 到 noise 的關係變大，整體的 error 可能不減反增。由於這樣的特性，我們希望 regression model 在考量 fitting data points 之餘，盡可能的讓 degree 可以越小越好，也就是可以的話 model 不要太過複雜。因此，為了限制 model 的 degree，我們額外加一個 penalty term 到我們的 error function，這個 penalty 由++係數的平方和，再乘上一個 weight++ 定義，理由是因為當 polynomial degree 越大時，係數的值通常就會越大，所以去施加這樣的 penalty，就等同去限制 polynomial degree。當時我們沒有多講為什麼當 polynomial degree 越大時，係數的值通常就會越大，而這篇筆記就是要針對這個問題去簡短說明。 > 不過這部分我也沒有太過深入去理解，所以斟酌參考即可，有興趣可以參考下方參考資料的連結。 --- ## Polynomial interpolation 首先，找出一個 function 來 fit dataset 中的 points 這件事稱作 polynomial interpolation，說明如下圖： ![image](https://hackmd.io/_uploads/BkisTwidR.png) > 我們利用 $n$ 次多項式（共 $n+1$ 個係數 $a_0,...,a_n$），來 fit $m+1$ 個 data points。如果我們把這 $m+1$ 個 data points 代入 $p(x)$，就會形成 $m+1$ 條方程式： \begin{equation} \begin{split} p(x_0) &= a_0 + a_1x_0 + a_2x_0^2 + ... + a_nx_0^n \\ p(x_1) &= a_1 + a_1x_1 + a_2x_1^2 + ... + a_nx_1^n \\ &. \\ &. \\ &. \\ p(x_m) &= a_1 + a_1x_m + a_2x_m^2 + ... + a_nx_m^n \\ \end{split} \end{equation} ### matrix form：形成 Vandermonde matrix 如果寫成矩陣的形式，我們會發現這 $m+1$ 條方程式變成一個 Vandermonde matrix 乘上一個係數向量，等於一個 output $\vec{y} = [y_0,...,y_m]^T$ 的向量： ![image](https://hackmd.io/_uploads/S1JXxdsdA.png) 接著根據 Vandermonde 矩陣的特性，在 $m=n$，且 $x_i$ 之間彼此不相同的條件下，會發現我們可以取 $V^{-1}$ 來求 $a$： ![image](https://hackmd.io/_uploads/r13qx_i_R.png) ### 衡量 uncertainty 的方式：condition number 但是儘管理論上這個 mapping 是 bijective 的，所以我們應該要可以求出 unique $\vec{a}$；但在實際情況裡，$\vec{a}$ 到底是多少仍然存在著誤差，這個誤差源自於如電腦計算時產生的 rounding error： ![image](https://hackmd.io/_uploads/ByzbzdsOC.png) 知道了這件事以後，我們很直覺的就會去想：這個 noise 會讓我們求出來的 $p(x)$ 係數有多大程度的變動？這個 uncertainty 的程度其實是可以被量化的，我們用一個稱作 condition number 的東西來表示我們的 uncertainty 程度，並且以 ==$cond \ A$== 表示 matrix $A$ 的 condition number，這個++數值越高，代表越容易受 noise 影響++。詳細如下圖： ![image](https://hackmd.io/_uploads/r11EGujO0.png) > 其實 norm 的選擇有很多種，下方參考資料「How (Un)stable Are Vandermonde Systems」舉了三種來說明。 >> 那篇內容非常詳細，包含爆多的數學，我只想知道大概就沒看完了，有興趣可以看看。省略所有數學細節以後我們可以得到結論： ![image](https://hackmd.io/_uploads/Byg34douR.png) ### Runge's Phenomenon 講到這邊我其實也沒有找到一個很簡單的方式去用數學式來證明為什麼 high-order polynoimal tends to have larger coefficients，只能從這個結論大概想像：當我們的係數越容易受到個別的每個 data point 影響，整個 function 的波動就越大。這整個現象其實有一個名稱叫做 Runge's Phenomenon，而它背後用的數學一部分是我們前面講的用 Vandermonde matrix 來構建我們的式子這件事。下圖我只是很簡單的截取了 wiki 上的圖，然後講結論，整個東西的推導未來有機會有能力再寫吧！ ![image](https://hackmd.io/_uploads/B1X0Pds_R.png) ## 解決 Runge's Phenomenon 的方式 ### Curve fitting 前面一整段下來，我們的結論時當 polynomial 具有 high degree 時，其實去求解它的 coefficients 通常是個 ill-conditioned problem，所以我們就有另一種替代方案叫做 curve fitting。兩者之間的差異還有 curve fitting 背後的想法在於： Interpolation 應該要被用在當我們很確定我們的 data 非常正確的時候，在這種情況下，我們才會希望我們的 function 可以盡可能地去 fit 每個點。但是如果是一個我們對 data 會怎樣呈現有一點想法的情況，我們只希望找出 data points 的整體趨勢大概是什麼樣的，不需要我們的 function 去 exactly 穿過每個 data points，那就比較適合用 curve fitting。原文說明比較清楚： :::info Curve fitting is used when you have some idea of the expected functionality of your data, but you don't need your function to exactly pass through all the data points. This is typical when the data may contain measurement errors or other imprecisions or when you wish to extract the general trend of the data. ::: > 出自參考資料中的 "Least squares approximation question" 舉例來說，假設我們猜測我們的 data 的 general trend 應該是個 exponential function，那 curve fitting 就是直接把 data points 代入這個 exponential function，再去尋找最適合的 parameters。講到這裡，聽起來好像有點熟悉？其實前面的小節裡，我們在尋找最適合的 parameters 值這件事，就是在做 curve fitting。我們先決定了 function 應該要是什麼樣子，接著看看 parameter space 裡的值，哪個代入 data points 後，總體的誤差會最小，至於定義 error 的方式，就是用像我們提過的 least squares approximation。 ![image](https://hackmd.io/_uploads/B1CjHtiu0.png) ### spline interpolation 另一種解決 Runge's Phenomenon 的方式叫做 spline interpolation，想法其實很簡單，就是與其我們用一個會產生很大的 variance 的 high-order polynomial 來 fit 我們的 data，我們就改用++許多不同的 low-degree polynomials 來連結 data points++。 > spline: a chain of several polynomials of a lower degree. ![image](https://hackmd.io/_uploads/B11voYjuC.png) 舉例來說，假設我們用 quadratic functions 來連結我們的 data points，那麼我們就會先將 data 劃分成不同的 intervals，每一個 interval 間都定義一個不同的 quadratic function，這樣一來，每一段的 polynomial 都會是 smooth 且 continuous 的，如下圖： ![image](https://hackmd.io/_uploads/B197YtidC.png) 至於如何去構造這樣的 spline，我沒有多去了解，或許以後有用到的時候再補充到筆記吧！ # 參考資料 - [How (Un)stable Are Vandermonde Systems](https://www.cs.purdue.edu/homes/wxg/selected_works/section_01/118.pdf) - StackExchange: - [Why are there large coefficents for higher-order polynomial](https://stats.stackexchange.com/questions/233414/why-are-there-large-coefficents-for-higher-order-polynomial) - [Least squares approximation question](https://scicomp.stackexchange.com/questions/11153/least-squares-approximation-question/11158#11158) - wiki: - [Vandermonde matrix](https://en.wikipedia.org/wiki/Vandermonde_matrix) - [Runge's phenomenon](https://en.wikipedia.org/wiki/Runge%27s_phenomenon) - [Polynomial interpolation](https://en.wikipedia.org/wiki/Polynomial_interpolation)