4.4 The Bayes' Estimator

# 4.4 The Bayes' Estimator 延續這個章節最一開始的討論，也就是假設我們從一個符合某個特定 model 的 distribution 來取得我們的 sample，而這個 model 由一個或一些 parameters 定義，我們的目標就是要找出這個或這些 parameters 真正的值。 ## prior density 在這個小節裡面，首先我們要來介紹一個詞叫做 prior density。有時候在看過 sample 之前，我們（或是這個 application 方面的專家）會對未知的 parameter $\theta$ 值有一些 prior information，像是我們知道 $\theta$ 可能介於什麼樣的範圍內、也許傾向什麼樣的分佈。 > 舉例來說： > > 在擲硬幣的情境裡，我們的 prior information 就是我們相信應該有很高的機率擲到 head 的機率是 $\frac{1}{2}$。這樣的 prior information 通常很有用（尤其是在 sample 很小的時候），即使我們沒辦法直接從這樣的 information 知道 $\theta$ 的值到底是什麼（不過如果這樣就能夠知道 $\theta$ 的值，那就也不用取 sample 了），我們還是可以透過它去進行進一步的計算。所以有了一些 information 以後，為了要去 model 這些 information 之中的 uncertainty，我們會利用一種做法： :::warning 將 $\theta$ 視為一個 random variable，並定義它的 pdf 為 prior density $p(\theta)$。 ::: 聽起來有點抽象，舉個例子： ### 例子假設 prior information 告訴我們，雖然我們不知道我們的 parameter $\theta$ 是什麼，但是我們知道 $\theta$ 所有可能的值的分佈大致上是 normal 的，並且我們有九成的 confidence $\theta$ 介於 $5$ 和 $9$ 之間，以 $7$ 作為中心對稱。 > 關於 normal distribution 的內容可參考筆記「[A.3.5 Normal(Gaussian) Distribution](https://hackmd.io/@pipibear/rkq0ZkoEC)」。畫成圖大概長這樣： ![image](https://hackmd.io/_uploads/SkwjWBZvA.png) > 簡單來說，就是我們假想 $\theta$ 的分佈應該要長圖中的這個樣子。如果我們先將 $\theta$ z-normalize，再去查表看看如果要落在 $0$ 左右共 $90\%$ 的範圍內，這樣會是距離幾個 $\sigma$，我們會發現是 $\pm1.64$ 個，再移回去我們會得到： \begin{equation} p(\theta) \sim N(7,(\frac{2}{1.64})^2) \end{equation} 過程如下： ![image](https://hackmd.io/_uploads/SkUQ7BWDC.png) :::info 像這樣的 prior density ==$p(\theta)$== 告訴我們的是，++在看過 sample 以前 $\theta$ 可能會是哪些值++。我們結合它和 ++sample data 告訴我們的資訊++（稱作 likelihood density ==$p(X|\theta)$==），再利用 Bayes Rule，就能求出 posterior density ==$p(\theta|X)$==，而 posterior density 就是在告訴我們++在看過 sample 以後，$\theta$ 可能會是哪些值++。 ::: > 對 Bayesians 來說， parameter $\theta$ 到底是什麼值被視為 random 的，所以我們需要透過 prior 和 data 來計算它的 posterior distribution。 > > 這件事情看完下方的說明就會更清楚。 $\rightarrow$ 接下來的內容與筆記「[補充：Bayesian Estimation](https://hackmd.io/@pipibear/Sy4XW138A)」有很多相關或重疊的地方，可以對照著看。 > 那篇或許算是比較基礎且詳細的，如果這裡有沒多作解釋的地方，或許可以先看看那篇。 ## posterior density 如果我們將上面在講的 posterior density 寫成數學式，就是下面的式子： ![image](https://hackmd.io/_uploads/H1pbgBSwA.png) > 第一個等號其實就是 Bayes Rule 的定義。從上面的式子我們會發現其實 $p(\theta|X)$ 和分子成正比，因此會得到一個常用的公式： :::success \begin{equation} p(\theta|X) \propto p(X|\theta)p(\theta) \end{equation} ::: > 分母的 $p(X)$ 是 $X$ 的 marginal probability，和 $\theta$ 無關，所以可以視為 constant。看完定義以後，我們可以再用圖像來看看 posterior pdf 到底是怎麼形成的。下面這張圖的例子說明得很清楚： ![image](https://hackmd.io/_uploads/rJ_vebaDC.png) > 首先我們有一個 random variable $W$。 > > 一開始我們的 prior information 讓我們能畫出 $W$ 的分佈情形（當然不是 $W$ 實際的 distribution，只是基於我們現有的資訊所得到的 distribution。） > > 接著看到圖中編號 $2$ 的地方，我們實際上去取得 observational data 以後，實驗結果告訴我們根據這個結果，pdf 應該要長那個樣子。 > > 同時，我們根據不同的 parameter 可能的值 $\mu = \mu_i$（在這個例子裡未知的 parameter 是 $\mu$，有 $n$ 種可能），畫出的 $n$ 種不同的 pdf，看看和 observational data 的 pdf 之間的相似程度。 > > 很直觀的，長得跟實驗數據得出來的結果比較像的 pdf，是它的機率就比較高，因此越像的我們就給越高的 weight（這個步驟就是在計算 likelihood。） > > 最後，我們將 $n$ 種 pdf 各自乘上對應的 weight 相加，就會得到第四點 posterior pdf 的那個樣子。 ## predict new data point 那有了上述這些東西以後，我們就可以在當我們有一個新的 data point $x$ 時，去 predict $x$ 的 density。我們想要去計算在看過 sample $X$ 以後，$x$ 的 density 是多少，也就是去計算 $p(x|X)$： ![image](https://hackmd.io/_uploads/BJ8olHSwA.png) 由這個式子： \begin{equation} P(x|X) = \int p(x|\theta)p(\theta|X) \,d\theta \end{equation} 我們會發現，它在做的事其實是對各種 $\theta$ 值的情況下做出的不同 predictions 取平均。什麼意思呢？看個簡單的例子說明就很清楚了： ![image](https://hackmd.io/_uploads/SkIR2xavR.png) 所以我們是在加總所有可能的 $\theta$ 的情況下，以 $p(\theta|X)$ （看過 sample 以後，我們認為每個 $\theta$ 各自發生的機率）為 weight，$x$ 的 density。除此之外，每一種 $\theta_i$ 所決定的 $p(x|\theta_i)$ 其實都是在給一個對 $x$ 發生機率的 prediction，這個想法就可以連結到當初在筆記「[2.6 Regression](https://hackmd.io/@pipibear/rJhgG8NNA)」中，我們有一個 function $g()$ 會賦予 input $x$ 一個 weight（如果有 multiple inputs，則賦予每一個 input 一個對應的 weight），來將 input map 到預測的 output。因此如果我們以 regression 的表示方式，把 given parameter $\theta$ 的條件下，observe $x$ 的 density 設為 $y$，即令： \begin{equation} y = g(x|\theta) \end{equation} 那我們會得到另一個式子： ![IMG_28D8B8EB28EB-1](https://hackmd.io/_uploads/SyyZJxpDA.jpg) 但是到這邊我們又會覺得積分這樣好像太難算了，所以如果我們能假設一個特定的 $\theta$ 值，那我們就不用算得這麼辛苦了。根據這樣的想法，我們要先來介紹 MAP (maximum a posteriori) estimate。 ### MAP (maximum a posteriori) estimate 直接看我筆記的整理： ![image](https://hackmd.io/_uploads/rJOBKWav0.png) > 在 posterior pdf 已知的情況下，我們去求 $x$ 的 MAP estimate 就是看 posterior pdf 最大的時候是發生在什麼樣的 $x$。而我們現在的情況是：我們不確定 posterior pdf 中的 $\theta$ 應該要是什麼值，所以我們想要挑出某一個 $\theta$，使得在 observation 為 $X$ 的情況下，是這個 $\theta$ 值的機率最大。把上述這種取 $\theta$ 的方式寫成數學式： \begin{equation} \theta_{MAP} = \arg \max_{\theta} p(\theta|X) \end{equation} 回到原本的情境，我們取 MAP estimate of $\theta$ 的目的是把複雜的積分簡化，那麼實際上取了 $\theta_{MAP}$ 會如何，可以看下圖： ![image](https://hackmd.io/_uploads/BJOzJfawR.png) > $p(x|X) = p(x|\theta_{MAP})$ 代表我們畫得出由 $\theta_{MAP}$ 定義的單一個特定的 pdf，那我們看這個 pdf 在 $x$ 的那個點，就能知道 $p(x|X)$。 > > 因為這樣，我們才會說用 $\theta_{MAP}$ 將整個 density 簡化成一個點。 #### MAP estimate 和 ML estimate 間的關聯如果我們現在沒有任何 prior information 來告訴我們應該要選哪個 $\theta$ 比較好，那麼就等同於我們的 prior density 是 ++flat++ 的。 > 會特別使用英文 flat 是因為有些書裡會用 flat 來表示 ++uniform++。 > > 其實兩者間的對應也很直覺，flat 指的是我們的 pdf 為一條（扁平）水平的線，代表不管 $\theta$ 是什麼值，發生的機率都相同（都是一個 constant），那我們稱這樣的 pdf 為 uniform，並且說這個 prior 是 ++noninformative prior++。 >> 在別的筆記我有寫過 noninformative prior，不過我忘記在哪了，但它的意思其實也就如同我上面講的內容。 noninformative prior 會造成什麼樣的影響呢？直接看我的筆記說明： ![image](https://hackmd.io/_uploads/ByuMLzpP0.png) > 如果忘記 MLE 是什麼，可以參考筆記「[4.2 Maximum Likelihood Estimation](https://hackmd.io/@pipibear/rk9DvfgSC)」。把結論重點寫出來標示一下： :::warning \begin{equation} p(\theta) = c \quad \rightarrow \quad\theta_{MAP} \equiv \theta_{ML} \end{equation} ::: ### Bayes estimator 除了上面看到的 $\theta_{MAP}, \ \theta_{ML}$ 我們再來定義一個 $\theta_{Bayes}$，稱作 Bayes estimator： :::info \begin{equation} \theta_{Bayes} = E[\theta|X] = \int \theta p(\theta|X) \,d\theta \end{equation} ::: > 為什麼要取 expected value 是因為，對一個 random variable 來說，最好的 estimate 就是去取它的 mean。假設現在我們要 predict 的 variable 是 $\theta$，$E[\theta] = \mu$，且我們令一個 constant $c$ 代表我們的 estimate of $\theta$。則我們的 expected square error 為： \begin{equation} \begin{split} E[(\theta - c)^2] &= E[(\theta - \mu + \mu - c)^2] \\ &= E[(\theta - \mu)^2] + (\mu - c)^2 \end{split} \end{equation} > 根據 expectation 具 linearity，以及 $(\mu - c)^2$ 為 constant 所以去 expected value 不改變。透過這個式子，我們會發現要最小化 $E[(\theta - c)^2]$ 的方法，就是將我們的 estimate $c$ 取 $\mu$。假設我們的 distribution 是 normal distribution $N(\mu,\sigma^2)$，因為 normal distribution 的 mode 也是 $\mu$，所以： > 關於 mode 的簡單介紹可參考筆記「[補充：Bayesian Estimation](https://hackmd.io/@pipibear/Sy4XW138A)」中最後一小節的例二。 >> 在現在的討論裡，我們是在討論 $\theta$ 的 distribution，$\theta \sim N(\mu,\sigma^2)$，那麼 $\theta$ 的 mode 就是所有可能的 $\theta$ 裡面，讓 $p(\theta|X)$ 最大的那個 $\theta$。 >> >> 因為 normal distribution 是 Bell-shaped，所以我們很好想到機率最高的點就是正中間的 $\mu$，因此 $X$ 的 mode $=\mu$。 \begin{equation} \theta_{Bayes} = \theta_{MAP} \end{equation} > 因為 $\theta_{MAP}$ 的定義就是取讓 posterior pdf / pmf 為最大的 $\theta$ 值，所以也就等同於取 $\theta$ 的 mode。重寫一次標示重點： :::warning \begin{equation} \text{if } p(\theta|X) \text{ is normal} \quad \rightarrow \quad \theta_{Bayes} = \theta_{MAP} \end{equation} ::: 舉個例子： ![image](https://hackmd.io/_uploads/Hy5LZzCP0.png) 下圖可看最後的結果就好，我寫一寫發現積分太難算，課本也直接給結果，就不算啦～ > 不過課本也提到在第十六章會有一些 approximation method 可以去算整個積分，就等讀到那裡的時候再看啦！ ![image](https://hackmd.io/_uploads/ryovfGAvC.png) > 如果有好好算的話會發現 posterior pdf 也是 normal 的，根據 $\theta_{MAP}$ 的定義，我們是要求 posterior pdf 最高的那個點的 $\theta$ 值，以 normal distribution 來說，就像前面提過的，就會是 mean 啦！ > > 因此我們會發現就是在算 mean 的 $\theta_{Bayes}$ 等於 $\theta_{MAP}$。最後，根據結尾 $\theta_{Bayes}$ 的式子，我們會發現根據我們的 $N,\sigma_0^2$ 的大或小，代表了不同的特性。見下圖： ![image](https://hackmd.io/_uploads/SJ4EQGAwC.png) 在結束這節之前，Note that： :::warning 無論是 MAP 還是 Bayes' estimator，兩者都會++將整個 density reduce 成一個 single point++，因此會 ++lose information++。 $\rightarrow$ 除非 posterior 是 unimodal，且這些點形成一個 narrow peak。 ::: > unimodal：具有 unquie mode。 > > 可參考下圖： > ![image](https://hackmd.io/_uploads/BkFc4fAPA.png) --- # 參考資料 - CMU stat 教材 pdf - [Chapter 3: Basics of Bayesian Statistics](https://www.stat.cmu.edu/~brian/463-663/week09/Chapter%2003.pdf) - Introduction to Probability, Statistics and Random Processes - [9.1.2 Maximum A Posteriori (MAP) Estimation](https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php) - Stanford teaching notes: - [lecture 16: inference](https://web.stanford.edu/~rjohari/teaching/notes/226_lecture16_inference.pdf)