ベイズ推定を用いた回帰

# ベイズ推定を用いた回帰この記事は[古川研究室 Workout_calendar](https://qiita.com/flab5420/private/aee38a24be700d146817) 22日目の記事です．本記事は古川研究室の学生が学習の一環として書いたものです．内容が曖昧であったり表現が多少異なったりする場合があります． # 概要第1回目と第2回目の記事は最尤推定とMAP推定の回帰に関しての記事でした．第3回目である本記事では，ベイズ推定を使用した回帰に関してまとめたいと思います．この記事で行ったことは具体的に以下になります． - (理論)ベイズ推定によるモデルのパラメータ$\mathbf{w}$の事後分布の推定 - (実装)導出した事後分布を用いた回帰 - (実装)逐次学習（パラメータ更新の部分） - (実装)新規入力$x^*$に対する予測分布のプロットちなみに，ベイズ推定については[第1回目の記事](https://hackmd.io/dg0K0jvnStK8ctrRKSg3dw?both)で簡単に触れているのでよろしければご覧ください ## 課題設定 **与えられているもの** 入力$\mathbf x=(x_1,...,x_N), x\in\mathbf R$と出力$\mathbf y=(y_1,...,y_N), y\in\mathbf R$のデータセット **仮定** ・$\mathbf x$と$\mathbf y$のデータセットは，独立同分布(i.i.d.)に従うとします．・$y_i,f,\epsilon$を次のように仮定します（$\beta$はガウスノイズの精度でスカラーです）． \begin{eqnarray} y_i&=&f(x_i;\mathbf w)+\epsilon_i \end{eqnarray} \begin{eqnarray} f(x_i;\mathbf w)=\mathbf{w}^T\boldsymbol{\phi}(x_i) \end{eqnarray} \begin{eqnarray} \epsilon_i&\sim& N(0,\beta^{-1}) \end{eqnarray} $\;\;\;$ここで，基底関数$\boldsymbol{\phi}(x)$とパラメータ$\mathbf{w}$は$D$個あるので次のようになります． \begin{eqnarray} \boldsymbol{\phi}(x_i)&=&(\phi_1(x_i),...,\phi_{D}(x_i))^T \end{eqnarray} \begin{eqnarray} \mathbf{w}&=&(w_1,...,w_{D})^T \end{eqnarray} ・事前分布は，あまり大きな$\mathbf{w}$を取らないと仮定します（ハイパーパラメータ$\alpha$はガウス分布の精度でスカラーです）． $$p(\mathbf　w|\alpha)=N(\mathbf　w|\mathbf 0,\alpha^{-1}\mathbf I)$$ **ベイズ推定のタスク** ベイズの定理からパラメータ$\mathbf w$の事後分布を推定します． \begin{eqnarray} p(\mathbf {w}|\mathbf x, \mathbf y, \alpha, \beta) &=&\frac{p(\mathbf y|\mathbf x,\mathbf　w,\beta)p(\mathbf w|\alpha)}{p(\mathbf y|\mathbf x, \alpha, \beta)}\\\ \end{eqnarray} また，推定して得られた$\mathbf w$の事後分布から，新規入力$x^*$に対する出力$y^*$の予測分布が得られます． ## 解析解の導出 1. ベイズの定理を用いて事後分布を推定する 2. 新規入力$x^*$に対する出力$y^*$の予測分布を求める ## 1. パラメータの事後分布を推定するはじめに，観測データ$(x_i,y_i)$が与えられているとすると，仮定から尤度$p\left(y_i|x_i,\mathbf{w},\beta\right)$は， \begin{eqnarray} p\left(y_i|x_i,\mathbf{w},\beta\right) &=&N\left(y_i|f(x_i),\beta^{^-1}\right)\\ &=&\sqrt\frac{\beta}{{2\pi}}\exp\left(-\frac{\beta}{2}(y_i-f(x_i))^{2}\right)\\ &=&\sqrt\frac{\beta}{{2\pi}}\exp\left(-\frac{\beta}{2}\left(y_i-\mathbf{w}^{T}\boldsymbol{\phi}\left(x_i\right)\right)^{2}\right) \end{eqnarray} となります．ここで，計算を簡単にするために尤度に対数をとると， $$\ln p\left(y_i|x_i,\mathbf{w},\beta\right)=\frac{1}{2}\ln \frac{\beta}{2\pi}-\frac{\beta}{2}\left(y_i-\mathbf{w}^{T}\boldsymbol{\phi}(x_i)\right)^{2}$$ です．従って，データ全体の対数尤度は， \begin{eqnarray} \ln p(\mathbf y|\mathbf x,\mathbf{w},\beta) &=&\sum_{i=1}^{N}\ln p(y_i|x_i,\mathbf{w},\beta)\\ &=&\sum_{i=1}^{N} \left \{ \frac{1}{2}\ln \frac{\beta}{2\pi}-\frac{\beta}{2}(y_i-\mathbf{w}^{T}\boldsymbol{\phi}(x_i))^{2} \right\} \\ &=&-\frac{\beta}{2}\sum_{i=1}^{N}(y_i-\mathbf{w}^{T}\boldsymbol{\phi}(x_i))^{2}+\frac{N}{2}\ln \beta-\frac{N}{2}\ln {2\pi}\\ &=&-\frac{\beta}{2}\|\mathbf{y}-\Phi\mathbf{w}\|^{2}+\frac{N}{2}\ln \beta-\frac{N}{2}\ln {2\pi}\\ &=&-\frac{\beta}{2}\|\mathbf{y}-\Phi\mathbf{w}\|^{2}+const \end{eqnarray} となります．ここで，$N$はデータ数，$D$はパラメータの個数，$const$は$\mathbf{w}$に関して定数であることを表し，$\mathbf{y},\mathbf{w},\Phi$はそれぞれ， \begin{eqnarray} \mathbf{y}&=&\left(y_1,...,y_N\right)^{T}\\ \mathbf{w}&=&(w_1,...,w_D)^{T}\\ \Phi &=& \left( \begin{array}{cccc} \phi_{1}(x_1) & \cdots & \phi_{D}(x_1)\\ \vdots & \ddots & \vdots\\ \phi_{1}(x_N) & \cdots & \phi_{D}(x_N)\\ \end{array} \right)\\ \end{eqnarray} となっています． #### パラメータの事前分布今，パラメータ$\mathbf w$はあまり大きな値を取らないと仮定すると．ガウス分布を使って事前分布$p(\mathbf{w}|\alpha)$を次のように定義できます． $$p(\mathbf{w}|\alpha)=N(\mathbf{w}|\mathbf 0,\alpha^{-1}\mathbf I)$$ ここで，$\alpha$はガウス分布の精度を表しており，ハイパーパラメータです．更に，対数事前分布$\ln p(\mathbf{w}|\alpha)$は， \begin{eqnarray} \ln p(\mathbf{w}|\alpha) &=&\ln \left\{\sqrt{\frac{\alpha}{2\pi}}\exp\left(-\frac{\alpha}{2}\|\mathbf{w}\|^2\right)\right\} &=&-\frac{\alpha}{2}\|\mathbf{w}\|^2+const \end{eqnarray} となります．ここで，パラメータ$\mathbf{w}$に関して定数なものは$const$にまとめています． #### パラメータの事後分布ベイズの定理から，パラメータ$\mathbf{w}$の事後分布$p(\mathbf{w}|\mathbf x, \mathbf y,\alpha,\beta)$は \begin{eqnarray} \ln p(\mathbf{w}|\mathbf x, \mathbf y,\alpha,\beta) &=& \ln p(\mathbf y|\mathbf x,\mathbf{w},\beta)+\ln p(\mathbf {w}|\alpha)-\ln p(\mathbf y|\mathbf x, \alpha,\beta)\\ &=& -\frac{\beta}{2}\|\Phi\mathbf{w}-\mathbf{y}\|^{2}-\frac{\alpha}{2}\|\mathbf{w}\|^{2}+const\\ &=&-\frac{\beta}{2}(\Phi\mathbf{w}-\mathbf{y})^{T}(\Phi\mathbf{w}-\mathbf{y})-\frac{\alpha}{2}\mathbf{w}^{T}\mathbf{w}+const\\ &=&-\frac{\beta}{2}\{\mathbf{w}^T\Phi^{T}\Phi\mathbf{w}-\mathbf{w}^T\Phi^T\mathbf{y}-\mathbf{y}^T\Phi\mathbf{w}+\mathbf{y}^T\mathbf{y}\}+\frac{\alpha}{2}\mathbf{w}^T\mathbf{w}+const\\ &=&-\frac{\beta}{2}\{\mathbf{w}^T\Phi^{T}\Phi\mathbf{w}-2\mathbf{y}^T\Phi\mathbf{w}+\mathbf{y}^T\mathbf{y}\}+\frac{\alpha}{2}\mathbf{w}^T\mathbf{w}+const\\ &=&-\frac{1}{2}(\beta\mathbf{w}^T\Phi^{T}\Phi\mathbf{w}+\alpha\mathbf{w}^T\mathbf{w})+(\beta\mathbf{y}^T\Phi)\mathbf{w}+const\\ &=&-\frac{1}{2}\mathbf{w}^T(\beta\Phi^{T}\Phi+\alpha \mathbf{I})\mathbf{w}+(\beta\mathbf{y}^T\Phi)\mathbf{w}+const \end{eqnarray} となっています．$const$については，パラメータ$\mathbf{w}$に無関係な定数です．ここで，尤度（関数）と事前分布の関係が共役な関係になるように事前分布を選ぶと事後分布が事前分布と同じ分布になるので，事後分布は多変量ガウス分布$N(\mathbf{w}|\mathbf{m,S})$になります．ここで，事後分布に対数を取ると対数事後分布$\ln p(\mathbf{w}|\mathbf x, \mathbf y,\alpha,\beta)$は， \begin{eqnarray} \ln p(\mathbf{w}|\mathbf x,\mathbf y,\alpha,\beta) &=&\frac{1}{\sqrt{(2\pi)|S|}}\exp\left(-\frac{1}{2}(\mathbf{w}-\mathbf{m})^TS^{-1}(\mathbf{w}-\mathbf{m})\right)\\ &=&-\frac{1}{2}(\mathbf{w}-\mathbf{m})^TS^{-1}(\mathbf{w}-\mathbf{m})+const\\ &=&-\frac{1}{2}\mathbf{w}^TS^{-1}\mathbf{w}+\mathbf{m}^{T}\mathbf{S}^{-1}\mathbf{w}+const \end{eqnarray} となって，先ほど導出した対数尤度の式 $$\ln p(\mathbf{w}|X,\beta,\alpha)=-\frac{1}{2}\mathbf{w}^T(\beta\Phi^{T}\Phi+\alpha \mathbf{I})\mathbf{w}+(\beta\mathbf{y}^T\Phi)\mathbf{w}+const$$ と見比べると，「$\mathbf{w}^{T}〇\mathbf{w}$」と「$〇\mathbf{w}$」の項の係数は対応した関係になっているので次のようになります． $$\mathbf{S}^{-1}=(\beta\Phi^{T}\Phi+\alpha \mathbf{I})$$ $$\mathbf{m}^{T}\mathbf{S}^{-1}=\beta\mathbf{y}^T\Phi$$ これを$\mathbf{m}$と$\mathbf{S}$について解きます．まず，$\mathbf{S}$は次のようになります． $$\mathbf{S}=(\beta\Phi^{T}\Phi+\alpha \mathbf{I})^{-1}$$ 次に，$\mathbf{m}$をもとめるために式変形すると \begin{eqnarray} \mathbf{m}^{T}\mathbf{S}^{-1}&=&\beta\mathbf{y}^T\Phi\\ \mathbf{m}^T&=&\beta\mathbf{y}^T\Phi S\\ \mathbf{m}^T&=&\beta\mathbf{y}^T\Phi (\beta\Phi^{T}\Phi+\alpha \mathbf{I})^{-1}\\ \mathbf m^T&=&\mathbf y^T\Phi(\Phi^T\Phi+\lambda \mathbf I)^{-1} \end{eqnarray} 従って，$\mathbf{m}$は \begin{eqnarray} \mathbf{m} &=&(y^T\Phi(\Phi^T\Phi+\lambda \mathbf I)^{-1})^T\\ &=&((\Phi^{T}\Phi+\lambda \mathbf{I})^{-1})^T(\mathbf y^T\Phi)^T\\ &=&(\Phi^{T}\Phi+\lambda \mathbf{I})^{-1}\Phi^{T}\mathbf{y} \end{eqnarray} となります．ここで，$(\Phi^{T}\Phi+\lambda \mathbf{I})^{-1}$は実対称行列です．また，$\lambda=\frac{\alpha}{\beta}$です．まとめると，事後分布が多変量ガウス分布$N(\mathbf{w}|\mathbf{m},\mathbf{S})$のとき， \begin{eqnarray} \mathbf{S}&=&(\beta\Phi^{T}\Phi+\alpha \mathbf{I})^{-1}\\ \mathbf{m}&=&(\Phi^{T}\Phi+\lambda \mathbf{I})^{-1}\Phi^{T}\mathbf{y} \end{eqnarray} となることが分かりました．ここで，多変量ガウス分布の平均$\mathbf{m}$はMAP推定を行ったときの[結果](https://qiita.com/tsnry7913/items/b9f15953191e5e08d5af#1-2-%E7%9B%AE%E7%9A%84%E9%96%A2%E6%95%B0%E3%82%92%E5%81%8F%E5%BE%AE%E5%88%86%E3%81%97%E3%81%A6%E6%A5%B5%E5%80%A4%E3%82%92%E6%B1%82%E3%82%81%E3%82%8B)と一致していることが分かります． ## 2. 新規入力に対する予測分布を求める推定された事後分布から，新規入力$x^*$に対する出力$y^*$の予測分布$p(y^{*}|x^{*},\mathbf x, \mathbf y,\alpha,\beta)$は次のようになります． \begin{eqnarray} p(y^{*}|x^{*},\mathbf x,\mathbf y,\alpha,\beta) &=&\int p(y^{*}|x^{*},\mathbf{w},\beta)p(\mathbf{w}|\mathbf x, \mathbf y,\alpha,\beta)d\mathbf{w}\\ &=&\int N(y^{*}|\mathbf{w}^{T}\mathbf{\phi}(x^{*}),\beta^{-1})N(\mathbf{w}|\mathbf{m,S})d\mathbf{w} \end{eqnarray} この予測分布の式は，ガウス分布同士の畳み込み積分になっています．ここで，ガウス分布同士の畳み込み積分を求める公式は，パターン認識と機械学習[1]の式2-116を使うと，次のように求めることができます（この部分はいずれ証明したい）． \begin{eqnarray} p(y^{*}|x^{*},\mathbf x, \mathbf y,\alpha,\beta)=N(y^{*}|\mathbf{m}^T\mathbf{\phi}(x^*),\beta^{-1}+\mathbf{\phi}(x^*)^T\mathbf{S}\mathbf{\phi}(x^*)) \end{eqnarray} ## おまけ　事後分布の更新について実際の環境では，例えばPCのメモリの都合などで全てのデータを一度に計算できなかったり，逐一結果を確認したい場合があります．そこで，データを分割して事後分布を逐次的に更新できるようにしたいです．ここで，尤度に対して共役な事前分布を選ぶことで事後分布を事前分布と同じ形状を持った分布にすることができることを利用し，次の時刻$t+1$での事後分布の$\mathbf{m}_{t+1},\mathbf{S}_{t+1}$は，現在の時刻$t$での$\mathbf{m}_{t},\mathbf{S}_{t}$の結果を使って次のように逐次的に更新できるようになります(この証明はいずれ取り組みたいです). \begin{eqnarray} \mathbf{S}_{t+1}&=&\mathbf{S}_{t}^{-1}+\beta\Phi^T\Phi\\ \mathbf{m}_{t+1}&=&\mathbf{S}_{t+1}(\mathbf{S}_{t}^{-1}\mathbf{m}_{N}+\beta\Phi^{T}\mathbf y) \end{eqnarray} # ベイズ推定を用いた回帰の実装 ## 実験１．ベイズ推定を用いた線形回帰はじめに，ベイズ推定を用いた最も単純な線形回帰を実装します．実装したプログラムは以下となっており，ハイパーパラメータは事前分布の精度$\alpha=1$，ガウスノイズの精度$\beta=0.5$と仮定しています． ``` # ベイズ推定を用いた線形回帰 %matplotlib inline import numpy as np import matplotlib.pyplot as plt import matplotlib.animation as anim import copy def phi(X): # X : [[x0],[x1],...] # return : [[phi_1(x),...,phi_n(x)] for x in X] Phi = np.array([[1,x[0]] for x in X]) # 9th poliminal return Phi.T def Phi(X): return phi(X).T def maltivariate_gaussian_distribution(x, mu, sigma): # x(weight): D dimension maltivariate # mu: D dimension # sigma: D×D dimension exp = np.exp(-(1/2)*np.dot(np.transpose(x-mu), np.dot(np.linalg.inv(sigma),(x-mu)))) return (1/(np.sqrt(2*np.pi)**(mu.shape[0]/2)))*(1/np.linalg.det(sigma)**(1/2))*exp # モデルをデータにfitする def update(n, ax, sample): global S, m, posterior, prior, w0, w1 ax[0].cla() ax[1].cla() ax[2].cla() # データを全て使用して学習するとき #a = sample[:,0] + epsilon #b = sample[:,1] + epsilon # データを少しずつ増やして学習するとき（今までのデータ＋新規データで使用するとき） #a = np.array(sample[:n+1,0]) + epsilon #b = np.array(sample[:n+1,1]) + epsilon # 新規データだけで逐次的に学習するとき（今までのデータは使用しない） a = np.array([[i] for i in sample[n,0]]) + epsilon b = np.array([[i] for i in sample[n,1]]) + epsilon # 事後分布 if n == 0: temp = beta*np.dot(Phi(a).T, Phi(a)) S = np.linalg.inv(temp + alpha*np.eye(len(temp))) m = np.dot(beta*np.dot(S, Phi(a).T), b) S0 = copy.deepcopy(S) m0 = copy.deepcopy(m) S = np.linalg.inv(np.linalg.inv(S0) + beta*np.dot(Phi(a).T, Phi(a))) m = np.dot(S, (np.dot(np.linalg.inv(S0), m0) + beta*np.dot(Phi(a).T, b))) # 事後分布から重みパラメータを確率的に選択 multi_normal = np.random.multivariate_normal(m.ravel(), S, size=5) w0_pred = multi_normal[:, 0] w1_pred = multi_normal[:, 1] # 描画 cordinate = [] X, Y=np.meshgrid(np.linspace(-scope,scope,N),np.linspace(-scope,scope,N)) for i in range(len(X)): temp = [] for j in range(len(Y)): temp.append([[X[i,j]], [Y[i,j]]]) cordinate.append(temp) cordinate = np.array(cordinate) posterior = np.array(posterior) posterior = np.array([maltivariate_gaussian_distribution(cordinate[i,j], m, S) for i in range(N) for j in range(N)]) posterior = posterior.reshape(N,N) ax[0].pcolormesh(X, Y, posterior, cmap=None) ax[0].scatter(w0_pred, w1_pred, marker='x', c='r', label='sample',s=75) ax[1].plot(np.linspace(-scope,scope,N), w0 + w1 * np.linspace(-scope,scope,N), label='true line',color='b', zorder=1) ax[1].scatter(a, b, label='train data', color='b',marker='*',s=75,zorder=3) for k, (wp0, wp1) in enumerate(zip(w0_pred, w1_pred)): ax[1].plot(np.linspace(-scope,scope,N), wp0 + wp1 * np.linspace(-scope,scope,N), label='predicted line' if k==0 else "", c='r', zorder=2) ax[0].set_title('prior/posterior({}epoch)'.format(n+1)) ax[1].set_title('data space({}epoch)'.format(n+1)) ax[0].set_xlabel('w0') ax[0].set_ylabel('w1') ax[1].set_xlabel('x') ax[1].set_ylabel('y') ax[1].set_xlim(-scope,scope) ax[1].set_ylim(-scope,scope) # 予測分布を求める # 2次元座標を与えて各点の確率を出力する predictive_distribution = np.zeros((N,N)) for i, x_i in enumerate(np.linspace(-scope,scope,N)): # x座標 for j, y_j in enumerate(np.linspace(-scope,scope,N)): # y座標 x_i = np.array([[x_i]]) y_j = np.array([[y_j]]) pd_m = np.dot(m.T, phi(x_i)) pd_S = beta**(-1) + np.dot(np.dot(phi(x_i).T, S), phi(x_i)) predictive_distribution[j][i] = maltivariate_gaussian_distribution(y_j, pd_m, pd_S).ravel()[0] ax[2].pcolor(np.linspace(-scope,scope,N), np.linspace(-scope,scope,N), predictive_distribution) ax[2].scatter(x_test,y_test,label='test data',color='r',marker='*',s=75) ax[2].set_title('predictive distribution({}epoch)'.format(n+1)) ax[2].set_xlabel('x') ax[2].set_ylabel('y') ax[2].set_xlim(-scope,scope) ax[2].set_ylim(-scope,scope) fig.tight_layout() ax[0].legend(loc='upper right') ax[1].legend(loc='upper right') ax[2].legend(loc='upper right') if __name__ == '__main__': np.random.seed(0) N = 100 # 学習データの個数 N_test = 25 # テストデータの個数 beta = 0.5 # 既知であると仮定 alpha = 1 # ハイパーパラメータ epsilon = 0.000001 # ランク落ちのための値 epochs = 50 # 試行回数 scope = 3.0 # データの定義域 -scope から scope まで w0 = -0.3 # 直線の切片パラメータ w1 = 0.5 # 直線の傾きパラメータ prior = [] posterior = [] x = np.linspace(-scope,scope,N) y = (w0 + w1 * x) + np.random.normal(0, beta, N) x_test = np.linspace(-scope,scope,N_test) y_test = np.sin(x_test) + np.random.normal(0, 0.5, len(x_test)) x = np.array([[i] for i in x]) y = np.array([[i] for i in y]) # データをシャッフルする sample = np.random.permutation([[i,j] for i,j in zip(x,y)]) fig, ax = plt.subplots(1,3,figsize=(12,4)) ani = anim.FuncAnimation(fig, update, fargs = (ax, sample), interval = 400, frames = 30) ani.save("bayesian_linear_regression.gif", writer = 'imagemagick') print('finished') ``` ### 実験１-１　学習データを1試行ごとに増やしたときの線形回帰下のGIF画像に，1-1の学習結果をまとめています． ![](https://i.imgur.com/75G1OUW.gif) はじめに，学習結果の図についての簡単な説明をします．左図は，パラメータ$\mathbf{w}$に関する2次元ガウス分布の事後分布を表しています．赤**×**は，この分布からランダムでサンプリングされた$\mathbf{w}(w_0, w_1)$を示しています．中央の図は，横軸が入力$x$，縦軸が出力$y$のデータ空間を表しています．青の直線は真の関数，青点は観測データ，赤の直線は，左図の分布からランダムでサンプリングされた$\mathbf{w}(w_0, w_1)$によって引かれる直線を示しています．右図は，新規入力$x^*$に対する出力$y^*$の予測分布を表しており，出力値に対して，自信があるときは確率が高く，自信があまりない（不確実性が高い）ときは確率が低く表示されます．また，汎化性があるかどうかを確認するために，テストデータを配置しています． ※　左図と右図のカラーマップは，黄色に近いほど確率が高く，また，黒色に近いほど確率が低いことを表しています． ### データを1点学習したときの学習結果（GIF画像の1epoch目） ![](https://i.imgur.com/4ufBlNL.gif) 中央の図から，$\mathbf{w}$の事後分布からランダムにサンプリングした$\mathbf{w}$で引いた赤の直線が，学習したデータ近傍を通る直線になっていることが分かります．更に，予測分布に関しても，学習した場所から離れるにつれて，不確実性が高くなっていることが分かります． ### データを3点学習したときの学習結果（GIF画像の3epoch目） ![](https://i.imgur.com/jX26adP.gif) 中央の図から，データを3点学習しただけで，ある程度同じ指向性を持った直線が何度も引けていることが分かります．また，予測分布が少し直線のような形状になっていることが分かります． ### データを25点学習したときの学習結果（GIF画像の25epoch目） ![](https://i.imgur.com/rznyE6G.gif) 十分データを学習した結果，$\mathbf{w}$の事後分布がほとんど収束したことが分かります．そのため，中央の図から，何度$\mathbf{w}$をサンプリングしても似たような直線になっていることが分かります．更に，予測分布も，はっきりと直線のような外形になっていることが確認できました． ### 実験１-２　新規データのみ学習したときの線形回帰下のGIF画像に，1-2の学習結果をまとめています． ![](https://i.imgur.com/jYYFhtq.gif) 新規データだけを学習したときの結果が上のGIF画像です．1-1の結果と比べ，$\mathbf{w}$の事後分布が収束するのに時間がかかっていますが，推定自体は上手くできていることが分かります． ### 実験１-３　大量のデータを一度に学習したときの線形回帰 ![](https://i.imgur.com/pQIAiyQ.gif) 新規データを一度に100点学習させたときの結果が上図です．大量のデータを一度に学習させることで，1-1の結果と比べて$\mathbf{w}$の事後分布が早く収束することが分かりました． ## 実験２　ベイズ推定を用いたsin関数の回帰次に，sin関数に対して多項式を用いて回帰を行います． ``` # ベイズ推定を用いたsin関数の回帰 %matplotlib inline import numpy as np import matplotlib.pyplot as plt import matplotlib.animation as anim import copy def phi(X): # X : [[x0],[x1],...] # return : [[phi_1(x),...,phi_n(x)] for x in X] #Phi = np.array([[1,x[0],x[0]**2,x[0]**3] for x in X]) # 3 poliminal Phi = np.array([[1,x[0],x[0]**2,x[0]**3,x[0]**4,x[0]**5,x[0]**6,x[0]**7,x[0]**8,x[0]**9] for x in X]) # poliminal return Phi.T def Phi(X): return phi(X).T def maltivariate_gaussian_distribution(x, mu, sigma): # x(weight): D dimension maltivariate # mu: D dimension # sigma: D×D dimension exp = np.exp(-(1/2)*np.dot(np.transpose(x-mu), np.dot(np.linalg.inv(sigma),(x-mu)))) return (1/(np.sqrt(2*np.pi)**(mu.shape[0]/2)))*(1/np.linalg.det(sigma)**(1/2))*exp # モデルをデータにfitする def update(n, ax, sample): global S, m ax[0].cla() ax[1].cla() # データを全て使用して学習するとき #a = sample[:,0] + epsilon #b = sample[:,1] + epsilon # データを少しずつ増やして学習するとき（今までのデータ＋新規データで使用するとき） a = np.array(sample[:n+1,0]) + epsilon b = np.array(sample[:n+1,1]) + epsilon # 新規データだけで逐次的に学習するとき（今までのデータは使用しない） #a = np.array([[i] for i in sample[n,0]]) + epsilon #b = np.array([[i] for i in sample[n,1]]) + epsilon # 事後分布 if n == 0: temp = beta*np.dot(Phi(a).T, Phi(a)) S = np.linalg.inv(temp + alpha*np.eye(len(temp))) m = np.dot(beta*np.dot(S, Phi(a).T), b) S0 = copy.deepcopy(S) m0 = copy.deepcopy(m) S = np.linalg.inv(np.linalg.inv(S0) + beta*np.dot(Phi(a).T, Phi(a))) m = np.dot(S, (np.dot(np.linalg.inv(S0), m0) + beta*np.dot(Phi(a).T, b))) # 事後分布から重みパラメータを確率的に選択する multi_norm = np.random.multivariate_normal(m.ravel(), S, size=5) # 描画 ax[0].plot(np.linspace(-scope,scope,N), np.sin(np.linspace(-scope,scope,N)), label='true curve',color='b', zorder=1) ax[0].scatter(a, b, label='train data', color='b',marker='*',s=75,zorder=3) r = np.linspace(-scope,scope,N) for w in multi_norm: ax[0].plot(r, [np.dot(w, [t**i for i in range(len(w))]) for t in r], c='r', zorder=2) ax[0].set_title('data space({}epoch)'.format(n+1)) ax[0].set_xlabel('x') ax[0].set_ylabel('y') ax[0].set_xlim(-scope,scope) ax[0].set_ylim(-2,2) # 予測分布を求める # 2次元座標を与えて各点の確率を出力する predictive_distribution = np.zeros((N,N)) for i, x_i in enumerate(np.linspace(-scope,scope,N)): # x座標 for j, y_j in enumerate(np.linspace(-scope,scope,N)): # y座標 x_i = np.array([[x_i]]) y_j = np.array([[y_j]]) pd_m = np.dot(m.T, phi(x_i)) pd_S = beta**(-1) + np.dot(np.dot(phi(x_i).T, S), phi(x_i)) predictive_distribution[j][i] = maltivariate_gaussian_distribution(y_j, pd_m, pd_S).ravel()[0] ax[1].pcolor(np.linspace(-scope,scope,N), np.linspace(-scope,scope,N), predictive_distribution) ax[1].scatter(x_test,y_test,label='test data',color='r',marker='*',s=75) ax[1].set_title('predictive distribution({}epoch)'.format(n+1)) ax[1].set_xlabel('x') ax[1].set_ylabel('y') ax[1].set_xlim(-scope,scope) ax[1].set_ylim(-2,2) fig.tight_layout() ax[0].legend(loc='upper right') ax[1].legend(loc='upper right') if __name__ == '__main__': np.random.seed(100) N = 100 # 学習データの個数 N_test = 25 # テストデータの個数 beta = 1 # 既知であると仮定 alpha = 1 # ハイパーパラメータ epsilon = 0.000001 # ランク落ちのための値 epochs = 50 # 試行回数 scope = 4.0 # データの定義域 -scope から scope まで x = np.linspace(-scope,scope,N) y = np.sin(x) + np.random.normal(0, 0.5, N) x_test = np.linspace(-scope,scope,N_test) y_test = np.sin(x_test) + np.random.normal(0, 0.5, len(x_test)) x = np.array([[i] for i in x]) y = np.array([[i] for i in y]) # データをシャッフルする sample = np.random.permutation([[i,j] for i,j in zip(x,y)]) fig, ax = plt.subplots(1,2,figsize=(14,6)) ani = anim.FuncAnimation(fig, update, fargs = (ax, sample), interval = 400, frames = 5) ani.save("bayesian_regression_poliminal.gif", writer = 'imagemagick') print('finished') ``` ### 実験２-１　9次多項式による回帰 ![](https://i.imgur.com/9rdvQ1z.gif) はじめに左図がデータ空間を，右図が予測分布を表しています．見方については，実験1と同じです．表現力が高い9次多項式を用いると，初めの方の学習では発散したような曲線になるものの，ある程度データを学習させればsin関数に上手く曲線フィッテイングできることが分かりました． ### 実験２-２　3次多項式による回帰 ![](https://i.imgur.com/INdMYI1.gif) $x$が-1から1の範囲におけるsin関数に対して，3次多項式を使うことで上手く曲線フィッテングできることが分かります． ### 実験２-３　2次多項式による回帰 ![](https://i.imgur.com/IZO78Kx.gif) 2次多項式の表現力がsin関数に対して足りていないため，上手く曲線フィッテングができていないことが分かります．予測分布の広がり（分散）も非常に大きくなっています． # 感想ベイズ推定の数式を導き，スクラッチで実装したことでベイズ推定や確率分布についての理解が深まりました． # 参考文献 [1] [パターン認識と機械学習（上）　C.M. ビショップ](https://www.amazon.co.jp/%E3%83%91%E3%82%BF%E3%83%BC%E3%83%B3%E8%AA%8D%E8%AD%98%E3%81%A8%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92-%E4%B8%8A-C-M-%E3%83%93%E3%82%B7%E3%83%A7%E3%83%83%E3%83%97/dp/4621061224) [2] [ベイズ推定による機械学習（入門）　須山敦志（著）](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%82%B9%E3%82%BF%E3%83%BC%E3%83%88%E3%82%A2%E3%83%83%E3%83%97%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA-%E3%83%99%E3%82%A4%E3%82%BA%E6%8E%A8%E8%AB%96%E3%81%AB%E3%82%88%E3%82%8B%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E5%85%A5%E9%96%80-KS%E6%83%85%E5%A0%B1%E7%A7%91%E5%AD%A6%E5%B0%82%E9%96%80%E6%9B%B8-%E9%A0%88%E5%B1%B1-%E6%95%A6%E5%BF%97/dp/4061538322/ref=asc_df_4061538322/?tag=jpgo-22&linkCode=df0&hvadid=295719984664&hvpos=1o1&hvnetw=g&hvrand=1954500955316738342&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=1009721&hvtargid=pla-525481402928&psc=1&th=1&psc=1) [3] [ガウス過程と機械学習　持橋大地 (著), 大羽成征 (著)](https://www.amazon.co.jp/dp/B07QMMJJV8/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1)