Logistic Regression

# Soft Binary Classification $$ f(\mathbf{x})=P(+1|\mathbf{x})\in[0,1] $$ 也就是說我們要找的潛在規則，給定某個 x ，則會以某個機率產生出 +1 。但是實際上我們得到的資料不會是機率，而是實實在在的 +1 跟 -1。 # Logistic Function/Logistic Regression $$ \theta(s)=\frac{e^{s}}{1+e^{s}}=\frac{1}{1+e^{-s}} $$ 這是 Logistic Function，可以看到他會把原本的向量內積的值投射到 0 跟 1 之間，也就是說我們可以用它來近似潛在的規則： $$ h(\mathbf{x})=\theta(s) $$ ## 特性 Logistic Function 有個特性： $$ 1-\theta(s)=\theta(-s) $$ # Error measure 對於一份資料： $$ \mathcal{D}=\left\{(\mathbf{x}_{1},\color{blue}{\circ}),(\mathbf{x}_{2},\color{red}{\times}),...,(\mathbf{x}_{N},\color{red}{\times})\right\} $$ 假如真的有一個潛在規則 $f$，那麼它產生出這份資料的機率為： $$ P(\mathbf{x}_{1})P(\color{blue}{\circ}|\mathbf{x}_{1})\times\\ P(\mathbf{x}_{2})P(\color{red}{\times}|\mathbf{x}_{2})\times\\ \cdots\\ P(\mathbf{x}_{N})P(\color{red}{\times}|\mathbf{x}_{N})\\ $$ 那麼將機率替換成近似的 $h$，就會寫成： $$ P(\mathbf{x}_{1})h(\mathbf{x}_{1})\times\\ P(\mathbf{x}_{2})(1-h(\mathbf{x}_{2}))\times\\ \cdots\\ P(\mathbf{x}_{N})(1-h(\mathbf{x}_{N}))\\ $$ 由於我們十足相信真的有某個規則產生出這筆資料，也就是說由 $f$ 產生出這筆資料的機率很大；那麼我們就從一群 $h$ 當中找出相乘起來的機率很大的人做為 $g$。 $$ likelihood(h)=P(\mathbf{x}_{1})h(\mathbf{x}_{1})\times P(\mathbf{x}_{2})(1-h(\mathbf{x}_{2}))\times \cdots P(\mathbf{x}_{N})(1-h(\mathbf{x}_{N})) $$ 使用 Logistic function 的特性化簡可以得到： $$ likelihood(h)\propto\prod_{n=1}^{N}h(y_{n}\mathbf{x}_{n}) $$ 而我們要找的是最大的那個 $h$： $$ \max_{h}\ likelihood(h)\propto\prod_{n=1}^{N}h(y_{n}\mathbf{x}_{n}) $$ 或者可以寫成： $$ \max_{\mathbf{w}}\ likelihood(\mathbf{w})\propto\prod_{n=1}^{N}\theta(y_{n}\mathbf{w}^{T}\mathbf{x}_{n}) $$ # Cross-Entropy Error 為了把上面的「連乘取 max」改成通常 Error measure 的「連加取 min」，所以我們做出以下修改： $$ \min_{\mathbf{w}}\frac{1}{N}\sum_{n=1}^{N}-ln\ \theta(y_{n}\mathbf{w}^{T}\mathbf{x}_{n})\\ \Rightarrow \min_{\mathbf{w}}\frac{1}{N}\sum_{n=1}^{N} ln\left(1+e^{-y_{n}\mathbf{w}^{T}\mathbf{x}_{n}}\right) $$ 這樣我們想要的 Err 就出來了。 $$ err(\mathbf{w},\mathbf{x},y)=ln\left(1+e^{-y_{n}\mathbf{w}^{T}\mathbf{x}_{n}}\right) $$ 這個東西就叫做 Cross-Entropy Error。 # Gradient $$ \nabla E_{in}(\mathbf{w})=\frac{1}{N}\sum_{n=1}^{N}\theta(-y_{n}\mathbf{w}^{T}\mathbf{x}_{n})(-y_{n}\mathbf{x}_{n}) $$ 但是要讓 $\nabla E_{in}=\mathbf{0}$，並沒有 Closed-form solution，於是我們便想要以類似 PLA 逐次修正的方式，來降低錯誤率。也就是說每次都找一個固定大小的方向前進，該方向的錯誤值是最小的： $$ \min_{\left\|\mathbf{v}\right\|=1}E_{in}(\mathbf{w}_{t}+\eta\mathbf{v}) $$ 但是這個東西我們也不會解，所以拿出我們微積分的工具，線性逼近： $$ E_{in}(\mathbf{w}_{t}+\eta\mathbf{v})\approx E_{in}(\mathbf{w}_{t})+\eta\mathbf{v}^{T}\nabla E_{in}(\mathbf{w}_{t}) $$ 從這個式子可以很清楚的發現，要讓錯誤率是最小的方向，就是跟當前梯度相反的方向： $$ \mathbf{v}=-\frac{\nabla E_{in}(\mathbf{w}_{t})}{\left\|\nabla E_{in}(\mathbf{w}_{t})\right\|} $$ 有了方向還不夠，雖然前面提到走出一步固定長度的距離，但是既然已經找到了對當前來說可以降低錯誤的最好方向，那麼理論上應該是根據梯度的大小決定要跨多長的距離。也就是說當梯度越大，應該跨越大步，越小則跨越小步。所以原先的更新公式為： $$ \mathbf{w}_{t}-\color{red}{\eta}\frac{\nabla E_{in}(\mathbf{w}_{t})}{\left\|\nabla E_{in}(\mathbf{w}_{t})\right\|} $$ 將原本的 $\color{red}{\eta}$，看成是某個和$\left\|\nabla E_{in}(\mathbf{w}_{t})\right\|$ 成正比的 $\color{purple}{\eta}$，則可以將公式改寫為： $$ \mathbf{w}_{t+1}=\mathbf{w}_{t}-\color{purple}{\eta}\cdot\nabla E_{in}(\mathbf{w}_{t}) $$ # Gradient Descent 上面就是著名的梯度下降法，而更新公式： $$ \mathbf{w}_{t+1}=\mathbf{w}_{t}+\color{purple}{\eta}\cdot-\nabla E_{in}(\mathbf{w}_{t}) $$ 可以用在容易求出梯度的 Error measure function。