Chapter 9-Understanding Machine Learning From Theory to Algorithms: Linear Predictors

# Chapter 9-Understanding Machine Learning From Theory to Algorithms: Linear Predictors ###### tags: `Understanding Machine Learning From Theory to Algorithms` ## Remark :::info 文章內容 ::: :::success 個人理解 ::: [課程連結]() [TOC] ## Linear Predictors :::info In this chapter we will study the family of linear predictors, one of the most useful families of hypothesis classes. Many learning algorithms that are being widely used in practice rely on linear predictors, first and foremost because of the ability to learn them efficiently in many cases. In addition, linear predictors are intuitive, are easy to interpret, and fit the data reasonably well in many natural learning problems. ::: :::success 這一章我們會學習linear predictors的系列家族，這是最有效的hypothesis classes系列家族之一。實務上被廣泛使用的許多學習演算法都依賴著它們(linear predictors)，首先是因為在許多情況下都可以高效率的學習的能力。此外，linear predictors是直觀、容易解釋，而且在許多自然學習問題中可以合理的擬合資料。 ::: :::info We will introduce several hypothesis classes belonging to this family -halfspaces, linear regression predictors, and logistic regression predictors - and present relevant learning algorithms: linear programming and the Perceptron algorithm for the class of halfspaces and the Least Squares algorithm for linear regression. This chapter is focused on learning linear predictors using the ERM approach; however, in later chapters we will see alternative paradigms for learning these hypothesis classes. ::: :::success 我們將會介紹屬於該系列家族的幾種hypothesis class-halfspaces(半空間)、linear regression predictors(線性迴歸)、logistic regression predictor(邏輯斯迴歸)，並介紹相關的學習演算法：用於halfspaces class的線性規劃與感知器演算法與linear gression的Least Squares(最小平方)演算法。本章著重於使用ERM方法學習linear predictors；但是，後面的章節我們會看到對於學習這些hypothesis class的替代範例。首先，我們將仿射函數的類別定義為 $$L_d = \left\{h_{w, b}: w \in \mathbb{R}^d, b\in \mathbb{R} \right\}$$ 其中 $$h_{w, b}(x) = <w, x> + b = \left(\sum^d_{i=1}w_ix_i \right) + b$$ 這也可以用符號來表示 $$L_d = \left\{x \mapsto <w, x> + b : w \in \mathbb{R}^d, b \in \mathbb{R} \right\}$$ ($<w, x>$意味著兩個向量的內積) 這數學式可以這麼看：$L_d$是函數的集合，每一個函數都透過$w \in \mathbb{R}^d, b \in \mathbb{R}$來參數化，每個函數都以向量-$x$做為輸入，然後回傳sclar-$<w, x> + b$做為輸出。 linear predictors內不同的hypothesis classes是由$L_d$上的函數$\phi: \mathbb{R} \to \mathcal{Y}$所組成。舉例來說，在二分類問題中，我們可以選擇$\phi$做為[正負號函數](http://terms.naer.edu.tw/detail/9470344/)，而對於迴歸問題，其$\mathcal{Y}=\mathbb{R}$，$\phi$就只是一個[恆等函數](http://terms.naer.edu.tw/detail/7293691/)。 ::: :::info It may be more convenient to incorporate $b$, called the bias, into $w$ as an extra coordinate and add an extra coordinate with a value of 1 to all $x \in X$; namely, let $w' = (b, w_1, w_2, \cdots, w_d) \in \mathbb{R}^{d+1}$ and let$x' = (1, x_1, x_2, \cdots, x_d) \in \mathbb{R}^{d+1}$. Therefore, $$h_{w, b}(x) = <w, x> + b = <w', x'>$$ It follows that each affine function in $\mathbb{R}^{d}$ can be rewritten as a homogeneous linear function in $\mathbb{R}^{d+1}$ applied over the transformation that appends the constant 1 to each input vector. Therefore, whenever it simplifies the presentation, we will omit the bias term and refer to $L_d$ as the class of homogeneous linear function of the form $h_w(x)=<w, x>$. Throughout the book we often use the general term "linear functions" for both affine functions and (homogenous) linear functions. ::: :::info 將$b$(稱為bias)合併進入$w$做為一個額外的座標，並且所有的$x \in X$都加入一個值為1的額外座標或許更為方便；也就是說，假設$w' = (b, w_1, w_2, \cdots, w_d) \in \mathbb{R}^{d+1}$，且$x' = (1, x_1, x_2, \cdots, x_d) \in \mathbb{R}^{d+1}$(d+1是因為原始維度再加上一個額外的維度)。因此， $$h_{w, b}(x) = <w, x> + b = <w', x'>$$ 也因此，$\mathbb{R}^{d}$內的每一個仿射函數(affine function)都可以被改寫為$\mathbb{R}^{d+1}$內的齊次線性函數(homogeneous linear function)，應用於將常數1附加到每個輸入向量的轉換上。因此，只要它簡化表示，我們就會忽略偏差項目，並將$L_d$做為$h_w(x)=<w, x>$形式的齊次線性函數的類別。這整本書裡面，我們通常會以通用的"線性函數"來表示仿射函數與齊次線性函數。 ::: ### 9.1 Halfspaces :::info The first hypothesis class we consider is the class of halfspaces, designed for binary classification problems, namely, $\mathcal{X} = \mathbb{R}^d$ and $\mathcal{Y} = \left\{-1, +1 \right\}$. The class of halfspaces is defined as follows: $$HS_d = \text{sign} \circ L_d = \left\{x \mapsto \text{sign}(h_{w, b}(x)): h_{w, b} \in L_d \right\}$$ In other words, each halfspace hypothesis in HSd is parameterized by $w \in \mathbb{R}^d$ and $b \in \mathbb{R}^d$ and upon receiving a vector $x$ the hypothesis returns the label $\text{sign}(<w, x> + b)$. ::: :::success 我們考慮到的第一個hypothesis class就是半空間類別(class of half-spaces)，專為二分類問題而設計的，也就是$\mathcal{X} = \mathbb{R}^d$且$\mathcal{Y} = \left\{-1, +1 \right\}$。半空間類別由下所定義： $$HS_d = \text{sign} \circ L_d = \left\{x \mapsto \text{sign}(h_{w, b}(x)): h_{w, b} \in L_d \right\}$$ 換句話說，每一個$HS_d$內的halfspace hypothesis(半空間假設)都是由$w \in \mathbb{R}^d$與$b \in \mathbb{R}^d$所參數化，然後接收向量$x$，並回傳標記$\text{sign}(<w, x> + b)$。為了說明hypothesis class的幾何性，我們用$d=2$來做為啟發性的說明。每一個hypothesis形成一個超平面，這超平面與向量$w$垂直，而且與垂直軸相交於點$(0, -b/w_2)$。在超平面"上方"的實例，也就是與$w$共享銳角的那些實例會被標記為正。在超平方"下方"的實例，也就是與$w$共享鈍角的那些實例會被標記為負。 ![](https://i.imgur.com/Wi2iUFi.png) ::: :::info In Section 9.1.3 we will show that $\text{VCdim}(HS_d) = d + 1$. It follows that we can learn halfspaces using the ERM paradigm, as long as the sample size is $\omega\left(\dfrac{d+\log(1/\delta)}{\epsilon} \right)$. Therefore, we now discuss how to implement an ERM procedure for halfspaces. ::: :::success 在章節9.1.3中，我們將會說明$\text{VCdim}(HS_d) = d + 1$。由此可知，只需要樣本大小有$\omega\left(\dfrac{d+\log(1/\delta)}{\epsilon} \right)$，我們就可以使用ERM paradigm來學習半空間。因此，現在我們來討論如何在半空間中實作一個ERP程序。下面會介紹兩個在可實現情況下找出ERM halfspace的解決方案。在半空間的情境中，可實現的案例通常是"可分離"的情況，因為我們可以用超平面將所有的正、負樣本給"分離"開來。在一個無法分離的情況中(如agonstic的案例)實作ERM rule已經知道是在計算上是困難的(Ben-David & Simon 2001)。有幾種方法可以學習不可分離的資料。最知名的一種方法就是使用surrogate loss functions(替代損失函數)，也就是說，在學習半空間的案例中，我們不需要去最小化經驗風險，取而代之的是使用不同的損失函數。舉例來說，在章節9.3中，我們將會介紹logistic regression這個方法，這方法即使在一個不可分離的情況中依然可以有效率的執行。後續在第12章我們會研究更多替代損失函數的細節。 ::: #### 9.1.1 Linear Programming for the Class of Halfspaces :::info Linear programs (LP) are problems that can be expressed as maximizing a linear function subject to linear inequalities. That is, $$\max_{w \in \mathbb{R}^d}\langle u , w \rangle \\ \text{subject to }Aw \geq v$$ where $w \in \mathbb{R}^d$ is the vector of variables we wish to determine, $A$ is an $m \text{x} d$ matrix, and $v \in \mathbb{R}^m, u \in \mathbb{R}^d$ are vectors. Linear programs can be solved efficiently, and furthermore, there are publicly available implementations of LP solvers. ::: :::success 線性規劃問題(LP)可以被表示為在線性不等式約束下最大化線性函數。也就是： $$\max_{w \in \mathbb{R}^d}\langle u , w \rangle \\ \text{subject to }Aw \geq v$$ 其中$w \in \mathbb{R}^d$就是我們在找的變數向量，$A$是一個$m \text{x} d$的矩陣，而$v \in \mathbb{R}^m, u \in \mathbb{R}^d$為向量。線性規劃可以被高效的求解，此外，已經公開的線性規劃求解器可用。我們將說明在可分離(因為上面已經定義，可實現通常是可分離，因此後續翻為可分離)案例中對於半空間的ERM問題可以被表示為線性規劃。為求簡單，我們假設這是一個homogenous的情況，假設$S=\left\{(x_i,y_i)\right\}^m_{i=1}$，為訓練集，其大小為$m$。因為我們假設是可分離的情況，因此，ERM predictor在訓練集上應該是零誤差。意即，我們在尋找某些向量_$w \in \mathbb{R}^d$，其 $$\text{sign}\left(\langle w, x_i \rangle \right) = y_i, \forall i = 1, \cdots, m$$ 這等價於我們在尋找某些向量_$w$，其 $$y_i \langle w, x_i \rangle > 0, \forall i=1, \cdots, m$$ ::: :::info Let $w^*$ be a vector that satisfies this condition (it must exist since we assume realizability). Define $\gamma=\min_i(y_i \langle w^*, x_i \rangle)$. Therefore, for all $i$ we have $$y_i \langle \bar{w}, x_i \rangle = \dfrac{1}{\gamma} y_i \langle w^*, x_i \rangle \geq 1$$ We have thus shown that there exists a vector that satisfies $$y_i \langle w, x_i \rangle \geq 1, \forall i = 1, \cdots , m \tag{9.1}$$ ::: :::success 假設$w^*$是一個向量並且滿足該條件(這是必須存在的，因為我們假設可分離)。定義$\gamma=\min_i(y_i \langle w^*, x_i \rangle)$並假設$\bar{w}=\dfrac{w^*}{\gamma}$。因此，對所有的$i$我們得到 $$y_i \langle \bar{w}, x_i \rangle = \dfrac{1}{\gamma} y_i \langle w^*, x_i \rangle \geq 1$$ 因此，我們證明存在一個向量滿足 $$y_i \langle w, x_i \rangle \geq 1, \forall i = 1, \cdots , m \tag{9.1}$$ 很明顯的，這樣的一個個量就是一個ERM predictor。為了找出滿足方程式9.1的向量，我們可以利用線性規劃解決器，如下處理： * 設置$A$為$m \text{ x } d$的矩陣，其rows為實例乘上$y_i$，也就是說，$A_{i,j}=y_ix_{i, j}$，其中$w_{i, j}$就是$x_i$的第$j$個元素 * 假設$v$為向量$(1, \cdots, 1)\in \mathbb{R}^m$ 那麼，方程式9.1可以改寫為 $$Aw \geq v$$ 線性規劃形式要求一個最大化的目標，但是滿足約束的所有的$w$都是做為輸出hypotheses的同等的候選。因此，我們設置一個"虛擬"的目標，$u=(0, \cdots, 0)\in \mathbb{R}^d$ ::: #### 9.1.2 Perceptron for Halfspaces :::info A different implementation of the ERM rule is the Perceptron algorithm of Rosenblatt (Rosenblatt 1958). The Perceptron is an iterative algorithm that constructs a sequence of vectors $w^{(1), w^{(2)}, \cdots$. Initially, $w^{(1)}$ is set to be the all-zeros vector. At iteration t, the Perceptron finds an example i that is mislabeled by $w^{(t)}$, namely, an example for which $\text{sign}(\langle w^{(t)}, x_i \rangle) \neq y_i$. Then, the Perceptron updates w(t) by adding to it the instance xi scaled by the label $y_i. That is, $w^{(t+1)} = w^{(t)} + y_ix_i$. Recall that our goal is to have $y_i\langle w, x_i \rangle > 0$ for all $i$ and note that $$y_i \langle w^{(t+1)}, x_i \rangle = y_i \langle w^{(t)} + y_ix_i, x_i \rangle = y_i \langle w^{(t)}, x_i \rangle + \Vert x_i \Vert^2$$ ::: :::success 一個實作ERM rule的不同方式就是Perceptron algorithm of Rosenblatt(感知器演算法)。感知器是一種迭代演算法，建構一個向量序列$w^{(1), w^{(2)}, \cdots$。最一開始，$w^{(1)}$ 設置為all-zero vector。在迭代第$t$次的時候，感知器會發現被$w^{(t)}$標記錯誤的樣本$i$，也就是有一個樣本，其$\text{sign}(\langle w^{(t)}, x_i \rangle) \neq y_i$。然後，感知器以標籤$y_i$比例縮放$x_i$加入向量來更新$w^{(t)}$。也就是$w^{(t+1)} = w^{(t)} + y_ix_i$。回頭看我們的目標，就是找出對所有的$i$有著$y_i\langle w, x_i \rangle > 0$，並注意到 $$y_i \langle w^{(t+1)}, x_i \rangle = y_i \langle w^{(t)} + y_ix_i, x_i \rangle = y_i \langle w^{(t)}, x_i \rangle + \Vert x_i \Vert^2$$ 因此，感知器的更新將指導解決方案在第$i$個樣本中"更正確"。 Batch Perceptron input: A training set $(x_1, y_i), \cdots, (x_m, y_m)$ initialize: $w^{(1)} = (0, \cdots,0)$ &ensp; for $t = 1, 2, \cdots$ &ensp; &ensp;if $(\exists i \text{ s.t. } y_i \langle w^{(t)}, x_i \rangle \leq 0)$ then &ensp; &ensp; &ensp;$w^{(t+1)} = w^{(t)} + y_ix_i$ &ensp; &ensp;else &ensp; &ensp; &ensp;output $w^{(t)}$ ::: :::info theorem 9.1 Assume that $(x_1, y_1), \cdots, (x_m, y_m)$ is separable, let $B=\min \left\{ \Vert w \Vert: \forall i \in [m], y_i \langle w, x_i \rangle \geq 1 \right\}$ and let $R = \max_i \Vert x_i \Vert$. Then, the Perceptron al- gorithm stops after at most $(RB)^2$ iterations, and when it stops it holds that $\forall i \in [m], y_i \langle w^{(t)}, x_i \rangle > 0$. ::: :::success THEOREM 9.1 假設，$(x_1, y_1), \cdots, (x_m, y_m)$是可分離的，假設$B=\min \left\{ \Vert w \Vert: \forall i \in [m], y_i \langle w, x_i \rangle \geq 1 \right\}$，並假設$R = \max_i \Vert x_i \Vert$。然後，感知器演算法會在最多$(RB)^2$次迭代後停止，並且當它停止之後，$\forall i \in [m], y_i \langle w^{(t)}, x_i \rangle > 0$成立。證明：根據停止條件的定義，如果感知器停止，那它必須是能夠將所有的樣本分離。我們將說明，如果感知器執行了$T$次迭代，那它必須滿足$T \leq (RB)^2$，這意味著感知器必須在最多最多$(RB)^2$次迭代之後停止。 ::: :::info Let $w^*$ be a vector that achieves the minimum in the definition of B. That is, $y_i \langle w^*, x_i \rangle \geq 1$ for all $i$, and among all vectors that satisfy these constraints, $w^*$ is of minimal norm. The idea of the proof is to show that after performing $T$ iterations, the cosine of the angle between $w^*$ and w(T+1) is at least $\dfrac{\sqrt{T}}{RB}$. That is, we will show that $$\dfrac{\langle w^*, w^{(T+1)} \rangle}{\Vert w^* \Vert \Vert w^{(T+1)} \Vert} \geq \dfrac{\sqrt{T}}{RB} \tag{9.2}$$ By the Cauchy-Schwartz inequality, the left-hand side of Equation (9.2) is at most 1. Therefore, Equation (9.2) would imply that $$1 \geq \dfrac{\sqrt{T}}{RB} \Rightarrow T \leq (RB)^2$$ which will conclude our proof. ::: :::info 假設$w^*$是一個向量，它能夠在$B$的定義中達到最小值。也就是說，對所有的$i$而言，$y_i \langle w^*, x_i \rangle \geq 1$，在所有滿足這個約束的向量中，$w^*$有著最小範數。證明的一個想法，就是說明，在執行$T$次迭代之後，其$w^*$與$w^{(t+1)}$之間夾角的餘弦最少是$\dfrac{\sqrt{T}}{RB}$。也就是，我們將說明的 $$\dfrac{\langle w^*, w^{(T+1)} \rangle}{\Vert w^* \Vert \Vert w^{(T+1)} \Vert} \geq \dfrac{\sqrt{T}}{RB} \tag{9.2}$$ 透過Cauchy–Schwarz inequality我們知道，方程式9.2的左手邊部份最多就是1(因為$\vert \langle x, y \rangle \vert \leq \Vert x \Vert \cdot \Vert y \Vert$)。因此，方程式9.2意味著 $$1 \geq \dfrac{\sqrt{T}}{RB} \Rightarrow T \leq (RB)^2$$ 這將得出我們的證明。 ::: :::info To show that Equation (9.2) holds, we first show that $\langle w^*, w^{(t+1)} \geq T$. Indeed, at the first iteration, $w^{(1)} = (0, \cdots, 0)$ and therefore $\langle w^*, w^{(1)} \rangle = 0$, while on iteration $t$, if we update using example $(x_i, y_i)$ we have that $$\begin{align}\langle w^*, w^{(t+1)} \rangle - \langle w^*, w^{(t)} \rangle & = \langle w^*, w^{(t+1)} - w^{(t)} \rangle \\ & = \langle w^*, y_ix_i \rangle = y_i \langle w^*, x_i \rangle \\ & \geq 1\end{align}$$ Therefore, after performing $T$ iterations, we get: $$\langle w^*, w^{(T+1)} \rangle = \sum_{t=1}^T \left(\langle w^*, w^{(t+1)} \rangle - \langle w^*, w^{(t)} \rangle \right) \geq T, \tag{9.3}$$ as required. ::: :::success 為了說明方程式9.2成立，我們首先說明$\langle w^*, w^{(t+1)} \geq T$。確實，在第一次迭代中，$w^{(1)} = (0, \cdots, 0)$，因此$\langle w^*, w^{(1)} \rangle = 0$，在第$t$次迭代的時候，如果我們使用樣本$(x_i, y_i)$來更新，我們會得到 $$\begin{align}\langle w^*, w^{(t+1)} \rangle - \langle w^*, w^{(t)} \rangle & = \langle w^*, w^{(t+1)} - w^{(t)} \rangle \\ & = \langle w^*, y_ix_i \rangle = y_i \langle w^*, x_i \rangle \\ & \geq 1\end{align}$$ 因此，在執行$T$次迭代之後，我們會得到： $$\langle w^*, w^{(T+1)} \rangle = \sum_{t=1}^T \left(\langle w^*, w^{(t+1)} \rangle - \langle w^*, w^{(t)} \rangle \right) \geq T, \tag{9.3}$$ ::: :::info Next, we upper bound $\Vert w^{(t+1)} \Vert^2$. For each iteration $t$ we have that $$\begin{align} \Vert w^{(t+1)} \Vert^2 &= \Vert w^{(t)} + y_ix_i \Vert^2 \\ & = \Vert w^{(t)} \Vert^2 + 2y_i \langle w^{(t)}, x_i \rangle + y_i^2 \Vert x_i \Vert^2 \\ & \leq \Vert w^{(t)} \Vert^2 + R^2 \tag{9.4} \end{align}$$ where the last inequality is due to the fact that example $i$ is necessarily such that $y_i \langle w^{(t)}, x_i \rangle \leq 0$, and the norm of $x_i$ is at most R. Now, since $\Vert w^{(1)} \Vert^2 = 0$, if we use Equation (9.4) recursively for $T$ iterations, we obtain that $$\Vert w^{(T+1)} \Vert^2 \leq TR^2 \Rightarrow \Vert w^{(T+1)} \Vert \leq \sqrt{T} R \tag{9.5}$$ Combining Equation (9.3) with Equation (9.5), and using the fact that $\Vert w^* \Vert = b$, we obtain that $$\dfrac{\langle w^{T+1)}, w^* \rangle}{\Vert w^* \Vert \Vert w^{(T+1)} \Vert} \geq \dfrac{T}{B\sqrt{T}R} = \dfrac{\sqrt{T}}{BR}$$ We have thus shown that Equation (9.2) holds, and this concludes our proof. ::: :::success 接下來，我們要來找出$\Vert w^{(t+1)} \Vert^2$的upper bound。對每一次的迭代$t$，我們都有： $$\begin{align} \Vert w^{(t+1)} \Vert^2 &= \Vert w^{(t)} + y_ix_i \Vert^2 \\ & = \Vert w^{(t)} \Vert^2 + 2y_i \langle w^{(t)}, x_i \rangle + y_i^2 \Vert x_i \Vert^2 \\ & \leq \Vert w^{(t)} \Vert^2 + R^2 \tag{9.4} \end{align}$$ 最後，最後的不等式是由於樣本$i$是必需的，因此$y_i \langle w^{(t)}, x_i \rangle \leq 0$，且$x_i$的範數，最多為$R$。現在，因為$\Vert w^{(1)} \Vert^2 = 0$，如果我們使用方程式9.4反覆執行$T$次迭代，我們可以得到 $$\Vert w^{(T+1)} \Vert^2 \leq TR^2 \Rightarrow \Vert w^{(T+1)} \Vert \leq \sqrt{T} R \tag{9.5}$$ 結合方程式9.3與9.5，並且$\Vert w^* \Vert = b$，我們得到 $$\dfrac{\langle w^{T+1)}, w^* \rangle}{\Vert w^* \Vert \Vert w^{(T+1)} \Vert} \geq \dfrac{T}{B\sqrt{T}R} = \dfrac{\sqrt{T}}{BR}$$ 因此，我們已經證明方程式9.2成立，我們完成我們的證明。 Remark 9.1 感知器可以簡單的實現，而且保證收斂。然而，收斂的速率是取決於參數$B$，這參數在某些情況下會隨著$d$呈指數形成長。這種情況下，以先前內文所述，利用解決線性規劃問題來實現ERM問題是較好的作法。不過啊，對許多資料集而言，$B$並不算太大，感知器的收斂還是可以相當快速的。 ::: #### 9.1.3 The VC Dimension of Halfspaces :::info To compute the VC dimension of halfspaces, we start with the homogenous case. theorem 9.2 The VC dimension of the class of homogenous halfspaces in $\mathbb{R}^d$ is $d$. Proof First, consider the set of vectors $e_1, \cdots, e_d$, where for every $i$ the vector $e_i$ is the all zeros vector except 1 in the $i$'th coordinate. This set is sh attered by the class of homogenous halfspaces. Indeed, for every labeling $y_1, \cdots y_d$, set $w = (y_1, \cdots, y_d)$, and then $\langle w, e_i \rangle = y_i$ for all $i$. Next, let $x_1, \cdots, x_{d+1}$ be a set of $d + 1$ vectors in $\mathbb{R}^d$. Then, there must exist real numbers $a_1, \cdots, a_{d+1}$, not all of them are zero, such that $\sum_{i=1}^{d+1}a_ix_i = 0$. Let $I = \left\{i : a_i > 0 \right\}$ and $J = \left\{j: a_j < 0 \right\}$. Either $I$ or $J$ is nonempty. Let us first assume that both of them are nonempty. Then, $$\sum_{i \in I}a_ix_i = \sum_{j \in J}\vert a_j \vert x_j$$ ::: :::success 為了計算半空間(half-sapce)的VC dimension，我們從homogenous來說起。 THEOREM 9.2 The VC dimension of the class of homogenous halfspaces in $\mathbb{R}^d$(homogenous halfspaces_$\mathbb{R}^d$的VC dimension為$d$) is $d$. 證明首先，考慮向量集合$e_1, \cdots, e_d$，這是一個只有在第$i$維才有值(1)，其它維為0的standard vector。這個集合被homogenous halfspaces的class打散。確實，對於每一個標籤$y_1, \cdots y_d$，設定$w = (y_1, \cdots, y_d)$，且對所有的$i$而言，$\langle w, e_i \rangle = y_i$ 接下來，假設$x_1, \cdots, x_{d+1}$是一個$\mathbb{R}^d$內的$d+1$個向量集合。那麼，$a_1, \cdots, a_{d+1}$必然存在非全為零的實數，滿足$\sum_{i=1}^{d+1}a_ix_i = 0$。假設$I = \left\{i : a_i > 0 \right\}$ 且$J = \left\{j: a_j < 0 \right\}$。$I, J$都不是非空集合。我們首先假設它們都不是非空集合。然後 $$\sum_{i \in I}a_ix_i = \sum_{j \in J}\vert a_j \vert x_j$$ 註：standard vector-$e_1 = \begin{bmatrix}1 \\ 0 \\ 0 \\ \vdots \end{bmatrix}, e_2 = \begin{bmatrix}0 \\ 1 \\ 0 \\ \vdots \end{bmatrix}$ ::: :::info Now, suppose that $x_1 \cdots, x_{d+1}$ are shattered by the class of homogenous classes. Then, there must exist a vector $w$ such that $\langle w, x_i \rangle > 0$ for all $i\in I$ while $\langle w, x_j \rangle < 0$ for every $j \in J$. It follows that $$0 < \sum_{i \in I} a_i \langle x_i, w \rangle = \left \langle \sum_{i \in I} a_ix_i, w \right \rangle = \left \langle \sum_{j \in J} \vert a_j \vert x_j, w \right \rangle = \sum_{j \in J} \vert a_j \vert \langle x_j, w \rangle < 0$$ which leads to a contradiction. Finally, if $J$ (respectively, $I$) is empty then the right-most (respectively, left-most) inequality should be replaced by an equality, which still leads to a contradiction. ::: :::success 現在，假定$x_1 \cdots, x_{d+1}$被homogenous classes的class給打散。那麼，必然存在一個$w$滿足對所有的$i\in I$，其$\langle w, x_i \rangle > 0$，且對所有的$j \in J$其$\langle w, x_j \rangle < 0$。由此可見 $$0 < \sum_{i \in I} a_i \langle x_i, w \rangle = \left \langle \sum_{i \in I} a_ix_i, w \right \rangle = \left \langle \sum_{j \in J} \vert a_j \vert x_j, w \right \rangle = \sum_{j \in J} \vert a_j \vert \langle x_j, w \rangle < 0$$ 這導致衝突(矛盾)。最後，如果$J$(或者$I$)是空集合，那麼不等式的最右邊(或者最左邊)應該用一個等式來取代，但這仍然會導致衝突(矛盾)。 THEOREM 9.3 非齊次半空間$\mathbb{R}^d的$VC dimension為$d+1$ 證明首先，如THEOREM 9.2的證明，這很簡單就可以驗證，向量集合$0, e_1, \cdots e_d$是被非齊次半空間類(class of nonhomogenous halfsapces)給打散。接著，假定向量$x_1, \cdots, x_{d+2}$被非齊次半空間類(class of nonhomogenous halfsapces)給打散。但是，使用我們在本章開頭所說的降維，$\mathbb{R}^{d+1}$中被齊次半空間類所打散的向量會有$d+2$個。但這跟Theorem 9.2是互相矛盾的。 ::: :::info ![](https://i.imgur.com/cRIKQTt.png) Figure 9.1 Linear regression for $d = 1$. For instance, the x-axis may denote the age of the baby, and the y-axis her weight. Figure 9.1 $d=1$的線性迴歸。對這實例而言，x軸也許是小baby的年紀，而y軸是她的體重 ::: ### 9.2 Linear Regression :::info Linear regression is a common statistical tool for modeling the relationship between some "explanatory" variables and some real valued outcome. Cast as a learning problem, the domain set $\mathcal{X}$ is a subset of $\mathbb{R}^d$, for some $d$, and the label set $\mathcal{Y}$ is the set of real numbers. We would like to learn a linear function $h:\mathbb{R}^d \to \mathbb{R}$ that best approximates the relationship between our variables (say, for example, predicting the weight of a baby as a function of her age and weight at birth). Figure 9.1 shows an example of a linear regression predictor for $d = 1$. The hypothesis class of linear regression predictors is simply the set of linear functions, ::: :::success 線性迴歸是一種常見的統計工具，用於建構某些"解釋"變數與實際值結果之間的關聯性。轉換為學習問題來看，domain set_$\mathcal{X}$是$\mathbb{R}^d$的一個子集，而label set_$\mathcal{Y}$則是一個實數集合。我們希望學習一個線性函數$h:\mathbb{R}^d \to \mathbb{R}$讓我們的變數之間的關聯性有著最佳近似(舉例來說，用小baby的年紀與出生體重來預測小baby的體重)。Figure 9.1說明一個$d=1$的線性迴歸預測器的範例。 linear regression predictors的hypothesis class就只是線性函數的集合， $$\mathcal{H}_{reg} = L_d = \left\{ x \mapsto \langle x, w \rangle + b : w \in \mathbb{R}^d, b\in \mathbb{R} \right\}$$ ::: :::info Next we need to define a loss function for regression. While in classification the definition of the loss is straightforward, as $\ell(h, (x, y))$ simply indicates whether $h(x)$ correctly predicts y or not, in regression, if the baby's weight is 3 kg, both the predictions 3.00001 kg and 4 kg are "wrong," but we would clearly prefer the former over the latter. We therefore need to define how much we shall be "penalized" for the discrepancy between $h(x)$ and $y$. One common way is to use the squared-loss function, namely, $$\ell(h, (x, y)) = (h(x) - y)^2$$ For this loss function, the empirical risk function is called the Mean Squared Error, namely, $$L_S(h) = \dfrac{1}{m}\sum^m_{i=1}(h(x_i)-y_i)^2$$ ::: :::success 下一步，我們要來定義迴歸的loss function。儘管在分類問題中的損失定義是非常的直接了當，像是$\ell(h, (x, y))$，簡單直接指出$h(x)$是否正確的預測$y$，但在迴歸問題中，如果小baby的體重是3公斤，那預測值是3.00001公斤或4公斤都是"錯的"，但很明顯的，我們會傾向接受前者是正確的(3.00001公斤)。因此，我們需要定義，我們應該對於$h(x)$與$y$之間的差異給予多少的"penalized(懲罰)"。一個常見的作法是使用平方損失函數，也就是 $$\ell(h, (x, y)) = (h(x) - y)^2$$ 對這個損失函數而言，其經驗風險函數稱為均方誤差，也就是 $$L_S(h) = \dfrac{1}{m}\sum^m_{i=1}(h(x_i)-y_i)^2$$ 下面的小節中，我們會看到如何用平方誤差在線性迴歸上實作ERM rule。當然啦，還有各種不同的損失函數可以使用，舉例來說，絕對值損失函數，$\ell(h, (x,y)) = \vert h(x) - y \vert$。這可以用線性規劃來實作，見練習1。注意到，因為線性迴歸並不是二分類預測任務，我們無法用VC-dimension來分析它的樣本複雜度。一個可能的分析方法，就是利用"discretization trick"(離散化的方式)([見第四章的Remark 4.1](https://hackmd.io/@shaoeChen/r1KzxD9fP/https%3A%2F%2Fhackmd.io%2F%40shaoeChen%2Fr1PLbHK-P))；也就是，如果我們高興，那可以用有限的位元數值來表示向量_$w$與偏差_$b$的每一個元素(舉例來說，以64bit浮點數來表示)。現在我們可以像第四章中所說的那樣來依賴有限的hypothesis classes的樣本複雜度的邊界。但是，注意一點，要讓樣本複雜度如第四章所說的那樣有界，那損失函數也要是有界的。本書後面，我們會以更為嚴謹的方法來分析回歸問題的樣本複雜度。 ::: #### 9.2.1 Least Squares :::info Least squares is the algorithm that solves the ERM problem for the hypothesis class of linear regression predictors with respect to the squared loss. The ERM problem with respect to this class, given a training set $S$, and using the homogenous version of $L_d$, is to find $$\stackrel{\text{argmin}}{w}L_S(h_w) = \stackrel{\text{argmin}}{w} \dfrac{1}{m} \sum^m_{i=1} (\langle w, x_i \rangle -y_i)^2$$ To solve the problem we calculate the gradient of the objective function and compare it to zero. That is, we need to solve $$\dfrac{2}{m} \sum^m_{i=1} (\langle w, x_i \rangle -y_i) x_i = 0$$ We can rewrite the problem as the problem $Aw = b$ where $$A=\left(\sum^m_{i=1}x_ix_i^T \right) \qquad \text{and} \qquad b=\sum^m_{i=1} y_ix_i \tag{9.6}$$ Or, in matrix form: $$A = \left(\begin{matrix}\vdots \qquad \vdots \\ x_1 \cdots x_m \\ \vdots \qquad \vdots \end{matrix} \right)\left(\begin{matrix}\vdots \qquad \vdots \\ x_1 \cdots x_m \\ \vdots \qquad \vdots \end{matrix} \right)^T \tag{9.7}$$ $$b = \left(\begin{matrix}\vdots \qquad \vdots \\ x_1 \cdots x_m \\ \vdots \qquad \vdots \end{matrix} \right)\left(\begin{matrix}y_1 \\ \vdots \\ y_m \end{matrix} \right)$$ If $A$ is invertible then the solution to the ERM problem is $$w = A^{-1} b$$ ::: :::success 最小平方(Least squares)是一種解決linear regression predictors的hypothesis class關於平方誤差的ERM問題的演算法。關於這個hypothesis class的ERM問題，給定訓練集$S$，使用homogenous version的$L_d$來找到 $$\stackrel{\text{argmin}}{w}L_S(h_w) = \stackrel{\text{argmin}}{w} \dfrac{1}{m} \sum^m_{i=1} (\langle w, x_i \rangle -y_i)^2$$ 為了解這個問題，我們計算這個目標函數的梯度，並且把梯度跟零做了比較。也就是說，我們必需要解 $$\dfrac{2}{m} \sum^m_{i=1} (\langle w, x_i \rangle -y_i) x_i = 0$$ 我們可以把這個問題重寫為$Aw=b$，其中 $$A=\left(\sum^m_{i=1}x_ix_i^T \right) \qquad \text{and} \qquad b=\sum^m_{i=1} y_ix_i \tag{9.6}$$ 或者以矩陣的徵式來表示： $$A = \left(\begin{matrix}\vdots \qquad \vdots \\ x_1 \cdots x_m \\ \vdots \qquad \vdots \end{matrix} \right)\left(\begin{matrix}\vdots \qquad \vdots \\ x_1 \cdots x_m \\ \vdots \qquad \vdots \end{matrix} \right)^T \tag{9.7}$$ $$b = \left(\begin{matrix}\vdots \qquad \vdots \\ x_1 \cdots x_m \\ \vdots \qquad \vdots \end{matrix} \right)\left(\begin{matrix}y_1 \\ \vdots \\ y_m \end{matrix} \right)$$ 如果$A$是可逆的(invertible)，那麼這個ERM問題的解就是 $$w = A^{-1} b$$ ::: :::info The case in which $A$ is not invertible requires a few standard tools from linear algebra, which are available in Appendix C. It can be easily shown that if the training instances do not span the entire space of $\mathbb{R}^d$ then $A$ is not invertible. Nevertheless, we can always find a solution to the system $Aw = b$ because $b$ is in the range of $A$. Indeed, since $A$ is symmetric we can write it using its eigenvalue decomposition as $A = VDV^T$, where $D$ is a diagonal matrix and $V$ is an orthonormal matrix (that is, $V^TV$ is the identity $d \text{x} d$ matrix). Define $D^+$ to be the diagonal matrix such that $D^+_{i,i}=0$ if $D_{i,i}=0$ and otherwise $D^+_{i,i}=1/D_{i,i}$. Now, define $$A^+ = VD^+V^T \qquad \text{and} \qquad \hat{w} = A^+b$$ Let $v_i$ denote the $i$'th column of $V$ . Then, we have $$A\hat{w} = AA^+b = VDV^TVD^+V^Tb = VDD^+V^Tb = \sum_{i:D_{i,i} \neq 0} v_i v_i^T b$$ That is, $A\hat{w}$ is the projection of $b$ onto the span of those vectors $v_i$ for which $D_{i,i} \neq 0$. Since the linear span of $x_1, \cdots, x_m$ is the same as the linear span of those $v_i$, and $b$ is in the linear span of the $x_i$, we obtain that $A\hat{w}=b$, which concludes our argument. ::: :::success 如果$A$是不可逆的，那就需要一些線性代數的標準工具來協助，這可以參閱附錄C說明。簡單來說，如果訓練實例無法Span整個$\mathbb{R}^d$的空間，那麼$A$就是不可逆。不過，我們仍然可以找到$Aw=b$的解，因為$b$在$A$的range內。當然啦，因為$A$是對稱矩陣，我們可以用它的eigenvalue來分解為$A = VDV^T$，其中$D$是一個diagonal matrix，而且$V$是一個orthonormal mattrix(也就是$V^TV$是一個$d \text{x} d$的單位矩陣)。我們定義$D^+$是一個diagonal matrix，只要$D_{i,i}=0$，那麼$D^+_{i,i}=0$，否則$D^+_{i,i}=1/D_{i,i}$。現在，定義 $$A^+ = VD^+V^T \qquad \text{and} \qquad \hat{w} = A^+b$$ 假設，$v_i$代表著$V$的第i個column。那麼我們就可以這麼說 $$A\hat{w} = AA^+b = VDV^TVD^+V^Tb = VDD^+V^Tb = \sum_{i:D_{i,i} \neq 0} v_i v_i^T b$$ 也就是說，$A\hat{w}$其實就是XXXX。因為$x_1, \cdots, x_m$的linear span與那些$v_i$的linear span是一樣的，而且$b$是$x_i$的linear span，因此我們得到$A\hat{w}=b$這個結論。 ::: #### 9.2.2 Linear Regression for Polynomial Regression Tasks :::info Some learning tasks call for nonlinear predictors, such as polynomial predictors. Take, for instance, a one dimensional polynomial function of degree $n$, that is, $$p(x) = a_0 + a_1x + a_2x^2 + \cdots + a_nx^n$$ where (a0; : : : ; an) is a vector of coefficients of size n + 1. In the following we depict a training set that is better fitted using a 3rd degree polynomial predictor than using a linear predictor. ::: :::success 有些學習任務稱為nonlinear predictors，像是polynomial predictors。舉例來而，一個次數(degree)為$n$的one dimensional polynomial function，像下面這個 $$p(x) = a_0 + a_1x + a_2x^2 + \cdots + a_nx^n$$ 其中，$a_0, \cdots, a_n$是一個大小為$n+1$的係數向量。下面我們會說明用一個次數為3的polynomial predictor會比linear predictor更好的擬合訓練資料集。 ![](https://i.imgur.com/PSrc9y4.png) ::: :::info We will focus here on the class of one dimensional, $n$-degree, polynomial regression predictors, namely $$\mathcal{H}^n_{\text{poly}} = \{ x \mapsto p(x) \}$$ where $p$ is a one dimensional polynomial of degree $n$, parameterized by a vector of coefficients $(a_0, \cdots, a_n)$. Note that $\mathcal{X} = \mathbb{R}$, since this is a one dimensional polynomial, and $\mathcal{Y} = \mathbb{R}$, as this is a regression problem. One way to learn this class is by reduction to the problem of linear regression, which we have already shown how to solve. To translate a polynomial regression problem to a linear regression problem, we define the mapping $\psi: \mathbb{R} \to \mathbb{R}^{n+1}$ such that $\phi(x) = (1, x, x^2, \cdots, x^n)$. Then we have that $$p(\phi(x)) = a_0 + a_1x + a_2x^2 + \cdots + a_n x^n = \langle a, \phi(x) \rangle$$ and we can find the optimal vector of coefficients a by using the Least Squares algorithm as shown earlier. ::: :::success 這邊我們會關注在一維$n$次多項式回歸器上，也就是 $$\mathcal{H}^n_{\text{poly}} = \{ x \mapsto p(x) \}$$ 其中$p$是一個一維$n$次的多項式，由係數向量$(a_0, \cdots, a_n)$來參數化。注意到，$\mathcal{X} = \mathbb{R}$，因為這是一個一維多項式，而$\mathcal{Y} = \mathbb{R}$，因為它是一個回歸問題。學習這類問題一個方法，就是將它簡化為線性迴歸的問題，我們已經說明過如何處理。為了將多項式迴歸問題轉換為線性迴歸問題，我們定義一個映射$\psi: \mathbb{R} \to \mathbb{R}^{n+1}$，因此$\phi(x) = (1, x, x^2, \cdots, x^n)$。然後我們就得到 $$p(\phi(x)) = a_0 + a_1x + a_2x^2 + \cdots + a_n x^n = \langle a, \phi(x) \rangle$$ 而且我們可以使用稍早所提到的Least Squares algorithm來找到最佳的係數向量。 ::: ### 9.3 Logistic Regression ::: info In logistic regression we learn a family of functions $h$ from $\mathbb{R}^b$ to the interval $[0, 1]$. However, logistic regression is used for classification tasks: We can interpret $h(x)$ as the probability that the label of $x$ is 1. The hypothesis class associated with logistic regression is the composition of a sigmoid function $\phi_{\text{sig}}: \mathbb{R} \to [0, 1]$ over the class of linear functions $L_d$. In particular, the sigmoid function used in logistic regression is the logistic function, defined as $$\phi_{\text{sig}}(z) = \dfrac{1}{1 + \text{exp}(-z)} \tag{9.9}$$ ::: :::success 在losigtic regression中，我們學習一個函數族$h$從$\mathbb{R}^b$空間中映射到$[0, 1]$這區間中。然而，logistic regression是應用於分類任務的：我們可以將$h(x)$視為$x$的標記為1的機率。跟logistic regression相關的hypothesis class是由sigmoid function所組成，$\phi_{\text{sig}}: \mathbb{R} \to [0, 1]$。特別是，logistic regression中使用的sigmoid function，也就是logistic function，定義為 $$\phi_{\text{sig}}(z) = \dfrac{1}{1 + \text{exp}(-z)} \tag{9.9}$$ "sigmoid"意味著"S-shaped"，如下圖所示： ![](https://i.imgur.com/cBaHNsg.png) ::: :::info The hypothesis class is therefore (where for simplicity we are using homogenous linear functions): $$H_\text{sig} = \phi_\text{sig} \circ L_d = \{x \mapsto \phi_\text{sig}(\langle w, x \rangle): w \in \mathbb{R}^d \}$$ Note that when $\langle w, x \rangle$ is very large then $\phi_\text{sig}(\langle w, x \rangle)$ is close to 1, whereas if $\langle w, x \rangle$ is very small then $\phi_\text{sig}(\langle w, x \rangle)$ is close to 0. Recall that the prediction of the halfspace corresponding to a vector $w$ is $\text{sign}(\langle w, x \rangle)$. Therefore, the predictions of the halfspace hypothesis and the logistic hypothesis are very similar whenever $\vert \langle w, x \rangle \vert$ is large. However, when $\vert \langle w, x \rangle \vert$ is close to 0 we have that $\text{sig}(\langle w, x \rangle) \approx \dfrac{1}{2}$. Intuitively, the logistic hypothesis is not sure about the value of the label so it guesses that the label is $\text{sig}(\langle w, x \rangle)$ with probability slightly larger than 50%. In contrast, the halfspace hypothesis always outputs a deterministic prediction of either 1 or -1, even if $\vert \langle w, x \rangle \vert$ is very close to 0. ::: :::success 因此，hypothesis class就是(為了簡單起見，使用homogenous linear functions)： $$H_\text{sig} = \phi_\text{sig} \circ L_d = \{x \mapsto \phi_\text{sig}(\langle w, x \rangle): w \in \mathbb{R}^d \}$$ 注意到，當$\langle w, x \rangle$非常大，那$\phi_\text{sig}(\langle w, x \rangle)$就會接近1，而當$\langle w, x \rangle$非常小的時候，$\phi_\text{sig}(\langle w, x \rangle)$會接近0。回想一下跟向量$w$對應的half-space的預測是$\text{sign}(\langle w, x \rangle)$。因此，每當$\vert \langle w, x \rangle \vert$非常大的時候，我們就說halfspace hypothesis與logistic hypothesis的預測是非常相似的。而當$\vert \langle w, x \rangle \vert$的時候，我們說$\text{sig}(\langle w, x \rangle) \approx \dfrac{1}{2}$。直觀來看，logistic hypothesis並不能確認label的值，因此它就是猜，猜這個值就是$\text{sig}(\langle w, x \rangle)$，只要略高於50%那就可以了。相比之下，half-space hypothesis都還是會輸出確定的預測值，即使$\vert \langle w, x \rangle \vert$非常接近0。 ::: :::info Next, we need to specify a loss function. That is, we should define how bad it is to predict some $h_w(x) \in [0, 1]$ given that the true label is $y \in \{ \pm 1\}$. Clearly, we would like that $h_w(x)$ would be large if $y = 1$ and that $1 - h_w(x)$ (i.e., the probability of predicting 􀀀1) would be large if $y = -1$. Note that $$1 - h_w(x) = 1 - \dfrac{1}{1 + \text{exp}(- \langle w, x \rangle)} = \dfrac{\text{exp}(-\langle w, x \rangle)}{1 + \text{exp}(-\langle w, x \rangle)} = \dfrac{1}{1+\text{exp}(\langle w, x\rangle)$$. The logistic loss function used in logistic regression penalizes hw based on the log of 1 + exp(􀀀yhw; xi) (recall that log is a monotonic function). That is, Therefore, any reasonable loss function would increase monotonically with $\dfrac{1}{1 + \text{exp}(\langle w, x \rangle)}$, or equivalently, would increase monotonically with $1 + \text{exp}(-y \langle w, x \rangle)$. The logistic loss function used in logistic regression penalizes hw based on the log of 1 + exp(􀀀yhw; xi) (recall that log is a monotonic function). That is, $$\ell(h_w, (x,y)) = \log(1 + \text{exp}(-y \langle w, x \rangle))$$ ::: :::success 接下來，我們需要指定一個損失函數(loss function)。也就是，我們應該定義預測的$h_w(x) \in [0, 1]$與實際的lable $y \in \{ \pm 1\}$相比有多糟。很明顯的我們是這麼希望的： * 當$y=1$，那$h_w(x)$愈大 * 當$y = -1$，那$1 - h_w(x)$愈大注意到： $$1 - h_w(x) = 1 - \dfrac{1}{1 + \text{exp}(- \langle w, x \rangle)} = \dfrac{\text{exp}(-\langle w, x \rangle)}{1 + \text{exp}(-\langle w, x \rangle)} = \dfrac{1}{1+\text{exp}(\langle w, x\rangle)$$ 因此，任何合理的損失函數都會以$\dfrac{1}{1 + \text{exp}(\langle w, x \rangle)}$增加，換者說個說法，都將會以$1 + \text{exp}(-y \langle w, x \rangle)$增加。Thexxxx。也就是， $$\ell(h_w, (x,y)) = \log(1 + \text{exp}(-y \langle w, x \rangle))$$ ::: :::info Therefore, given a training set $S = (x_1, y_1), \cdots, (x_m, y_m)$, the ERM problem associated with logistic regression is $$\underset{w \in \mathbb{R}^d}{\text{argmin}}\dfrac{1}{m} \sum^m_{i=1}\log(1+\text{exp}(-y_i \langle w, x_i \rangle)) \tag{9.10}$$ The advantage of the logistic loss function is that it is a convex function with respect to $w$; hence the ERM problem can be solved efficiently using standard methods. We will study how to learn with convex functions, and in particular specify a simple algorithm for minimizing convex functions, in later chapters. The ERM problem associated with logistic regression (Equation (9.10)) is identical to the problem of finding a Maximum Likelihood Estimator, a well-known statistical approach for finding the parameters that maximize the joint probability of a given data set assuming a specific parametric probability function. We will study the Maximum Likelihood approach in Chapter 24. ::: :::success 因此，給定資料集$S = (x_1, y_1), \cdots, (x_m, y_m)$，與logistic regression相關的ERM就是 $$\underset{w \in \mathbb{R}^d}{\text{argmin}}\dfrac{1}{m} \sum^m_{i=1}\log(1+\text{exp}(-y_i \langle w, x_i \rangle)) \tag{9.10}$$ logistic loss function的優點就是它是一個相對於$w$的凸函數；也因此，ERM問題可以用標準的方法來有效的求解。後面的章節中，我們將會研究如何使用凸函數來學習，特別是指定一個最小化凸函數的簡單的演算法。關於logistic regression的ERM問題(方程式9.10)，與求最大似然概率的問題是一樣的，一個眾所皆知的統計方法，用於在假設特定參數機率函數情況下找到讓給定資料集的聯合機率最大化的參數。我們會在第24章中學習最大似然概率這個方法。 ::: ### 9.4 Summary :::info The family of linear predictors is one of the most useful families of hypothesis classes, and many learning algorithms that are being widely used in practice rely on linear predictors. We have shown efficient algorithms for learning linear predictors with respect to the zero-one loss in the separable case and with respect to the squared and logistic losses in the unrealizable case. In later chapters we will present the properties of the loss function that enable efficient learning. Naturally, linear predictors are effiective whenever we assume, as prior knowledge, that some linear predictor attains low risk with respect to the underlying distribution. In the next chapter we show how to construct nonlinear predictors by composing linear predictors on top of simple classes. This will enable us to employ linear predictors for a variety of prior knowledge assumptions. ::: :::success 線性預測器是最常用的hypothesis classes之一，許多已經被廣泛使用的學習演算法也都依賴著線性預測器。我們已經說明用於學習線性預器測的有效的演算法(loss為0、1，且為可分以及unrealizable case的logistic losses)。後面的章節，我們將介紹能夠有效學習的loss function的特性。 :::