Chapter 5-Understanding Machine Learning From Theory to Algorithms: The Bias-Complexity Tradeoff

# Chapter 5-Understanding Machine Learning From Theory to Algorithms: The Bias-Complexity Tradeoff ###### tags: `Understanding Machine Learning From Theory to Algorithms` ## Remark :::info 文章內容 ::: :::success 個人理解 ::: [課程連結]() [TOC] ## The Bias-Complexity Tradeoff :::info In Chapter 2 we saw that unless one is careful, the training data can mislead the learner, and result in overfitting. To overcome this problem, we restricted the search space to some hypothesis class H. Such a hypothesis class can be viewed as reflecting some prior knowledge that the learner has about the task - a belief that one of the members of the class H is a low-error model for the task. For example, in our papayas taste problem, on the basis of our previous experience with other fruits, we may assume that some rectangle in the color-hardness plane predicts (at least approximately) the papaya's tastiness. ::: :::success 在第二章中我們已經看到，除非很小心，不然訓練資料是可以誤導learner，並導致overfitting。為了克服這個問題，我們將整個搜尋空間限制為某些hypothesis class $\mathcal{H}$。這樣子的一個hypothesis class可以被視為是反映出learner對於相關任務所擁有的先驗知識，也就是我們認為class $\mathcal{H}$之中的一個成員對於該任務有著低誤差的模型。以我們的木瓜問題為例(見第二章)，基於我們對於其它水果的經驗，我們也許會假設在顏色、軟硬度平面中的某一個長方形可以預測木瓜吃起來如何。 ::: :::info Is such prior knowledge really necessary for the success of learning? Maybe there exists some kind of universal learner, that is, a learner who has no prior knowledge about a certain task and is ready to be challenged by any task? Let us elaborate on this point. A specific learning task is defined by an unknown distribution $\mathcal{D}$ over $\mathcal{X} \space \text{x} \space \mathcal{Y}$, where the goal of the learner is to find a predictor $h: \mathcal{X} \to \mathcal{Y}$, whose risk, $L_\mathcal{D}(h)$, is small enough. The question is therefore whether there exist a learning algorithm $A$ and a training set size $m$, such that for every distribution $\mathcal{D}$, if $A$ receives $m$ i.i.d. examples from $\mathcal{D}$, there is a high chance it outputs a predictor $h$ that has a low risk. ::: :::success 但這樣的先驗知識對於成功的學習是否真的需要?或許存在著某些萬能的learner，而且並不需要關於任務的先驗知識，並且已經準備好接受任何任務的挑戰?讓我們來詳細說明這一點。一個特定的學習任務是由$\mathcal{X} \space \text{x} \space \mathcal{Y}$上未知的分佈$\mathcal{D}$所定義，learner的目標就是找出一個predictor_$h: \mathcal{X} \to \mathcal{Y}$，其實際誤差_$L_\mathcal{D}(h)$要夠小。然而，問題在於是否存在著學習演算法$A$以及大小為$m$的訓練集，使得對每一個分佈$\mathcal{D}$而言，如果$A$從$\mathcal{D}$收到$m$筆獨立同分佈的樣本，那就會有很高的機率可以輸出一個低誤差的predictor_$h$ ::: :::info The first part of this chapter addresses this question formally. The No-Free-Lunch theorem states that no such universal learner exists. To be more precise, the theorem states that for binary classification prediction tasks, for every learner there exists a distribution on which it fails. We say that the learner fails if, upon receiving i.i.d. examples from that distribution, its output hypothesis is likely to have a large risk, say, $\geq 0.3$, whereas for the same distribution, there exists another learner that will output a hypothesis with a small risk. In other words, the theorem states that no learner can succeed on all learnable tasks - every learner has tasks on which it fails while other learners succeed. ::: :::success 本章的第一部份正式的討論這個問題。天下沒有白吃的午餐，這理論已經指出並不存在這種萬能的learner。更精確的說，對於二分類預測任務而言，每一個learner都存在著一個會造成失敗的分佈。如果learner從分佈接收到i.i.d.的樣本，而且它輸出的hypothesis有著比較大的誤差，也許$\geq 0.3$，那我們就說learner失敗了，儘管對這些分佈而言可能存在著其它的learner所輸出的hypothesis的誤差較小。換言之，這個理論指出，沒有learner可以在所有的學習任務上都是成功的 - 每一個learner都會有那種它失敗但別的learner成功的時候。 ::: :::info Therefore, when approaching a particular learning problem, defined by some distribution $\mathcal{D}$, we should have some prior knowledge on $\mathcal{D}$. One type of such prior knowledge is that $\mathcal{D}$ comes from some specific parametric family of distributions. We will study learning under such assumptions later on in Chapter 24. Another type of prior knowledge on $\mathcal{D}$, which we assumed when defining the PAC learning model, is that there exists h in some predefined hypothesis class $\mathcal{H}$, such that $L_\mathcal{D}(h)=0$. A softer type of prior knowledge on $\mathcal{D}$ is assuming that $\min_{h \in \mathcal{H}}L_\mathcal{D}(h)$ is small. In a sense, this weaker assumption on $\mathcal{D}$ is a prerequisite for using the agnostic PAC model, in which we require that the risk of the output hypothesis will not be much larger than $\min_{h \in \mathcal{H}}L_\mathcal{D}(h)$. ::: :::success 因此，當處理一個由某個分佈$\mathcal{D}$所定義的特定的學習任務時，我們應該對這個分佈$\mathcal{D}$有著先驗知識。其中一個這類的先驗知識類型就是，分佈$\mathcal{D}$是來自某一個特定的[參數化的分佈家族](http://terms.naer.edu.tw/detail/6049/)。我們會在第二十四章在這類假設下的學習問題。關於$\mathcal{D}$的另一種類型的先驗知識，我們假設，在定義PAC learning model的時候，在某個預先定義的hypothesis class $\mathcal{H}$裡面存在著一個$h$可以讓$L_\mathcal{D}(h)=0$。關於$\mathcal{D}$的一種比較不那麼嚴謹的先驗知識，也就是假設$\min_{h \in \mathcal{H}}L_\mathcal{D}(h)$是小的。某種意義上，這種弱假設對於使用agnostic PAC model而言是需要的，因為我們要求輸出的hypothesis的誤差不能比$\min_{h \in \mathcal{H}}L_\mathcal{D}(h)$還要來的大。 ::: :::info In the second part of this chapter we study the benefits and pitfalls of using a hypothesis class as a means of formalizing prior knowledge. We decompose the error of an ERM algorithm over a class H into two components. The first component reflects the quality of our prior knowledge, measured by the minimal risk of a hypothesis in our hypothesis class, minh2H LD(h). This component is also called the approximation error, or the bias of the algorithm toward choosing a hypothesis from H. The second component is the error due to overfitting, which depends on the size or the complexity of the class H and is called the estimation error. These two terms imply a tradeoff between choosing a more complex H (which can decrease the bias but increases the risk of overfitting) or a less complex H (which might increase the bias but decreases the potential overfitting). ::: :::success 本章的第二部份我們會研究使用hypothesis class做為形式化先驗知識方法的優缺點。我們會將class $\mathcal{H}$上的ERM algorithm的誤差拆分為兩個部份。第一個部份反映出先驗知識的品質，利用hypothesis class內最小誤差的那個hypothesis來量測，$\min_{h \in \mathcal{H}}L_\mathcal{D}(h)$。這個部份亦稱為approximation error，或是演算法從$\mathcal{H}$中挑出hypothesis的偏差。第二個部份導致過擬合的誤差，取決於$\mathcal{H}$的大小或複雜度，稱為estimation error。這兩個項目意味著在選擇一個較為複雜的$\mathcal{H}$(降低bias，但會增加過擬合的風險)與不那麼複雜的$\mathcal{H}$(增加bias，但會減少可能的過擬合)之間的一個折衷。 ::: ### 5.1 The No-Free-Lunch Theorem :::info In this part we prove that there is no universal learner. We do this by showing that no learner can succeed on all learning tasks, as formalized in the following theorem: THEOREM 5.1 (No-Free-Lunch) Let $A$ be any learning algorithm for the task of binary classification with respect to the $0 \space - \space 1$ loss over a domain $\mathcal{X}$. Let $m$ be any number smaller than $\vert \mathcal{X} \vert / 2$, representing a training set size. Then, there exists a distribution $\mathcal{D}$ over $\mathcal{X} \text{x} \{0, 1\}$ such that: 1. There exists a function $f: \mathcal{X} \to \{ 0, 1 \}$ with $L_\mathcal{D}(f)=0$. 2. With probability of at least $1/7$ over the choice of $S \sim \mathcal{D}^m$ we have that $L_\mathcal{D}(A(S)) \geq 1/8$. ::: :::success 在這邊，我們會證明沒有萬能的learner。我們會透過沒有learner可以在所有的學習任務上成功說明來證明這一點，如下面定理所說明： THEOREM 5.1 (No-Free-Lunch) 假設$A$是一個任意的學習演算法(domain $\mathcal{X}$上，$0 \space - \space 1$損失的二分類問題)。假設$m$是一個任意數值，代表訓練集的大小，唯一的條件就是它比$\vert \mathcal{X} \vert / 2$還要小。然後，$\mathcal{X} \text{x} \{0, 1\}$上存在著一個分佈$\mathcal{D}$，使得： 1. 存在著一個函數$f: \mathcal{X} \to \{ 0, 1 \}$其$L_\mathcal{D}(f)=0$ 2. 在選擇的訓練集$S \sim \mathcal{D}^m$，最少有$1/7$的機率會得到$L_\mathcal{D}(A(S)) \geq 1/8$ ::: :::info This theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. Indeed, a trivial successful learner in this case would be an ERM learner with the hypothesis class $\mathcal{H} = \{ f \}$, or more generally, ERM with respect to any finite hypothesis class that contains $f$ and whose size satisfies the equation $m \geq 8 \log (7 \vert \mathcal{H} \vert / 6)$ (see Corollary 2.3). ::: :::success 這個理論指出，對每一個learner而言，都存在著一個會讓它失敗的任務，即使這個任務可以被其它的learner成功的學習。事實上，這種情況下，一個微不足道的成功的learner會是一個具有hypothesis class $\mathcal{H} = \{ f \}$的ERM learner，或者更一般而言，一個具有任意有限hypothesis class而且包含著$f$的ERM，它的大小是滿足方程式$m \geq 8 \log (7 \vert \mathcal{H} \vert / 6)$(見2.3) ::: :::info Proof Let $C$ be a subset of $\mathcal{X}$ of size $2m$. The intuition of the proof is that any learning algorithm that observes only half of the instances in $C$ has no information on what should be the labels of the rest of the instances in $C$. Therefore, there exists a "reality," that is, some target function $f$, that would contradict the labels that $A(S)$ predicts on the unobserved instances in $C$. ::: :::success Proof 假設$C$是$\mathcal{X}$的子集，其大小為$2m$。直觀來看這個證明就是，僅觀察到$C$的一半實例的任意的學習演算法，對$C$的另一半該標記什麼是完全沒有頭緒的，因為只看到有看到的那一半。因此，存在著一個"現實"，也就是某些目標函數$f$，其與$A(S)$在$C$中未觀察到的實例上預測的標籤是互相矛盾的。 ::: :::info Note that there are $T=2^{2m}$ possible functions from $C \to \{0, 1\}$. Denote these functions by $f_1, \cdots, f_T$. For each such function, let $\mathcal{D}_i$ be a distribution over $C \space \text{x} \space \{0,1\}$ defined by $$ \mathcal{D}_i(\{(x, y)\}) = \begin{cases} 1/\vert C \vert & \text{if $y=f_i(x)$} \\[2ex] 0 & \text{otherwise} \end{cases} $$ ::: :::success 要注意到，從$C \to \{0, 1\}$總共有$T=2^{2m}$個可能函數。將這些函數標記為$f_1, \cdots, f_T$。對於每一個這樣的函數，假設$\mathcal{D}_i$是$C \space \text{x} \space \{0,1\}$上的一個分佈，定義如下 $$ \mathcal{D}_i(\{(x, y)\}) = \begin{cases} 1/\vert C \vert & \text{if $y=f_i(x)$} \\[2ex] 0 & \text{otherwise} \end{cases} $$ ::: :::info That is, the probability to choose a pair $(x, y)$ is $1/\vert C \vert$ if the label $y$ is indeed the true label according to $f_i$, and the probability is 0 if $y \neq f_i(x)$. Clearly,$L_\mathcal{D_i}(f_i)=0$. ::: :::success 也就是，如果標籤$y$與根據$f_i$判定之後的結果確實是對的，那選擇到$(x, y)$這一對資料的機率就會是$1/\vert C \vert$，如果$y \neq f_i(x)$，那選擇到$(x, y)$這一對資料的機率就會是0。很明顯的，$L_\mathcal{D_i}(f_i)=0$ ::: :::info We will show that for every algorithm, $A$, that receives a training set of $m$ examples from $C \space \text{x} \space \{0,1\}$ and returns a function $A(S):C \to \{0, 1\}$ it holds that $$\max_{i \in [T]} \space \underset{S \sim \mathcal{D}^m_i}{\mathbb{E}}[L_\mathcal{D_i}(A(S))] \geq 1/4 \tag{5.1}$$ ::: :::success 我們將會說明，對於每一個學習演算法$A$，從$C \space \text{x} \space \{0,1\}$接收到$m$筆的訓練集，並回傳函數$A(S):C \to \{0, 1\}$，其認為 $$\max_{i \in [T]} \space \underset{S \sim \mathcal{D}^m_i}{\mathbb{E}}[L_\mathcal{D_i}(A(S))] \geq 1/4 \tag{5.1}$$ ::: :::info Clearly, this means that for every algorithm, $A'$, that receives a training set of $m$ examples from $\mathcal{X} \space \text{x} \space \{0,1\}$ there exist a function $f:\mathcal{X} \to \{0, 1\}$ and a distribution $\mathcal{D}$ over $\mathcal{X} \space \text{x} \space \{0,1\}$, such that $L_\mathcal{D}(f)=0$ and $$\underset{S \sim \mathcal{D}^m}{\mathbb{E}}[L_\mathcal{D}(A'(S))] \geq 1/4 \tag{5.2}$$ It is easy to verify that the preceding suffices for showing that $\mathbb{P}[L_\mathcal{D}(A'(S)) \geq 1/8] \geq 1/7$, which is what we need to prove (see Exercise 1). ::: :::success 很明顯的，這意味著對於每一個學習演算法$A'$而言，從$\mathcal{X} \space \text{x} \space \{0,1\}$接收到$m$筆的訓練集，存在著一個函數$f:\mathcal{X} \to \{0, 1\}$，以及$\mathcal{X} \space \text{x} \space \{0,1\}$上的一個分佈$\mathcal{D}$，因此$L_\mathcal{D}(f)=0$，且 $$\underset{S \sim \mathcal{D}^m}{\mathbb{E}}[L_\mathcal{D}(A'(S))] \geq 1/4 \tag{5.2}$$ 前面的論述很輕易的可以證明$\mathbb{P}[L_\mathcal{D}(A'(S)) \geq 1/8] \geq 1/7$，這也是我們需要證明的(見練習1)。 ::: :::info We now turn to proving that Equation (5.1) holds. There are $k=(2m)^m$ possible sequences of m examples from $C$. Denote these sequences by $S_1, \cdots, S_k$.Also, if $S_j = (x_1, \cdots, x_m)$ we denote by $S^i_j$ the sequence containing the instances in $S_j$ labeled by the function $f_i$, namely, $S^i_j = ((x_1, f_i(x_1)), \cdots, (x_m, f_i(x_m))))$. If the distribution is Di then the possible training sets $A$ can receive are $S_1^i, \cdots, S_k^i$,and all these training sets have the same probability of being sampled. Therefore, $$\underset{S \sim \mathcal{D}^m_i}{\mathbb{E}}[L_\mathcal{D_i}(A(S))] = \dfrac{1}{k} \sum^k_{j=1}L_\mathcal{D_i}(A(S^i_j)) \tag{5.3}$$ ::: :::success 我們現在轉向證明方程式5.1是成立的。我們說，對於來自$C$的$m$筆樣本，有$k=(2m)^m$的可能序列。把這些序列標記為$S_1, \cdots, S_k$。然後，如果$S_j = (x_1, \cdots, x_m)$，我們用$S^i_j$來表示序列$S_j$所包含的由函數$f_i$所標記的實例，也就是說，$S^i_j = ((x_1, f_i(x_1)), \cdots, (x_m, f_i(x_m))))$。如果分佈為$\mathcal{D}_i$，那演算法$A$能接收到的可能訓練集就是$S_1^i, \cdots, S_k^i$，而且所有的訓練集都有相同的機率會被取樣到的。因此， $$\underset{S \sim \mathcal{D}^m_i}{\mathbb{E}}[L_\mathcal{D_i}(A(S))] = \dfrac{1}{k} \sum^k_{j=1}L_\mathcal{D_i}(A(S^i_j)) \tag{5.3}$$ ::: :::info Using the facts that "maximum" is larger than "average" and that "average" is larger than "minimum," we have $$\begin{equation}\begin{aligned} \max_{i \in [T]} \dfrac{1}{k} \sum^k_{j=1} L_\mathcal{D_i}(A(S^i_j)) & \geq \dfrac{1}{T} \sum^T_{i=1} \dfrac{1}{k} \sum^k_{j=1} L_\mathcal{D_i}(A(S^i_j)) \\ &=\dfrac{1}{k} \sum^k_{j=1} \dfrac{1}{T} \sum^T_{i=1} L_\mathcal{D_i}(A(S^i_j)) \\ &\geq \min_{j \in [k]} \dfrac{1}{T} \sum^T_{i=1} L_\mathcal{D_i}(A(S^i_j))\end{aligned}\end{equation} \tag{5.4}$$ ::: :::success 我們利用一個事實，最大值一定比平均值還要大，而平均值一定比最小值還要大，我們會得到 $$\begin{equation}\begin{aligned} \max_{i \in [T]} \dfrac{1}{k} \sum^k_{j=1} L_\mathcal{D_i}(A(S^i_j)) & \geq \dfrac{1}{T} \sum^T_{i=1} \dfrac{1}{k} \sum^k_{j=1} L_\mathcal{D_i}(A(S^i_j)) \\ &=\dfrac{1}{k} \sum^k_{j=1} \dfrac{1}{T} \sum^T_{i=1} L_\mathcal{D_i}(A(S^i_j)) \\ &\geq \min_{j \in [k]} \dfrac{1}{T} \sum^T_{i=1} L_\mathcal{D_i}(A(S^i_j))\end{aligned}\end{equation} \tag{5.4}$$ ::: :::info Next, fix some $j \in [k]$. Denote $S_j = (x_1, \cdots, x_m)$ and let $v_1, \cdots, v_p$ be the examples in $C$ that do not appear in $S_j$ . Clearly, $p \geq m$. Therefore, for every function $h:C \to \{0, 1\}$ and every $i$ we have ::: :::success 下一步，固定某些$j \in [k]$。$S_j = (x_1, \cdots, x_m)$，並且假設$v_1, \cdots, v_p$是$C$裡面，但並沒有出現在$S_j$的樣本。很明顯的，$p \geq m$。因此，對每一個函數$h:C \to \{0, 1\}$以及每一個$i$，我們得到 $$\begin{equation}\begin{aligned} L_{\mathcal{D}_i}(h) &=\dfrac{1}{2m} \sum_{x \in C}\mathbb{I}_{[h(x) \neq f_i(x)]} \\ & \geq \dfrac{1}{2m} \sum^p_{r=1} \mathbb{I}_{[h(v_r) \neq f_i(v_r)]} \\ & \geq \dfrac{1}{2p} \sum^p_{r=1} \mathbb{I}_{[h(v_r) \neq f_i(v_r)]}\end{aligned}\end{equation} \tag{5.5}$$ 這邊的公式與書上不同，書上是1，但markdown上的$\mathbb{1}$不是很好的呈現，因此後續完全會以I，也就是$\mathbb{I}$來表示。因此， $$\begin{equation}\begin{aligned}\dfrac{1}{T} \sum^T_{i=1} L_{\mathcal{D}_i}(A(S^i_j)) & \geq \dfrac{1}{T} \sum^T_{i=1} \dfrac{1}{2p} \sum^p_{r=1} \mathbb{I}_{[A(S^i_j)(vr) \neq f_i(v_r)]} \\ & =\dfrac{1}{2p} \sum^p_{r=1} \dfrac{1}{T} \sum^T_{i=1} \mathbb{I}_{[A(S^i_j)(v_r) \neq f_i(v_r)]} \\ & \geq \dfrac{1}{2} \cdot \min_{r \in [p]} \dfrac{1}{T} \sum^T_{i=1} \mathbb{I}_{[A(S^i_j)(v_r) \neq f_i(v_r)]}\end{aligned}\end{equation} \tag{5.6}$$ ::: :::info Next, fix some $r \in [p]$. We can partition all the functions in $f_1, \cdots, f_T$ into $T/2$ disjoint pairs, where for a pair $(f_i, f_{i'})$ we have that for every $c \in C, f_i(c) \neq f_{i'}(c)$ if and only if $c=v_r$. Since for such a pair we must have $S_j^i = s^{i'}_j$ , it follows that $$\mathbb{I}_{[A(S^i_j)(v_r) \neq f_i(v_r)]} + \mathbb{I}_{[A(S_j^{i'})(v_r) \neq f_{i'}(v_r)]} = 1$$ which yields $$\dfrac{1}{T} \sum^T_{i=1} \mathbb{I}_{[A(S^i_j)(v_r) \neq f_i(v_r)]}=\dfrac{1}{2}$$ Combining this with Equation (5.6), Equation (5.4), and Equation (5.3), we obtain that Equation (5.1) holds, which concludes our proof. ::: :::success 接下來，固定某些$r \in [p]$。我們可以將所有的函數$f_1, \cdots, f_T$分割為$T/2$，互不相交的配對，其中對於$(f_i, f_{i'})$這樣的配對，只要$c=v_r$，那我們說，$c \in C, f_i(c) \neq f_{i'}(c)$。因為這於這樣的配對我們必須要有$S_j^i = s^{i'}_j$，這樣的話 $$\mathbb{I}_{[A(S^i_j)(v_r) \neq f_i(v_r)]} + \mathbb{I}_{[A(S_j^{i'})(v_r) \neq f_{i'}(v_r)]} = 1$$ 這讓 $$\dfrac{1}{T} \sum^T_{i=1} \mathbb{I}_{[A(S^i_j)(v_r) \neq f_i(v_r)]}=\dfrac{1}{2}$$ 把這個式子跟方程式5.6、5.4、5.3結合，我們就可以證明方程式5.1是成立的，完成我們的證明。 ::: #### 5.1.1 No-Free-Lunch and Prior Knowledge :::info How does the No-Free-Lunch result relate to the need for prior knowledge? Let us consider an ERM predictor over the hypothesis class $\mathcal{H}$ of all the functions $f$ from $X \to \{0, 1\}$. This class represents lack of prior knowledge: Every possible function from the domain to the label set is considered a good candidate. According to the No-Free-Lunch theorem, any algorithm that chooses its output from hypotheses in H, and in particular the ERM predictor, will fail on some learning task. Therefore, this class is not PAC learnable, as formalized in the following corollary: ::: :::success 沒有白吃的午餐這句話是怎麼必需跟先驗知識有關聯的?讓我們考慮一個hypothesis class $\mathcal{H}$上的ERM predictor，這class內的所有的函數$f$的功用是將$X \to \{0, 1\}$。這個類別的表示缺少了先驗知識：每一個從domain到label set的可能的函數都被視為是不錯的選擇。根據沒有白吃的午餐的理論，任意一個從$\mathcal{H}$內的hypotheses選擇其輸出的演算法，實務上，ERM predictor會在某一些學習任務上失敗。因此，這個class就不會PAC learnable，如下推論說明： ::: :::info corollary 5.2 Let $\mathcal{X}$ be an infinite domain set and let $\mathcal{H}$ be the set of all functions from $\mathcal{X}$ to $\{0, 1\}$. Then, $\mathcal{H}$ is not PAC learnable. Proof Assume, by way of contradiction, that the class is learnable. Choose some $\epsilon$ < 1/8$ and $\delta < 1/7$. By the definition of PAC learnability, there must be some learning algorithm $A$ and an integer $m = m(\epsilon, \delta)$, such that for any data-generating distribution over $\mathcal{X} \text{x} \{0, 1\}$, if for some function $f: \mathcal{X} \to \{0, 1\}$, $L_\mathcal{D}(f)=0$, then with probability greater than $1 - \delta$ when $A$ is applied to samples $S$ of size $m$, generated i.i.d. by $\mathcal{D}$, $L_\mathcal{D}(A(S)) \leq \epsilon$. However, applying the No-Free-Lunch theorem, since $\vert \mathcal{X} \vert > 2m$, for every learning algorithm (and in particular for the algorithm $A$), there exists a distribution $\mathcal{D}$ such that with probability greater than $1/7 > \delta, L_\mathcal{D}(A(S)) > 1/8 > \epsilon$, which leads to the desired contradiction. ::: :::success corollary 5.2 我們說，$\mathcal{X}$是一個無限的domain set，而$\mathcal{H}$是一個將$\mathcal{X}映射到$\{0, 1\}$的所有函數的集合。那麼，$\mathcal{H}$將不是PAC learnable。 Proof 我們利用反證法來假設這個class是learnable。選擇$\epsilon$ < 1/8, \delta < 1/7$。依著PAC learnability的定義，存在著一個學習演算法$A$，以及一個整數$m = m(\epsilon, \delta)$，對$\mathcal{X} \text{x} \{0, 1\}$上任意的資料生成分佈，如果對於某些函數$f: \mathcal{X} \to \{0, 1\}$，$L_\mathcal{D}(f)=0$，當我們把學習演算法$A$應用在大小為$m$且由分佈$\mathcal{D}$i.i.d取得的訓練樣本$S$上，那麼就會有大於$1 - \delta$，其$L_\mathcal{D}(A(S)) \leq \epsilon$。然而，根據沒有白吃的午餐這個定義，由於$\vert \mathcal{X} \vert > 2m$，因此對每一個學習演算法(尤其是$A$)都存在著一個分佈$\mathcal{D}$使得機率大於$1/7 > \delta, L_\mathcal{D}(A(S)) > 1/8 > \epsilon$，這就導致我們預期中的矛盾了。 ::: :::info How can we prevent such failures? We can escape the hazards foreseen by the No-Free-Lunch theorem by using our prior knowledge about a specific learning task, to avoid the distributions that will cause us to fail when learning that task. Such prior knowledge can be expressed by restricting our hypothesis class. ::: :::success 我們如何防止這樣的失敗?我們可以利用我們對某個特定的學習任務的先驗知識來逃避沒有白吃的午餐理論所預見的危險，來避免掉學習任務時會導致失敗的那個分佈。這類的先驗知識可以透過限制我們的hypothesis class來表示。 ::: :::info But how should we choose a good hypothesis class? On the one hand, we want to believe that this class includes the hypothesis that has no error at all (in the PAC setting), or at least that the smallest error achievable by a hypothesis from this class is indeed rather small (in the agnostic setting). On the other hand, we have just seen that we cannot simply choose the richest class { the class of all functions over the given domain. This tradeoff is discussed in the following section. ::: :::success 但是，我們應該如何選擇一個好的hypothesis clas?一方面我們希望相信這個class所包含的hypothesis是完全沒有創的(在PCA的設定下)，或者那個從class取得的hypothesis所能做到的最小誤差要真的很小(在agnostic的設定下)。另一方面，我們剛才看到，我們並不能很輕易的去選擇到那個richest class - 也就是給定domain情況下，所有函數的class。後續的章節會討論到這之間的折衷權衡。 ::: ### 5.2 Error Decomposition :::info To answer this question we decompose the error of an $ERM_\mathcal{H}$ predictor into two components as follows. Let $h_S$ be an $ERM_\mathcal{H}$ hypothesis. Then, we can $$L_\mathcal{D}(h_S) = \epsilon_{app} + \epsilon_{est} \space \text{where:} \space \epsilon_{app} = \min_{h \in \mathcal{H}} L_\mathcal{D}(h), \space \epsilon_{est} = L_\mathcal{D}(h_S) - \epsilon_{app} \tag{5.7}$$ ::: :::success 為了回答這個問題，我們將$ERM_\mathcal{H}$ predictor的差拆解為下面兩項。假設$h_S$是一個$ERM_\mathcal{H}$ hypothesis。那麼我們可以寫成 $$L_\mathcal{D}(h_S) = \epsilon_{app} + \epsilon_{est} \space \text{where:} \space \epsilon_{app} = \min_{h \in \mathcal{H}} L_\mathcal{D}(h), \space \epsilon_{est} = L_\mathcal{D}(h_S) - \epsilon_{app} \tag{5.7}$$ ::: :::info The Approximation Error - the minimum risk achievable by a predictor in the hypothesis class. This term measures how much risk we have because we restrict ourselves to a specific class, namely, how much inductive bias we have. The approximation error does not depend on the sample size and is determined by the hypothesis class chosen. Enlarging the hypothesis class can decrease the approximation error.Under the realizability assumption, the approximation error is zero. In the agnostic case, however, the approximation error can be large.1 ::: :::success The Approximation Error - hypothesis class內的predictor可以實現最低風險。這個項目量測的是，我們存在著多少的風險，因為我們把我們自己限制在一個特定的class，也就是說，我們有著多少的歸納偏差。[近似誤差](http://terms.naer.edu.tw/detail/3645883/)並非取決於樣本的大小，而是由選擇的hypothesis class所定義。增大hypothesis class可以降低近似誤差。在realizability assumption之下，這近似誤差是0。但在agnostic情況下，這近似誤差可能會很大。(事實上，這始終包含著Bayes optimal predictor的誤差(見第三章)，最小但不可避免的誤差，因為在這個模型中的世界可能存在著不確定性。有時候文獻中的近似誤差並不是指$\min_{h \in \mathcal{H}}L_\mathcal{D}(h)$，而是Bayes optimal predictor，也就是$\min_{h \in \mathcal{H}}L_\mathcal{D}(h) - \epsilon_{\text{Bayes}}$) ::: :::info The Estimation Error - the difference between the approximation error and the error achieved by the ERM predictor. The estimation error results because the empirical risk (i.e., training error) is only an estimate of the true risk, and so the predictor minimizing the empirical risk is only an estimate of the predictor minimizing the true risk. The quality of this estimation depends on the training set size and on the size, or complexity, of the hypothesis class. As we have shown, for a finite hypothesis class, $\epsilon_{est}$ increases (logarithmically) with $\vert \mathcal{H} \vert$ and decreases with $m$. We can think of the size of $\mathcal{H}$ as a measure of its complexity. In future chapters we will define other complexity measures of hypothesis classes. ::: :::success The Estimation Error - 與近似誤差不同的地方在於，[估計誤差](http://terms.naer.edu.tw/detail/150589/)是來自於ERM predictor。估計誤差的產生是因為經驗風險(像訓練誤差)就僅僅是實際風險的一個估計，因此predictor最小化的那個經驗誤差也只是對最小化實際風險的一個估計。估計的好壞取決於訓練集的大小以及hypothesis class的大小或複雜度。如我們先前說明過的，對於一個有限的hypothesis class，$\epsilon_{est}$會隨著$\mathcal{H}$遞增，而隨著$m$遞減(對數)。我們可以將$\mathcal{H}$的大小視為複雜度的量測。後續的章節中，我們將定義其它的hypothesis classes複雜度的量測。 ::: :::info Since our goal is to minimize the total risk, we face a tradeoff, called the bias-complexity tradeoff. On one hand, choosing $\mathcal{H}$ to be a very rich class decreases the approximation error but at the same time might increase the estimation error, as a rich $\mathcal{H}$ might lead to overfitting. On the other hand, choosing H to be a very small set reduces the estimation error but might increase the approximation error or, in other words, might lead to underfitting. Of course, a great choice for $\mathcal{H}$ is the class that contains only one classifier { the Bayes optimal classifier. But the Bayes optimal classifier depends on the underlying distribution $\mathcal{D}$, which we do not know (indeed, learning would have been unnecessary had we known $\mathcal{D}$). ::: :::success 因為我們的目標是最小化所有的風險，我們面臨著一個取捨，這稱為bias-complexity tradeoff。一方面，選擇一個rich class做為$\mathcal{H}$來降低近似誤差，但同時也許會增加估計誤差，rich $\mathcal{H}$會導致overfitting。另一方面，拿一個非常小的集合來做為$\mathcal{H}$，這可以降低估計誤差，但卻會增加近似誤差，換句話說，這也許會導致underfittin。當然，對$\mathcal{H}$而言一個不錯的選擇，就是那個class僅包含一個classifier，也就是Bayes optimal classifier。但是，Bayes optimal classifier取決於潛在的分佈$\mathcal{D}$，而我們並不知道這分佈是長什麼樣子(如果知道就不用學習了)。 ::: :::info Learning theory studies how rich we can make H while still maintaining reasonable estimation error. In many cases, empirical research focuses on designing good hypothesis classes for a certain domain. Here, \good" means classes for which the approximation error would not be excessively high. The idea is that although we are not experts and do not know how to construct the optimal classifier, we still have some prior knowledge of the specific problem at hand, which enables us to design hypothesis classes for which both the approximation error and the estimation error are not too large. Getting back to our papayas example, we do not know how exactly the color and hardness of a papaya predict its taste, but we do know that papaya is a fruit and on the basis of previous experience with other fruit we conjecture that a rectangle in the color-hardness space may be a good predictor. ::: :::success 學習理論研究著，多麼rich的$\mathcal{H}$可以讓我們維持著一個合理的估計誤差。在許多情況中，[經驗研究](http://terms.naer.edu.tw/detail/3131458/)特別觀注在為某個特定的domain設計一個好的hypothesis classes。這邊，"好的"意思是指，那個class的近似誤差不會太過高。想法上，雖然我們不是什麼專家，而且不知道怎麼去建構最佳的分類器，但我們手上仍然有針對特定問題的一些先驗知識，這讓我們能夠去設計出一個近似誤差與估計誤差都不會太大的hypothesis classes。回頭看我們的木瓜案例，我們並不知道木瓜的顏色和硬度究竟如何預測它的味道，但我們知道木瓜是一種水果，用著在其它水果的經驗，我們推測出在顏色與硬度的空間中的某一個長方形也許是一個不錯的predictor。 ::: ### 5.3 Summary :::info The No-Free-Lunch theorem states that there is no universal learner. Every learner has to be specified to some task, and use some prior knowledge about that task, in order to succeed. So far we have modeled our prior knowledge by restricting our output hypothesis to be a member of a chosen hypothesis class. When choosing this hypothesis class, we face a tradeoff, between a larger, or more complex, class that is more likely to have a small approximation error, and a more restricted class that would guarantee that the estimation error will be small. In the next chapter we will study in more detail the behavior of the estimation error. In Chapter 7 we will discuss alternative ways to express prior knowledge. ::: :::success 沒有白吃的午餐的理論指出，沒並不存在著通用型(萬能)的learner。每一個learner都僅適用於特定任務，而且為了成功學習，還需要使用關於該任務的某些先驗知識。目前為止，我們靠著將輸出的hypothesis限制為所選的hypothesis class的成員來建模我們的先驗知識。當我們選擇這個hypothesis class，我們面臨一個選擇，更大或更複雜的class，那我們可能會有一個小的近似差，或是一個更為受限的class，那我們就可以保證其估計誤差會是小的。下一章，我們將研究估計誤差更細部的行為。第七章的時候，我們將會討論表達先驗知識的替代方式。 :::