# Chapter 2-Understanding Machine Learning From Theory to Algorithms: A Gentle Start ###### tags: `Understanding Machine Learning From Theory to Algorithms` ## Remark :::info 文章內容 ::: :::success 個人理解 ::: [課程連結](https://www.youtube.com/watch?v=aILazXK059Y) [TOC] ## A Gentle Start :::info You have to learn how to predict whether a papaya you see in the market is tasty or not. First, you need to decide which features of a papaya your prediction should be based on. On the basis of your previous experience with other fruits, you decide to use two features: the papaya’s color, ranging from dark green, through orange and red to dark brown, and the papaya’s softness, ranging from rock hard to mushy. Your input for figuring out your prediction rule is a sample of papayas that you have examined for color and softness and then tasted and found out whether they were tasty or not. ::: :::success 案例上假設我們到一個太平洋島,然後發現木瓜是在當地是很重要的一個食材,因此我們要學會如何去分析木瓜的好壞。這時候我們決定用先前在其它水果上的經驗將木瓜弄了兩個特徵: 1. 顏色 2. 軟硬程度 我們要利用這兩個特徵來決定木瓜是好吃還是不好吃。 ::: ### 2.1 A Formal Model – The Statistical Learning Framework :::info 1. The learner’s input * Domain set * Label set * Training data 2. The learner’s output 3. A simple data-generation model 4. Measures of success ::: :::success 在分析之前,要先瞭解幾個定義: 1. The learner’s input:資料要分析就需要有輸入,learner下面幾個輸入 * Domain set:任意集合,符號為$\mathcal{X}$。如果以開頭的木瓜範例來看的話,這個domain set就會是所有的木瓜集合。通常會以特徵向量來表示(如開頭所選擇的特徵,顏色與軟硬程度)。domain points亦稱為instances,而$\mathcal{X}$就是instance space * Label set:就我們目前的木瓜範例而言,label set會有兩個元素,0、1或是-1、+1。符號為$\mathcal{Y}$,1表示好吃,0則表示不好吃 * Training data:符號為$S$,是一個成對的資料序列,$(x_1, y_2), \cdots ,(x_m, y_m)$,或者說它是一個標記好的domain point的序列。又稱為訓練範本,或訓練集。 2. The learner’s output:learner的輸出為prediction rule,預測規則,也就是$h: \mathcal{X} \rightarrow \mathcal{Y}$ * 可以稱為predictor(預測器)、hypothesis(假設)、classifier(分類器) * 用來預測新的domain point的標記(label) * 以$A(S)$來表示學習演算法(hypothesis) 3. A simple data-generation model:我們假設instances(也就是domain set裡面的domain point)由某些機率分佈所產生。以$\mathcal{D}$來表示。很重要的一點是,learner是不知道這個分佈的。這個分佈可以是任意分佈 * 我們假設有一個正確的標記函數(label function),$f: \mathcal{X} \rightarrow \mathcal{Y}$,即$y_i = f(x_i)$ * 對learner而言,這個標記函數是未知的,而learner試圖找出這個標記函數 * 訓練資料-$S$內的資料是根據分佈$\mathcal{D}$隨機採樣出一個$x_i$,再經過標記函數$f$來表示它是0或1 4. Measures of success:我們將分類器的誤差定義為機率,由前面提到的潛在空間所隨機採樣的data point,標記錯誤的機率。即$h$的誤差就是根據分佈$\mathcal{D}$所做的隨機採樣的instance-$x$,而$h(x) \neq f(x)$,也就是learner學到的所預測的與實際是不相同的 * 給定一個domain subset(子集),其$A \subset \mathcal{X}$($A$包含在$\mathcal{X}$內),機率分佈為$\mathcal{D}$,$\mathcal{D}(A)$則決定$x \in A$的可能性有多大 * 很多情況下會將$A$視為event(事件),使用函數$\pi : \mathcal{X} \rightarrow \left\{ 0, 1 \right\}$來表示,也就是$A = \left\{ x \in \mathcal{X} : \pi (x) = 1 \right\}$ * 可以用機率來表示$D(A)$,即$\mathbb{P}_{x \sim \mathcal{D}}[\pi(x)]$ * 將預測規則($h: \mathcal{X} \rightarrow \mathcal{Y}$)的誤差定義為 $$L_{\mathcal{D}, f}(h) \stackrel{\text{def}}{=} \mathbb{P}_{x \in \mathcal{D}} \left[h(x) \neq f(x) \right] \stackrel{\text{def}}{=} \mathcal{D}(\left\{x: h(x) \neq f(x) \right\}) \tag{2.1} $$ 下標的$\mathcal{D}, f$意味著是根據分佈-$\mathcal{D}$與正確的標記函數-$f$ * $L_{(\mathcal{D}, f)}(h)$有很多同意詞,generalization error(泛化誤差)、risk(風險)、true error of $h$($h$的實際誤差),以$L$做為learner的誤差的表示 5. A note about the information available to the learner:分佈-$\mathcal{D}$與正確的標記函數-$f$對於learner而言是未知的,唯一能做的就是透過訓練集來觀察 ::: ### 2.2 Empirical Risk Minimization :::info As mentioned earlier, a learning algorithm receives as input a training set $S$, sampled from an unknown distribution $\mathcal{D}$ and labeled by some target function $f$, and should output a predictor $h_S: \mathcal{X} \rightarrow \mathcal{Y}$ (the subscript $S$ emphasizes the fact that the output predictor depends on $S$). The goal of the algorithm is to find $h_S$ that minimizes the error with respect to the unknown $\mathcal{D}$ and $f$. ::: :::success 這邊說明的是,整個學習演算法以訓練集-$S$為輸入,這個訓練集是從一個未知的分佈$\mathcal{D}$中所採樣,經由某一個目標函數-$f$來標記,然後輸出一個predictor-$h_S: \mathcal{X} \rightarrow \mathcal{Y}$(下標$S$是強調這個輸出的predictor是取決於$S$)。演算法的目標是找出一個$h_S$來找最小化相對於$\mathcal{D}$、$f$之間的誤差。 但是因為上面有提到過,$\mathcal{D}$、$f$對learner而言是未知的,因此我們只能從訓練集來計算訓練誤差,也就是training error,即為classifier在整個訓練樣本中所導致的訓練誤差: $$L_S(h) \stackrel{\text{def}}{=} \dfrac{\vert \left\{ i \in \vert m \vert : h(x_i) \neq y_i \right\} \vert}{m} \tag{2.2}$$ 其中$\vert m \vert$就是你的訓練集中有$m$筆的資料,直觀來看,就是誤差總筆數/資料總數。 訓練樣本是learner所能看到的一個真實世界的縮影,所以在訓練集上求解是有其意義的。這利用predictor-$h$來最小化$L_S(h)$的作法稱為Empirical Risk Minimization(經驗風險最小化),或簡稱ERM。 這邊的loss的定義要注意到有其意義: * $L_S$如所提是經驗誤差,也就是以訓練而得的predictor-$h$應用在訓練集-$S$所得的結果 * $L_{(\mathcal{D}, f)}$為實際誤差,也就是以訓練而得的predictor-$h$應用在真實世界所得的結果 ::: #### 2.2.1 Something May Go Wrong – Overfitting :::info Assume that the probability distribution $\mathcal{D}$ is such that instances are distributed uniformly within the gray square and the labeling function, $f$, determines the label to be 1 if the instance is within the inner blue square, and 0 otherwise. The area of the gray square in the picture is 2 and the area of the blue square is 1. Consider the following predictor: $$h_S(x)= \left\{ \begin{array}{l} y_i \space \text{ if} \exists i \in \vert m \vert \text{ s.t. } x_i=x \\ 0 \text{ otherwise} \end{array} \right. \tag{2.3}$$ We have found a predictor whose performance on the training set is excellent, yet its performance on the true “world” is very poor. This phenomenon is called overfitting. ::: :::success 這邊說明的是,即使ERM看起來是沒問題的,但一不小心還是會落入陷井。一樣的,用最一開始的木瓜為例說明。考慮到木瓜的分佈如下圖: ![](https://i.imgur.com/dd2QyTZ.png) 假設整個instances是平均分佈於灰框內,而標記函數-$f$則會將藍框內的instances標記為1,其餘為0。然後,灰框的區域大小為2,藍框為1。我們考慮下面的predictor: $$h_S(x)= \left\{ \begin{array}{l} y_i \space \text{ if } \exists i \in \vert m \vert \text{ s.t. } x_i=x \\ 0 \text{ otherwise} \end{array} \right. \tag{2.3}$$ 對這個predictor而言,不管你怎麼sample,只要是從$\vert m \vert$拿出來的,那它就一定是$y_i$,一定預測成功,也就是$L_S(h_S)=0$(資料集-$S$的loss必定為0。也因此這個predictor或許可以由ERM演算法來選擇(作者提到,沒有分類器的loss可以比ERM演算法更低)。但是,另一方面呢,任一個classifier在這個有限的instances中,將所有的domain point通通都標記為1的實際誤差,在這種情況下會是1/2。也就是$L_S(h_S)=1/2$,這是因為灰框的大小是2,藍框的大小是1,因此在這個範圍內的有限資料內就是會有1/2的誤差。 這個predictor在訓練集上是$L_S(h_S)=0$(方程式2.3的predictor),但在實際的世界中(灰框世界)它卻是$L_\mathcal{D}(h_S)=1/2$。這種現象稱為overfitting(過擬合)。簡單說就是我們的hypothesis用在訓練集上沒問題,但應用在其它的資料上的效能卻不好。其實,我們在訓練模型的過程就是將模型利用訓練資料集來擬合出一個分佈,我們希望這個分佈可以代表這世界、地表上的所有資料的分佈,但模型過度擬合於訓練集的情況就會造成泛化效果的不好。 ::: ### 2.3 Empirical Risk Minimization with Inductive Bias :::info We will search for conditions under which there is a guarantee that ERM does not overfit, namely, conditions under which when the ERM predictor has good performance with respect to the training data, it is also highly likely to perform well over the underlying data distribution. ::: :::success 雖然剛才我們證明ERM是容易的overfitting,但作者希望找出一個條件來保證ERM不會overfitting。也就是說,這個條件可以讓ERM除了很好的擬合於訓練集之外,對於潛在資料分佈也可以有很好的效能。 ::: :::info A common solution is to apply the ERM learning rule over a restricted search space. Formally, the learner should choose in advance (before seeing the data) a set of predictors. This set is called a hypothesis class and is denoted by $\mathcal{H}$. Each h ∈ H is a function mapping from X to Y. F ::: :::success 常見作法就是限制搜尋空間。讓learner提選擇一組predictors。這組predictors稱為hypothesis class,以符號$\mathcal{H}$表示。$h \in \mathcal{H}$,這個$\mathcal{H}$內的每一個$h$都是一個函數,將$\mathcal{X} \rightarrow \mathcal{Y}$。 ::: :::info For a given class $\mathcal{H}$, and a training sample, $S$, the ERM~H~ learner uses the ERM rule to choose a predictor $h \in \mathcal{H}$, with the lowest possible error over $S$. By restricting the learner to choosing a predictor from $\mathcal{H}$, we bias it toward a particular set of predictors. Such restrictions are often called an inductive bias. ::: :::success 給定hypothesis class $\mathcal{H}$,訓練樣本-$S$、ERM~H~ learner使用ERM選擇一個在訓練樣本-$S$上最低可能誤差的predictor-$h$,$h \in \mathcal{H}$,寫為$ERM_H(S) \in \arg\min_{h \in \mathcal{H}}L_S(h)$,$\arg\min$就是指在整個hypotheses class-$\mathcal{H}$內的hypothese-$h$,能夠讓$L_S(h)$最低的那個hypothese-$h$。透過限制learner從$\mathcal{H}$選擇一個predictor,我們讓它傾向於一組特定的predictors。這樣的限制條件稱為inductive bias(歸納偏置)。 對於inductive bias的理解可以參考: * [知乎](https://www.zhihu.com/question/264264203)上的討論 * 者周志華老師的西瓜書上也有提到,學習過程中的對某種hypothesis的偏好,就是一個inductive bias * [Wikiwand](https://www.wikiwand.com/zh-tw/%E6%AD%B8%E7%B4%8D%E5%81%8F%E7%BD%AE):當學習器去預測其未遇到過的輸入的結果時,會做一些假設(Mitchell, 1980)。而學習演算法中的歸納偏置(Inductive bias)則是這些假設的集合。 ::: :::info Since the choice of such a restriction is determined before the learner sees the training data, it should ideally be based on some prior knowledge about the problem to be learned. ::: :::success 這種限制條件的選擇是在learner看到訓練資料之前選定的,因此,理想情況下,這樣的限制條件應該要建立在要學習的那個問題的先驗知識上。也就是這個限制條件應該與目標是有相關性的,或許可以理解成我們所選擇的特徵,某種角度來說,我們選擇的特徵也是一種限制條件。回頭看木瓜範例,也許我們會選擇由軸對稱矩型所確定的class-$\mathcal{H}$做為一組predictors(由顏色與軟硬度所組成的空間)。 另一方面,先前看到的overfitting的範例,這說明了,選擇$\mathcal{H}$做為predictors的class($\mathcal{H}$內包含將有限的domain points賦值1的所有函數)並不足以保證$ERM_\mathcal{H}$不會overfitting。學習理論中的一個基本問題,在那一個hypothesis classes $ERM_\mathcal{H}$上學習才不會導致overfitting。直觀來看,選擇更多限制的hypothesis class能夠更好的預防overfitting,但同時也導致更強烈的歸納偏置(Inductive bias)。(見上inductive bias的理解參考) ::: #### 2.3.1 Finite Hypothesis Classes :::info The simplest type of restriction on a class is imposing an upper bound on its size (that is, the number of predictors $h \in \mathcal{H}$). In this section, we show that if $\mathcal{H}$ is a finite class then ERMH will not overfit, provided it is based on a sufficiently large training sample (this size requirement will depend on the size of $\mathcal{H}$). ::: :::success 在class上最簡單的限制條件類型就是依其大小加入upper bound(也就是$\mathcal{H}$內的predoctors-$h$的數量)。這邊要說明的就是,如果$\mathcal{H}$是一個有限的class,那$ERM_\mathcal{H}$就不會overfitting,前提是擁有足夠大的訓練樣本(需求取決於$\mathcal{H}$的大小)。 ::: :::info Limiting the learner to prediction rules within some finite hypothesis class may be considered as a reasonably mild restriction. In our papayas example, we mentioned previously the class of axis aligned rectangles. While this is an infinite class, if we discretize the representation of real numbers, say, by using a 64 bits floating-point representation, the hypothesis class becomes a finite class. ::: :::success 將learner限制在某個有限的hypothesis class內被認為是合理的限制。一樣的回到木瓜案例,我們先前有提到軸對稱矩型的class,我們可以用64bit的浮點來表示,將這個hypothesis class離散化,那它就會變成一個有限的class,因為最多最多就是64bit的浮點可以表示,它被限制了。 現在我們可以來看看$ERM_\mathcal{H}$的學習的效能如何,假設$\mathcal{H}$是一個有限的class。對於訓練樣本-$S$、根據某個標記函數$f: \mathcal{X} \rightarrow \mathcal{Y}$,然後$h_S$為將$ERM_\mathcal{H}$應用在訓練集-$S$上的結果,也就是說: $$h_S \in \arg\min_{h \in \mathcal{H}} L_S(h) \tag{2.4}$$ 這邊有做一些簡單的假設: * 定義 2.1(Realizability Assumption):存在一個$h^* \in \mathcal{H} \space s.t. \space L_{(\mathcal{D}, f)}(h^*)=0$。這意思是說,有這麼一個hypothsis-$h^*$,它對於根據分佈-$\mathcal{D}$所採樣出來的$S$的instances,經過標記函數-$f$標記,得到loss為0,也就是$L_S(h^*)=0$ 這個Realizability Assumption意味著,對於每一個ERM hypothesis,我們都有著$L_S(h_S)=0$,但我們會對$h_S$的實際風險比較有興趣,也就是$L_{(\mathcal{D}, f)}(h_S)$ ::: :::info Clearly, any guarantee on the error with respect to the underlying distribution, $\mathcal{D}$, for an algorithm that has access only to a sample $S$ should depend on the relationship between $\mathcal{D}$ and $S$. The common assumption in statistical machine learning is that the training sample $S$ is generated by sampling points from the distribution $\mathcal{D}$ independently of each other. ::: :::success 這邊說到的是,對於任何只能看到訓練樣本-$S$的演算法而言,關於潛在分佈-$\mathcal{D}$的誤差的任何保證都取決於$\mathcal{D}$與$S$之間的關係。 而在統計機器學習中的常見的假設就是,訓練集-$S$是從分佈-$\mathcal{D}$中彼此獨立的採樣而得,正確的說是: * 獨立同分佈假設(i.i.d. assumption):訓練集內的範本是根據分佈-$\mathcal{D}$ independently且identically distributed(i.i.d)。也就是說,訓練集-$S$內的每一個$x_i$都是根據分佈-$\mathcal{D}$新鮮現採,然後再根據標記函數-$f$來賦予標記 * [獨立同分佈](https://zh.wikipedia.org/wiki/%E7%8B%AC%E7%AB%8B%E5%90%8C%E5%88%86%E5%B8%83):隨機過程中,任何時候的採樣都是一個隨機變量,如果它們服從相同的分佈,而且互相是獨立的,那它們就是獨立同分佈,就好像骰子,每次都是1/6的機率,但每次都是獨立的。不過獨立同分佈並不代表每次的機率相同,以不均勻的骰子來說,每次的結果都是獨立,但是機率都是不一樣的 * 這個假設以$S \sim \mathcal{D}^m$來表示,$m$代表$S$的大小,$\mathcal{D}^m$則表示機率,採樣到這$m$組資料的機率,記得,是i.i.d。 訓練集-$S$可以看成是一個提供learner觀察分佈-$\mathcal{D}$與標記函數-$f$的一個窗口。你得到的資料集愈大,就越有機會能夠更準確的反映出用於生成樣本的分佈與標記,這回到一開始說的,我們利用訓練集擬合一個分佈,希望這個分佈可以反映出真實的世界的資料分佈,所以資料集愈大,當然愈好(目前而言)。 ::: :::info Since $L_{\mathcal{D}, f}(h_S)$ depends on the training set, $S$, and that training set is picked by a random process, there is randomness in the choice of the predictor hS and, consequently, in the risk $L_{\mathcal{D}, f}(h_S)$. Formally, we say that it is a random variable. ::: :::success 因為$L_{\mathcal{D}, f}(h_S)$是取決於訓練集-$S$,而訓練集的取得是隨機過程,它有隨機性,也因此$L_{\mathcal{D}, f}(h_S)$所選擇的$h_S$也是存在隨機性。我們把這個稱為隨機變量。因為整個世界的資料集太大太大了,你不能期待你的訓練集-$S$能夠完整的代表整個世界,所以上面才會說,訓練集-$S$是learner觀察世界的一個窗口。而且採樣過程中還是會有機會取得不具代表性的資料集。 ::: :::info If we go back to the papaya tasting example, there is always some (small) chance that all the papayas we have happened to taste were not tasty, in spite of the fact that, say, 70% of the papayas in our island are tasty. In such a case, $ERM_\mathcal{H}(S)$ may be the constant function that labels every papaya as “not tasty” (and has 70% error on the true distribution of papapyas in the island). We will therefore address the probability to sample a training set for which $L_{\mathcal{D}, f}(h_S)$ is not too large. Usually, we denote the probability of getting a nonrepresentative sample by $\delta$, and call $(1-\delta)$ the confidence parameter of our prediction. ::: :::success 回到木瓜範例,就算整個島的70%的木瓜都是好吃的,但總是會有機會(很小的機率)我們吃到的木瓜都是不好吃的。這種情況下$ERM_\mathcal{H}(S)$就跟常數函數一樣,將每一個木瓜都標記為不好吃。 這在島上的真實分佈上會造成70%的誤差率,因為其實有70%的木瓜是好吃的,但我們剛好連續吃到的都是那30%的不好吃,所以造成predictor每次的output都回答不好吃,也就造成70%的誤差率。 所以我們要來處理對$L_{\mathcal{D}, f}(h_S)$不是太大的訓練集的採樣機率。也就是我們希望我們的訓練集的樣本所造成的loss不會太大。我們將取得無代表性(nonrepresentative)樣本的機率以符號$\delta$來表示,而$(1-\delta)$則稱為預測的置信參數(confidence parameter)。 ::: :::info On top of that, since we cannot guarantee perfect label prediction, we introduce another parameter for the quality of prediction, the accuracy parameter commonly denoted by 以$\epsilon$. We interpret the event $L_{\mathcal{D}, f}(h_S) > \epsilon$ as a failure of the learner, while if $L_{\mathcal{D}, f}(h_S) \leq \epsilon$ we view the output of the algorithm as an approximately correct predictor. Therefore (fixing some labeling function $f: \mathcal{X} \rightarrow \mathcal{Y}$), we are interested in upper bounding the probability to sample m-tuple of instances that will lead to failure of the learner. Formally, let $S\vert _x = (x_1, \cdots, x_m)$ be the instances of the training set. We would like to upper bound $$\mathcal{D}^m(\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\})$$ Let $\mathcal{H}_B$ be the set of “bad” hypotheses, that is, $$\mathcal{H}_B = \left\{h \in \mathcal{H}: L_{\mathcal{D}, f}(h_S) > \epsilon \right\}$$ In addition, let $$M=\left\{S \vert_x: \exists h \in \mathcal{H}_B, L_S(h)=0 \right\}$$ be the set of misleading samples: Namely, for every $S \vert_x \in M$, there is a “bad” hypothesis, $h \in \mathcal{H}_B$, that looks like a “good” hypothesis on $S \vert_x$. Now, recall that we would like to bound the probability of the event $L_{\mathcal{D}, f}(h_S) > \epsilon$. But, since the realizability assumption implies that $L_S(h_S)=0$, it follows that the event $L_{\mathcal{D}, f}(h_S) > \epsilon$ can only happen if for some $h \in \mathcal{H}_B$ we have $L_S(h)=0$. In other words, this event will only happen if our sample is in the set of misleading samples, $M$. Formally, we have shown that $$\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\} \subseteq M$$ Note that we can rewrite M as $$M = \bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\} \tag{2.5}$$ Hence, $$\mathcal{D}^m(\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\}) \leq \mathcal{D}^m(M) = \mathcal{D}^m(\bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\}) \tag{2.6}$$ ::: :::success 因為我們不能保證能夠完美的預測,所以要加入另一個參數來提高預測的品質,也就是準確度參數(accuracy parameter),以$\epsilon$來表示: 1. $L_{\mathcal{D}, f}(h_S) > \epsilon$:視為失敗的learner 2. $L_{\mathcal{D}, f}(h_S) \leq \epsilon$:視為接近正確的predictor 現在固定某些標記函數$f: \mathcal{X} \rightarrow \mathcal{Y}$,我們現在感興趣的是採樣的時候取得的資料會造成learner失敗的機率上限(upper bounding)。將訓練集的instance以$S\vert _x = (x_1, \cdots, x_m)$來表示。我們希望upper bound為 $$\mathcal{D}^m(\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\})$$ 上面的i.i.d.的地方有提到,$\mathcal{D}^m$表示取得這$m$組資料的機率,而我們想知道的是取得的這$m$取資料會造成$L_{\mathcal{D}, f}(h_S) > \epsilon$的機率是多少。 接下來,以$\mathcal{H}_B$來表示不好的hypotheses的集合,也就是: $$\mathcal{H}_B = \left\{h \in \mathcal{H}: L_{\mathcal{D}, f}(h_S) > \epsilon \right\}$$ 這個數學式很直觀,就是整個hypotheses-$\mathcal{H}$內會造成$L_{\mathcal{D}, f}(h_S) > \epsilon$的$h$的集合,就是$\mathcal{H}_B$,就是這一堆的hypotheses都是拍米阿就是了。 然後,定義一個誤導性的樣本集$M$ $$M=\left\{S \vert_x: \exists h \in \mathcal{H}_B, L_S(h)=0 \right\}$$ 這個數學式是說,在$\mathcal{H}_B$內一定存在一個不好hypothese-$h$在$S\vert _x = (x_1, \cdots, x_m)$上會造成$L_S(h)=0$,這會讓我們誤會這個hypthese是好的。 回頭看我們感興趣的,我們希望可以限制$L_{\mathcal{D}, f}(h_S) > \epsilon$出現的機率,回想一下上面的定義2.1,也就是realizability assumption,$h^* \in \mathcal{H} \space s.t. \space L_{(\mathcal{D}, f)}(h^*)=0$,我們假設會有這麼一個$h^*$,你只要是從分佈-$\mathcal{D}$中採樣,經過標記函數-$f$,最終得到的loss會是0,這意味著$L_S(h^*)=0$,因為你的資料集-$S$正是從分佈-$\mathcal{D}$中採樣的。 所以,什麼情況下會有$L_{\mathcal{D}, f}(h_S) > \epsilon$的事件發生?只有在你的hypothese是從不好的hypotheses內取得,而且$L_S=0$的時候才會發生。而且也有在我們的資料集是從誤導性樣本-$M$中採樣的時候才會發生。這說明一件事: $$\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\} \subseteq M$$ 上面的數學式就是說,有這麼的一個資料集會造成$L_{\mathcal{D}, f}(h_S) > \epsilon$,而這資料集是包含於$M$這個誤導性樣本中,我們可以改寫成這樣: $$M = \bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\} \tag{2.5}$$ 現在來整理一下: $$\mathcal{D}^m(\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\}) \leq \mathcal{D}^m(M) = \mathcal{D}^m(\bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\}) \tag{2.6}$$ 這個數學式說明著,$\mathcal{D}^m(\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\})$,這是一個子集的機率,它不可能比整個誤導性樣本-$M$還要高,因此$\leq \mathcal{D}^m(M)$,而$\mathcal{D}^m(M)$的機率就是$\bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\}$的聯集,因此$= \mathcal{D}^m(\bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\})$。 ::: :::info LEMMA(引理) 2.2 (Union Bound) For any two sets A, B and a distribution $\mathcal{D}$ we have $$\mathcal{D}(A \cup B) \leq \mathcal{D}(A) + \mathcal{D}(B)$$ ::: :::success Union Bound又稱[布爾不等式(Boole's inequality)](https://zh.wikipedia.org/wiki/%E5%B8%83%E5%B0%94%E4%B8%8D%E7%AD%89%E5%BC%8F),就是說對於全部事件的機率不會大於單個事件的機率總和: $$\mathcal{D}(A \cup B) \leq \mathcal{D}(A) + \mathcal{D}(B)$$ 也因此,我們可以把剛剛的方程式2.6做一些調整: $$\mathcal{D}^m(\left\{S \vert_x: L_{\mathcal{D}, f}(h_S) > \epsilon \right\}) \leq \mathcal{D}^m(M) \leq \sum_{h \in \mathcal{H}_B} \mathcal{D}^m(\left\{S \vert_x: L_S(h)=0 \right\}) \tag{2.7}$$ 這很好理解,方程式2.6右手邊的$\mathcal{D}^m(\bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\})$是全部事件的機率,因此它不會大於單個事件的機率總和,也因此我們可以改成$\leq \sum_{h \in \mathcal{H}_B} \mathcal{D}^m(\left\{S \vert_x: L_S(h)=0 \right\})$ ::: :::info Next, let us bound each summand of the right-hand side of the preceding inequality. Fix some “bad” hypothesis $h \in \mathcal{H}_B$. The event $L_S(h)=0$ is equivalent to the event $\forall i,h(x_i)=f(x_i)$. Since the examples in the training set are sampled i.i.d. we get that \begin{align} \mathcal{D}^m(\left\{S \vert_x: L_S(h)=0 \right\}) & = \mathcal{D}^m(\left\{S \vert_x: \forall i,h(x_i)=f(x_i) \right\}) \\ & = \prod_{i=1}^m D(\left\{ x_i: h(x_i)=f(x_i) \right\}) \tag{2.8} \end{align} ::: :::success 接下來要做的,就是限制方程式2.7中的每一個被加數: * 固定某些不好的hypothesis $h \in \mathcal{H}_B$,先不考慮這個,單純考慮sample data * $L_S(h)=0$等價於$\forall i,h(x_i)=f(x_i)$,也就是任一個$x$的實際label與預測的label都會是相同的 因為訓練集中的每一個範本的採樣都是i.i.d.,因此: \begin{align} \mathcal{D}^m(\left\{S \vert_x: L_S(h)=0 \right\}) & = \mathcal{D}^m(\left\{S \vert_x: \forall i,h(x_i)=f(x_i) \right\}) \\ & = \prod_{i=1}^m D(\left\{ x_i: h(x_i)=f(x_i) \right\}) \tag{2.8} \end{align} 中間的轉換在上面的兩項有說到,最後面的部份$\prod_{i=1}^m \mathcal{D}(\left\{ x_i: h(x_i)=f(x_i) \right\})$說明著,整個資料集的機率就是每一筆樣本機率的連乘積$\mathcal{D}(x_1)*\mathcal{D}(x_2)*\cdots * \mathcal{D}(x_m)$,也就是,每一個單獨樣本的採樣機率就是$D(\left\{ x_i: h(x_i)=f(x_i) \right\})$,從$h \in \mathcal{H}_B$我們可以得到: $$D(\left\{ x_i: h(x_i)=f(x_i) \right\}) = 1 - L_{(\mathcal{D}, f)}(h) \leq 1 - \epsilon$$ 課程的1小時12分30秒處,老師說,這邊有點跳tone,我也真的是這樣覺得。 加入1小時9分50秒處的內容說明: * 假如空間是整個木瓜的集合,空間被切割成兩個部份,也就是$h$標記正確與$h$標記錯誤。如果,$h \in \mathcal{H}_B$,那它怎麼知道這個集合(老師手指著h wrong)。而我們又怎麼知道集合內(老師手指著h wrong)的點會被那一個$h$標記錯誤。我們只知道一件事,就是它的大小,如果$h \in \mathcal{H}_B$,那它的機率就會$\leq \epsilon$,換句話說,挑到正確的機率就會是$1-\epsilon$ * 我們在方程式2.1已經定義,$L_{\mathcal{D}, f}(h) = \mathcal{D}(\left\{x: h(x) \neq f(x) \right\})$,而這個機率我們已經定義為$\epsilon$,因此挑到正確的機率自然就會小於等於$1-\epsilon$ 把這個式子跟方程式2.8結合,再加入不等式$1 - \epsilon \leq e^{- \epsilon}$,我們就可以得到,每一個$h \in \mathcal{H}_B$的機率為: $$\mathcal{D}^m(\left\{S \vert_x: L_S(h)=0 \right\}) \leq (1 - \epsilon)^m \leq e^{- \epsilon m} \tag{2.9}$$ 在前一個方程式中我們定義出單一個樣本抽出是正確的機率,而我們要取$m$個,因此是$m$次方,如此可得方程式2.9 把方程式2.9再跟方程式2.7結合,得到一個結論: $$\mathcal{D}^m(\left\{S \vert_x: L_{(\mathcal{D}, f)}(h_S) > \epsilon \right\}) \leq \vert \mathcal{H}_B \vert e^{-\epsilon m} \leq \vert \mathcal{H} \vert e^{-\epsilon m}$$ 我們想去限制不好的資料集-$S$出現: $$D^m(bad-S) \leq D^m(M) = \mathcal{D}^m(\bigcup_{h \in \mathcal{H}_B} \left\{S \vert_x: L_S(h)=0 \right\}) \leq \vert \mathcal{H}_B \vert \cdot (1 - \epsilon)^m \leq \vert \mathcal{H} \vert \cdot (1 - \epsilon)^m$$ 我們去sample到不好的資料集-$S$的機率會小於等於整個誤導性樣本-$M$,而整個誤導性樣本的機率又會等於所有不好的hypotheses的聯集的機率,這個又會小於等於整個不好的hypotheses-$\mathcal{H}_B$乘上$(1 - \epsilon)^m$,很明顯的,這個不好的hypotheses-$\mathcal{H}_B$一定會小於等於整個hypotheses-$\mathcal{H}$乘上$(1 - \epsilon)^m$的機率,課程推導到這邊結束。 ::: :::info ![](https://i.imgur.com/mYB60OJ.png) Figure 2.1 Each point in the large circle represents a possible $m$-tuple of instances. Each colored oval represents the set of “misleading” $m$-tuple of instances for some “bad” predictor $h \in \mathcal{H}_B$. The ERM can potentially overfit whenever it gets a misleading training set $S$. That is, for some $h \in \mathcal{H}_B$ we have $L_S(h)=0$. Equation (2.9) guarantees that for each individual bad hypothesis, $h \in \mathcal{H}_B$, at most $(1 - \epsilon)^m$ $m$-fraction of the training sets would be misleading. In particular, the larger $m$ is, the smaller each of these colored ovals becomes. The union bound formalizes the fact that the area representing the training sets that are misleading with respect to some $h \in \mathcal{H}_B$ (that is, the training sets in $M$) is at most the sum of the areas of the colored ovals. Therefore, it is bounded by $\vert \mathcal{H}_B \vert$ times the maximum size of a colored oval. Any sample $S$ outside the colored ovals cannot cause the ERM rule to overfit. ::: :::success ![](https://i.imgur.com/mYB60OJ.png) Figure 2.1 在這大圓圈裡的每一個點都代表著一個可能的實例,$m$-tuple。每一種顏色的楕圓都代表某些用於不好的predictor-$h \in \mathcal{H}_B$的誤導性的實例。ERM在誤導性訓練集$S$上很有可能會過擬合。也就是,對於某些$h \in \mathcal{H}_B$,我們會得到$L_S(h)=0$。方程式(2.9)保證,每一個各別的不好的hypothesis-$h \in \mathcal{H}_B$,最多訓練集的$(1 - \epsilon)^m$會是誤導集。實際上,$m$愈大,帶色的楕圓就會愈來愈小。布爾不等式(union bound)證明了這件事,也就是代表訓練集的區域對於某些$h \in \mathcal{H}_B$而言是誤導性的,最多最多就是彩色楕圓區域的總和。然而,它被$\vert \mathcal{H}_B \vert$乘上彩色楕圓的最大尺寸所限制的。彩色楕圓之外的任何樣本-$S$都不會導致ERM過擬合。 ::: :::info COROLLARY 2.3 Let $\mathcal{H}$ be a finite hypothesis class. Let $\delta \in (0, 1)$ and $\epsilon > 0$ and let m be an integer that satisfies $$m \geq \dfrac{log(\vert \mathcal{H} \vert / \delta)}{\epsilon}$$ ::: :::success COROLLARY 2.3 這邊假設$\mathcal{H}$是一個有限的hypothesis class。$\delta \in (0, 1)$(取得無代表性(nonrepresentative)樣本的機率)且$\epsilon > 0$(準確度參數),$m$是一個整數,並且滿足下面方程式: $$m \geq \dfrac{log(\vert \mathcal{H} \vert / \delta)}{\epsilon}$$ ::: :::info Then, for any labeling function, $f$, and for any distribution, $\mathcal{D}$, for which the realizability assumption holds (that is, for some $h \in \mathcal{H}, L_{(\mathcal{D}, f)}(h)=0$), with probability of at least $1 - \delta$ over the choice of an i.i.d. sample $S$ of size $m$, we have that for every ERM hypothesis, $h_S$, it holds that ::: :::success 然後,對任意的標記函數-$f$與任意的分佈-$\mathcal{D}$,其realizability assumption在i.i.d. 大小為$m$的樣本-$S$上最少有$1 - \delta$的機率是成立的(也就是,部份的$h \in \mathcal{H}, L_{(\mathcal{D}, f)}(h)=0$),對於每一個ERM hypothesis-$h_S$,下面的不等式成立: $$L_{(\mathcal{D}, f)}(h_S) \leq \epsilon$$ ::: :::info The preceeding corollary tells us that for a sufficiently large $m$, the $ERM_\mathcal{H}$ rule over a finite hypothesis class will be probably (with confidence $1 - \delta$) approximately (up to an error of $\epsilon$) correct. In the next chapter we formally define the model of Probably Approximately Correct (PAC) learning. ::: :::success 前面的推論告訴我們一件事,只要有足夠大的$m$,$ERM_\mathcal{H}$的整個有限的hypothesis class會是probably($1 - \delta$) approximately(誤差上限為$\epsilon$) correct(PAC)。有關PAC會在下一章說明。 :::