Understanding Machine Learning From Theory to Algorithms
文章內容
個人理解
In the previous chapter, we decomposed the error of the ERMH rule into approximation error and estimation error. The approximation error depends on the fit of our prior knowledge (as reflected by the choice of the hypothesis class H) to the underlying unknown distribution. In contrast, the definition of PAC learnability requires that the estimation error would be bounded uniformly over all distributions.
前一章裡面,我們將的誤差分解為近似誤差與估計誤差。近似誤差取決於先驗知識與潛在未知的分佈之間的擬合狀況(正如hypothesis class 的選擇所反映的那樣)。相較之下,PAC learnability的定義需要估計誤差能夠在所有的分佈上被均衡的界定。
Our current goal is to figure out which classes H are PAC learnable, and tocharacterize exactly the sample complexity of learning a given hypothesis class. So far we have seen that finite classes are learnable, but that the class of all functions (over an infinite size domain) is not. What makes one class learnable and the other unlearnable? Can infinite-size classes be learnable, and, if so, what determines their sample complexity?
我們當前的目標就是去找出那些classes 是PAC learnable,並且確切的描述出學習一個給定hyothesis class所需要的樣本複雜度。目前為止我們已經知道,finite class是learnable,但所有函數的class上並非如此(在無限大小的domain上)。是什麼造成一個class是learnable,而另一個是unlearnable?infinite-size的classes是否可以是learnable,如果可以,那怎麼定義它們的樣本複雜度?
We begin the chapter by showing that infinite classes can indeed be learnable, and thus, finiteness of the hypothesis class is not a necessary condition for learnability. We then present a remarkably crisp characterization of the family of learnable classes in the setup of binary valued classification with the zero-one loss. This characterization was first discovered by Vladimir Vapnik and Alexey Chervonenkis in 1970 and relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension). We formally define the VC-dimension, provide several examples, and then state the fundamental theorem of statistical learning theory, which integrates the concepts of learnability, VC-dimension, the ERM rule, and uniform convergence.
這章的一開始,我們要說明一件事,那就是infinite class確實可以是learnable,因此,hypothesis的有限性就不會是learnability的必要條件。然後,在設置具有0-1損失的二分類中,我們會說明一個關於family of learnable classes非常清晰的表徵分析。這種表徵首先由Vladimir Vapnik與Alexey Chervonenkis在1970年所發現,這依賴著一種組合概念,也就是Vapnik-Chervonenkis dimension (VC-dimension)。我們會正式的定義VC-dimension,然後說明統計學習的基本理論,這個理論會整合learnability、VC-dimension、ERM rule與均勻收斂。
In Chapter 4 we saw that finite classes are learnable, and in fact the sample complexity of a hypothesis class is upper bounded by the log of its size. To show that the size of the hypothesis class is not the right characterization of its sample complexity, we first present a simple example of an infinite-size hypothesis class that is learnable.
在第四章裡面,我們看到了finite classes是learnable,事實上,其hypothesis class 的樣本複雜度就是由它本身的大小取對數來做為上限。為了說明hypothesis class的大小並不是其樣本複雜度的正確表徵,我們首先提出一個是learnable的hypothesis class的簡單的範例,這個hypothesis class的infinite size。
Example 6.1 Let be the set of threshold functions over the real line, namely, , where is a function such that . To remind the reader, is if and otherwise. Clearly, is of infinite size. Nevertheless, the following lemma shows that H is learnable in the PAC model using the ERM algorithm.
Example 6.1 假設是實(數)線上門檻函數的集合,也就是說,,其中是一個函數,為。的意思就是,當,那就是,反之則為。很明顯的,是無限大小的。儘管如此,下面的lemma依然是說明在PAC model中使用ERM algorithm,是learnable。
Lemma 6.1 Let be the class of thresholds as defined earlier. Then, is PAC learnable, using the ERM rule, with sample complexity of .
Proof Let be a threshold such that the hypothesis achieves . Let be the marginal distribution over the domain and let be such that
Lemma 6.1 假設,是一個如先前所定義的thresholds class。那麼就是PAC learnable,使用ERM rule,以的樣本複雜度。
Proof 假設是一個閥值,那麼hypothesis 就可以得到。假設是domain 上的邊際分佈,並且設定,因此
( we set and similarly for ). Given a training set , let and (if no example in is positive we set and if no example in is negative we set ). Let be a threshold corresponding to an ERM hypothesis, , which implies that . Therefore, a sufficient condition for is that both and . In other words,
and using the union bound we can bound the preceding by
(如果,那我們就設置,也是一樣的做法)。給定一個訓練集,假設,而(如果內沒有樣本是真值(positive),那我們就設置,而如果內沒有負值(negative),那我們就設置)。假設是一個相當於ERM hypothesis 的閥值,這意味著。因此,對於的一個充份條件就是且。換言之,
然後利用union bound,我們就可以bound住前面的式子,
The event happens if and only if all examples in are not in the interval , whose probability mass is defined to be , namely,
這個事件只會發生在訓練集內的所有的樣本都沒有在這個區間內,其機率質量由定義,也就是說,
Since we assume it follows that the equation is at most . In the same way it is easy to see that . Combining with Equation (6.1) we conclude our proof.
因為我們假設,因此上面的式子最多就是。相同的方法,很容易可以知道。結合方程式6.1,得證。
We see, therefore, that while finiteness of H is a sufficient condition for learnability, it is not a necessary condition. As we will show, a property called the VC-dimension of a hypothesis class gives the correct characterization of its learnability. To motivate the definition of the VC-dimension, let us recall the No-Free-Lunch theorem (Theorem 5.1) and its proof. There, we have shown that without restricting the hypothesis class, for any learning algorithm, an adversary can construct a distribution for which the learning algorithm will perform poorly, while there is another learning algorithm that will succeed on the same distribution. To do so, the adversary used a finite set and considered a family of distributions that are concentrated on elements of C. Each distribution was derived from a \true" target function from . To make any algorithm fail, the adversary used the power of choosing a target function from the set of all possible functions from .
因此,我們可以看到,盡管的有限性對於learnability是充份條件,但並不是必要條件。一如我們要說明的,一個稱為hypothesis class的VC-dimension給出其learnability的正確特徵。為了引出VC-dimension的定義,讓我們回想天下沒有白吃的午餐這個定理(見5.1)及其證明。我們已經證明過,在沒有對hypothesis class做任何限制的情況下,對任何的學習演算法來說,對某一個分佈的效能會很差,但會有另一個學習演算法在該分佈上是可以成功學習。這樣做,我們可以使用有限的集合,然後考慮一個集中的元素上面的分佈族。每一個分佈都是由"真"的目標函數()所衍生的。為了讓任意的演算法失敗,adversary可以從中所有可能的函數集合裡面選擇一個目標函數。
When considering PAC learnability of a hypothesis class , the adversary is restricted to constructing distributions for which some hypothesis achieves a zero risk. Since we are considering distributions that are concentrated on elements of , we should study how behaves on , which leads to the
following definition.
當考慮到hypothesis class 的PAC learnability,adversary就會被限制去構建一個讓某些hypothesis 得到zero rick的分佈。由於我們考慮是集中在的元素上的分佈,所以我們應該研究在上的表現如何,這導出下面的定義。
DEFINITION 6.2 (Restriction of to ) Let be a class of functions from to and let . The restriction of to is the set of functions from to that can be derived from . That is,
where we represent each function from to as a vector in .
DEFINITION 6.2 (到的限制) 假設是一個的函數類,並假設(是的子集)。到的限制就是的函數集合,這可以由推導出來。也就是,
我們把每一個的函數以向量來表示。
If the restriction of to is the set of all functions from to , then we say that shatters the set . Formally:
DEFINITION 6.3 (Shattering) A hypothesis class shatters a finite set if the restriction of to is the set of all functions from to . That is,
.
如果這個到的限制是所有的函數集合,那我們就可以說 shatters集合。正確來說是:
DEFINITION 6.3 (Shattering) 如果到的限制是的所有函數集合,那麼我們說,hypothesis class shatters一個有限集合。也就是,。
Example 6.2 Let be the class of threshold functions over . Take a set . Now, if we take , then we have , and if we take , then we have . Therefore, is the set of all functions from to , and shatters . Now take a set , where . No can account for the labeling , because any threshold that assigns the label to must assign the label to as well. Therefore not all functions
from to are included in ; hence is not shattered by .
Example 6.2 假設是上的一個門檻函數。取一個集合。現在,如果我們取,那我們就會得到,如果我們取,那我們就會得到。因此,就是的所有函數的集合,而且 shatters 。現在,我們取一個集合,其中。沒有可以產生labeling ,因為任一個標記label 到的閥值同時也必需要能標記到。因此,並非所有的函數都被包含在;因此,是無法shatter 的。
Getting back to the construction of an adversarial distribution as in the proof of the No-Free-Lunch theorem (Theorem 5.1), we see that whenever some set is shattered by H, the adversary is not restricted by , as they can construct a distribution over based on any target function from to , while still maintaining the realizability assumption. This immediately yields:
corollary 6.4 Let be a hypothesis class of functions from to . Let be a training set size. Assume that there exists a set of size that is shattered by . Then, for any learning algorithm, , there exist a distribution over and a predictor such that but with probability of at least 1/7 over the choice of we have that .
回到證明沒有白吃的午餐的理論(Theorem 5.1)中所說的adversarial distribution的構建,當某一些集合被給shatter,那這個adversary distribution沒有被給限制,因此它們可以在上構建任何基於目標函數上的分佈,同時維持relizability assumption。這直接得到:
COROLLARY 6.4 假設是函數的hypothesis class。假設是訓練集的大小。假設存在一個大小為的集合,這個集合可以被給shatter。那麼,對任意的學習演算法,,存在一個上的分佈,以及一個predictor ,可以得到,但對於選擇的資料集最少會有1/7的機率其。
Corollary 6.4 tells us that if shatters some set of size then we cannot learn H using m examples. Intuitively, if a set is shattered by , and we receive a sample containing half the instances of , the labels of these instances give us no information about the labels of the rest of the instances in - every possible labeling of the rest of the instances can be explained by some hypothesis in . Philosophically,
If someone can explain every phenomenon, his explanations are worthless.
Corollary 6.4 告訴我們一件事,如果可以shatter某些大小為的集合,那我們就沒有辦法用個樣本來學習。直觀來看,如果一個集合被給shattered,而我們所接收的樣本包含C的instances的一半,那這些instances是沒有辦法提供給我們任何關於剩下的instances的標籤的信息 - 其餘的instances的每一個可能的標籤都可以用內的某一個hypothesis來解釋。有一句話是這麼說的,
如果某人可以解釋每一種現象,那麼他的解釋一定是錯的。
This leads us directly to the definition of the VC dimension.
DEFINITION 6.5 (VC-dimension) The VC-dimension of a hypothesis class , denoted VCdim(H), is the maximal size of a set that can be shattered by . If can shatter sets of arbitrarily large size we say that has infinite VC-dimension.
這引導出VC dimension的定義。
DEFINITION 6.5 (VD-dimension) hypothesis class 的VC-dimension以來表示,其表示為所能shatter掉的最大大小。如果可以shatter掉任意大小的集合,那我們就說,無限的VC-dimension。
A direct consequence of Corollary 6.4 is therefore:
THEOREM 6.6 Let be a class of infinite VC-dimension. Then, is not PAC learnable.
Corollary 6.4直接導致的結果就是:
THEOREM 6.6 假設有著無限的VC-dimension。那麼,我們就說,並非PAC learnable。
Proof Since has an infinite VC-dimension, for any training set size , there exists a shattered set of size 2m$, and the claim follows by Corollary 6.4.
Proof 因為有著無限的VC-dimension,因此對任意大小為的訓練集,都會存在著一個大小為而且可以被shattered的集合,這個證明來自Corollary 6.4。
We shall see later in this chapter that the converse is also true: A finite VC-dimension guarantees learnability. Hence, the VC-dimension characterizes PAC learnability. But before delving into more theory, we first show several examples.
章節的後面會看到這個推論反過來說也是一樣成立:一個有限的VC-dimension可以保證它的learnability。因此,VC-dimension描述著PAC learnability。在深入探討更多理論之前,讓我們先來看幾個範例。
In this section we calculate the VC-dimension of several hypothesis classes. To show that we need to show that
在這個章節裡,我們會試著計算各種hypothesis classes的VC-dimension。為了說明,我們需要證明兩件事:
Let be the class of threshold functions over . Recall Example 6.2, where we have shown that for an arbitrary set , shatters ; therefore . We have also shown that for an arbitrary set where , does not shatter . We therefore conclude that .
假設是上門檻函數的class。回想範例6.2,我們已經說明過,對任意的集合,都可以shatter掉;因此,。我們也已經證明,對任意集合,其中,沒有辦法shatter掉。因此,我們得到一個結論,。
Let be the class of intervals over , namely, , where is a function such that . Take the set . Then, shatters (make sure you understand why) and therefore . Now take an arbitrary set and assume without loss of generality that . Then, the labeling cannot be obtained by an interval and therefore does not shatter . We therefore conclude that
.
假設是上的區間類別,也就是,,其中是一個函數,因此。取集合。然後,H shatters (),因此。現在,取任意集合,並假設這集合一樣是。然後,其標籤無法由一個區間來取得,因此無法shatter 。因此我們得到一個結論,
Let be the class of axis aligned rectangles, formally:
where
假設是一個軸對稱矩陣,正確來說:
其中
We shall show in the following that . To prove this we need to find a set of 4 points that are shattered by and show that no set of 5 points can be shattered by . Finding a set of 4 points that are shattered is easy (see Figure 6.1). Now, consider any set of 5 points. In , take a leftmost point (whose first coordinate is the smallest in ), a rightmost point (first coordinate is the largest), a lowest point (second coordinate is the smallest), and a highest point (second coordinate is the largest). Without loss of generality, denote and let be the point that was not selected. Now, define the labeling . It is impossible to obtain this labeling by an axis aligned rectangle. Indeed, such a rectangle must contain ; but in this case the rectangle contains as well, because its coordinates are within the intervals defined by the selected points. So, is not shattered by , and therefore .
我們將在下面說明。為了證明這一點,我們需要尋找一個可以被 shatter的4個資料點的集合,然後證明沒有5個資料點的集合可以被H shatter。要找出一個可以被shatter的4個資料點的集合實在是太簡單(見Figure 6.1)。現在,考慮5個資料點的任意集合。在裡面,取一個最左邊的點(第一軸的最小值)、最右邊的點(第一中的最大值),最低的點(第二軸的最小值),最高的點(第二軸的最大值)。在不失一般性下這樣表示,,然後假設是沒有被取到的點。現在,定義標籤分別為。這種情況下你不可能用軸對稱矩形去得到這種標籤結果。確實,這樣的矩形必需包含;但是在這個案例中的矩形還包含了,因為就在你所選定的座標區間內。所以,沒有辦法被 shatter,因此,。
Figure 6.1 Left: 4 points that are shattered by axis aligned rectangles. Right: Any axis aligned rectangle cannot label by 0 and the rest of the points by 1.
Figure 6.1 Left:4個點的情況下是可以被軸對稱矩形shatter。
Right:任一個軸對稱矩陣都沒有辦法在其它資料點為1的情況下將標記為0。
Let be a finite class. Then, clearly, for any set we have and thus cannot be shattered if . This implies that . This shows that the PAC learnability of finite classes follows from the more general statement of PAC learnability of classes with finite VC-dimension, which we shall see in the next section. Note, however, that the VC-dimension of a finite class can be significantly smaller than . For example, let , for some integer , and consider the class of threshold functions (as defined in Example 6.2). Then, but . Since can be arbitrarily large, the gap between and can be arbitrarily large.
假設是一個finite class。那麼,很明顯的,對任意的集合都存在著。因而,如果,那麼就沒有辦法被shatter。這指出了。這說明了,finite classes的PAC learnability是來自於finite VC-dimension的PAC learnability更一般化的敘述,我們將會在後面的章節談到。注意到,finite class 的VC-dimension很明顯的會比還要小。舉例來說,假設,是整數,並且考慮到範例6.2中所提到的門檻函數的類別。那麼,,但。因為可以是任意大小的值,因此與之間的差距就可以是任意大小。
In the previous examples, the VC-dimension happened to equal the number of parameters defining the hypothesis class. While this is often the case, it is not always true. Consider, for example, the domain , and the hypothesis class where is defined by . It is possible to prove that , namely, for every , one can find points that are shattered by (see Exercise 8).
在先前的範例中,VC-dimension剛好會是hypothesis class的參數數量。儘管這很常發生,但不總是這樣的。注意到,舉例來說,domain ,且hypothesis class ,其中由所定義。這可能可以證明,也就是說,對於每一個,我們都可以找出個資料點可以被給shatter。
We have already shown that a class of infinite VC-dimension is not learnable. The converse statement is also true, leading to the fundamental theorem of statistical learning theory:
THEOREM 6.7 (The Fundamental Theorem of Statistical Learning) Let be a hypothesis class of functions from a domain to and let the loss function be the loss. Then, the following are equivalent:
The proof of the theorem is given in the next section.
我們已經說明,infinite VC-dimension的class並非learnable。反過來說也一樣,這導出一個統計學習理論的基本定義:
THEOREM 6.7 (統計學習的基本定理) 假設是一個hypothesis class,由domain 映射到的functions集結而成,然後假設loss function是的loss。那麼,下面的幾個說明是等價的:
這個定理的證明會在下一章給出。
Not only does the VC-dimension characterize PAC learnability; it even determines the sample complexity.
THEOREM 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative Version) Let be a hypothesis class of functions from a domain to
and let the loss function be the loss. Assume that .
Then, there are absolute constants $C_1,C_2 such that:
has the uniform convergence property with sample complexity
is agnostic PAC learnable with sample complexity
is PAC learnable with sample complexity
The proof of this theorem is given in Chapter 28.
VC-dimension不只可以描述PAC learnability;它還可以定義樣本複雜度。
THEOREM 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative Version) 假設是一個hypothesis class,由domain 映射到的functions集結而成,並且loss function是的loss。假設,。然後,有兩個絕對常數$C_1,C_2會使得:
具均勻收斂的特性,其樣本複雜度為
是agnostic PAC learnable,其樣本複雜度為
是PAC learnable,其樣本複雜度為
這個定理的證明會在第28章給出。
Remark 6.3 We stated the fundamental theorem for binary classification tasks. A similar result holds for some other learning problems such as regression with the absolute loss or the squared loss. However, the theorem does not hold for all learning tasks. In particular, learnability is sometimes possible even though the uniform convergence property does not hold (we will see an example in Chapter 13, Exercise 2). Furthermore, in some situations, the ERM rule fails but learnability is possible with other learning rules.
Remark 6.3 我們說明了二分類任務的基本定理。類型的結果也會在其它的學習問題上成立,像是使用絕對損失或是平方損失的迴歸。但是這個定理並不適用於所有學習任務。特別是,即使均均勻收斂性不成立,learnability有時候也是有可能的(這點會在第13章的練習2中看到)。此外,某些情況下,ERM rule會失敗,但其它的learning rule反而有可能提高learnability。
We have already seen that in Chapter 4. The implications and are trivial and so is . The implications and follow from the No-Free-Lunch theorem. The difficult part is to show that . The proof is based on two main claims:
在第四章我們已經看到。這意謂著與是很自然而然的,也是一樣。依著沒有白吃的午餐的定理,與也是成立的。最難的是證明。這證明主要基於兩個宣稱(主張):
We defined the notion of shattering, by considering the restriction of to a finite set of instances. The growth function measures the maximal "effiective" size of on a set of m examples. Formally:
DEFINITION 6.9 (Growth Function) Let be a hypothesis class. Then the growth function of , denoted , is defined as
我們透過將為有限的實例集合來定義shattering的概念。而成長函數就是拿來量測在個樣本的集合上的最大"有效"大小。正確的說是這樣的:
DEFINITION 6.9 (Growth Function) 假設是hypothesis clas。那麼,的成長函數就標記為,其定義為:
In words, is the number of different functions from a set of size to that can be obtained by restricting to .
也就是,就是從大小為的集合映射到的不同函數的個數,這可以透過限制到來取得。
Obviously, if then for any we have . In such cases, induces all possible functions from to . The following beautiful lemma, proposed independently by Sauer, Shelah, and Perles, shows that when becomes larger than the VC-dimension, the growth function increases polynomially rather than exponentially with .
lemma 6.10 (Sauer-Shelah-Perles) Let be a hypothesis class with . Then, for all , . In particular, if then .
很明顯的,如果,那麼對於任意的我們都會有著 。這種情況下,會歸納出到的所有可能函數。下面漂亮的引理,由Sauer、Shelah與perles獨自提出,說明了當變的比VC-dimension還要大的時候,成長函數會以的多項式增長而不是指數增長。
LEMMA 6.10 (Sauer-Shelah-Perles) 假設是一個hypothesis class,且。那麼,對所有的而言,。特別是,如果,那麼
To prove the lemma it suffices to prove the following stronger claim: For any
we have
為了證明引理,我們要證明下面更強烈的主張:對任意的集合我們有著
The reason why Equation (6.3) is sufficient to prove the lemma is that if then no set whose size is larger than is shattered by and therefore
When the right-hand side of the preceding is at most (see Lemma A.5 in Appendix A).
方程式6.3足以證明這引理的理由在於,如果,那麼就不會有大小超過的集合可以被給shatter,因此
當,上面方程式的石邊最多就是(可參考附錄A的Lemma A.5)。
We are left with proving Equation (6.3) and we do it using an inductive argument. For , no matter what is, either both sides of Equation (6.3) equal 1 or both sides equal 2 (the empty set is always considered to be shattered by ). Assume Equation (6.3) holds for sets of size and let us prove it for sets of size . Fix and . Denote and in addition, define the following two sets:
and
我們只剩下證明方程式6.3,這點我們將採用歸納論證來證明。當,無論是什麼,方程式6.3的兩邊都會等於1或2(我們認為空集合是可以被 shatter)。假設方程式6.3在大小為的集合上是成立的,讓我們證明集合大小為的情況。固定,且。另外,,此外,定義下面兩個集合:
與
It is easy to verify that . Additionally, since , using the induction assumption (applied on and ) we have that
很容易就可以驗證。此外,因為,我們用歸納假設得到
Next, define to be
namely, \mathcal{H'}$ contains pairs of hypotheses that agree on and differ on . Using this definition, it is clear that if shatters a set then it also shatters the set and vice versa. Combining this with the fact that and using the inductive assumption (now applied on and ) we obtain that
下一步,定義為
也就是說,包含了在可以但在不行的hypotheses。利用這個定義,很明顯的,如果 shatter集合,那它就可以shatter掉集合,反之亦然。結合這點與,並使用歸納假設(現在是與),我們得到
Overall, we have shown that
which concludes our proof.
整體來說,我們已經證明
In this section we prove that if has small effective size then it enjoys the uniform convergence property. Formally,
THEOREM 6.11 Let be a class and let be its growth function. Then, for every and every , with probability of at least over the choice of we have
這一章我們會證明,如果的有效大小是小的,那麼它就會擁有均勻收斂的特性。正確來說,
THEOREM 6.11 假設是一個class,且是它的成長函數。那麼,對每一個與每一個,有著最少的機率的會有著
Before proving the theorem, let us first conclude the proof of Theorem 6.7.
Proof of Theorem 6.7 It suffices to prove that if the VC-dimension is finite then the uniform convergence property holds. We will prove that
在證明這個理論之前,讓我們先來怒證一下Theorem 6.7。
Proof of Threorem 6.7 這足以證明,如果Vc-dimension是有限的,那麼均勻收斂的特性是成立的。我們將證明,
From Sauer’s lemma we have that for , . Combining this with Theorem 6.11 we obtain that with probability of at least ,
根據Sauer的lemma,對於,我們得到。把這一點跟Theorem 6.11結合,我們得到最少的機率,
For simplicity assume that$\sqrt{d\log(2em/d)} \geq 4; hence,
為了簡單起見,我們假設;因此,
To ensure that the preceding is at most we need that
為了確保上面說的部份最多就是,我們需要加入,
Standard algebraic manipulations (see Lemma A.2 in Appendix A) show that a
sufficient condition for the preceding to hold is that
標準代數計算(見附錄A的Lemma A.2)說明著,滿足上面不等式的充份條件就是,
Remark 6.4 The upper bound on we derived in the proof Theorem 6.7 is not the tightest possible. A tighter analysis that yields the bounds given in Theorem 6.8 can be found in Chapter 28.
Remark 6.4 我們在Theorem 6.7中推導出來的的upper bound並不是最嚴謹的。在28章會有更嚴謹的分析,得出Theorem 6.8中所出給的bound。
We will start by showing that
我們從證明下面不等式開始
Since the random variable is nonnegative, the proof of the theorem follows directly from the preceding using Markov's inequality (see Section B.1).
因為隨機變數是非負的,因此,其證明直接來自前面的Markov's inequality(見附錄B.1)
To bound the left-hand side of Equation (6.4) we first note that for every
, we can rewrite , where is an additional i.i.d. sample. Therefore,
為了bound方程式6.4的左邊項目,首先我們注意到對每一個,我們可以重寫為,其中,是一個額外的獨立同分佈樣本。因此,
A generalization of the triangle inequality yields
and the fact that supermum of expectation is smaller than expectation of supremum yields
Formally, the previous two inequalities follow from Jensen's inequality. Combining all we obtain
形式上來說,前面兩個不等式來自Jensen's inequality。結合全部,我們得到下面,
The expectation on the right-hand side is over a choice of two i.i.d. samples ariable is nonnegative, the proof of the theorem follows directly from the preceding using Markov's inequality (see Section B.1).
因為隨機變數是非負的,因此,其證明直接來自前面的Markov's inequality(見附錄B.1)
To bound the left-hand side of Equation (6.4) we first note that for every
, we can rewrite , where is an additional i.i.d. sample. Therefore,
為了bound方程式6.4的左邊項目,首先我們注意到對每一個,我們可以重寫為,其中,是一個額外的獨立同分佈樣本。因此,
A generalization of the triangle inequality yields
and the fact that supermum of expectation is smaller than expectation of supremum yields
Formally, the previous two inequalities follow from Jensen's inequality. Combining all we obtain
形式上來說,前面兩個不等式來自Jensen's inequality。結合全部,我們得到下面,
The expectation on the right-hand side is over a choice of two i.i.d. samples and . Since all of these 2m vectors are chosen i.i.d., nothing will change if we replace the name of the random vector with the name of the random vector . If we do it, instead of the term in Equation (6.5) we will have the term . It follows that for every we have that Equation (6.5) equals
右手邊的期望值是選擇兩個獨立同分佈樣本與。因為全部這個向量都是獨立同分佈所選出來的,因此,就算我們將隨機向量的名稱改為隨機向量也不會有任何改變。如果我們這麼做,那方程式6.5就不再是,而是。因此,對每一個,我們都會有方程式6.5等於
Since this holds for every , it also holds if we sample each component of uniformly at random from the uniform distribution over , denoted .
因為這對每一個都成立,因此,如果我們隨機的從一個上的均勻分佈對的每一個組成均勻的採樣,那也會成立,以來表示。
Hence, Equation (6.5) also equals
and by the linearity of expectation it also equals
因此,方程式6.5也等於
而且根據期望的線性,它也等價於
Next, fix S and S0, and let C be the instances appearing in S and S0. Then, we can take the supremum only over h 2 HC. Therefore,
接下來,固定與,並假設是與中出現的實例。然後,我們就只能夠在上取最小上界。因此,
Fix some and denote . Since and is an average of independent variables, each of which takes values in , we have by Hoeffding's inequality that for every ,
固定某些,表示為。由於且為自變量的平均數,每一個都是取值在之間,根據hoeffding's inequality,每一個,
Applying the union bound over , we obtain that for any ,
在上做union bound,我們得到,對於任意的,
Finally, Lemma A.4 in Appendix A tells us that the preceding implies
最後,附錄A中的Lemma A.4告訴我們,上面不等式的意思就是
Combining all with the definition of , we have shown that
結合全部與的定義,我們終於證明
The fundamental theorem of learning theory characterizes PAC learnability of classes of binary classifiers using VC-dimension. The VC-dimension of a class is a combinatorial property that denotes the maximal sample size that can be shattered by the class. The fundamental theorem states that a class is PAC learnable if and only if its VC-dimension is finite and specifies the sample complexity required for PAC learning. The theorem also shows that if a problem is at all learnable, then uniform convergence holds and therefore the problem is learnable using the ERM rule.
學習理論的基本定理使用VC-dimension來描述二分分類器類別的PAC learnability。一個類別的VC-dimension是一種組合特性,其表示該類別被shattered的最大樣本數。基本定理指出,一個類別要是PAC learnable,若且唯若其VC-dimension是有限的,且確定PAC learnable所需的樣本複雜度。該定理還說明,如果一個問題是完全的可學習的(learnable),那麼均勻收斂就會成立,因此,該問題使用ERM rule就是可學習的(learnable)。