Try   HackMD

Chapter 6-Understanding Machine Learning From Theory to Algorithms: The VC-Dimension

tags: Understanding Machine Learning From Theory to Algorithms

Remark

文章內容

個人理解

課程連結

The VC-Dimension

In the previous chapter, we decomposed the error of the ERMH rule into approximation error and estimation error. The approximation error depends on the fit of our prior knowledge (as reflected by the choice of the hypothesis class H) to the underlying unknown distribution. In contrast, the definition of PAC learnability requires that the estimation error would be bounded uniformly over all distributions.

前一章裡面,我們將

ERMH的誤差分解為近似誤差與估計誤差。近似誤差取決於先驗知識與潛在未知的分佈之間的擬合狀況(正如hypothesis class
H
的選擇所反映的那樣)。相較之下,PAC learnability的定義需要估計誤差能夠在所有的分佈上被均衡的界定。

Our current goal is to figure out which classes H are PAC learnable, and tocharacterize exactly the sample complexity of learning a given hypothesis class. So far we have seen that finite classes are learnable, but that the class of all functions (over an infinite size domain) is not. What makes one class learnable and the other unlearnable? Can infinite-size classes be learnable, and, if so, what determines their sample complexity?

我們當前的目標就是去找出那些classes

H是PAC learnable,並且確切的描述出學習一個給定hyothesis class所需要的樣本複雜度。目前為止我們已經知道,finite class是learnable,但所有函數的class上並非如此(在無限大小的domain上)。是什麼造成一個class是learnable,而另一個是unlearnable?infinite-size的classes是否可以是learnable,如果可以,那怎麼定義它們的樣本複雜度?

We begin the chapter by showing that infinite classes can indeed be learnable, and thus, finiteness of the hypothesis class is not a necessary condition for learnability. We then present a remarkably crisp characterization of the family of learnable classes in the setup of binary valued classification with the zero-one loss. This characterization was first discovered by Vladimir Vapnik and Alexey Chervonenkis in 1970 and relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension). We formally define the VC-dimension, provide several examples, and then state the fundamental theorem of statistical learning theory, which integrates the concepts of learnability, VC-dimension, the ERM rule, and uniform convergence.

這章的一開始,我們要說明一件事,那就是infinite class確實可以是learnable,因此,hypothesis的有限性就不會是learnability的必要條件。然後,在設置具有0-1損失的二分類中,我們會說明一個關於family of learnable classes非常清晰的表徵分析。這種表徵首先由Vladimir Vapnik與Alexey Chervonenkis在1970年所發現,這依賴著一種組合概念,也就是Vapnik-Chervonenkis dimension (VC-dimension)。我們會正式的定義VC-dimension,然後說明統計學習的基本理論,這個理論會整合learnability、VC-dimension、ERM rule與均勻收斂。

6.1 Infinite-Size Classes Can Be Learnable

In Chapter 4 we saw that finite classes are learnable, and in fact the sample complexity of a hypothesis class is upper bounded by the log of its size. To show that the size of the hypothesis class is not the right characterization of its sample complexity, we first present a simple example of an infinite-size hypothesis class that is learnable.

在第四章裡面,我們看到了finite classes是learnable,事實上,其hypothesis class 的樣本複雜度就是由它本身的大小取對數來做為上限。為了說明hypothesis class的大小並不是其樣本複雜度的正確表徵,我們首先提出一個是learnable的hypothesis class的簡單的範例,這個hypothesis class的infinite size。

Example 6.1 Let

H be the set of threshold functions over the real line, namely,
H={ha:aR}
, where
ha:R{0,1}
is a function such that
ha(x)=I[x<a]
. To remind the reader,
I[x<a]
is
1
if
x<a
and
0
otherwise. Clearly,
H
is of infinite size. Nevertheless, the following lemma shows that H is learnable in the PAC model using the ERM algorithm.

Example 6.1 假設

H實(數)線門檻函數的集合,也就是說,
H={ha:aR}
,其中
ha:R{0,1}
是一個函數,為
ha(x)=I[x<a]
I[x<a]
的意思就是,當
x<a
,那就是
1
,反之則為
0
。很明顯的,
H
是無限大小的。儘管如此,下面的lemma依然是說明在PAC model中使用ERM algorithm,
H
是learnable。

Lemma 6.1 Let

H be the class of thresholds as defined earlier. Then,
H
is PAC learnable, using the ERM rule, with sample complexity of
mH(ϵ,δ)log(2/δ)ϵ
.

Proof Let

a be a threshold such that the hypothesis
h(x)=I[xMa]
achieves
LD(h)=0
. Let
Dx
be the marginal distribution over the domain
X
and let
a0<a<a1
be such that

PxDx[x(a0,a)]=PxDx[x(a,a1)]=ϵ

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Lemma 6.1 假設,

H是一個如先前所定義的thresholds class。那麼
H
就是PAC learnable,使用ERM rule,以
mH(ϵ,δ)log(2/δ)ϵ
的樣本複雜度。

Proof 假設

a是一個閥值,那麼hypothesis
h(x)=I[xMa]
就可以得到
LD(h)=0
。假設
Dx
是domain
X
上的邊際分佈,並且設定
a0<a<a1
,因此

PxDx[x(a0,a)]=PxDx[x(a,a1)]=ϵ

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

(

Dx(,a)ϵ we set
a0=
and similarly for
a1
). Given a training set
S
, let
b0=max{x:(x,1)S}
and
b1=min{x:(x,0)S}
(if no example in
S
is positive we set
b0=
and if no example in
S
is negative we set
b1=
). Let
bS
be a threshold corresponding to an ERM hypothesis,
hS
, which implies that
bS(b0,b1)
. Therefore, a sufficient condition for
LDϵ
is that both
b0a0
and
b1a1
. In other words,

PDm[LD(hS)>ϵ]PDm[b0<a0b1>a1]

and using the union bound we can bound the preceding by

(6.1)PDm[LD(hS)>ϵ]PDm[b0<a0]+PDm[b1>a1]

(如果

Dx(,a)ϵ,那我們就設置
a0=
a1
也是一樣的做法)。給定一個訓練集
S
,假設
b0=max{x:(x,1)S}
,而
b1=min{x:(x,0)S}
(如果
S
內沒有樣本是真值(positive),那我們就設置
b0=
,而如果
S
內沒有負值(negative),那我們就設置
b1=
)。假設
bS
是一個相當於ERM hypothesis
hS
的閥值,這意味著
bS(b0,b1)
。因此,對於
LDϵ
的一個充份條件就是
b0a0
b1a1
。換言之,

PDm[LD(hS)>ϵ]PDm[b0<a0b1>a1]

然後利用union bound,我們就可以bound住前面的式子,

(6.1)PDm[LD(hS)>ϵ]PDm[b0<a0]+PDm[b1>a1]

The event

b0<a0 happens if and only if all examples in
S
are not in the interval
(a0,a)
, whose probability mass is defined to be
ϵ
, namely,

PDm[b0<a0]=PDm[(x,y)S,x(a0,a)]=(1ϵ)meϵm

b0<a0這個事件只會發生在訓練集
S
內的所有的樣本都沒有在
(a0,a)
這個區間內,其機率質量
ϵ
定義,也就是說,

PDm[b0<a0]=PDm[(x,y)S,x(a0,a)]=(1ϵ)meϵm

Since we assume

m>log(2/δ)/ϵ it follows that the equation is at most
δ/2
. In the same way it is easy to see that
PSDm[b1>a1]δ/2
. Combining with Equation (6.1) we conclude our proof.

因為我們假設

m>log(2/δ)/ϵ,因此上面的式子最多就是
δ/2
。相同的方法,很容易可以知道
PSDm[b1>a1]δ/2
。結合方程式6.1,得證。

6.2 The VC-Dimension

We see, therefore, that while finiteness of H is a sufficient condition for learnability, it is not a necessary condition. As we will show, a property called the VC-dimension of a hypothesis class gives the correct characterization of its learnability. To motivate the definition of the VC-dimension, let us recall the No-Free-Lunch theorem (Theorem 5.1) and its proof. There, we have shown that without restricting the hypothesis class, for any learning algorithm, an adversary can construct a distribution for which the learning algorithm will perform poorly, while there is another learning algorithm that will succeed on the same distribution. To do so, the adversary used a finite set

CX and considered a family of distributions that are concentrated on elements of C. Each distribution was derived from a \true" target function from
C{0,1}
. To make any algorithm fail, the adversary used the power of choosing a target function from the set of all possible functions from
C{0,1}
.

因此,我們可以看到,盡管

H的有限性對於learnability是充份條件,但並不是必要條件。一如我們要說明的,一個稱為hypothesis class的VC-dimension給出其learnability的正確特徵。為了引出VC-dimension的定義,讓我們回想天下沒有白吃的午餐這個定理(見5.1)及其證明。我們已經證明過,在沒有對hypothesis class做任何限制的情況下,對任何的學習演算法來說,對某一個分佈的效能會很差,但會有另一個學習演算法在該分佈上是可以成功學習。這樣做,我們可以使用有限的集合
CX
,然後考慮一個集中
C
的元素上面的分佈族。每一個分佈都是由"真"的目標函數(
C to {0,1}
)所衍生的。為了讓任意的演算法失敗,adversary可以從
C{0,1}
中所有可能的函數集合裡面選擇一個目標函數。

When considering PAC learnability of a hypothesis class

H, the adversary is restricted to constructing distributions for which some hypothesis
hH
achieves a zero risk. Since we are considering distributions that are concentrated on elements of
C
, we should study how
H
behaves on
C
, which leads to the
following definition.

當考慮到hypothesis class

H的PAC learnability,adversary就會被限制去構建一個讓某些hypothesis
hH
得到zero rick的分佈。由於我們考慮是集中在
C
的元素上的分佈,所以我們應該研究
H
C
上的表現如何,這導出下面的定義。

DEFINITION 6.2 (Restriction of

H to
C
) Let
H
be a class of functions from
X
to
{0,1}
and let
C={c1,,cm}X
. The restriction of
H
to
C
is the set of functions from
C
to
{0,1}
that can be derived from
H
. That is,

HC={(h(c1),,h(cm)):hH}

where we represent each function from

C to
{0,1}
as a vector in
{0,1}|C|
.

DEFINITION 6.2 (

H
C
的限制) 假設
H
是一個
x{0,1}
的函數類,並假設
C={c1,,cm}X
(
C
X
的子集)。
H
C
的限制就是
C{0,1}
的函數集合,這可以由
H
推導出來。也就是,

HC={(h(c1),,h(cm)):hH}

我們把每一個

C{0,1}的函數以向量
{0,1}|C|
來表示。

If the restriction of

H to
C
is the set of all functions from
C
to
{0,1}
, then we say that
H
shatters the set
C
. Formally:

DEFINITION 6.3 (Shattering) A hypothesis class

H shatters a finite set
CX
if the restriction of
H
to
C
is the set of all functions from
C
to
{0,1}
. That is,
|HC|=2|C|
.

如果這個

H
C
的限制是所有
C{0,1}
的函數集合,那我們就可以說
H
shatters集合
C
。正確來說是:

DEFINITION 6.3 (Shattering) 如果

H
C
的限制是
C{0,1}
的所有函數集合,那麼我們說,hypothesis class
H
shatters一個有限集合
CX
。也就是,
|HC|=2|C|

Example 6.2 Let

H be the class of threshold functions over
R
. Take a set
C={c1}
. Now, if we take
a=c1+1
, then we have
ha(c1)=1
, and if we take
a=c11
, then we have
ha(c1)=1
. Therefore,
HC
is the set of all functions from
C
to
{0,1}
, and
H
shatters
C
. Now take a set
C={c1,c2}
, where
c1c2
. No
hH
can account for the labeling
(0,1)
, because any threshold that assigns the label
0
to
c1
must assign the label
0
to
c2
as well. Therefore not all functions
from
C
to
{0,1}
are included in
HC
; hence
C
is not shattered by
H
.

Example 6.2 假設

H
R
上的一個門檻函數。取一個集合
C={c1}
。現在,如果我們取
a=c1+1
,那我們就會得到
ha(c1)=1
,如果我們取
a=c11
,那我們就會得到
ha(c1)=1
。因此,
HC
就是
C{0,1}
的所有函數的集合,而且
H
shatters
C
。現在,我們取一個集合
C={c1,c2}
,其中
c1c2
。沒有
hH
可以產生labeling
(0,1)
,因為任一個標記label
0
c1
的閥值同時也必需要能標記
0
c2
。因此,並非所有
C{0,1}
的函數都被包含在
HC
;因此,
H
是無法shatter
C
的。

Getting back to the construction of an adversarial distribution as in the proof of the No-Free-Lunch theorem (Theorem 5.1), we see that whenever some set

C is shattered by H, the adversary is not restricted by
H
, as they can construct a distribution over
C
based on any target function from
C
to
{0,1}
, while still maintaining the realizability assumption. This immediately yields:

corollary 6.4 Let

H be a hypothesis class of functions from
X
to
{0,1}
. Let
m
be a training set size. Assume that there exists a set
CX
of size
2m
that is shattered by
H
. Then, for any learning algorithm,
A
, there exist a distribution
D
over
Xx{0,1}
and a predictor
hH
such that
LD(h)=0
but with probability of at least 1/7 over the choice of
SDm
we have that
LD(A(S))1/8
.

回到證明沒有白吃的午餐的理論(Theorem 5.1)中所說的adversarial distribution的構建,當某一些集合

C
H
給shatter,那這個adversary distribution沒有被
H
給限制,因此它們可以在
C
上構建任何基於目標函數
C{0,1}
上的分佈,同時維持relizability assumption。這直接得到:

COROLLARY 6.4 假設

H是函數
X{0,1}
的hypothesis class。假設
m
是訓練集的大小。假設存在一個大小為
2m
的集合
CX
,這個集合可以被
H
給shatter。那麼,對任意的學習演算法,
A
,存在一個
Xx{0,1}
上的分佈
D
,以及一個predictor
hH
,可以得到
LD(h)=0
,但對於選擇的資料集
SDm
最少會有1/7的機率其
LD(A(S))1/8

Corollary 6.4 tells us that if

H shatters some set
C
of size
2m
then we cannot learn H using m examples. Intuitively, if a set
C
is shattered by
H
, and we receive a sample containing half the instances of
C
, the labels of these instances give us no information about the labels of the rest of the instances in
C
- every possible labeling of the rest of the instances can be explained by some hypothesis in
H
. Philosophically,

If someone can explain every phenomenon, his explanations are worthless.

Corollary 6.4 告訴我們一件事,如果

H可以shatter某些大小為
2m
的集合
C
,那我們就沒有辦法用
m
個樣本來學習
H
。直觀來看,如果一個集合
C
H
給shattered,而我們所接收的樣本包含C的instances的一半,那這些instances是沒有辦法提供給我們任何關於
C
剩下的instances的標籤的信息 - 其餘的instances的每一個可能的標籤都可以用
H
內的某一個hypothesis來解釋。有一句話是這麼說的,

如果某人可以解釋每一種現象,那麼他的解釋一定是錯的。

This leads us directly to the definition of the VC dimension.

DEFINITION 6.5 (VC-dimension) The VC-dimension of a hypothesis class

H, denoted VCdim(H), is the maximal size of a set
CX
that can be shattered by
H
. If
H
can shatter sets of arbitrarily large size we say that
H
has infinite VC-dimension.

這引導出VC dimension的定義。

DEFINITION 6.5 (VD-dimension) hypothesis class

H的VC-dimension以
VCdim(H)
來表示,其表示為
H
所能shatter掉
CX
的最大大小。如果
H
可以shatter掉任意大小的集合,那我們就說,
H
無限的VC-dimension。

A direct consequence of Corollary 6.4 is therefore:

THEOREM 6.6 Let

H be a class of infinite VC-dimension. Then,
H
is not PAC learnable.

Corollary 6.4直接導致的結果就是:

THEOREM 6.6 假設

H有著無限的VC-dimension。那麼,我們就說,
H
並非PAC learnable。

Proof Since

H has an infinite VC-dimension, for any training set size
m
, there exists a shattered set of size 2m$, and the claim follows by Corollary 6.4.

Proof 因為

H有著無限的VC-dimension,因此對任意大小為
m
的訓練集,都會存在著一個大小為
2m
而且可以被shattered的集合,這個證明來自Corollary 6.4。

We shall see later in this chapter that the converse is also true: A finite VC-dimension guarantees learnability. Hence, the VC-dimension characterizes PAC learnability. But before delving into more theory, we first show several examples.

章節的後面會看到這個推論反過來說也是一樣成立:一個有限的VC-dimension可以保證它的learnability。因此,VC-dimension描述著PAC learnability。在深入探討更多理論之前,讓我們先來看幾個範例。

6.3 Examples

In this section we calculate the VC-dimension of several hypothesis classes. To show that

VCdim(H)=d we need to show that

  1. There exists a set
    C
    of size
    d
    that is shattered by
    H
    .
  2. Every set
    C
    of size
    d+1
    is not shattered by
    H
    .

在這個章節裡,我們會試著計算各種hypothesis classes的VC-dimension。為了說明

VCdim(H)=d,我們需要證明兩件事:

  1. 存在一個大小為
    d
    的集合
    C
    ,這集合可以被
    H
    給shatter
  2. 每一個大小為
    d+1
    的集合
    C
    都沒有辦法被
    H
    給shatter

6.3.1 Threshold Functions

Let

H be the class of threshold functions over
R
. Recall Example 6.2, where we have shown that for an arbitrary set
C={c1}
,
H
shatters
C
; therefore
VCdim(H)1
. We have also shown that for an arbitrary set
C={c1,c2}
where
c1c2
,
H
does not shatter
C
. We therefore conclude that
VCdim(H)=1
.

假設

H
R
上門檻函數的class。回想範例6.2,我們已經說明過,對任意的集合
C={c1}
H
都可以shatter掉
C
;因此,
VCdim(H)1
。我們也已經證明,對任意集合
C={c1,c2}
,其中
c1c2
H
沒有辦法shatter掉
C
。因此,我們得到一個結論,
VCdim(H)=1

6.3.2 Intervals

Let

H be the class of intervals over
R
, namely,
H={ha,b:a,bR,a<b}
, where
ha,b:R{0,1}
is a function such that
ha,b(x)=I[x(a,b)]
. Take the set
C={1,2}
. Then,
H
shatters
C
(make sure you understand why) and therefore
VCdim(H)2
. Now take an arbitrary set
C={c1,c2,c3}
and assume without loss of generality that
c1c2c3
. Then, the labeling
(1,0,1)
cannot be obtained by an interval and therefore
H
does not shatter
C
. We therefore conclude that
VCdim(H)=2
.

假設

H
R
上的區間類別,也就是,
H={ha,b:a,bR,a<b}
,其中
ha,b:R{0,1}
是一個函數,因此
ha,b(x)=I[x(a,b)]
。取集合
C={1,2}
。然後,H shatters
C
(),因此
VCdim(H)2
。現在,取任意集合
C={c1,c2,c3}
,並假設這集合一樣是
c1c2c3
。然後,其標籤
(1,0,1)
無法由一個區間來取得,因此
H
無法shatter
C
。因此我們得到一個結論,
VCdim(H)=2

6.3.3 Axis Aligned Rectangles

Let

H be the class of axis aligned rectangles, formally:
H={h(a1,a2,b1,b2):a1a2,b1b2}

where

h(a1,a2,b1,b2)(x1,x2)={1 if a1x1a2 and b1x2b20 otherwise

假設

H是一個軸對稱矩陣,正確來說:
H={h(a1,a2,b1,b2):a1a2,b1b2}

其中

h(a1,a2,b1,b2)(x1,x2)={1 if a1x1a2 and b1x2b20 otherwise

We shall show in the following that

VCdim(H)=4. To prove this we need to find a set of 4 points that are shattered by
H
and show that no set of 5 points can be shattered by
H
. Finding a set of 4 points that are shattered is easy (see Figure 6.1). Now, consider any set
CR2
of 5 points. In
C
, take a leftmost point (whose first coordinate is the smallest in
C
), a rightmost point (first coordinate is the largest), a lowest point (second coordinate is the smallest), and a highest point (second coordinate is the largest). Without loss of generality, denote
C={c1,,c5}
and let
c5
be the point that was not selected. Now, define the labeling
(1,1,1,1,0)
. It is impossible to obtain this labeling by an axis aligned rectangle. Indeed, such a rectangle must contain
c1,,c4
; but in this case the rectangle contains
c5
as well, because its coordinates are within the intervals defined by the selected points. So,
C
is not shattered by
H
, and therefore
VCdim(H)=4
.

我們將在下面說明

VCdim(H)=4。為了證明這一點,我們需要尋找一個可以被
H
shatter的4個資料點的集合,然後證明沒有5個資料點的集合可以被H shatter。要找出一個可以被shatter的4個資料點的集合實在是太簡單(見Figure 6.1)。現在,考慮5個資料點的任意集合
CR2
。在
C
裡面,取一個最左邊的點(第一軸的最小值)、最右邊的點(第一中的最大值),最低的點(第二軸的最小值),最高的點(第二軸的最大值)。在不失一般性下這樣表示,
C={c1,,c5}
,然後假設
c5
是沒有被取到的點。現在,定義標籤分別為
(1,1,1,1,0)
。這種情況下你不可能用軸對稱矩形去得到這種標籤結果。確實,這樣的矩形必需包含
c1,,c4
;但是在這個案例中的矩形還包含了
c5
,因為
c5
就在你所選定的座標區間內。所以,
C
沒有辦法被
H
shatter,因此,
VCdim(H)=4

Figure 6.1 Left: 4 points that are shattered by axis aligned rectangles. Right: Any axis aligned rectangle cannot label

c5 by 0 and the rest of the points by 1.

Figure 6.1 Left:4個點的情況下是可以被軸對稱矩形shatter。
Right:任一個軸對稱矩陣都沒有辦法在其它資料點為1的情況下將

c5標記為0。

6.3.4 Finite Classes

Let

H be a finite class. Then, clearly, for any set
C
we have
|HC||H|
and thus
C
cannot be shattered if
|H|<2|C|
. This implies that
VCdim(H)log2(|H|)
. This shows that the PAC learnability of finite classes follows from the more general statement of PAC learnability of classes with finite VC-dimension, which we shall see in the next section. Note, however, that the VC-dimension of a finite class
H
can be significantly smaller than
log2(|H|)
. For example, let
x={1,,k}
, for some integer
k
, and consider the class of threshold functions (as defined in Example 6.2). Then,
|H|=k
but
VCdim(H)=1
. Since
k
can be arbitrarily large, the gap between
log2(|H|)
and
VCdim(H)
can be arbitrarily large.

假設

H是一個finite class。那麼,很明顯的,對任意的集合
C
都存在著
|HC||H|
。因而,如果
|H|<2|C|
,那麼
C
就沒有辦法被shatter。這指出了
VCdim(H)log2(|H|)
。這說明了,finite classes的PAC learnability是來自於finite VC-dimension的PAC learnability更一般化的敘述,我們將會在後面的章節談到。注意到,finite class
H
的VC-dimension很明顯的會比
log2(|H|)
還要小。舉例來說,假設
x={1,,k}
k
是整數,並且考慮到範例6.2中所提到的門檻函數的類別。那麼,
|H|=k
,但
VCdim(H)=1
。因為
k
可以是任意大小的值,因此
log2(|H|)
VCdim(H)
之間的差距就可以是任意大小。

6.3.5 VC-Dimension and the Number of Parameters

In the previous examples, the VC-dimension happened to equal the number of parameters defining the hypothesis class. While this is often the case, it is not always true. Consider, for example, the domain

XR, and the hypothesis class
X={hθ:θR}
where
hθ:X{0,1}
is defined by
hθ(x)=0.5sin(θx)
. It is possible to prove that
VCdim(H)=
, namely, for every
d
, one can find
d
points that are shattered by
H
(see Exercise 8).

在先前的範例中,VC-dimension剛好會是hypothesis class的參數數量。儘管這很常發生,但不總是這樣的。注意到,舉例來說,domain

XR,且hypothesis class
X={hθ:θR}
,其中
hθ:X{0,1}
hθ(x)=0.5sin(θx)
所定義。這可能可以證明
VCdim(H)=
,也就是說,對於每一個
d
,我們都可以找出
d
個資料點可以被
H
給shatter。

6.4 The Fundamental Theorem of PAC learning

We have already shown that a class of infinite VC-dimension is not learnable. The converse statement is also true, leading to the fundamental theorem of statistical learning theory:

THEOREM 6.7 (The Fundamental Theorem of Statistical Learning) Let

H be a hypothesis class of functions from a domain
X
to
{0,1}
and let the loss function be the
01
loss. Then, the following are equivalent:

  1. H
    has the uniform convergence property.
  2. Any ERM rule is a successful agnostic PAC learner for
    H
    .
  3. H
    is agnostic PAC learnable.
  4. H
    is PAC learnable.
  5. Any ERM rule is a successful PAC learner for
    H
    .
  6. H
    has a finite VC-dimension.

The proof of the theorem is given in the next section.

我們已經說明,infinite VC-dimension的class並非learnable。反過來說也一樣,這導出一個統計學習理論的基本定義:

THEOREM 6.7 (統計學習的基本定理) 假設

H是一個hypothesis class,由domain
X
映射到
{0,1}
的functions集結而成,然後假設loss function是
01
的loss。那麼,下面的幾個說明是等價的:

  1. H
    存在著均勻收斂的特性
  2. H
    而言,任意的ERM rule都是成功的agnostic PAC learner
  3. H
    為agnostic PAC learnable
  4. H
    為PAC learnable
  5. H
    而言,任意的ERM rule都是成功的PAC learner
  6. H
    有著finite VC-dimension

這個定理的證明會在下一章給出。

Not only does the VC-dimension characterize PAC learnability; it even determines the sample complexity.

THEOREM 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative Version) Let

H be a hypothesis class of functions from a domain
X
to
{0,1}

and let the loss function be the
01
loss. Assume that
VCdim(H)=d<
.

Then, there are absolute constants $C_1,C_2 such that:

  1. H has the uniform convergence property with sample complexity
    C1d+log(1/δ)ϵ2mHUC(ϵ,δ)C2d+log(1/δ)ϵ2

  2. H is agnostic PAC learnable with sample complexity
    C1d+log(1/δ)ϵ2mH(ϵ,δ)C2d+log(1/δ)ϵ2

  3. H is PAC learnable with sample complexity
    C1d+log(1/δ)ϵmH(ϵ,δ)C2dlog(1/δ)+log(1/δ)ϵ

The proof of this theorem is given in Chapter 28.

VC-dimension不只可以描述PAC learnability;它還可以定義樣本複雜度。

THEOREM 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative Version) 假設

H是一個hypothesis class,由domain
X
映射到
{0,1}
的functions集結而成,並且loss function是
01
的loss。假設,
VCdim(H)=d<
。然後,有兩個絕對常數$C_1,C_2會使得:

  1. H具均勻收斂的特性,其樣本複雜度為
    C1d+log(1/δ)ϵ2mHUC(ϵ,δ)C2d+log(1/δ)ϵ2

  2. H是agnostic PAC learnable,其樣本複雜度為
    C1d+log(1/δ)ϵ2mH(ϵ,δ)C2d+log(1/δ)ϵ2

  3. H是PAC learnable,其樣本複雜度為
    C1d+log(1/δ)ϵmH(ϵ,δ)C2dlog(1/δ)+log(1/δ)ϵ

這個定理的證明會在第28章給出。

Remark 6.3 We stated the fundamental theorem for binary classification tasks. A similar result holds for some other learning problems such as regression with the absolute loss or the squared loss. However, the theorem does not hold for all learning tasks. In particular, learnability is sometimes possible even though the uniform convergence property does not hold (we will see an example in Chapter 13, Exercise 2). Furthermore, in some situations, the ERM rule fails but learnability is possible with other learning rules.

Remark 6.3 我們說明了二分類任務的基本定理。類型的結果也會在其它的學習問題上成立,像是使用絕對損失或是平方損失的迴歸。但是這個定理並不適用於所有學習任務。特別是,即使均均勻收斂性不成立,learnability有時候也是有可能的(這點會在第13章的練習2中看到)。此外,某些情況下,ERM rule會失敗,但其它的learning rule反而有可能提高learnability。

6.5 Proof of Theorem

We have already seen that

12 in Chapter 4. The implications
23
and
34
are trivial and so is
25
. The implications
46
and
56
follow from the No-Free-Lunch theorem. The difficult part is to show that
61
. The proof is based on two main claims:

  • If
    VCdim(H)=d
    then even though
    H
    might be infinite, when restricting it to a finite set
    CX
    , its "effiective" size,
    |H|
    , is only
    O(|C|d)
    . That is, the size of
    HC
    grows polynomially rather than exponentially with
    |C|
    . This claim is often referred to as Sauer's lemma, but it has also been stated and proved independently by Shelah and by Perles. The formal statement is given in Section 6.5.1 later.
  • In Section 4 we have shown that finite hypothesis classes enjoy the uniform
    convergence property. In Section 6.5.2 later we generalize this result and
    show that uniform convergence holds whenever the hypothesis class has a
    "small effiective size." By "small effiective size" we mean classes for which
    |HC|
    grows polynomially with
    |C|
    .

在第四章我們已經看到

12。這意謂著
23
34
是很自然而然的,
25
也是一樣。依著沒有白吃的午餐的定理,
46
56
也是成立的。最難的是證明
61
。這證明主要基於兩個宣稱(主張):

  • 如果
    VCdim(H)=d
    ,即使
    H
    可能是無限集合,當我們把它限制到一個有限集合
    CX
    ,它的有效大小,
    |H|
    ,就只會是
    O(|C|d)
    。也就是,
    HC
    的大小是以
    |C|
    的多項式函式而不是指數函數增長。這個主張通常被稱為Sauer's lemma,這也被Shelah與perles各別的提出證明過。正式的說明會在後續說明(6.5.1)
  • 在第四章我們已經說明過,finite hypothesis classes有著均勻收斂的性質。下面6.5.2會泛化這個結果,然後說明每當hypothesis class有著"小的有效大小",那麼均勻收斂就會成立。所謂的"小的有效大小"指的就是
    |HC|
    會以
    |C|
    的多項式成長。

6.5.1 Sauer's Lemma and the Growth Function

We defined the notion of shattering, by considering the restriction of

Hto a finite set of instances. The growth function measures the maximal "effiective" size of
H
on a set of m examples. Formally:

DEFINITION 6.9 (Growth Function) Let

H be a hypothesis class. Then the growth function of
H
, denoted
τH:NN
, is defined as

τH(m)=maxCX:|C|=m|HC|

我們透過將

H為有限的實例集合來定義shattering的概念。而成長函數就是拿來量測
H
m
個樣本的集合上的最大"有效"大小。正確的說是這樣的:

DEFINITION 6.9 (Growth Function) 假設

H是hypothesis clas。那麼,
H
的成長函數就標記為
τH:NN
,其定義為:

τH(m)=maxCX:|C|=m|HC|

In words,

τH(m) is the number of different functions from a set
C
of size
m
to
{0,1}
that can be obtained by restricting
H
to
C
.

也就是,

τH(m)就是從大小為
m
的集合映射到
{0,1}
的不同函數的個數,這可以透過限制
H
C
來取得。

Obviously, if

VCdim(H)=d then for any
md
we have
τH(m)=2m
. In such cases,
H
induces all possible functions from
C
to
{0,1}
. The following beautiful lemma, proposed independently by Sauer, Shelah, and Perles, shows that when
m
becomes larger than the VC-dimension, the growth function increases polynomially rather than exponentially with
m
.

lemma 6.10 (Sauer-Shelah-Perles) Let

H be a hypothesis class with
VCdim(H)=d<
. Then, for all
m
,
τH(m)i=0d(mi)
. In particular, if
m>d+1
then
τH(m)(em/d)d
.

很明顯的,如果

VCdim(H)=d,那麼對於任意的
md
我們都會有著
τH(m)=2m
。這種情況下,
H
會歸納出
C
{0,1}
的所有可能函數。下面漂亮的引理,由Sauer、Shelah與perles獨自提出,說明了當
m
變的比VC-dimension還要大的時候,成長函數會以
m
的多項式增長而不是指數增長。

LEMMA 6.10 (Sauer-Shelah-Perles) 假設

H是一個hypothesis class,且
VCdim(H)=d<
。那麼,對所有的
m
而言,
τH(m)i=0d(mi)
。特別是,如果
m>d+1
,那麼
τH(m)(em/d)d

Proof of Sauer's Lemma

To prove the lemma it suffices to prove the following stronger claim: For any

C={c1,,cm} we have
H,|HC||{B}|

為了證明引理,我們要證明下面更強烈的主張:對任意的集合

C={c1,,cm}我們有著
(6.3)H,|HC||{BC:H  shatters B}|

The reason why Equation (6.3) is sufficient to prove the lemma is that if

VCdim(H)d then no set whose size is larger than
d
is shattered by
H
and therefore
|{BC:H  shatters B}|i=0d(mi)

When

m>d+1 the right-hand side of the preceding is at most
(em/d)d
(see Lemma A.5 in Appendix A).

方程式6.3足以證明這引理的理由在於,如果

VCdim(H)d,那麼就不會有大小超過
d
的集合可以被
H
給shatter,因此
|{BC:H  shatters B}|i=0d(mi)

m>d+1,上面方程式的石邊最多就是
(em/d)d
(可參考附錄A的Lemma A.5)。

We are left with proving Equation (6.3) and we do it using an inductive argument. For

m=1, no matter what
H
is, either both sides of Equation (6.3) equal 1 or both sides equal 2 (the empty set is always considered to be shattered by
H
). Assume Equation (6.3) holds for sets of size
k<m
and let us prove it for sets of size
m
. Fix
H
and
C={c1,,cm}
. Denote
c={c2,,cm}
and in addition, define the following two sets:
Y0={(y2,,ym):(0,y2,,ym)HC(1,y2,,ym)HC}

and
Y1={(y2,,ym):(0,y2,,ym)HC(1,y2,,ym)HC}

我們只剩下證明方程式6.3,這點我們將採用歸納論證來證明。當

m=1,無論
H
是什麼,方程式6.3的兩邊都會等於1或2(我們認為空集合是可以被
H
shatter)。假設方程式6.3在大小為
k<m
的集合上是成立的,讓我們證明集合大小為
m
的情況。固定
H
,且
C={c1,,cm}
。另外,
c={c2,,cm}
,此外,定義下面兩個集合:
Y0={(y2,,ym):(0,y2,,ym)HC(1,y2,,ym)HC}


Y1={(y2,,ym):(0,y2,,ym)HC(1,y2,,ym)HC}

It is easy to verify that

|HC|=|Y0|+|Y1|. Additionally, since
Y0=HC
, using the induction assumption (applied on
H
and
C
) we have that
|Y0|=|HC||{BC:H shatters B}|=|{BC:c1BH shatters B}|

很容易就可以驗證

|HC|=|Y0|+|Y1|。此外,因為
Y0=HC
,我們用歸納假設得到
|Y0|=|HC||{BC:H shatters B}|=|{BC:c1BH shatters B}|

Next, define

HH to be
H={hH:hH s.t. (1h(c1),h(c2),,h(cm))=(h(c1),h(c2),,h(cm))}

namely, \mathcal{H'}$ contains pairs of hypotheses that agree on

C and differ on
c1
. Using this definition, it is clear that if
H
shatters a set
BC
then it also shatters the set
B{c1}
and vice versa. Combining this with the fact that
Y1=HC
and using the inductive assumption (now applied on
H
and
C
) we obtain that
|Y1|=|HC||{BC:H shatters B}|=|{BC:H shatters B{c1}}|=|{BC:c1BH shatters B}||{BC:c1BH shatters B}|

下一步,定義

HH

H={hH:hH s.t. (1h(c1),h(c2),,h(cm))=(h(c1),h(c2),,h(cm))}

也就是說,

H包含了在
C
可以但在
c1
不行的hypotheses。利用這個定義,很明顯的,如果
H
shatter集合
BC
,那它就可以shatter掉集合
B{c1}
,反之亦然。結合這點與
Y1=HC
,並使用歸納假設(現在是
H
C
),我們得到
|Y1|=|HC||{BC:H shatters B}|=|{BC:H shatters B{c1}}|=|{BC:c1BH shatters B}||{BC:c1BH shatters B}|

Overall, we have shown that

|HC|=|Y0|+|Y1||{BC:c1BH shatters B}|+|{BC:c1BH shatters B}|=|{BC:H shatters B}|
which concludes our proof.

整體來說,我們已經證明

|HC|=|Y0|+|Y1||{BC:c1BH shatters B}|+|{BC:c1BH shatters B}|=|{BC:H shatters B}|

6.5.2 Uniform Convergence for Classes of Small Effiective Size

In this section we prove that if

H has small effective size then it enjoys the uniform convergence property. Formally,

THEOREM 6.11 Let

H be a class and let
τH
be its growth function. Then, for every
D
and every
δ(0,1)
, with probability of at least
1δ
over the choice of
SDm
we have

|LD(h)LS(h)|4+log(τH(2m))δ2m

這一章我們會證明,如果

H的有效大小是小的,那麼它就會擁有均勻收斂的特性。正確來說,

THEOREM 6.11 假設

H是一個class,且
τH
是它的成長函數。那麼,對每一個
D
與每一個
δ(0,1)
,有著最少
1δ
的機率的
SDm
會有著

|LD(h)LS(h)|4+log(τH(2m))δ2m

Before proving the theorem, let us first conclude the proof of Theorem 6.7.

Proof of Theorem 6.7 It suffices to prove that if the VC-dimension is finite then the uniform convergence property holds. We will prove that

mHUC(ϵ,δ)416d(δϵ)2log(16d(δϵ)2+16dlog(2r/d)(δϵ)2)

在證明這個理論之前,讓我們先來怒證一下Theorem 6.7。

Proof of Threorem 6.7 這足以證明,如果Vc-dimension是有限的,那麼均勻收斂的特性是成立的。我們將證明,

mHUC(ϵ,δ)416d(δϵ)2log(16d(δϵ)2+16dlog(2r/d)(δϵ)2)

From Sauer’s lemma we have that for

m>d,
τH(2m)(2em/d)d
. Combining this with Theorem 6.11 we obtain that with probability of at least
1δ
,

|LS(h)LD(h)|4+dlog(2em/d)δ2m

根據Sauer的lemma,對於

m>d,我們得到
τH(2m)(2em/d)d
。把這一點跟Theorem 6.11結合,我們得到最少
1δ
的機率,

|LS(h)LD(h)|4+dlog(2em/d)δ2m

For simplicity assume that$\sqrt{d\log(2em/d)} \geq 4; hence,

|LS(h)LD(h)|1δ2dlog(2em/d)m

為了簡單起見,我們假設

dlog(2em/d)4;因此,

|LS(h)LD(h)|1δ2dlog(2em/d)m

To ensure that the preceding is at most

ϵ we need that

m2dlog(m)(δϵ)2+2dlog(2e/d)(δϵ)2

為了確保上面說的部份最多就是

ϵ,我們需要加入,

m2dlog(m)(δϵ)2+2dlog(2e/d)(δϵ)2

Standard algebraic manipulations (see Lemma A.2 in Appendix A) show that a
sufficient condition for the preceding to hold is that

m42d(δϵ)2log(2d(δϵ)2)+4dlog(2e/d)(δϵ)2

標準代數計算(見附錄A的Lemma A.2)說明著,滿足上面不等式的充份條件就是,

m42d(δϵ)2log(2d(δϵ)2)+4dlog(2e/d)(δϵ)2

Remark 6.4 The upper bound on

\(m_{UC}_{\mathcal{H}\) we derived in the proof Theorem 6.7 is not the tightest possible. A tighter analysis that yields the bounds given in Theorem 6.8 can be found in Chapter 28.

Remark 6.4 我們在Theorem 6.7中推導出來的

\(m_{UC}_{\mathcal{H}\)的upper bound並不是最嚴謹的。在28章會有更嚴謹的分析,得出Theorem 6.8中所出給的bound。

Proof of Theorem 6.11 *

We will start by showing that

(6.4)ESDm[suphH|LD(h)LS(h)|]4+log(τH(2m))2m

我們從證明下面不等式開始

(6.4)ESDm[suphH|LD(h)LS(h)|]4+log(τH(2m))2m

Since the random variable

suphH is nonnegative, the proof of the theorem follows directly from the preceding using Markov's inequality (see Section B.1).

因為隨機變數

suphH是非負的,因此,其證明直接來自前面的Markov's inequality(見附錄B.1)

To bound the left-hand side of Equation (6.4) we first note that for every

hH, we can rewrite
LD(h)=ESDm[LS(h)]
, where
S=z1,,zm
is an additional i.i.d. sample. Therefore,

ESDm[suphH|LD(h)LS(h)|]=ESDm[suphH|ESDmLS(h)LS(h)|]

為了bound方程式6.4的左邊項目,首先我們注意到對每一個

hH,我們可以重寫為
LD(h)=ESDm[LS(h)]
,其中
S=z1,,zm
,是一個額外的獨立同分佈樣本。因此,

ESDm[suphH|LD(h)LS(h)|]=ESDm[suphH|ESDmLS(h)LS(h)|]

A generalization of the triangle inequality yields

|ESDmLS(h)LS(h)|ESDmLS(h)LS(h)

and the fact that supermum of expectation is smaller than expectation of supremum yields

suphHESDm|LS(h)LS(h)|ESDmsuphH|LS(h)LS(h)|

三角不等式的推廣

|ESDmLS(h)LS(h)|ESDmLS(h)LS(h)

以及期望值的最小上界小於最小上界的期望值這一事實

suphHESDm|LS(h)LS(h)|ESDmsuphH|LS(h)LS(h)|

Formally, the previous two inequalities follow from Jensen's inequality. Combining all we obtain

(6.5)ESDm[suphH|LD(h)LS(h)|]ES,SDm[suphH|LS(h)LS(h)|]=ES,SDm[suphH1m|i=1m((h,zi)(h,zi))]

形式上來說,前面兩個不等式來自Jensen's inequality。結合全部,我們得到下面,

(6.5)ESDm[suphH|LD(h)LS(h)|]ES,SDm[suphH|LS(h)LS(h)|]=ES,SDm[suphH1m|i=1m((h,zi)(h,zi))]

The expectation on the right-hand side is over a choice of two i.i.d. samples ariable

suphH is nonnegative, the proof of the theorem follows directly from the preceding using Markov's inequality (see Section B.1).

因為隨機變數

suphH是非負的,因此,其證明直接來自前面的Markov's inequality(見附錄B.1)

To bound the left-hand side of Equation (6.4) we first note that for every

hH, we can rewrite
LD(h)=ESDm[LS(h)]
, where
S=z1,,zm
is an additional i.i.d. sample. Therefore,

ESDm[suphH|LD(h)LS(h)|]=ESDm[suphH|ESDmLS(h)LS(h)|]

為了bound方程式6.4的左邊項目,首先我們注意到對每一個

hH,我們可以重寫為
LD(h)=ESDm[LS(h)]
,其中
S=z1,,zm
,是一個額外的獨立同分佈樣本。因此,

ESDm[suphH|LD(h)LS(h)|]=ESDm[suphH|ESDmLS(h)LS(h)|]

A generalization of the triangle inequality yields

|ESDmLS(h)LS(h)|ESDmLS(h)LS(h)

and the fact that supermum of expectation is smaller than expectation of supremum yields

suphHESDm|LS(h)LS(h)|ESDmsuphH|LS(h)LS(h)|

三角不等式的推廣

|ESDmLS(h)LS(h)|ESDmLS(h)LS(h)

以及期望值的最小上界小於最小上界的期望值這一事實

suphHESDm|LS(h)LS(h)|ESDmsuphH|LS(h)LS(h)|

Formally, the previous two inequalities follow from Jensen's inequality. Combining all we obtain

(6.5)ESDm[suphH|LD(h)LS(h)|]ES,SDm[suphH|LS(h)LS(h)|]=ES,SDm[suphH1m|i=1m((h,zi)(h,zi))]

形式上來說,前面兩個不等式來自Jensen's inequality。結合全部,我們得到下面,

(6.5)ESDm[suphH|LD(h)LS(h)|]ES,SDm[suphH|LS(h)LS(h)|]=ES,SDm[suphH1m|i=1m((h,zi)(h,zi))]

The expectation on the right-hand side is over a choice of two i.i.d. samples

S=z1,,zm and
S=z1,,zm
. Since all of these 2m vectors are chosen i.i.d., nothing will change if we replace the name of the random vector
zi
with the name of the random vector
zi
. If we do it, instead of the term
((h,zi)(h,zi))
in Equation (6.5) we will have the term
((h,zi)(h,zi))
. It follows that for every
σ±1m
we have that Equation (6.5) equals

ES,SDm[suphH1m|i=1mσi((h,zi)(h,zi))|]

右手邊的期望值是選擇兩個獨立同分佈樣本

S=z1,,zm
S=z1,,zm
。因為全部這
2m
個向量都是獨立同分佈所選出來的,因此,就算我們將隨機向量
zi
的名稱改為隨機向量
zi
也不會有任何改變。如果我們這麼做,那方程式6.5就不再是
((h,zi)(h,zi))
,而是
((h,zi)(h,zi))
。因此,對每一個
σ±1m
,我們都會有方程式6.5等於

ES,SDm[suphH1m|i=1mσi((h,zi)(h,zi))|]

Since this holds for every

σ{±1}m, it also holds if we sample each component of
σ
uniformly at random from the uniform distribution over
{±1}
, denoted
U±
.

因為這對每一個

σ{±1}m都成立,因此,如果我們隨機的從一個
{±1}
上的均勻分佈對
σ
的每一個組成均勻的採樣,那也會成立,以
U±
來表示。

Hence, Equation (6.5) also equals

EσU±mES,SDm[hH|i=1mσi((h,zi)(h,zi))|]

and by the linearity of expectation it also equals

ES,SDmEσU±m[hH|i=1mσi((h,zi)(h,zi))|]

因此,方程式6.5也等於

EσU±mES,SDm[hH|i=1mσi((h,zi)(h,zi))|]

而且根據期望的線性,它也等價於

ES,SDmEσU±m[hH|i=1mσi((h,zi)(h,zi))|]

Next, fix S and S0, and let C be the instances appearing in S and S0. Then, we can take the supremum only over h 2 HC. Therefore,

EσU±m[suphH1m|i=1mσi((h,zi)(h,zi))|]=EσU±m[maxhHC1m|i=1mσi((h,zi)(h,zi))|]

接下來,固定

S
S
,並假設
C
S
S
中出現的實例。然後,我們就只能夠在
hHC
上取最小上界。因此,

EσU±m[suphH1m|i=1mσi((h,zi)(h,zi))|]=EσU±m[maxhHC1m|i=1mσi((h,zi)(h,zi))|]

Fix some

hHC and denote
θh=1mi=1mσi((h,zi)(h,zi))
. Since
E[θh]=0
and
θh
is an average of independent variables, each of which takes values in
[1,1]
, we have by Hoeffding's inequality that for every
σ>0
,

P[|θh|>σ]2exp(2mσ2)

固定某些

hHC,表示為
θh=1mi=1mσi((h,zi)(h,zi))
。由於
E[θh]=0
θh
為自變量的平均數,每一個都是取值在
[1,1]
之間,根據hoeffding's inequality,每一個
σ>0

P[|θh|>σ]2exp(2mσ2)

Applying the union bound over

hHC, we obtain that for any
σ>0
,

P[maxhHC|θh|>σ]2|HC|exp(2mσ2)

hHC上做union bound,我們得到,對於任意的
σ>0

P[maxhHC|θh|>σ]2|HC|exp(2mσ2)

Finally, Lemma A.4 in Appendix A tells us that the preceding implies

E[maxhHC|θh|]4+log(|HC|)2m

最後,附錄A中的Lemma A.4告訴我們,上面不等式的意思就是

E[maxhHC|θh|]4+log(|HC|)2m

Combining all with the definition of

τH, we have shown that

ESDm[suphH|LD(h)LS(h)|]4+log(τH(2m))2m

結合全部與

τH的定義,我們終於證明

ESDm[suphH|LD(h)LS(h)|]4+log(τH(2m))2m

6.6 Summary

The fundamental theorem of learning theory characterizes PAC learnability of classes of binary classifiers using VC-dimension. The VC-dimension of a class is a combinatorial property that denotes the maximal sample size that can be shattered by the class. The fundamental theorem states that a class is PAC learnable if and only if its VC-dimension is finite and specifies the sample complexity required for PAC learning. The theorem also shows that if a problem is at all learnable, then uniform convergence holds and therefore the problem is learnable using the ERM rule.

學習理論的基本定理使用VC-dimension來描述二分分類器類別的PAC learnability。一個類別的VC-dimension是一種組合特性,其表示該類別被shattered的最大樣本數。基本定理指出,一個類別要是PAC learnable,若且唯若其VC-dimension是有限的,且確定PAC learnable所需的樣本複雜度。該定理還說明,如果一個問題是完全的可學習的(learnable),那麼均勻收斂就會成立,因此,該問題使用ERM rule就是可學習的(learnable)。