Chapter 6-Understanding Machine Learning From Theory to Algorithms: The VC-Dimension

tags: `Understanding Machine Learning From Theory to Algorithms`

Remark

文章內容

個人理解

Chapter 6-Understanding Machine Learning From Theory to Algorithms: The VC-Dimension
- Remark
- The VC-Dimension

The VC-Dimension

In the previous chapter, we decomposed the error of the ERMH rule into approximation error and estimation error. The approximation error depends on the fit of our prior knowledge (as reflected by the choice of the hypothesis class H) to the underlying unknown distribution. In contrast, the definition of PAC learnability requires that the estimation error would be bounded uniformly over all distributions.

前一章裡面，我們將

E R M_{H}

的誤差分解為近似誤差與估計誤差。近似誤差取決於先驗知識與潛在未知的分佈之間的擬合狀況(正如hypothesis class

H

的選擇所反映的那樣)。相較之下，PAC learnability的定義需要估計誤差能夠在所有的分佈上被均衡的界定。

Our current goal is to figure out which classes H are PAC learnable, and tocharacterize exactly the sample complexity of learning a given hypothesis class. So far we have seen that finite classes are learnable, but that the class of all functions (over an infinite size domain) is not. What makes one class learnable and the other unlearnable? Can infinite-size classes be learnable, and, if so, what determines their sample complexity?

我們當前的目標就是去找出那些classes

H

是PAC learnable，並且確切的描述出學習一個給定hyothesis class所需要的樣本複雜度。目前為止我們已經知道，finite class是learnable，但所有函數的class上並非如此(在無限大小的domain上)。是什麼造成一個class是learnable，而另一個是unlearnable?infinite-size的classes是否可以是learnable，如果可以，那怎麼定義它們的樣本複雜度?

We begin the chapter by showing that infinite classes can indeed be learnable, and thus, finiteness of the hypothesis class is not a necessary condition for learnability. We then present a remarkably crisp characterization of the family of learnable classes in the setup of binary valued classification with the zero-one loss. This characterization was first discovered by Vladimir Vapnik and Alexey Chervonenkis in 1970 and relies on a combinatorial notion called the Vapnik-Chervonenkis dimension (VC-dimension). We formally define the VC-dimension, provide several examples, and then state the fundamental theorem of statistical learning theory, which integrates the concepts of learnability, VC-dimension, the ERM rule, and uniform convergence.

這章的一開始，我們要說明一件事，那就是infinite class確實可以是learnable，因此，hypothesis的有限性就不會是learnability的必要條件。然後，在設置具有0-1損失的二分類中，我們會說明一個關於family of learnable classes非常清晰的表徵分析。這種表徵首先由Vladimir Vapnik與Alexey Chervonenkis在1970年所發現，這依賴著一種組合概念，也就是Vapnik-Chervonenkis dimension (VC-dimension)。我們會正式的定義VC-dimension，然後說明統計學習的基本理論，這個理論會整合learnability、VC-dimension、ERM rule與均勻收斂。

6.1 Infinite-Size Classes Can Be Learnable

In Chapter 4 we saw that finite classes are learnable, and in fact the sample complexity of a hypothesis class is upper bounded by the log of its size. To show that the size of the hypothesis class is not the right characterization of its sample complexity, we first present a simple example of an infinite-size hypothesis class that is learnable.

在第四章裡面，我們看到了finite classes是learnable，事實上，其hypothesis class 的樣本複雜度就是由它本身的大小取對數來做為上限。為了說明hypothesis class的大小並不是其樣本複雜度的正確表徵，我們首先提出一個是learnable的hypothesis class的簡單的範例，這個hypothesis class的infinite size。

Example 6.1 Let

H

be the set of threshold functions over the real line, namely,

H = {h_{a} : a \in R}

, where

h_{a} : R \to {0, 1}

is a function such that

h_{a} (x) = I_{[x < a]}

. To remind the reader,

I_{[x < a]}

1

x < a

and

0

otherwise. Clearly,

H

is of infinite size. Nevertheless, the following lemma shows that H is learnable in the PAC model using the ERM algorithm.

Example 6.1 假設

H

是實(數)線上門檻函數的集合，也就是說，

H = {h_{a} : a \in R}

，其中

h_{a} : R \to {0, 1}

是一個函數，為

h_{a} (x) = I_{[x < a]}

。

I_{[x < a]}

的意思就是，當

x < a

，那就是

1

，反之則為

0

。很明顯的，

H

是無限大小的。儘管如此，下面的lemma依然是說明在PAC model中使用ERM algorithm，

H

是learnable。

Lemma 6.1 Let

H

be the class of thresholds as defined earlier. Then,

H

is PAC learnable, using the ERM rule, with sample complexity of

m_{H} (ϵ, δ) \leq ⌈ \log (2 / δ) ϵ ⌉

Proof Let

a^{*}

be a threshold such that the hypothesis

h^{*} (x) = I_{[x M a^{*}]}

achieves

L_{D} (h^{*}) = 0

. Let

D_{x}

be the marginal distribution over the domain

X

and let

a_{0} < a^{*} < a_{1}

be such that

\underset{x \sim D_{x}}{P} [x \in (a_{0}, a^{*})] = \underset{x \sim D_{x}}{P} [x \in (a^{*}, a_{1})] = ϵ

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Lemma 6.1 假設，

H

是一個如先前所定義的thresholds class。那麼

H

就是PAC learnable，使用ERM rule，以

m_{H} (ϵ, δ) \leq ⌈ \log (2 / δ) ϵ ⌉

的樣本複雜度。

Proof 假設

a^{*}

是一個閥值，那麼hypothesis

h^{*} (x) = I_{[x M a^{*}]}

就可以得到

L_{D} (h^{*}) = 0

。假設

D_{x}

是domain

X

上的邊際分佈，並且設定

a_{0} < a^{*} < a_{1}

，因此

\underset{x \sim D_{x}}{P} [x \in (a_{0}, a^{*})] = \underset{x \sim D_{x}}{P} [x \in (a^{*}, a_{1})] = ϵ

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

(

D_{x} (- \infty, a^{*}) \leq ϵ

we set

a_{0} = - \infty

and similarly for

a_{1}

). Given a training set

S

, let

b_{0} = max {x : (x, 1) \in S}

and

b_{1} = min {x : (x, 0) \in S}

(if no example in

S

is positive we set

b_{0} = - \infty

and if no example in

S

is negative we set

b_{1} = \infty

). Let

b_{S}

be a threshold corresponding to an ERM hypothesis,

h_{S}

, which implies that

b_{S} \in (b_{0}, b_{1})

. Therefore, a sufficient condition for

L_{D} \leq ϵ

is that both

b_{0} \geq a_{0}

and

b_{1} \leq a_{1}

. In other words,

\underset{Ｓ \sim D^{m}}{P} [L_{D} (h_{S}) > ϵ] \leq \underset{Ｓ \sim D^{m}}{P} [b_{0} < a_{0} \lor b_{1} > a_{1}]

and using the union bound we can bound the preceding by

\begin{matrix} (6.1) & \underset{Ｓ \sim D^{m}}{P} [L_{D} (h_{S}) > ϵ] \leq \underset{Ｓ \sim D^{m}}{P} [b_{0} < a_{0}] + \underset{Ｓ \sim D^{m}}{P} [b_{1} > a_{1}] \end{matrix}

(如果

D_{x} (- \infty, a^{*}) \leq ϵ

，那我們就設置

a_{0} = - \infty

，

a_{1}

也是一樣的做法)。給定一個訓練集

S

，假設

b_{0} = max {x : (x, 1) \in S}

，而

b_{1} = min {x : (x, 0) \in S}

(如果

S

內沒有樣本是真值(positive)，那我們就設置

b_{0} = - \infty

，而如果

S

內沒有負值(negative)，那我們就設置

b_{1} = \infty

)。假設

b_{S}

是一個相當於ERM hypothesis

h_{S}

的閥值，這意味著

b_{S} \in (b_{0}, b_{1})

。因此，對於

L_{D} \leq ϵ

的一個充份條件就是

b_{0} \geq a_{0}

且

b_{1} \leq a_{1}

。換言之，

\underset{Ｓ \sim D^{m}}{P} [L_{D} (h_{S}) > ϵ] \leq \underset{Ｓ \sim D^{m}}{P} [b_{0} < a_{0} \lor b_{1} > a_{1}]

然後利用union bound，我們就可以bound住前面的式子，

\begin{matrix} (6.1) & \underset{Ｓ \sim D^{m}}{P} [L_{D} (h_{S}) > ϵ] \leq \underset{Ｓ \sim D^{m}}{P} [b_{0} < a_{0}] + \underset{Ｓ \sim D^{m}}{P} [b_{1} > a_{1}] \end{matrix}

The event

b_{0} < a_{0}

happens if and only if all examples in

S

are not in the interval

(a_{0}, a^{*})

, whose probability mass is defined to be

ϵ

, namely,

\underset{Ｓ \sim D^{m}}{P} [b_{0} < a_{0}] = \underset{Ｓ \sim D^{m}}{P} [\forall (x, y) \in S, x \notin (a_{0}, a^{*})] = (1 - ϵ)^{m} \leq e^{- ϵ m}

b_{0} < a_{0}

這個事件只會發生在訓練集

S

內的所有的樣本都沒有在

(a_{0}, a^{*})

這個區間內，其機率質量由

ϵ

定義，也就是說，

\underset{Ｓ \sim D^{m}}{P} [b_{0} < a_{0}] = \underset{Ｓ \sim D^{m}}{P} [\forall (x, y) \in S, x \notin (a_{0}, a^{*})] = (1 - ϵ)^{m} \leq e^{- ϵ m}

Since we assume

m > \log (2 / δ) / ϵ

it follows that the equation is at most

δ / 2

. In the same way it is easy to see that

P_{S \sim D^{m}} [b_{1} > a_{1}] \leq δ / 2

. Combining with Equation (6.1) we conclude our proof.

因為我們假設

m > \log (2 / δ) / ϵ

，因此上面的式子最多就是

δ / 2

。相同的方法，很容易可以知道

P_{S \sim D^{m}} [b_{1} > a_{1}] \leq δ / 2

。結合方程式6.1，得證。

6.2 The VC-Dimension

We see, therefore, that while finiteness of H is a sufficient condition for learnability, it is not a necessary condition. As we will show, a property called the VC-dimension of a hypothesis class gives the correct characterization of its learnability. To motivate the definition of the VC-dimension, let us recall the No-Free-Lunch theorem (Theorem 5.1) and its proof. There, we have shown that without restricting the hypothesis class, for any learning algorithm, an adversary can construct a distribution for which the learning algorithm will perform poorly, while there is another learning algorithm that will succeed on the same distribution. To do so, the adversary used a finite set

C \subset X

and considered a family of distributions that are concentrated on elements of C. Each distribution was derived from a \true" target function from

C \to {0, 1}

. To make any algorithm fail, the adversary used the power of choosing a target function from the set of all possible functions from

C \to {0, 1}

因此，我們可以看到，盡管

H

的有限性對於learnability是充份條件，但並不是必要條件。一如我們要說明的，一個稱為hypothesis class的VC-dimension給出其learnability的正確特徵。為了引出VC-dimension的定義，讓我們回想天下沒有白吃的午餐這個定理(見5.1)及其證明。我們已經證明過，在沒有對hypothesis class做任何限制的情況下，對任何的學習演算法來說，對某一個分佈的效能會很差，但會有另一個學習演算法在該分佈上是可以成功學習。這樣做，我們可以使用有限的集合

C \subset X

，然後考慮一個集中

C

的元素上面的分佈族。每一個分佈都是由"真"的目標函數(

C to {0, 1}

)所衍生的。為了讓任意的演算法失敗，adversary可以從

C \to {0, 1}

中所有可能的函數集合裡面選擇一個目標函數。

When considering PAC learnability of a hypothesis class

H

, the adversary is restricted to constructing distributions for which some hypothesis

h \in H

achieves a zero risk. Since we are considering distributions that are concentrated on elements of

C

, we should study how

H

behaves on

C

, which leads to the
following definition.

當考慮到hypothesis class

H

的PAC learnability，adversary就會被限制去構建一個讓某些hypothesis

h \in H

得到zero rick的分佈。由於我們考慮是集中在

C

的元素上的分佈，所以我們應該研究

H

在

C

上的表現如何，這導出下面的定義。

DEFINITION 6.2 (Restriction of

H

C

) Let

H

be a class of functions from

X

{0, 1}

and let

C = {c_{1}, \dots, c_{m}} \subset X

. The restriction of

H

C

is the set of functions from

C

{0, 1}

that can be derived from

H

. That is,

H_{C} = {(h (c_{1}), \dots, h (c_{m})) : h \in H}

where we represent each function from

C

{0, 1}

as a vector in

{0, 1}^{| C |}

DEFINITION 6.2 (

H

到

C

的限制) 假設

H

是一個

x \to {0, 1}

的函數類，並假設

C = {c_{1}, \dots, c_{m}} \subset X

(

C

是

X

的子集)。

H

到

C

的限制就是

C \to {0, 1}

的函數集合，這可以由

H

推導出來。也就是，

H_{C} = {(h (c_{1}), \dots, h (c_{m})) : h \in H}

我們把每一個

C \to {0, 1}

的函數以向量

{0, 1}^{| C |}

來表示。

If the restriction of

H

C

is the set of all functions from

C

{0, 1}

, then we say that

H

shatters the set

C

. Formally:

DEFINITION 6.3 (Shattering) A hypothesis class

H

shatters a finite set

C \subset X

if the restriction of

H

C

is the set of all functions from

C

{0, 1}

. That is,

| H_{C} | = 2^{| C |}

如果這個

H

到

C

的限制是所有

C \to {0, 1}

的函數集合，那我們就可以說

H

shatters集合

C

。正確來說是：

DEFINITION 6.3 (Shattering) 如果

H

到

C

的限制是

C \to {0, 1}

的所有函數集合，那麼我們說，hypothesis class

H

shatters一個有限集合

C \subset X

。也就是，

| H_{C} | = 2^{| C |}

。

Example 6.2 Let

H

be the class of threshold functions over

R

. Take a set

C = {c_{1}}

. Now, if we take

a = c_{1} + 1

, then we have

h_{a} (c_{1}) = 1

, and if we take

a = c_{1} - 1

, then we have

h_{a} (c_{1}) = 1

. Therefore,

H_{C}

is the set of all functions from

C

{0, 1}

, and

H

shatters

C

. Now take a set

C = {c_{1}, c_{2}}

, where

c_{1} \leq c_{2}

. No

h \in H

can account for the labeling

(0, 1)

, because any threshold that assigns the label

0

c_{1}

must assign the label

0

c_{2}

as well. Therefore not all functions
from

C

{0, 1}

are included in

H_{C}

; hence

C

is not shattered by

H

Example 6.2 假設

H

是

R

上的一個門檻函數。取一個集合

C = {c_{1}}

。現在，如果我們取

a = c_{1} + 1

，那我們就會得到

h_{a} (c_{1}) = 1

，如果我們取

a = c_{1} - 1

，那我們就會得到

h_{a} (c_{1}) = 1

。因此，

H_{C}

就是

C \to {0, 1}

的所有函數的集合，而且

H

shatters

C

。現在，我們取一個集合

C = {c_{1}, c_{2}}

，其中

c_{1} \leq c_{2}

。沒有

h \in H

可以產生labeling

(0, 1)

，因為任一個標記label

0

到

c_{1}

的閥值同時也必需要能標記

0

到

c_{2}

。因此，並非所有

C \to {0, 1}

的函數都被包含在

H_{C}

；因此，

H

是無法shatter

C

的。

Getting back to the construction of an adversarial distribution as in the proof of the No-Free-Lunch theorem (Theorem 5.1), we see that whenever some set

C

is shattered by H, the adversary is not restricted by

H

, as they can construct a distribution over

C

based on any target function from

C

{0, 1}

, while still maintaining the realizability assumption. This immediately yields:

corollary 6.4 Let

H

be a hypothesis class of functions from

X

{0, 1}

. Let

m

be a training set size. Assume that there exists a set

C \subset X

of size

2 m

that is shattered by

H

. Then, for any learning algorithm,

A

, there exist a distribution

D

over

X x {0, 1}

and a predictor

h \in H

such that

L_{D} (h) = 0

but with probability of at least 1/7 over the choice of

S \sim D^{m}

we have that

L_{D} (A (S)) \geq 1 / 8

回到證明沒有白吃的午餐的理論(Theorem 5.1)中所說的adversarial distribution的構建，當某一些集合

C

被

H

給shatter，那這個adversary distribution沒有被

H

給限制，因此它們可以在

C

上構建任何基於目標函數

C \to {0, 1}

上的分佈，同時維持relizability assumption。這直接得到：

COROLLARY 6.4 假設

H

是函數

X \to {0, 1}

的hypothesis class。假設

m

是訓練集的大小。假設存在一個大小為

2 m

的集合

C \subset X

，這個集合可以被

H

給shatter。那麼，對任意的學習演算法，

A

，存在一個

X x {0, 1}

上的分佈

D

，以及一個predictor

h \in H

，可以得到

L_{D} (h) = 0

，但對於選擇的資料集

S \sim D^{m}

最少會有1/7的機率其

L_{D} (A (S)) \geq 1 / 8

。

Corollary 6.4 tells us that if

H

shatters some set

C

of size

2 m

then we cannot learn H using m examples. Intuitively, if a set

C

is shattered by

H

, and we receive a sample containing half the instances of

C

, the labels of these instances give us no information about the labels of the rest of the instances in

C

- every possible labeling of the rest of the instances can be explained by some hypothesis in

H

. Philosophically,

If someone can explain every phenomenon, his explanations are worthless.

Corollary 6.4 告訴我們一件事，如果

H

可以shatter某些大小為

2 m

的集合

C

，那我們就沒有辦法用

m

個樣本來學習

H

。直觀來看，如果一個集合

C

被

H

給shattered，而我們所接收的樣本包含C的instances的一半，那這些instances是沒有辦法提供給我們任何關於

C

剩下的instances的標籤的信息 - 其餘的instances的每一個可能的標籤都可以用

H

內的某一個hypothesis來解釋。有一句話是這麼說的，

如果某人可以解釋每一種現象，那麼他的解釋一定是錯的。

This leads us directly to the definition of the VC dimension.

DEFINITION 6.5 (VC-dimension) The VC-dimension of a hypothesis class

H

, denoted VCdim(H), is the maximal size of a set

C \subset X

that can be shattered by

H

. If

H

can shatter sets of arbitrarily large size we say that

H

has infinite VC-dimension.

這引導出VC dimension的定義。

DEFINITION 6.5 (VD-dimension) hypothesis class

H

的VC-dimension以

VCdim (H)

來表示，其表示為

H

所能shatter掉

C \subset X

的最大大小。如果

H

可以shatter掉任意大小的集合，那我們就說，

H

無限的VC-dimension。

A direct consequence of Corollary 6.4 is therefore:

THEOREM 6.6 Let

H

be a class of infinite VC-dimension. Then,

H

is not PAC learnable.

Corollary 6.4直接導致的結果就是：

THEOREM 6.6 假設

H

有著無限的VC-dimension。那麼，我們就說，

H

並非PAC learnable。

Proof Since

H

has an infinite VC-dimension, for any training set size

m

, there exists a shattered set of size 2m$, and the claim follows by Corollary 6.4.

Proof 因為

H

有著無限的VC-dimension，因此對任意大小為

m

的訓練集，都會存在著一個大小為

2 m

而且可以被shattered的集合，這個證明來自Corollary 6.4。

We shall see later in this chapter that the converse is also true: A finite VC-dimension guarantees learnability. Hence, the VC-dimension characterizes PAC learnability. But before delving into more theory, we first show several examples.

章節的後面會看到這個推論反過來說也是一樣成立：一個有限的VC-dimension可以保證它的learnability。因此，VC-dimension描述著PAC learnability。在深入探討更多理論之前，讓我們先來看幾個範例。

6.3 Examples

In this section we calculate the VC-dimension of several hypothesis classes. To show that

VCdim (H) = d

we need to show that

There exists a set
$C$ of size
$d$ that is shattered by
$H$ .
Every set
$C$ of size
$d + 1$ is not shattered by
$H$ .

在這個章節裡，我們會試著計算各種hypothesis classes的VC-dimension。為了說明

VCdim (H) = d

，我們需要證明兩件事：

存在一個大小為
$d$ 的集合
$C$ ，這集合可以被
$H$ 給shatter
每一個大小為
$d + 1$ 的集合
$C$ 都沒有辦法被
$H$ 給shatter

6.3.1 Threshold Functions

Let

H

be the class of threshold functions over

R

. Recall Example 6.2, where we have shown that for an arbitrary set

C = {c_{1}}

H

shatters

C

; therefore

VCdim (H) \geq 1

. We have also shown that for an arbitrary set

C = {c_{1}, c_{2}}

where

c_{1} \leq c_{2}

H

does not shatter

C

. We therefore conclude that

VCdim (H) = 1

假設

H

是

R

上門檻函數的class。回想範例6.2，我們已經說明過，對任意的集合

C = {c_{1}}

，

H

都可以shatter掉

C

；因此，

VCdim (H) \geq 1

。我們也已經證明，對任意集合

C = {c_{1}, c_{2}}

，其中

c_{1} \leq c_{2}

，

H

沒有辦法shatter掉

C

。因此，我們得到一個結論，

VCdim (H) = 1

。

6.3.2 Intervals

Let

H

be the class of intervals over

R

, namely,

H = {h_{a, b} : a, b \in R, a < b}

, where

h_{a, b} : R \to {0, 1}

is a function such that

h_{a, b} (x) = I_{[x \in (a, b)]}

. Take the set

C = {1, 2}

. Then,

H

shatters

C

(make sure you understand why) and therefore

VCdim (H) \geq 2

. Now take an arbitrary set

C = {c_{1}, c_{2}, c_{3}}

and assume without loss of generality that

c_{1} \leq c_{2} \leq c_{3}

. Then, the labeling

(1, 0, 1)

cannot be obtained by an interval and therefore

H

does not shatter

C

. We therefore conclude that

VCdim (H) = 2

假設

H

是

R

上的區間類別，也就是，

H = {h_{a, b} : a, b \in R, a < b}

，其中

h_{a, b} : R \to {0, 1}

是一個函數，因此

h_{a, b} (x) = I_{[x \in (a, b)]}

。取集合

C = {1, 2}

。然後，H shatters

C

()，因此

VCdim (H) \geq 2

。現在，取任意集合

C = {c_{1}, c_{2}, c_{3}}

，並假設這集合一樣是

c_{1} \leq c_{2} \leq c_{3}

。然後，其標籤

(1, 0, 1)

無法由一個區間來取得，因此

H

無法shatter

C

。因此我們得到一個結論，

VCdim (H) = 2

6.3.3 Axis Aligned Rectangles

Let

H

be the class of axis aligned rectangles, formally:

H = {h_{(a_{1}, a_{2}, b_{1}, b_{2})} : a_{1} \leq a_{2}, b_{1} \leq b_{2}}

where

h_{(a_{1}, a_{2}, b_{1}, b_{2})} (x_{1}, x_{2}) = {\begin{cases} 1 if a_{1} \leq x_{1} \leq a_{2} and b_{1} \leq x_{2} \leq b_{2} \\ 0 otherwise \end{cases}

假設

H

是一個軸對稱矩陣，正確來說：

H = {h_{(a_{1}, a_{2}, b_{1}, b_{2})} : a_{1} \leq a_{2}, b_{1} \leq b_{2}}

其中

h_{(a_{1}, a_{2}, b_{1}, b_{2})} (x_{1}, x_{2}) = {\begin{cases} 1 if a_{1} \leq x_{1} \leq a_{2} and b_{1} \leq x_{2} \leq b_{2} \\ 0 otherwise \end{cases}

We shall show in the following that

VCdim (H) = 4

. To prove this we need to find a set of 4 points that are shattered by

H

and show that no set of 5 points can be shattered by

H

. Finding a set of 4 points that are shattered is easy (see Figure 6.1). Now, consider any set

C \subset R^{2}

of 5 points. In

C

, take a leftmost point (whose first coordinate is the smallest in

C

), a rightmost point (first coordinate is the largest), a lowest point (second coordinate is the smallest), and a highest point (second coordinate is the largest). Without loss of generality, denote

C = {c_{1}, \dots, c_{5}}

and let

c_{5}

be the point that was not selected. Now, define the labeling

(1, 1, 1, 1, 0)

. It is impossible to obtain this labeling by an axis aligned rectangle. Indeed, such a rectangle must contain

c_{1}, \dots, c_{4}

; but in this case the rectangle contains

c_{5}

as well, because its coordinates are within the intervals defined by the selected points. So,

C

is not shattered by

H

, and therefore

VCdim (H) = 4

我們將在下面說明

VCdim (H) = 4

。為了證明這一點，我們需要尋找一個可以被

H

shatter的4個資料點的集合，然後證明沒有5個資料點的集合可以被H shatter。要找出一個可以被shatter的4個資料點的集合實在是太簡單(見Figure 6.1)。現在，考慮5個資料點的任意集合

C \subset R^{2}

。在

C

裡面，取一個最左邊的點(第一軸的最小值)、最右邊的點(第一中的最大值)，最低的點(第二軸的最小值)，最高的點(第二軸的最大值)。在不失一般性下這樣表示，

C = {c_{1}, \dots, c_{5}}

，然後假設

c_{5}

是沒有被取到的點。現在，定義標籤分別為

(1, 1, 1, 1, 0)

。這種情況下你不可能用軸對稱矩形去得到這種標籤結果。確實，這樣的矩形必需包含

c_{1}, \dots, c_{4}

；但是在這個案例中的矩形還包含了

c_{5}

，因為

c_{5}

就在你所選定的座標區間內。所以，

C

沒有辦法被

H

shatter，因此，

VCdim (H) = 4

。

Figure 6.1 Left: 4 points that are shattered by axis aligned rectangles. Right: Any axis aligned rectangle cannot label

c_{5}

by 0 and the rest of the points by 1.

Figure 6.1 Left：4個點的情況下是可以被軸對稱矩形shatter。
Right：任一個軸對稱矩陣都沒有辦法在其它資料點為1的情況下將

c_{5}

標記為0。

6.3.4 Finite Classes

Let

H

be a finite class. Then, clearly, for any set

C

we have

| H_{C} | \leq | H |

and thus

C

cannot be shattered if

| H | < 2^{| C |}

. This implies that

VCdim (H) \leq \log_{2} (| H |)

. This shows that the PAC learnability of finite classes follows from the more general statement of PAC learnability of classes with finite VC-dimension, which we shall see in the next section. Note, however, that the VC-dimension of a finite class

H

can be significantly smaller than

\log_{2} (| H |)

. For example, let

x = {1, \dots, k}

, for some integer

k

, and consider the class of threshold functions (as defined in Example 6.2). Then,

| H | = k

but

VCdim (H) = 1

. Since

k

can be arbitrarily large, the gap between

\log_{2} (| H |)

and

VCdim (H)

can be arbitrarily large.

假設

H

是一個finite class。那麼，很明顯的，對任意的集合

C

都存在著

| H_{C} | \leq | H |

。因而，如果

| H | < 2^{| C |}

，那麼

C

就沒有辦法被shatter。這指出了

VCdim (H) \leq \log_{2} (| H |)

。這說明了，finite classes的PAC learnability是來自於finite VC-dimension的PAC learnability更一般化的敘述，我們將會在後面的章節談到。注意到，finite class

H

的VC-dimension很明顯的會比

\log_{2} (| H |)

還要小。舉例來說，假設

x = {1, \dots, k}

，

k

是整數，並且考慮到範例6.2中所提到的門檻函數的類別。那麼，

| H | = k

，但

VCdim (H) = 1

。因為

k

可以是任意大小的值，因此

\log_{2} (| H |)

與

VCdim (H)

之間的差距就可以是任意大小。

6.3.5 VC-Dimension and the Number of Parameters

In the previous examples, the VC-dimension happened to equal the number of parameters defining the hypothesis class. While this is often the case, it is not always true. Consider, for example, the domain

X - R

, and the hypothesis class

X = {h_{θ} : θ \in R}

where

h_{θ} : X \to {0, 1}

is defined by

h_{θ} (x) = ⌈ 0.5 \sin (θ x) ⌉

. It is possible to prove that

VCdim (H) = \infty

, namely, for every

d

, one can find

d

points that are shattered by

H

(see Exercise 8).

在先前的範例中，VC-dimension剛好會是hypothesis class的參數數量。儘管這很常發生，但不總是這樣的。注意到，舉例來說，domain

X - R

，且hypothesis class

X = {h_{θ} : θ \in R}

，其中

h_{θ} : X \to {0, 1}

由

h_{θ} (x) = ⌈ 0.5 \sin (θ x) ⌉

所定義。這可能可以證明

VCdim (H) = \infty

，也就是說，對於每一個

d

，我們都可以找出

d

個資料點可以被

H

給shatter。

6.4 The Fundamental Theorem of PAC learning

We have already shown that a class of infinite VC-dimension is not learnable. The converse statement is also true, leading to the fundamental theorem of statistical learning theory:

THEOREM 6.7 (The Fundamental Theorem of Statistical Learning) Let

H

be a hypothesis class of functions from a domain

X

{0, 1}

and let the loss function be the

0 - 1

loss. Then, the following are equivalent:

$H$ has the uniform convergence property.
Any ERM rule is a successful agnostic PAC learner for
$H$ .
$H$ is agnostic PAC learnable.
$H$ is PAC learnable.
Any ERM rule is a successful PAC learner for
$H$ .
$H$ has a finite VC-dimension.

The proof of the theorem is given in the next section.

我們已經說明，infinite VC-dimension的class並非learnable。反過來說也一樣，這導出一個統計學習理論的基本定義：

THEOREM 6.7 (統計學習的基本定理) 假設

H

是一個hypothesis class，由domain

X

映射到

{0, 1}

的functions集結而成，然後假設loss function是

0 - 1

的loss。那麼，下面的幾個說明是等價的：

$H$ 存在著均勻收斂的特性
對
$H$ 而言，任意的ERM rule都是成功的agnostic PAC learner
$H$ 為agnostic PAC learnable
$H$ 為PAC learnable
對
$H$ 而言，任意的ERM rule都是成功的PAC learner
$H$ 有著finite VC-dimension

這個定理的證明會在下一章給出。

Not only does the VC-dimension characterize PAC learnability; it even determines the sample complexity.

THEOREM 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative Version) Let

H

be a hypothesis class of functions from a domain

X

{0, 1}

and let the loss function be the

0 - 1

loss. Assume that

VCdim (H) = d < \infty

Then, there are absolute constants $C_1,C_2 such that:

$H$ has the uniform convergence property with sample complexity

$C_{1} \frac{d + \log (1 / δ)}{ϵ^{2}} \leq m_{H}^{U C} (ϵ, δ) \leq C_{2} \frac{d + \log (1 / δ)}{ϵ^{2}}$
$H$ is agnostic PAC learnable with sample complexity

$C_{1} \frac{d + \log (1 / δ)}{ϵ^{2}} \leq m_{H} (ϵ, δ) \leq C_{2} \frac{d + \log (1 / δ)}{ϵ^{2}}$
$H$ is PAC learnable with sample complexity

$C_{1} \frac{d + \log (1 / δ)}{ϵ} \leq m_{H} (ϵ, δ) \leq C_{2} \frac{d \log (1 / δ) + \log (1 / δ)}{ϵ}$

The proof of this theorem is given in Chapter 28.

VC-dimension不只可以描述PAC learnability；它還可以定義樣本複雜度。

THEOREM 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative Version) 假設

H

是一個hypothesis class，由domain

X

映射到

{0, 1}

的functions集結而成，並且loss function是

0 - 1

的loss。假設，

VCdim (H) = d < \infty

。然後，有兩個絕對常數$C_1,C_2會使得：

$H$ 具均勻收斂的特性，其樣本複雜度為

$C_{1} \frac{d + \log (1 / δ)}{ϵ^{2}} \leq m_{H}^{U C} (ϵ, δ) \leq C_{2} \frac{d + \log (1 / δ)}{ϵ^{2}}$
$H$ 是agnostic PAC learnable，其樣本複雜度為

$C_{1} \frac{d + \log (1 / δ)}{ϵ^{2}} \leq m_{H} (ϵ, δ) \leq C_{2} \frac{d + \log (1 / δ)}{ϵ^{2}}$
$H$ 是PAC learnable，其樣本複雜度為

$C_{1} \frac{d + \log (1 / δ)}{ϵ} \leq m_{H} (ϵ, δ) \leq C_{2} \frac{d \log (1 / δ) + \log (1 / δ)}{ϵ}$

這個定理的證明會在第28章給出。

Remark 6.3 We stated the fundamental theorem for binary classification tasks. A similar result holds for some other learning problems such as regression with the absolute loss or the squared loss. However, the theorem does not hold for all learning tasks. In particular, learnability is sometimes possible even though the uniform convergence property does not hold (we will see an example in Chapter 13, Exercise 2). Furthermore, in some situations, the ERM rule fails but learnability is possible with other learning rules.

Remark 6.3 我們說明了二分類任務的基本定理。類型的結果也會在其它的學習問題上成立，像是使用絕對損失或是平方損失的迴歸。但是這個定理並不適用於所有學習任務。特別是，即使均均勻收斂性不成立，learnability有時候也是有可能的(這點會在第13章的練習2中看到)。此外，某些情況下，ERM rule會失敗，但其它的learning rule反而有可能提高learnability。

6.5 Proof of Theorem

We have already seen that

1 \to 2

in Chapter 4. The implications

2 \to 3

and

3 \to 4

are trivial and so is

2 \to 5

. The implications

4 \to 6

and

5 \to 6

follow from the No-Free-Lunch theorem. The difficult part is to show that

6 \to 1

. The proof is based on two main claims:

If
$VCdim (H) = d$ then even though
$H$ might be infinite, when restricting it to a finite set
$C \subset X$ , its "effiective" size,
$| H |$ , is only
$O (| C |^{d})$ . That is, the size of
$H_{C}$ grows polynomially rather than exponentially with
$| C |$ . This claim is often referred to as Sauer's lemma, but it has also been stated and proved independently by Shelah and by Perles. The formal statement is given in Section 6.5.1 later.
In Section 4 we have shown that finite hypothesis classes enjoy the uniform
convergence property. In Section 6.5.2 later we generalize this result and
show that uniform convergence holds whenever the hypothesis class has a
"small effiective size." By "small effiective size" we mean classes for which

$| H_{C} |$ grows polynomially with
$| C |$ .

在第四章我們已經看到

1 \to 2

。這意謂著

2 \to 3

與

3 \to 4

是很自然而然的，

2 \to 5

也是一樣。依著沒有白吃的午餐的定理，

4 \to 6

與

5 \to 6

也是成立的。最難的是證明

6 \to 1

。這證明主要基於兩個宣稱(主張)：

如果
$VCdim (H) = d$ ，即使
$H$ 可能是無限集合，當我們把它限制到一個有限集合
$C \subset X$ ，它的有效大小，
$| H |$ ，就只會是
$O (| C |^{d})$ 。也就是，
$H_{C}$ 的大小是以
$| C |$ 的多項式函式而不是指數函數增長。這個主張通常被稱為Sauer's lemma，這也被Shelah與perles各別的提出證明過。正式的說明會在後續說明(6.5.1)
在第四章我們已經說明過，finite hypothesis classes有著均勻收斂的性質。下面6.5.2會泛化這個結果，然後說明每當hypothesis class有著"小的有效大小"，那麼均勻收斂就會成立。所謂的"小的有效大小"指的就是
$| H_{C} |$ 會以
$| C |$ 的多項式成長。

6.5.1 Sauer's Lemma and the Growth Function

We defined the notion of shattering, by considering the restriction of

H

to a finite set of instances. The growth function measures the maximal "effiective" size of

H

on a set of m examples. Formally:

DEFINITION 6.9 (Growth Function) Let

H

be a hypothesis class. Then the growth function of

H

, denoted

τ_{H} : N \to N

, is defined as

τ_{H} (m) = \underset{C \subset X : | C | = m}{m a x} | H_{C} |

我們透過將

H

為有限的實例集合來定義shattering的概念。而成長函數就是拿來量測

H

在

m

個樣本的集合上的最大"有效"大小。正確的說是這樣的：

DEFINITION 6.9 (Growth Function) 假設

H

是hypothesis clas。那麼，

H

的成長函數就標記為

τ_{H} : N \to N

，其定義為：

τ_{H} (m) = \underset{C \subset X : | C | = m}{m a x} | H_{C} |

In words,

τ_{H} (m)

is the number of different functions from a set

C

of size

m

{0, 1}

that can be obtained by restricting

H

C

也就是，

τ_{H} (m)

就是從大小為

m

的集合映射到

{0, 1}

的不同函數的個數，這可以透過限制

H

到

C

來取得。

Obviously, if

VCdim (H) = d

then for any

m \leq d

we have

τ_{H} (m) = 2^{m}

. In such cases,

H

induces all possible functions from

C

{0, 1}

. The following beautiful lemma, proposed independently by Sauer, Shelah, and Perles, shows that when

m

becomes larger than the VC-dimension, the growth function increases polynomially rather than exponentially with

m

lemma 6.10 (Sauer-Shelah-Perles) Let

H

be a hypothesis class with

VCdim (H) = d < \infty

. Then, for all

m

τ_{H} (m) \leq \sum_{i = 0}^{d} (\begin{matrix} m \\ i \end{matrix})

. In particular, if

m > d + 1

then

τ_{H} (m) \leq (e m / d)^{d}

很明顯的，如果

VCdim (H) = d

，那麼對於任意的

m \leq d

我們都會有著

τ_{H} (m) = 2^{m}

。這種情況下，

H

會歸納出

C

到

{0, 1}

的所有可能函數。下面漂亮的引理，由Sauer、Shelah與perles獨自提出，說明了當

m

變的比VC-dimension還要大的時候，成長函數會以

m

的多項式增長而不是指數增長。

LEMMA 6.10 (Sauer-Shelah-Perles) 假設

H

是一個hypothesis class，且

VCdim (H) = d < \infty

。那麼，對所有的

m

而言，

τ_{H} (m) \leq \sum_{i = 0}^{d} (\begin{matrix} m \\ i \end{matrix})

。特別是，如果

m > d + 1

，那麼

τ_{H} (m) \leq (e m / d)^{d}

Proof of Sauer's Lemma

To prove the lemma it suffices to prove the following stronger claim: For any

C = {c_{1}, \dots, c_{m}}

we have

\forall H, | H_{C} | \leq | {B} |

為了證明引理，我們要證明下面更強烈的主張：對任意的集合

C = {c_{1}, \dots, c_{m}}

我們有著

\begin{matrix} (6.3) & \forall H, | H_{C} | \leq | {B \subseteq C : H shatters B} | \end{matrix}

The reason why Equation (6.3) is sufficient to prove the lemma is that if

VCdim (H) \leq d

then no set whose size is larger than

d

is shattered by

H

and therefore

| {B \subseteq C : H shatters B} | \leq \sum_{i = 0}^{d} (\begin{matrix} m \\ i \end{matrix})

When

m > d + 1

the right-hand side of the preceding is at most

(e m / d)^{d}

(see Lemma A.5 in Appendix A).

方程式6.3足以證明這引理的理由在於，如果

VCdim (H) \leq d

，那麼就不會有大小超過

d

的集合可以被

H

給shatter，因此

| {B \subseteq C : H shatters B} | \leq \sum_{i = 0}^{d} (\begin{matrix} m \\ i \end{matrix})

當

m > d + 1

，上面方程式的石邊最多就是

(e m / d)^{d}

(可參考附錄A的Lemma A.5)。

We are left with proving Equation (6.3) and we do it using an inductive argument. For

m = 1

, no matter what

H

is, either both sides of Equation (6.3) equal 1 or both sides equal 2 (the empty set is always considered to be shattered by

H

). Assume Equation (6.3) holds for sets of size

k < m

and let us prove it for sets of size

m

. Fix

H

and

C = {c_{1}, \dots, c_{m}}

. Denote

c^{'} = {c_{2}, \dots, c_{m}}

and in addition, define the following two sets:

Y_{0} = {(y_{2}, \dots, y_{m}) : (0, y_{2}, \dots, y_{m}) \in H_{C} \lor (1, y_{2}, \dots, y_{m}) \in H_{C}}

and

Y_{1} = {(y_{2}, \dots, y_{m}) : (0, y_{2}, \dots, y_{m}) \in H_{C} \lor (1, y_{2}, \dots, y_{m}) \in H_{C}}

我們只剩下證明方程式6.3，這點我們將採用歸納論證來證明。當

m = 1

，無論

H

是什麼，方程式6.3的兩邊都會等於1或2(我們認為空集合是可以被

H

shatter)。假設方程式6.3在大小為

k < m

的集合上是成立的，讓我們證明集合大小為

m

的情況。固定

H

，且

C = {c_{1}, \dots, c_{m}}

。另外，

c^{'} = {c_{2}, \dots, c_{m}}

，此外，定義下面兩個集合：

Y_{0} = {(y_{2}, \dots, y_{m}) : (0, y_{2}, \dots, y_{m}) \in H_{C} \lor (1, y_{2}, \dots, y_{m}) \in H_{C}}

與

Y_{1} = {(y_{2}, \dots, y_{m}) : (0, y_{2}, \dots, y_{m}) \in H_{C} \lor (1, y_{2}, \dots, y_{m}) \in H_{C}}

It is easy to verify that

| H_{C} | = | Y_{0} | + | Y_{1} |

. Additionally, since

Y_{0} = H_{C^{'}}

, using the induction assumption (applied on

H

and

C^{'}

) we have that

| Y_{0} | = | H_{C^{'}} | \leq | {B \subseteq C^{'} : H shatters B} | = | {B \subseteq C : c_{1} \notin B \land H shatters B} |

很容易就可以驗證

| H_{C} | = | Y_{0} | + | Y_{1} |

。此外，因為

Y_{0} = H_{C^{'}}

，我們用歸納假設得到

| Y_{0} | = | H_{C^{'}} | \leq | {B \subseteq C^{'} : H shatters B} | = | {B \subseteq C : c_{1} \notin B \land H shatters B} |

Next, define

H^{'} \subseteq H

to be

\begin{aligned} H^{'} & = {h \in H : \exists h^{'} \in H s.t. (1 - h^{'} (c_{1}), h^{'} (c_{2}), \dots, h^{'} (c_{m})) \\ = (h (c_{1}), h (c_{2}), \dots, h (c_{m}))} \end{aligned}

namely, \mathcal{H'}$ contains pairs of hypotheses that agree on

C^{'}

and differ on

c_{1}

. Using this definition, it is clear that if

H^{'}

shatters a set

B \subseteq C^{'}

then it also shatters the set

B \cup {c_{1}}

and vice versa. Combining this with the fact that

Y_{1} = {H^{'}}_{C^{'}}

and using the inductive assumption (now applied on

H^{'}

and

C^{'}

) we obtain that

\begin{aligned} | Y_{1} | & = | {H^{'}}_{C^{'}} | \leq | {B \subseteq C^{'} : H^{'} shatters B} | = | {B \subseteq C^{'} : H^{'} shatters B \cup {c_{1}}} | \\ = | {B \subseteq C : c_{1} \in B \land H^{'} shatters B} | \leq | {B \subseteq C : c_{1} \in B \land H shatters B} | \end{aligned}

下一步，定義

H^{'} \subseteq H

為

\begin{aligned} H^{'} & = {h \in H : \exists h^{'} \in H s.t. (1 - h^{'} (c_{1}), h^{'} (c_{2}), \dots, h^{'} (c_{m})) \\ = (h (c_{1}), h (c_{2}), \dots, h (c_{m}))} \end{aligned}

也就是說，

H^{'}

包含了在

C^{'}

可以但在

c_{1}

不行的hypotheses。利用這個定義，很明顯的，如果

H^{'}

shatter集合

B \subseteq C^{'}

，那它就可以shatter掉集合

B \cup {c_{1}}

，反之亦然。結合這點與

Y_{1} = {H^{'}}_{C^{'}}

，並使用歸納假設(現在是

H^{'}

與

C^{'}

)，我們得到

\begin{aligned} | Y_{1} | & = | {H^{'}}_{C^{'}} | \leq | {B \subseteq C^{'} : H^{'} shatters B} | = | {B \subseteq C^{'} : H^{'} shatters B \cup {c_{1}}} | \\ = | {B \subseteq C : c_{1} \in B \land H^{'} shatters B} | \leq | {B \subseteq C : c_{1} \in B \land H shatters B} | \end{aligned}

Overall, we have shown that

\begin{aligned} | H_{C} | & = | Y_{0} | + | Y_{1} | \\ \leq | {B \subseteq C : c_{1} \notin B \land H shatters B} | + | {B \subseteq C : c_{1} \in B \land H shatters B} | \\ = | {B \subseteq C : H shatters B} | \end{aligned}

which concludes our proof.

整體來說，我們已經證明

\begin{aligned} | H_{C} | & = | Y_{0} | + | Y_{1} | \\ \leq | {B \subseteq C : c_{1} \notin B \land H shatters B} | + | {B \subseteq C : c_{1} \in B \land H shatters B} | \\ = | {B \subseteq C : H shatters B} | \end{aligned}

6.5.2 Uniform Convergence for Classes of Small Effiective Size

In this section we prove that if

H

has small effective size then it enjoys the uniform convergence property. Formally,

THEOREM 6.11 Let

H

be a class and let

τ_{H}

be its growth function. Then, for every

D

and every

δ \in (0, 1)

, with probability of at least

1 - δ

over the choice of

S \sim D^{m}

we have

| L_{D} (h) - L_{S} (h) | \leq \frac{4 + \sqrt{\log (τ_{H} (2 m))}}{δ \sqrt{2 m}}

這一章我們會證明，如果

H

的有效大小是小的，那麼它就會擁有均勻收斂的特性。正確來說，

THEOREM 6.11 假設

H

是一個class，且

τ_{H}

是它的成長函數。那麼，對每一個

D

與每一個

δ \in (0, 1)

，有著最少

1 - δ

的機率的

S \sim D^{m}

會有著

| L_{D} (h) - L_{S} (h) | \leq \frac{4 + \sqrt{\log (τ_{H} (2 m))}}{δ \sqrt{2 m}}

Before proving the theorem, let us first conclude the proof of Theorem 6.7.

Proof of Theorem 6.7 It suffices to prove that if the VC-dimension is finite then the uniform convergence property holds. We will prove that

m_{H}^{U C} (ϵ, δ) \leq 4 \frac{16 d}{(δ ϵ)^{2}} \log (\frac{16 d}{(δ ϵ)^{2}} + \frac{16 d \log (2 r / d)}{(δ ϵ)^{2}})

在證明這個理論之前，讓我們先來怒證一下Theorem 6.7。

Proof of Threorem 6.7 這足以證明，如果Vc-dimension是有限的，那麼均勻收斂的特性是成立的。我們將證明，

m_{H}^{U C} (ϵ, δ) \leq 4 \frac{16 d}{(δ ϵ)^{2}} \log (\frac{16 d}{(δ ϵ)^{2}} + \frac{16 d \log (2 r / d)}{(δ ϵ)^{2}})

From Sauer’s lemma we have that for

m > d

τ_{H} (2 m) \leq (2 e m / d)^{d}

. Combining this with Theorem 6.11 we obtain that with probability of at least

1 - δ

| L_{S} (h) - L_{D} (h) | \leq \frac{4 + \sqrt{d \log (2 e m / d)}}{δ \sqrt{2 m}}

根據Sauer的lemma，對於

m > d

，我們得到

τ_{H} (2 m) \leq (2 e m / d)^{d}

。把這一點跟Theorem 6.11結合，我們得到最少

1 - δ

的機率，

| L_{S} (h) - L_{D} (h) | \leq \frac{4 + \sqrt{d \log (2 e m / d)}}{δ \sqrt{2 m}}

For simplicity assume that$\sqrt{d\log(2em/d)} \geq 4; hence,

| L_{S} (h) - L_{D} (h) | \leq \frac{1}{δ} \sqrt{\frac{2 d \log (2 e m / d)}{m}}

為了簡單起見，我們假設

\sqrt{d \log (2 e m / d)} \geq 4

；因此，

| L_{S} (h) - L_{D} (h) | \leq \frac{1}{δ} \sqrt{\frac{2 d \log (2 e m / d)}{m}}

To ensure that the preceding is at most

ϵ

we need that

m \geq \frac{2 d \log (m)}{(δ ϵ)^{2}} + \frac{2 d \log (2 e / d)}{(δ ϵ)^{2}}

為了確保上面說的部份最多就是

ϵ

，我們需要加入，

m \geq \frac{2 d \log (m)}{(δ ϵ)^{2}} + \frac{2 d \log (2 e / d)}{(δ ϵ)^{2}}

Standard algebraic manipulations (see Lemma A.2 in Appendix A) show that a
sufficient condition for the preceding to hold is that

m \geq 4 \frac{2 d}{(δ ϵ)^{2}} \log (\frac{2 d}{(δ ϵ)^{2}}) + \frac{4 d \log (2 e / d)}{(δ ϵ)^{2}}

標準代數計算(見附錄A的Lemma A.2)說明著，滿足上面不等式的充份條件就是，

m \geq 4 \frac{2 d}{(δ ϵ)^{2}} \log (\frac{2 d}{(δ ϵ)^{2}}) + \frac{4 d \log (2 e / d)}{(δ ϵ)^{2}}

Remark 6.4 The upper bound on

$m_{UC}_{\mathcal{H}$ we derived in the proof Theorem 6.7 is not the tightest possible. A tighter analysis that yields the bounds given in Theorem 6.8 can be found in Chapter 28.

Remark 6.4 我們在Theorem 6.7中推導出來的

$m_{UC}_{\mathcal{H}$的upper bound並不是最嚴謹的。在28章會有更嚴謹的分析，得出Theorem 6.8中所出給的bound。

Proof of Theorem 6.11 *

We will start by showing that

\begin{matrix} (6.4) & E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] \leq \frac{4 + \sqrt{\log (τ_{H} (2 m))}}{\sqrt{2 m}} \end{matrix}

我們從證明下面不等式開始

\begin{matrix} (6.4) & E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] \leq \frac{4 + \sqrt{\log (τ_{H} (2 m))}}{\sqrt{2 m}} \end{matrix}

Since the random variable

sup_{h \in H}

is nonnegative, the proof of the theorem follows directly from the preceding using Markov's inequality (see Section B.1).

因為隨機變數

sup_{h \in H}

是非負的，因此，其證明直接來自前面的Markov's inequality(見附錄B.1)

To bound the left-hand side of Equation (6.4) we first note that for every

h \in H

, we can rewrite

L_{D} (h) = E_{S^{'} \sim D^{m}} [L_{S^{'}} (h)]

, where

S^{'} = z_{1}^{'}, \dots, z_{m}^{'}

is an additional i.i.d. sample. Therefore,

E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] = E_{S \sim D^{m}} [sup_{h \in H} | E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) |]

為了bound方程式6.4的左邊項目，首先我們注意到對每一個

h \in H

，我們可以重寫為

L_{D} (h) = E_{S^{'} \sim D^{m}} [L_{S^{'}} (h)]

，其中

S^{'} = z_{1}^{'}, \dots, z_{m}^{'}

，是一個額外的獨立同分佈樣本。因此，

E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] = E_{S \sim D^{m}} [sup_{h \in H} | E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) |]

A generalization of the triangle inequality yields

| E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h)

and the fact that supermum of expectation is smaller than expectation of supremum yields

sup_{h \in H} E_{S^{'} \sim D^{m}} | L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |

三角不等式的推廣，

| E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h)

以及期望值的最小上界小於最小上界的期望值這一事實

sup_{h \in H} E_{S^{'} \sim D^{m}} | L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |

Formally, the previous two inequalities follow from Jensen's inequality. Combining all we obtain

\begin{matrix} (6.5) & \begin{aligned} E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] & \leq E_{S, S^{'} \sim D^{m}} [sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |] \\ = E_{S, S^{'} \sim D^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))] \end{aligned} \end{matrix}

形式上來說，前面兩個不等式來自Jensen's inequality。結合全部，我們得到下面，

\begin{matrix} (6.5) & \begin{aligned} E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] & \leq E_{S, S^{'} \sim D^{m}} [sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |] \\ = E_{S, S^{'} \sim D^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))] \end{aligned} \end{matrix}

The expectation on the right-hand side is over a choice of two i.i.d. samples ariable

sup_{h \in H}

is nonnegative, the proof of the theorem follows directly from the preceding using Markov's inequality (see Section B.1).

因為隨機變數

sup_{h \in H}

是非負的，因此，其證明直接來自前面的Markov's inequality(見附錄B.1)

To bound the left-hand side of Equation (6.4) we first note that for every

h \in H

, we can rewrite

L_{D} (h) = E_{S^{'} \sim D^{m}} [L_{S^{'}} (h)]

, where

S^{'} = z_{1}^{'}, \dots, z_{m}^{'}

is an additional i.i.d. sample. Therefore,

E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] = E_{S \sim D^{m}} [sup_{h \in H} | E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) |]

為了bound方程式6.4的左邊項目，首先我們注意到對每一個

h \in H

，我們可以重寫為

L_{D} (h) = E_{S^{'} \sim D^{m}} [L_{S^{'}} (h)]

，其中

S^{'} = z_{1}^{'}, \dots, z_{m}^{'}

，是一個額外的獨立同分佈樣本。因此，

E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] = E_{S \sim D^{m}} [sup_{h \in H} | E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) |]

A generalization of the triangle inequality yields

| E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h)

and the fact that supermum of expectation is smaller than expectation of supremum yields

sup_{h \in H} E_{S^{'} \sim D^{m}} | L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |

三角不等式的推廣，

| E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} L_{S^{'}} (h) - L_{S} (h)

以及期望值的最小上界小於最小上界的期望值這一事實

sup_{h \in H} E_{S^{'} \sim D^{m}} | L_{S^{'}} (h) - L_{S} (h) | \leq E_{S^{'} \sim D^{m}} sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |

Formally, the previous two inequalities follow from Jensen's inequality. Combining all we obtain

\begin{matrix} (6.5) & \begin{aligned} E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] & \leq E_{S, S^{'} \sim D^{m}} [sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |] \\ = E_{S, S^{'} \sim D^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))] \end{aligned} \end{matrix}

形式上來說，前面兩個不等式來自Jensen's inequality。結合全部，我們得到下面，

\begin{matrix} (6.5) & \begin{aligned} E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] & \leq E_{S, S^{'} \sim D^{m}} [sup_{h \in H} | L_{S^{'}} (h) - L_{S} (h) |] \\ = E_{S, S^{'} \sim D^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))] \end{aligned} \end{matrix}

The expectation on the right-hand side is over a choice of two i.i.d. samples

S = z_{1}, \dots, z_{m}

and

S^{'} = z_{1}^{'}, \dots, z_{m}^{'}

. Since all of these 2m vectors are chosen i.i.d., nothing will change if we replace the name of the random vector

z_{i}

with the name of the random vector

z_{i}^{'}

. If we do it, instead of the term

(ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))

in Equation (6.5) we will have the term

- (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))

. It follows that for every

σ \in {\pm 1}^{m}

we have that Equation (6.5) equals

E_{S, S^{'} \sim D^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |]

右手邊的期望值是選擇兩個獨立同分佈樣本

S = z_{1}, \dots, z_{m}

與

S^{'} = z_{1}^{'}, \dots, z_{m}^{'}

。因為全部這

2 m

個向量都是獨立同分佈所選出來的，因此，就算我們將隨機向量

z_{i}

的名稱改為隨機向量

z_{i}^{'}

也不會有任何改變。如果我們這麼做，那方程式6.5就不再是

(ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))

，而是

- (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))

。因此，對每一個

σ \in {\pm 1}^{m}

，我們都會有方程式6.5等於

E_{S, S^{'} \sim D^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |]

Since this holds for every

σ \in {\pm 1}^{m}

, it also holds if we sample each component of

σ

uniformly at random from the uniform distribution over

{\pm 1}

, denoted

U_{\pm}

因為這對每一個

σ \in {\pm 1}^{m}

都成立，因此，如果我們隨機的從一個

{\pm 1}

上的均勻分佈對

σ

的每一個組成均勻的採樣，那也會成立，以

U_{\pm}

來表示。

Hence, Equation (6.5) also equals

E_{σ \sim U_{\pm}^{m}} E_{S, S^{'} \sim D^{m}} [\sum_{h \in H} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |]

and by the linearity of expectation it also equals

E_{S, S^{'} \sim D^{m}} E_{σ \sim U_{\pm}^{m}} [\sum_{h \in H} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |]

因此，方程式6.5也等於

E_{σ \sim U_{\pm}^{m}} E_{S, S^{'} \sim D^{m}} [\sum_{h \in H} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |]

而且根據期望的線性，它也等價於

E_{S, S^{'} \sim D^{m}} E_{σ \sim U_{\pm}^{m}} [\sum_{h \in H} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |]

Next, fix S and S0, and let C be the instances appearing in S and S0. Then, we can take the supremum only over h 2 HC. Therefore,

\begin{array}{r} E_{σ \sim U_{\pm}^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |] \\ = E_{σ \sim U_{\pm}^{m}} [max_{h \in H_{C}} \frac{1}{m} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |] \end{array}

接下來，固定

S

與

S^{'}

，並假設

C

是

S

與

S^{'}

中出現的實例。然後，我們就只能夠在

h \in H_{C}

上取最小上界。因此，

\begin{array}{r} E_{σ \sim U_{\pm}^{m}} [sup_{h \in H} \frac{1}{m} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |] \\ = E_{σ \sim U_{\pm}^{m}} [max_{h \in H_{C}} \frac{1}{m} | \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i})) |] \end{array}

Fix some

h \in H_{C}

and denote

θ_{h} = \frac{1}{m} \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))

. Since

E [θ_{h}] = 0

and

θ_{h}

is an average of independent variables, each of which takes values in

[- 1, 1]

, we have by Hoeffding's inequality that for every

σ > 0

P [| θ_{h} | > σ] \leq 2 \exp (- 2 m σ^{2})

固定某些

h \in H_{C}

，表示為

θ_{h} = \frac{1}{m} \sum_{i = 1}^{m} σ_{i} (ℓ (h, z_{i}^{'}) - ℓ (h, z_{i}))

。由於

E [θ_{h}] = 0

且

θ_{h}

為自變量的平均數，每一個都是取值在

[- 1, 1]

之間，根據hoeffding's inequality，每一個

σ > 0

，

P [| θ_{h} | > σ] \leq 2 \exp (- 2 m σ^{2})

Applying the union bound over

h \in H_{C}

, we obtain that for any

σ > 0

P [max_{h \in H_{C}} | θ_{h} | > σ] \leq 2 | H_{C} | \exp (- 2 m σ^{2})

在

h \in H_{C}

上做union bound，我們得到，對於任意的

σ > 0

，

P [max_{h \in H_{C}} | θ_{h} | > σ] \leq 2 | H_{C} | \exp (- 2 m σ^{2})

Finally, Lemma A.4 in Appendix A tells us that the preceding implies

E [max_{h \in H_{C}} | θ_{h} |] \leq \frac{4 + \sqrt{\log (| H_{C} |)}}{\sqrt{2 m}}

最後，附錄A中的Lemma A.4告訴我們，上面不等式的意思就是

E [max_{h \in H_{C}} | θ_{h} |] \leq \frac{4 + \sqrt{\log (| H_{C} |)}}{\sqrt{2 m}}

Combining all with the definition of

τ_{H}

, we have shown that

E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] \leq \frac{4 + \sqrt{\log (τ_{H} (2 m))}}{\sqrt{2 m}}

結合全部與

τ_{H}

的定義，我們終於證明

E_{S \sim D^{m}} [sup_{h \in H} | L_{D} (h) - L_{S} (h) |] \leq \frac{4 + \sqrt{\log (τ_{H} (2 m))}}{\sqrt{2 m}}

6.6 Summary

The fundamental theorem of learning theory characterizes PAC learnability of classes of binary classifiers using VC-dimension. The VC-dimension of a class is a combinatorial property that denotes the maximal sample size that can be shattered by the class. The fundamental theorem states that a class is PAC learnable if and only if its VC-dimension is finite and specifies the sample complexity required for PAC learning. The theorem also shows that if a problem is at all learnable, then uniform convergence holds and therefore the problem is learnable using the ERM rule.

學習理論的基本定理使用VC-dimension來描述二分分類器類別的PAC learnability。一個類別的VC-dimension是一種組合特性，其表示該類別被shattered的最大樣本數。基本定理指出，一個類別要是PAC learnable，若且唯若其VC-dimension是有限的，且確定PAC learnable所需的樣本複雜度。該定理還說明，如果一個問題是完全的可學習的(learnable)，那麼均勻收斂就會成立，因此，該問題使用ERM rule就是可學習的(learnable)。

Chapter 6-Understanding Machine Learning From Theory to Algorithms: The VC-Dimension

tags: Understanding Machine Learning From Theory to Algorithms

Remark

The VC-Dimension

6.1 Infinite-Size Classes Can Be Learnable

6.2 The VC-Dimension

6.3 Examples

6.3.1 Threshold Functions

6.3.2 Intervals

6.3.3 Axis Aligned Rectangles

6.3.4 Finite Classes

6.3.5 VC-Dimension and the Number of Parameters

6.4 The Fundamental Theorem of PAC learning

6.5 Proof of Theorem

6.5.1 Sauer's Lemma and the Growth Function

Proof of Sauer's Lemma

6.5.2 Uniform Convergence for Classes of Small Effiective Size

Proof of Theorem 6.11 *

6.6 Summary

Read more

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Book_論文翻譯

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Dify + Whisper Asr Webservice

tags: `Understanding Machine Learning From Theory to Algorithms`