Probably Approximately Correct — a Formal Theory of Learning(翻譯)

tags: `Understanding Machine Learning From Theory to Algorithms` `翻譯`

In tackling machine learning (and computer science in general) we face some deep philosophical questions. Questions like, “What does it mean to learn?” and, “Can a computer learn?” and, “How do you define simplicity?” and, “Why does Occam’s Razor work? (Why do simple hypotheses do well at modelling reality?)” In a very deep sense, learning theorists take these philosophical questions — or at least aspects of them — give them fleshy mathematical bodies, and then answer them with theorems and proofs. These fleshy bodies might have imperfections or they might only address one small part of a big question, but the more we think about them the closer we get to robust answers and, as a reader of this blog might find relevant, useful applications. But the glamorous big-picture stuff is an important part of the allure of learning theory.

在處理機器學習(與一般電腦科學)的時候，我們面對著一些深度的哲學問題。問題像是，"學習是什麼意思?"、"電腦可以學習嗎?"、"你如何定義單一性?"、以及"為什麼奧卡姆剃刀是可行的?(為什麼簡單的hypothesis在模擬現實方面做的很好?)"。從深的意思上來說，學習的理論者會接受這些哲學問題(或者最少是某些方面)，給他們fleshy mathematical bodies(豐富的數學?)，然後用理論與證明來回答他們。這些fleshy bodies也許會有缺陷，或者它們可能只能單純的解決一個大問題內的一小部份問題，但我們對它們思考的愈多就越能接近那可靠的答案，做為這blog的讀書，可能會找到相關、有用的應用。但那目標說的就是學習理論的魅力中最重要的組成部份。

But before we jump too far ahead of ourselves, we need to get through the basics. In this post we’ll develop the basic definitions of the theory of PAC-learning. It will be largely mathematical, but fear not: we’ll mix in a wealth of examples to clarify the austere symbols.

但是在我們開始之前，我們要先瞭解一下基礎知識。在這篇文章中，我們將會說明PAC-learning理論的基礎定義。這很大程度上都會是數學，但不要擔心：我們會混合大量的範例來闡明這些符號。

Some historical notes: PAC learning was invented by Leslie Valiant in 1984, and it birthed a new subfield of computer science called computational learning theory and won Valiant some of computer science’s highest awards. Since then there have been numerous modifications of PAC learning, and also models that are entirely different from PAC learning. One other goal of learning theorists (as with computational complexity researchers) is to compare the power of different learning models. We’ll discuss this more later once we have more learning models under our belts.

這邊一些歷史記錄：PAC learning是Leslie Valiant在1984年的時候發明的，這催生出電腦科學的一個新的子領域，稱為電腦學習理論，並且獲得Valiant部份的電腦科學最高獎項。從那時候開始，PAC learning歷經多次的修改，而且模型也跟PAC learning完全的不同。學習理論者(與計算複雜度研究人員一樣)的另一個目標就是比較不同模型之間的能力。在我們掌握更多學習模式之後，我們會在後續討論這個議題。

If you’re interested in following along with a book, the best introduction to the subject is the first few chapters of An Introduction to Computational Learning Theory.

如果你有興趣看著書一起學習，那麼最好的入門介紹就是"An introduction to Computational Learning Theory"的前幾章。

So let’s jump right in and see what this award-winning definition is all about.

那就讓我們開始跳坑了。

Learning Intervals

The core idea of PAC-learnability is easy to understand, and we’ll start with a simple example to explain it. Imagine a game between two players. Player 1 generates numbers x at random in some fixed way, and in Player 1’s mind he has an interval [a,b]. Whenever Player 1 gives out an x, he must also say whether it’s in the interval (that is, whether a \leq x \leq b). Let’s say that Player 1 gives reports a 1 if x is in the interval, and a 0 otherwise. We’ll call this number the label of x, and call the pair of (x, label) a sample, or an example. We recognize that the zero and one correspond to “yes” and “no” answers to some question (Is this email spam? Does the user click on my ad? etc.), and so sometimes the labels are instead \pm 1, and referred to as “positive” or “negative” examples. We’ll use the positive/negative terminology here, so positive is a 1 and negative is a 0.

PAC-learnability的核心觀念很容易瞭解，我們會從一個簡單的範例開始來解釋它。想像一個兩人遊戲：

Palyer-1以某種固定的方法來產生隨機數值
$x$ ，而且Player-1的內心中有一個屬於它的區間，
$[a, b]$
每當Player-1給出一個
$x$ ，他必需說明這個
$x$ 是否在區間內，也就是是否
$a \leq x \leq b$
Player-1要給出一個報告，當
$x$ 在區間內，那就標記1，不在區間內則標記為0，這數值稱為
$x$ 的標記，並且將(
$x$ , label)稱為樣本，或範本

我們瞭解到，

0, 1

是對應某些問題的答案"yes"、"no"(這電子郵件是垃圾郵件嗎?使用者是否點擊我的廣告?等)，有些時候會用

\pm 1

來表示，稱為"正"或"負"範本(樣本)。在這裡，我們會使用正/負來表示，因此，正為1，負為0。

Player 2 (we’re on her side) sees a bunch of samples and her goal is to determine a and b. Of course Player 2 can’t guess the interval exactly if the endpoints are real numbers, because Player 1 only gives out finitely many samples. But whatever interval Player 2 does guess at the end can be tested against Player 1’s number-producing scheme. That is, we can compute the probability that Player 2’s interval will give an incorrect label if Player 1 were to continue giving out numbers indefinitely. If this error is small (taking into account how many samples were given), then Player 2 has “learned” the interval. And if Player 2 plays this game over and over and usually wins (no matter what strategy or interval Player 1 decides to use!), then we say this problem is PAC-learnable.

Player-2(我們在她身邊)看著一堆的樣本，她的目標就是找出那個區間值，

a, b

。當然，如果端點是實數，那Player-2是沒有辦法完全的猜到這個區間，因為Player-1給出的是有限的樣本。但無論Player-2最終猜到的是怎麼樣的區間，都可以針對Player-1的數值生成的組合來測試。也就是說，如果Player-1無限期的連續給出數字，那我們就可以計算Player-2的區間會給出錯誤標記的機率。如果錯誤非常小(考慮給出多少樣本)，那Player-2就"學習"到Player-1的區間。而且，如果Player-2反覆的玩，而且通常都是一直贏(無論Player-1決定使用什麼樣的策略或區間)，那我們就說，這個問題是PAC-learnable。

PAC stands for Probably Approximately Correct, and our number guessing game makes it clear what this means. Approximately correct means the interval is close enough to the true interval that the error will be small on new samples, and Probably means that if we play the game over and over we’ll usually be able to get a good approximation. That is, we’ll find an approximately good interval with high probability.

PAC代表著可能近似正確，而我們的猜數字遊戲清楚的說明著這意味著什麼。近似正確意味著猜測的區間足夠接近真實的區間，因此新樣本的誤差會很小，而可能意味著，如果我們反覆的玩著這個遊戲，通常可以得到一個不錯的近似。也就是說，我們將發現一個很有可能是一個很好的區間。

Indeed, one might already have a good algorithm in mind to learn intervals. Simply take the largest and smallest positive examples and use those as the endpoints of your interval. It’s not hard to see why this works, but if we want to prove it (or anything) is PAC-learnable, then we need to solidify these ideas with mathematical definitions.

確實，人們可能已經有一好的演算法來學習這個區間目題。只要拿最大、最小的正樣本，然後拿用兩個樣本做為區間的兩端點即可。不難看出這為什麼有用，但如果我們要證明它(或任何事情)是PAC-learnable，那我們需要數學定義來鞏固這些想法。

Distributions and Hypotheses

First let’s settle the random number generation scheme. In full generality, rather than numbers we’ll just have some set

X

. Could be finite, could be infinite, no restrictions. And we’re getting samples randomly generated from

X

according to some fixed but arbitrary and unknown distribution

D

. To be completely rigorous, the samples are independent and identically distributed (they’re all drawn from the same

D

and independently so). This is Player 1’s dastardly decision: how should he pick his method to generate random numbers so as to bring Player 2’s algorithm to the most devastating ruin?

首先，讓我們先決定隨機數值的生成方案。一般來說，我們需要的是一些集合-

X

，而不是數值。它可以是有限的，也可以是無限的，沒有限制。我們會根據一些固定但是任意而且未知的分佈

D

中從

X

隨機生樣本。為求嚴謹，樣本為獨立且同分佈(i.i.d)。這是Player-1的卑鄙決定：他應該如何選擇隨機生成數值的方法將Player-2的演算法帶往毀滅性的崩潰。

So then Player 1 is reduced to a choice of distribution

D

over

X

, and since we said that Player 2’s algorithm has to do well with high probability no matter what

D

is, then the definition becomes something like this:

A problem is PAC-learnable if there is an algorithm

A

which will likely win the game for all distributions

D

over

X

因此，Player-1被減化為選擇

X

上的分佈-

D

，因為我們說，Player-2的演算法無論怎麼樣的

D

都會有很高的機率可以做的很好，因此定義就會變成這樣：

如果有一個演算法
$A$ ，這演算法很有可能會贏得
$X$ 上的所有分佈
$D$ 的遊戲，那麼這問題就是PAC-learnable

Now we have to talk about how “intervals” fit in to the general picture. Because if we’re going to talk about learning in general, we won’t always be working with intervals to make decisions. So we’re really saying that Player 1 picks some function

c

for classifying points in

X

as a 0 or a 1. We’ll call this a concept, or a target, and it’s the thing Player 2 is trying to learn. That is, Player 2 is producing her own function h that also labels points in

X

, and we’re comparing it to

c

. We call a function generated by Player 2 a hypothesis (hence the use of the letter

h

現在我們要來談談"區間"如何擬合一般情況。因為，如果我們要談的是一般性的學習，那我們就不會總是為區間這個作業下一個好的決定。因此，我們實際上要說的是，Player-1選擇某些函數-

c

來分類

X

裡面的點是0或1。我們將其稱為"concept"或"target"(概念或目標?)，而且這就是Player-2試著學習的事情。也就是說，Player-2生成她自己的函數-

h

來標記

X

內的點，然後拿來跟

c

的結果做比較。我們將Player-2生成的函數稱為hypothesis(因此使用字母

h

)。

And how can we compare them? Well, we can compute the probability that they differ. We call this the error:

{err}_{c, D} (h) = P_{D} (h (x) \neq c (x))

我們可以怎麼比較它們?嗯，我們可以計算它們不同的機率。將之稱為誤差率：

{err}_{c, D} (h) = P_{D} (h (x) \neq c (x))

就是說，在

D

這個分佈下，Player-1的

c

跟Player-2的

h

不一樣的機率。

One would say this aloud: “The error of the hypothesis

h

with respect to the concept

c

and the distribution

D

is the probability over

x

drawn via

D

that

h (x)

and

c (x)

differ”. Some might write the “differ” part as the symmetric difference of the two functions as sets. And then it becomes a probability density, if that’s your cup of tea (it’s not mine).

有人會大聲的說："這個相對於concept-

c

與分佈-

D

的hypothesis-

h

的誤差就是通過

D

取得

x

，而

h (x)

的結果與

c (x)

不同的機率"。有些人可能會將"差異"部份做為兩個函數的對稱差異集。那它就會變成機率密度，如果你感興趣的話。

So now for a problem to be PAC-learnable we can say something like,

A problem is PAC-learnable if there is an algorithm
$A$ which for any distribution
$D$ and any concept
$c$ will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.

因此，對於一個PAC-learnable的問題，我們可以說這類似：

如果存在一個演算法-
$A$ ，對任意的分佈-
$D$ 與任意的concept-
$c$ ，在給出一些獨立取得的樣本而且會有很高的機率產生一個誤差率很小的hypothesis，那這個問題就是PAC-learnable。

There are still a few untrimmed hedges in this definition (like “some,” “small,” and “high”), but there’s still a more important problem: there’s just too many possible concepts! Even for finite sets: there are

2^{n} {0, 1}

-valued functions on a set of

n

elements, and if we hope to run in polynomial time we can only possible express a miniscule fraction of those functions. Going back to the interval game, it’d be totally unreasonable to expect Player 2 to be able to get a reasonable hypothesis (using intervals or not!) if Player 1’s chosen concept is arbitrary. (The mathematician in me is imaging some crazy rule using non-measurable sets, but just suffice it to say: you might think you know everything about the real numbers, but you don’t.)

這定義仍然存在著一些枝節末微(像是"一些"、"小"、"高")，但有一個更重要的問題：它們存在太多可能的concepts!即使是有限的集合：它們還是有著

2^{n} {0, 1}

個函數(

n

個元素的集合上)，如果我們希望在多項式時間內執行，那就只能表示出這些函數的極小部份而以。回到區間遊戲，如果Player-1的選擇concept是任意的，那期待Player-2能夠得到合理的hypothesis(是否使用區間)是完全不合理的。(我心中的數學家正用著不可量測的集合來想著一些瘋狂的規則，但這足以說明：你可能會覺得你知道真正的數值，但實際上你不知道。)

So we need to boil down what possibilities there are for the concepts c and the allowed expressive power of the learner. This is what concept classes are for.

因此，我們需要對concepts-

c

與learner允許的表示能力的可能性進行歸納。這就是concept classes的用法。

Concept Classes

A concept class

C

over

X

is a family of functions

X \to {0, 1}

. That’s all.

No, okay, there’s more to the story, but for now it’s just a shift of terminology. Now we can define the class of labeling functions induced by a choice of intervals. One might do this by taking

C

to be the set of all characteristic functions of intervals,

χ_{[a, b]} (x) = 1

a \leq x \leq b

and 0 otherwise. Now the concept class becomes the sole focus of our algorithm. That is, the algorithm may use knowledge of the concept class to produce its hypotheses. So our working definition becomes:

A concept class
$C$ is PAC-learnable if there is an algorithm
$A$ which, for any distribution
$D$ of samples and any concept
$c \in C$ , will with high probability produce a hypothesis
$h \in C$ whose error is small.

X

上的concept class-

C

是一個函數族(

X \to {0, 1}

)。就這樣。

我騙你的，故事還很多，但現在只是術語的轉換。現在我們可以定義由區間選擇所引起的標記函數的類別。我們可以透過將

C

做為區間的所有特徵函數的集合來實現這一點，如果

a \leq x \leq b

，那麼

χ_{[a, b]} (x) = 1

，否則為0。現在，concept class變成演算法的唯一焦點。也就是說，演算法可以用concept class的知識來產生它的hypothesis。因此，我們的工作定義變成：

如果存在一個演算法
$A$ ，對任意的樣本分佈-
$D$ 與任意的concept-
$c \in C$ 都有很高的機率產生一個hypothesis-
$h \in C$ 且誤差率是小的，那麼，這個concept class-
$C$ 是一個PAC-learnable

As a short prelude to future posts: we’ll be able to prove that, if the concept class is sufficiently simple (think, “low dimension”) then any algorithm that does something reasonable will be able to learn the concept class. But that will come later. Now we turn to polishing the rest of this definition.

做為未來文章的序幕：我們將能夠證明，如果concept class非常的簡單(想成是"低維度")，那麼，任意執行合理的演算法都將能夠學習concept class。但這是後面的事情，現在我們要來完全這個定義的其它部份。

Probably Approximately Correct Learning

We don’t want to phrase the definition in terms of games, so it’s time to remove the players from the picture. What we’re really concerned with is whether there’s an algorithm which can produce good hypotheses when given random data. But we have to solidify the “giving” process and exactly what limits are imposed on the algorithm.

我們並不想要用遊戲來表達這個定義，因此，是時候從描述中拿掉Players。我們真的關心的是，是否真的有著這麼一個演算法在給定亂數資料之後可以生成一個好的hypothesis。但是我們必須鞏固"給出"的過程以及對演算法上施加的那些確切地限制。

It sounds daunting, but the choices are quite standard as far as computational complexity goes. Rather than say the samples come as a big data set as they might in practice, we want the algorithm to be able to decide how much data it needs. To do this, we provide it with a query function which, when accessed, spits out a sample in unit time. Then we’re interested in learning the concept with a reasonable number of calls to the query function.

這聽起來讓人退避三舍，但就其計算複雜度而言，選擇是相當標準的。我們希望演算法能夠決定它需要多少的資料，而不是說樣本實際上是一個大數據集。為此，我們提供一個查詢函數(query function)，這函數會在單位時間內吐出一個樣本。我們對學習這個concept需要呼叫這個查詢函數的合理數值是感興趣的。

And now we can iron out those words like “some” and “small” and “high” in our working definition. Since we’re going for small error, we’ll introduce a parameter

0 < ε < 1 / 2

to represent our desired error bound. That is, our goal is to find a hypothesis

h

such that

{err}_{c, D} (h) \leq ε

with high probability. And as

ε

gets smaller and smaller (as we expect more and more of it), we want to allow our algorithm more time to run, so we limit our algorithm to run in time and space polynomial in

1 / ε

現在，我們可以消除掉我們定義中的那些字眼，像是"某些"、"小"、"高"。因為我們主要的目標是小的誤差，因此我們會引入一個參數

0 < ε < 1 / 2

來表示我們希望的誤差限制。也就是說，我們的目標是發現一個hypothesis-

h

，這個hypothesis很高的機率會讓

{err}_{c, D} (h) \leq ε

。而且，隨著

ε

愈來愈小(一如我們所期待)，我們希望允許我們的演算法有更多的時間執行，因此我們限制演算法以

1 / ε

的時間與空間多項式來執行。

下面回憶

{err}_{c, D} (h)

{err}_{c, D} (h) = P_{D} (h (x) \neq c (x))

這邊說的是，我們希望得到一個hypothesis-

h

的誤差率是小於等於

ε

We need another parameter to control the “high probability” part as well, so we’ll introduce

0 < δ < 1 / 2

to represent the small fraction of the time we allow our learning algorithm to have high error. And so our goal becomes to, with probability at least

1 - δ

, produce a hypothesis whose error is less than

ε

. In symbols, we want

P_{D} ({err}_{c, D} (h) \leq ε) > 1 - δ

我們還需要另外的參數來控制"很高的機率"的部份，因此，我們引入

0 < δ < 1 / 2

來表示我們允許學習演算法具有高誤差的時間的一小部份。因此我們的目標就變成，最少

1 - δ

的機率產生一個hypothesis，其誤差小於

ε

，如下：

P_{D} ({err}_{c, D} (h) \leq ε) > 1 - δ

這邊說的是，我們希望得到一個誤差率小於等於

ε

的hypothesis-

h

的機率是大於

1 - δ

Note that the

P_{D}

refers to the probability over which samples you happen to get when you call the query function (and any other random choices made by the algorithm). The “high probability” hence refers to the unlikely event that you get data which is unrepresentative of the distribution generating it. Note that this is not the probability over which distribution is chosen; an algorithm which learns must still learn no matter what

D

is.

注意到，

P_{D}

代表呼叫查詢函數_{(函數會在單位時間內吐出一個樣本)}時你碰巧得到樣本的機率(以及演算法所做的其它任意隨機選擇)。"很高的機率"因此代表不大可能的事件，也就是你得到的資料不代表生成該資料的分佈。要注意，這並不是選擇分佈的機率；無論

D

是什麼，演算法都必須要學習。

And again as we restrict

δ

more and more, we want the algorithm to be allowed more time to run. So we require the algorithm runs in time polynomial in both

1 / ε, 1 / δ

And now we have all the pieces to state the full definition.

随著我們對

δ

的限制愈來愈多，我們演算法能夠有更多的執行時間。所以我們要求演算法在

1 / ε, 1 / δ

的時間多項式內執行。

現在我們有所有的片段可以陳述完整的定義。

Definition: Let

X

be a set, and

C

be a concept class over

X

. We say that

C

is PAC-learnable if there is an algorithm

A (ε, δ)

with access to a query function for

C

and runtime

O (poly (\frac{1}{ε}, \frac{1}{δ}))

, such that for all

c \in C

, all distributions

D

over

X

, and all inputs

ε, δ

between 0 and 1/2, the probability that

A

produces a hypothesis

h

with error at most

ε

is at least

1 - δ

. In symbols,

P_{D} (P_{x \sim D} (h (x) \neq c (x)) \leq ε) \geq 1 - δ

where the first

P_{D}

is the probability over samples drawn from

D

during the execution of the program to produce

h

. Equivalently, we can express this using the error function,

P_{D} ({err}_{c, D} (h) \leq ε) \geq 1 - δ

Excellent.

定義：假設

X

是一個集合，而

C

是

X

上的concept class。我們說，

C

是PAC-learnable，只要滿足：

存在一個演算法
$A (ε, δ)$ ，其具有訪問
$C$ 且執行時間為
$O (poly (\frac{1}{ε}, \frac{1}{δ}))$ 的查詢函數

這樣，對所有的

c \in C

，所有

X

上的分佈-

D

，與所有的輸入

ε, δ

(介於0~1/2)，其

A

生成誤差率最多為

ε

的hypothesis-

h

的機率最少為

1 - δ

，

P_{D} (P_{x \sim D} (h (x) \neq c (x)) \leq ε) \geq 1 - δ

其中，第一個

P_{D}

是執行程式產生

h

的期間從

D

所sample出來的樣本機率。相同的，我們可以用誤差函數來表示

P_{D} ({err}_{c, D} (h) \leq ε) \geq 1 - δ

太優秀了!

Intervals are PAC-Learnable

Now that we have this definition we can return to our problem of learning intervals on the real line. Our concept class is the set of all characteristic functions of intervals (and we’ll add in the empty set for the default case). And the algorithm we proposed to learn these intervals was quite simple: just grab a bunch of sample points, take the biggest and smallest positive examples, and use those as the endpoints of your hypothesis interval.

現在我們有了這個定義，我們就可以回到實際的學習區間的問題上。我們的concept class是區間的所有特徵函數的集合(預設是一個空集合)。我們提出學習這些區間的演算法非常簡單：只要利用一堆樣本點，拿出最大、最小的正樣本，用這些這做為你的hypothesis區間的端點。

Let’s now prove that this algorithm can learn any interval with any distribution over real numbers. This proof will have the following form:

Leave the number of samples you pick arbitrary, say
$m$ .
Figure out the probability that the total error of our produced hypothesis is
$> ε$ in terms of
$m$ .
Pick
$m$ to be sufficiently large that this event (failing to achieve low error) happens with small probability.

現在讓我們來證明這個演算法可以學習任何實數上的任意分佈的任意區間。這個推論有下面幾個形式：

以
$m$ 來表示你所選擇的任意樣本數量
計算出就
$m$ 而言，我們所生成的hypothesis的總誤差
$> ε$ 的機率
選擇足夠大的
$m$ 讓事件(未能達到低誤差)發生的機率是小的

So fix any distribution D over the real line and say we have our m samples, we picked the max and min, and our interval is

I = [a_{1}, b_{1}]

when the target concept is

J = [a_{0}, b_{0}]

. We can notice one thing, that our hypothesis is contained in the true interval,

I \subset J

. That’s because the sample never lie, so the largest sample we saw must be smaller than the largest possible positive example, and vice versa. In other words

a_{0} < a_{1} < b_{1} < b_{0}

. And so the probability of our hypothesis producing an error is just the probability that

D

produces a positive example in the two intervals

A = [a_{0}, a_{1}], B = [b_{1}, b_{0}]

固定實(數)線上的任意分佈

D

，假設我們有

m

個樣本，我們選擇最大、最小，因此我們的區間為

I = [a_{1}, b_{1}]

，此時目標的concept為

J = [a_{0}, b_{0}]

。我們可以注意到一件事，我們的hypothesis包含著實際的區間，

I \subset J

。這是因為樣本永遠不會說謊，所以我們看到的最大樣本必須小於最大可能的正樣本，反之亦然。換句說話，

a_{0} < a_{1} < b_{1} < b_{0}

。因此，我們的hypothesis產生錯誤的機率就是

D

在兩個區間

A = [a_{0}, a_{1}], B = [b_{1}, b_{0}]

中產生一個正樣本的機率。

上面所提的部份，實際區間為

I

，hypothesis為

J

，而hypothesis包含實際區間，因此

I \subset J

。因為

A = [a_{0}, a_{1}]

與

B = [b_{1}, b_{0}]

這兩個區間應該是負樣本，正樣本僅存在

I = [a_{1}, b_{1}]

中，因此如果在兩區間產生正樣本那就代表錯誤，因而說，這個hypothesis產生錯誤的機率就是在這兩個區間產生正樣本的機率。

This is all setup for the second bullet point above. The total error is at most the sum of the probabilities that a positive sample shows up in each of

A, B

separately.

{err}_{J, D} \leq P_{x \sim D} (x \in A) + P_{x \sim D} (x \in B)

這就是上面兩個要點的全部設定。總誤差最多最多就是一個正樣本分別在

A, B

出現的機率之和。

{err}_{J, D} \leq P_{x \sim D} (x \in A) + P_{x \sim D} (x \in B)

Here’s a picture.

If we can guarantee that each of the green pieces is smaller than

ε / 2

with high probability, then we’ll be done. Let’s look at

A

, and the same argument will hold for

B

. Define

A^{'}

to be the interval

[a_{0}, y]

which is so big that the probability that a positive example is drawn from

A^{'}

under

D

is exactly

ε / 2

. Here’s another picture to clarify that.

如下圖所示

(兩個綠色區域是可能發生錯誤的區域)

如果我們可以保證每一個綠色區域都可以有很高的機率小於

ε / 2

，那我們就完成了。讓我們來看一下

A

，相同的論述對

B

也是成立的。定義

A^{'}

的區間為

[a_{0}, y]

，它很大，大到你從

A^{'}

下的

D

取到正樣本的機率剛好就是

ε / 2

，如下圖所示

(粉紅區域

A^{'}

的總機率權重為

ε / 2

，如果綠色區域

A

比較大，那我們就會因為太多錯誤而無法是PAC-learnable。)

這邊會要保證每一個綠色區域都有很高的機率小於

ε / 2

是因為這兩個綠色區域加起來就是我們給的限制

ε

，然後又假設

A^{'}

就是整個取到正樣本的機率，也就是

ε / 2

，這意味著所有錯誤的機率都在這一個

A^{'}

上。

We’ll be in great shape if it’s already the case that

A \subset A^{'}

, because that implies the probability we draw a positive example from

A

is at most

ε / 2

. So we’re worried about the possibility that

A^{'} \subset A

. But this can only happen if we never saw a point from

A^{'}

as a sample during the run of our algorithm. Since we had

m

samples, we can compute in terms of

m

the probability of never seeing a sample from

A^{'}

如果存在著

A \subset A^{'}

的情況，那我們的一切都會非常良好，因為那指出，我們從

A

取得正樣本的機率最多是

ε / 2

。因此我們擔心的是

A^{'} \subset A

。但這在我們的演算法在執行期間從未看過從

A^{'}

的點做為樣本的情況才會發生。因為我們有

m

個樣本，因此我們可以根據

m

計算出來自

A^{'}

的且從未見過的樣本的機率。

The probability of a single sample not being in

A^{'}

is just

1 - ε / 2

(by definition!). Recalling our basic probability theory, two draws are independent events, and so the probability of missing

A^{'}

m

times is equal to the product of the probabilities of each individual miss. That is, the probability that our chosen

A

contributes error greater than

ε / 2

is at most

P_{D} (A^{'} \subset A) \leq (1 - ε / 2)^{m}

單一樣本不在

A^{'}

的機率為

1 - ε / 2

(根據定義)。回想我們的basic probability theory，兩個採樣是獨立事件，因此

A^{'}

失敗m次的機率等價於每次失誤機率的乘積。也就是說，我們選擇

A

貢獻出的誤差大於

ε / 2

的機率最多為

P_{D} (A^{'} \subset A) \leq (1 - ε / 2)^{m}

樣本在

A^{'}

的機率

ε / 2

，因此不在的機率就是

1 - ε / 2

。

The same argument applies to

B

, so we know by the union bound that the probability of error

> ε / 2

occurring in either

A

B

is at most the sum of the probabilities of large error in each piece, so that

P_{D} ({err}_{J, D} (I) > ε) \leq 2 (1 - ε / 2)^{m}

相同的理論用於

B

，所以我們知道，利用聯集上界，

A

、

B

發生誤差

> ε / 2

的機率最多最多就是每一個片段出現大誤差的機率總和，因此：

P_{D} ({err}_{J, D} (I) > ε) \leq 2 (1 - ε / 2)^{m}

Now for the third bullet. We want the chance that the error is big to be smaller than

δ

, so that we’ll have low error with probability >

1 - δ

. So simply set

2 (1 - ε / 2)^{m} \leq δ

現在是第三個項目。我們希望誤差很大的機會小於

δ

，這樣我們就會有

> 1 - δ

的機率是誤差很小的。所以，簡單的設置：

2 (1 - ε / 2)^{m} \leq δ

And solve for m. Using the fact that

(1 - x) \leq e^{- x}

(which is proved by Taylor series), it’s enough to solve

2 e^{- ε m / 2} \leq δ

And a fine solution is

m \geq (2 / ε \log (2 / δ))

來解決

m

。利用一個事實，也就是

(1 - x) \leq e^{- x}

(泰勒級數)，這足以解決

2 e^{- ε m / 2} \leq δ

得到的一個很好的解就是

m \geq (2 / ε \log (2 / δ))

。

Now to cover all our bases: our algorithm simply computes m for its inputs

ε, δ

, queries that many samples, and computes the tightest-fitting interval containing the positive examples. Since the number of samples is polynomial in

1 / ε, 1 / δ

(and our algorithm doesn’t do anything complicated), we comply with the time and space bounds. And finally we just proved that the chance our algorithm will misclassify an

ε

fraction of new points drawn from

D

is at most

δ

. So we have proved the theorem:

Theorem: Intervals on the real line are PAC-learnable.

現在萬事俱備：我們的演算法只是計算

m

的輸入，

ε, δ

(上面得到的結論)，然後查詢很多次的樣本，然後計算包含正樣本的最緊密擬合的區間。由於樣本的數值是

1 / ε, 1 / δ

的多項式(而且我們的演算法並沒有做任何複雜的事情)，因此我們符合時間與空間的限制。最後，我們剛剛證明我們的演算法會對從

D

取出的新的點的

ε

的部份分類錯誤的機率，最多最多就是

δ

。因此我們證明了一個定理：

定理：實數上的區間為PAC-learnable.

As an exercise, see if you can generalize the argument to axis-aligned rectangles in the plane. What about to arbitrary axis-aligned boxes in

d

dimensional space? Where does

d

show up in the number of samples needed? Is this still efficient?

做為練習，你可以試一下看能不能將理論擴廣至平面中的軸對稱矩型。

d

維空間中的任意軸對稱的框該如何處理?在所需樣本數量中，

d

會在那裡出現?這定理是否依然有效?

Comments and Previews

There are a few more technical details we’ve ignored in the course of this post, but the important idea are clear. We have a formal model of learning which allows for certain pre-specified levels of imperfection, and we proved that one can learn how to recognize intervals in this model. It’s a far cry from decision trees and neural networks, but it’s a solid foundation to build upon.

這文章的課程我們忽略不少技術細節，但重點都有出來。我們有一個正式的學習模型，這模型允許某些事先擬定的缺陷層面，而且我們還證明可以學習如何識別該模型中的區間。這跟決策樹還有神經網路相比還差的遠，但這可是基礎中的基礎。

However, the definition we presented here for PAC-learning is not quite complete. It turns out, as we’ll see in the next post, that forcing the PAC-learning algorithm to produce hypotheses from the same concept class it’s trying to learn makes some problems that should be easy hard. This is just because it could require the algorithm to represent some simple hypothesis in a convoluted form, and in the next post we’ll see that this is not an idle threat, and we’ll make a slight modification to the PAC definition.

然而，這邊對於PAC-learning的定義還不是十分完整。結果發現，如我們在下一篇文章要看的，強制PAC-learning從相同的concept class產生hypotheses，這讓一些簡單的問題變困難。這只是因為它可能會要求演算法以複雜的形式來表示一些簡單的hypothesis，我們會在下一篇文章中看到，這並不是亂唬人的，而且我們將會對PAC的定義做些許的修正。

However PAC-learning is far from sacred. In particular, the choice that we require a single algorithm to succeed no matter what the distribution

D

was a deliberate choice, and it’s quite a strong requirement for a learning algorithm. There are also versions of PAC that remove other assumptions from the definition, such that the oracle for the target concept is honest (noise-free) and that there is any available hypothesis that is actually true (realizability). These all give rise to interesting learning models and discovering the relationship between the models is the end goal.

但是，PAC-learning也不是不能改變的。特別是，不論分佈-

D

是怎麼樣的經過深思熟慮的選擇，我們都需要一個單獨的演算法才能成功，這是對學習演算法的一個強烈要求。也有一些PAC的看法是從定義中移除其它假設，像是，對於target concept而言，oralce is honest(無噪)，而且存在任何實際為真的可用的hypothesis(可實現性)。這些都會產生有趣的學習模型，而且終極目標就是發現模型之間的關聯性。

And so the kinds of questions we ask are: can we classify all PAC-learnable problems? Can we find a meta-algorithm that would work on any PAC-learnable concept class given some assumptions? How does PAC-learning relate to other definitions of learning? Say, one where we don’t require it to work for every distribution; would that really allow us to solve more problems?

因此，我們提出的問題類型是：我們可以分類所有的PAC-learnable問題嗎?在給定一些假設的情況下，我們能否發現一個適用任意的PAC-learnable concept class的共通式演算法?PAC-learning與其它學習的定義有何關聯?我們說，我們並不要求它能適用於每一個分佈；但這真的能讓我們解決更多問題嗎?

It’s a question of finding out the deep truths of mathematics now, but we promise that this series will soon come back around to practical applications, for learning theory naturally entails the design and analysis of fascinating algorithms.

這是一個找出數學真理的問題，但我們保證，這個系列很快的就會回到實際應用層面，因為學習理論很自然的需要設計與分析有趣的演算法。

Until next time!

拜拜啊!下次見。

Probably Approximately Correct — a Formal Theory of Learning(翻譯)

tags: Understanding Machine Learning From Theory to Algorithms 翻譯

Learning Intervals

Distributions and Hypotheses

Concept Classes

Probably Approximately Correct Learning

Intervals are PAC-Learnable

Comments and Previews

Read more

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Book_論文翻譯

Dify + Whisper Asr Webservice

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

tags: `Understanding Machine Learning From Theory to Algorithms` `翻譯`