Chapter 3-Understanding Machine Learning From Theory to Algorithms: A Formal Learning Model

# Chapter 3-Understanding Machine Learning From Theory to Algorithms: A Formal Learning Model ###### tags: `Understanding Machine Learning From Theory to Algorithms` ## Remark :::info 文章內容 ::: :::success 個人理解 ::: 建議可以先看[Probably Approximately Correct — a Formal Theory of Learning(翻譯)](https://hackmd.io/@shaoeChen/HyuI98iGw) [課程連結]() [TOC] ## A Formal Learning Model ### 3.1 PAC Learning :::info In the previous chapter we have shown that for a finite hypothesis class, if the ERM rule with respect to that class is applied on a sufficiently large training sample (whose size is independent of the underlying distribution or labeling function) then the output hypothesis will be probably approximately correct. More generally, we now define Probably Approximately Correct (PAC) learning. ::: :::success 上一章節我們說明，一個有限的hypothesis class，把與這個class有關的ERM規則應用在一個足夠大的訓練集上，那它輸出的hypothesis就有機會近似正確。說白一點就是，我們現在定義的是Probably Approximately Correct (PAC) learning。 ::: :::info DEFINITION 3.1 (PAC Learnability) A hypothesis class H is PAC learnable if there exist a function $m_\mathcal{H} : (0, 1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following property: For every $\epsilon, \delta \in (0, 1)$, for every distribution $\mathcal{D}$ over $\mathcal{X}$ , and for every labeling function $f: \mathcal{X} \rightarrow \left\{0，1 \right\}$, if the realizable assumption holds with respect to $\mathcal{H}, \mathcal{D}, f$, then when running the learning algorithm on $m \geq m_\mathcal{H}(\epsilon, \delta)$ i.i.d. examples generated by $\mathcal{D}$ and labeled by $f$, the algorithm returns a hypothesis h such that, with probability of at least $1 - \delta$ (over the choice of the examples), $L_{(\mathcal{D}, f)}(h) \leq \epsilon$. ::: :::success DEFINITION 3.1 (PAC Learnability) 假如滿足下面條件，則hypothesis class-$\mathcal{H}$是PAC learnable： * 存在一個函數$m_\mathcal{H} : (0, 1)^2 \rightarrow \mathbb{N}$ * $\mathbb{N}$：自然數，也就是不含0的正整數 * 存在一個學習演算法有著下面特性： * 對每一個$\epsilon, \delta \in (0, 1)$，$\mathcal{X}$上的每一個分佈-$\mathcal{D}$與每個標記函數$f: \mathcal{X} \rightarrow \left\{0，1 \right\}$，如果關於$\mathcal{H}, \mathcal{D}, f$的realizable assumption成立，那麼當學習演算法(learning algorithm)執行在一個由分佈-$\mathcal{D}$所生成的i.i.d.樣本($m \geq m_\mathcal{H}(\epsilon, \delta)$)，且由標記函數$f$標記，其演算法所回傳的hypothesis-$h$會有最少$1 - \delta$的機率，其$L_{(\mathcal{D}, f)}(h) \leq \epsilon$ Probably Approximately Correct learnability包含兩個近似(approximation)參數： 1. accuracy parameter(準確度參數)：$\epsilon$，定義輸出的classifier與最佳的classifier之間的距離，這對應PAC的approximately correct 2. confidence parameter(置信度參數)：$\delta$，classifier滿足準確度要求的可能性有多少，這對應PAC的probably 在我們正在研究的data access model中，近似是必然的。因為訓練集是隨機生成的，會有很小的機率會發生資料集是無信息的(會有一些機會，在反覆採樣的情況下，資料集只包含一個domain point)。此外，即使我們很幸運的得到一個資料集，而且這資料集也非常準確的表示著分佈-$\mathcal{D}$，因為這只是一個有限的樣本，因此它總是會有一些無法反映$\mathcal{D}$的細節。我們的準確度參數-$\epsilon$，可以原諒learner的classifier所犯下的小錯誤。 ::: :::info Sample Complexity The function $m_\mathcal{H}: (0, 1)^2 \rightarrow \mathbb{N}$ determines the sample complexity of learning $\mathcal{H}$: that is, how many examples are required to guarantee a probably approximately correct solution. The sample complexity is a function of the accuracy ($\epsilon$) and confidence ($\delta$) parameters. It also depends on properties of the hypothesis class $\mathcal{H}$ - for example, for a finite class we showed that the sample complexity depends on log the size of $\mathcal{H}$. ::: :::success Sample Complexity 函數$m_\mathcal{H}: (0, 1)^2 \rightarrow \mathbb{N}$定義學習$\mathcal{H}的$採樣複雜度，也就是，需要多少的樣本來保證機率近似正確的解。採樣複雜度是一個準確($\epsilon$)參數與置信($\delta$)參數的函數。它還取決於hypothesis class-$\mathcal{H}$的屬性，舉例來說，對於一個有限的class，我們說明過，其採樣複雜度是取決於$\mathcal{H}$的大小取對數。 ::: :::info Note that if $\mathcal{H}$ is PAC learnable, there are many functions $m_{\mathcal{H}}$ that satisfy the requirements given in the definition of PAC learnability. Therefore, to be precise, we will define the sample complexity of learning H to be the "minimal function", in the sense that for any $\epsilon$; $\delta$, $m_\mathcal{H}(\epsilon, \delta)$ is the minimal integer that satisfies the requirements of PAC learning with accuracy $\epsilon$ and confidence $\delta$. ::: :::success 要注意的是，如果$\mathcal{H}$是PAC learnable，那會有很多的函數$m_{\mathcal{H}}$滿足PAC learnability定義中給定的要求。因此，為了更精確，我們會定義學習$\mathcal{H}$的採樣複雜度做為最小函數，也就是說，對任意的$\epsilon$與$\delta$而言，$m_\mathcal{H}(\epsilon, \delta)$是滿足PAC learning的最小整數。回顧一下上一章的結論。它可以改寫為： COROLLARY 3.2 每一個有限的hypothesis class都是PAC learnable，其採樣複雜度為 $$m_\mathcal{H}(\epsilon, \delta) \leq \left \lceil \frac{\log(\vert \mathcal{H} \vert / \delta)}{\epsilon} \right \rceil$$ 還有無限的classes是learnable(見Exercise 3)。後續我們會說明，決定一個class的PAC learnability不在它的有限性，而是在於一個稱為VC dimension的組點度量。 ::: ### 3.2 A More General Learning Model :::info The model we have just described can be readily generalized, so that it can be made relevant to a wider scope of learning tasks. We consider generalizations in two aspects: * Removing the Realizability Assumption * Learning Problems beyond Binary Classification ::: :::success 剛剛所描述的模型可以很輕易的泛化，因此可以讓它應用到更廣泛的學習任務。我們從兩個觀點來考慮泛化： * Removing the Realizability Assumption * 我們要求在滿足realizability assumption的前提之下，學習演算法在分佈-$\mathcal{D}$與標記函數-$f$的成對資料上是成功的。對於實際的學習任務而言，這個假設也許太強了(我們真的能夠保證在顏色-軟硬度空間中真的有一個矩形可以完全確定那些木瓜是很好吃的嗎?)。下一個小節，我們將說明agnostic PAC模型，這種情況下就不再有realizability assumption，不再使用這個假設 * Learning Problems beyond Binary Classification * 目前為止我們討論的學習任務都是跟預測給定範本的二進制標籤有關(就像好吃、不好吃)。但不同的學習任務會有不同的形式需求。舉例來說，也許有人希望預測的是實際的數值(假設，明天的晚上九點的溫度是幾度)，或是從一個有限的標籤集合中挑選一個標籤(就像是明天的報紙的主旨)。這說明了，透過考慮各種不同的損失函數，我們對學習分析可以很輕易的推展到不同的場景上。於3.2.2討論 ::: #### 3.2.1 Releasing the Realizability Assumption - Agnostic PAC Learning :::info A More Realistic Model for the Data-Generating Distribution Recall that the realizability assumption requires that there exists $h^* \in \mathcal{H}$ such that $\mathbb{P}_{x \sim \mathcal{D}} \left[h^*(x)=f(x) \right] = 1$. In many practical problems this assumption does not hold. Furthermore, it is maybe more realistic not to assume that the labels are fully determined by the features we measure on input elements (in the case of the papayas, it is plausible that two papayas of the same color and softness will have different taste). In the following, we relax the realizability assumption by replacing the \target labeling function" with a more flexible notion, a data-labels generating distribution. ::: :::success A More Realistic Model for the Data-Generating Distribution * realizability assumption的要求，存在$h^* \in \mathcal{H}$，且$\mathbb{P}_{x \sim \mathcal{D}} \left[h^*(x)=f(x) \right] = 1$，但在很多實際問題中，這種假設是不成立的。而且，也許更實際的作法是不假設標籤完全由我們輸入元素上所測量的特徵來決定(木瓜案例來看，我們可以認為兩個相同顏色且相同軟硬度的木瓜會有不同的口味)。 * 下面將以更靈活的概念，"目標標記函數(target labeling function)"來取代realizability assumption，也就是data-labels generating distribution(生成分佈的資料-標籤) ::: :::info Formally, from now on, let $\mathcal{D}$ be a probability distribution over $\mathcal{X} \mathrm{x} \mathcal{Y}$, where, as before, $\mathcal{X}$ is our domain set and $\mathcal{Y}$is a set of labels (usually we will consider $\mathcal{Y} = \left\{0, 1 \right\}$). That is, $\mathcal{D}$ is a joint distribution over domain points and labels. One can view such a distribution as being composed of two parts: a distribution $\mathcal{D}_x$ over unlabeled domain points (sometimes called the marginal distribution) and a conditional probability over labels for each domain point, $\mathcal{D}((x, y)\vert x)$. In the papaya example, Dx determines the probability of encountering a papaya whose color and hardness fall in some color-hardness values domain, and the conditional probability is the probability that a papaya with color and hardness represented by $x$ is tasty. Indeed, such modeling allows for two papayas that share the same color and hardness to belong to different taste categories. ::: :::success 這邊開始調整一些假設： * 分佈-$\mathcal{D}$是$\mathcal{X} \mathrm{x} \mathcal{Y}$上的機率分佈(原先為$\mathcal{X}$，而$\mathcal{X}$是domain set，$\mathcal{Y}$則為標籤的集合，通常為$\mathcal{Y} = \left\{0, 1 \right\}$) * $\mathcal{D}$是domain point與labels上的joint distribution(聯合分佈) * 這種分佈可以視為兩個部份的組合： 1. 未標記的domain point上的$\mathcal{D}_x$(有些時候稱為邊際分佈(marginal distribution)) 2. 每一個domain point在標籤上的條件機率，即$\mathcal{D}((x, y)\vert x)$ 以木瓜範例來說： * $\mathcal{D}_x$定義遇到顏色、硬度落在某個顏色-硬度值域的機率 * 條件機率則為以$x$來表示的顏色、硬度是好吃的木瓜的機率事實上，這樣的建模可以讓擁有相同顏色、硬度的兩個木瓜屬於不同的口味類別。 ::: :::info The empirical and the True Error Revised For a probability distribution, $\mathcal{D}$, over $\mathcal{X} \mathrm{x} \mathcal{Y}$, one can measure how likely $h$ is to make an error when labeled points are randomly drawn according to $\mathcal{D}$. We redefine the true error (or risk) of a prediction rule $h$ to be $$L_{\mathcal{D}}(h) \stackrel{\text{def}}{=} \mathbb{P}_{(x, y) \sim \mathcal{D}} \left[h(x) \neq y \right] \stackrel{\text{def}}{=} \mathcal{D}(\left\{(x, y): h(x) \neq y \right\}) \tag{3.1}$$ We would like to find a predictor, $h$, for which that error will be minimized. However, the learner does not know the data generating $\mathcal{D}$. What the learner does have access to is the training data, $S$. The definition of the empirical risk remains the same as before, namely, $$L_S(h) \stackrel{\text{def}}{=} \dfrac{\vert \left\{i \in \vert m \vert : h(x_i) \neq y_i \right\} \vert}{m}$$ Given $S$, a learner can compute $L_S(h)$ for any function $h: X \rightarrow \left\{0, 1 \right\}$. Note that $L_S(h)=L_{D(uniform over S)}(h)$. ::: :::success 針對$\mathcal{X} \mathrm{x} \mathcal{Y}$上的機率分佈-$\mathcal{D}$，一個量測的方法就是，當根據$\mathcal{D}$中隨機取得一個標記點時，$h$錯誤的機率有多少。下面重新定義prediction rule-$h$的真實誤差： $$L_{\mathcal{D}}(h) \stackrel{\text{def}}{=} \mathbb{P}_{(x, y) \sim \mathcal{D}} \left[h(x) \neq y \right] \stackrel{\text{def}}{=} \mathcal{D}(\left\{(x, y): h(x) \neq y \right\}) \tag{3.1}$$ 這邊也給出上一個真實誤差的定義： $$L_{\mathcal{D}, f}(h) \stackrel{\text{def}}{=} \mathbb{P}_{x \in \mathcal{D}} \left[h(x) \neq f(x) \right] \stackrel{\text{def}}{=} \mathcal{D}(\left\{x: h(x) \neq f(x) \right\}) \tag{2.1} $$ 我們希望找出一個最小化真實誤差的predictor-$h$。learner並不知道生成資料的分佈-$\mathcal{D}$，唯一有的就是訓練資料-$S$。經驗風險的定義和先前是一樣的，也就是： $$L_S(h) \stackrel{\text{def}}{=} \dfrac{\vert \left\{i \in \vert m \vert : h(x_i) \neq y_i \right\} \vert}{m}$$ 給定資料集$S$，learner針對任意的function-$h: X \rightarrow \left\{0, 1 \right\}$計算loss-$L_S(h)$。注意，$L_S(h)=L_{D(\mathrm{uniform} \space \mathrm{over} \space S)}(h)$。 ::: :::info The Goal We wish to find some hypothesis, $h: \mathcal{X} \rightarrow \mathcal{Y}$, that (probably approximately) minimizes the true risk, $L_D(h)$. ::: :::success The Goal 我們希望可以找到一些hypothesis，$h: \mathcal{X} \rightarrow \mathcal{Y}$，來最小化實際風險，也就是$L_D(h)$。 ::: :::info The Bayes Optimal Predictor. Given any probability distribution $\mathcal{D}$over $\mathcal{X} \mathrm{x} \left\{0, 1 \right\}$, the best label predicting function from $\mathcal{X}$ to $\left\{0, 1\right\}$ will be $$ f_\mathcal{D}(x) = \begin{cases} 1, & \text{if $\mathbb{P}\left[y=1 \vert x \right] \geq 1/2 $} \\ 0, & \text{otherwise} \end{cases} $$ It is easy to verify (see Exercise 7) that for every probability distribution $\mathcal{D}$, the Bayes optimal predictor $f_\mathcal{D}$ is optimal, in the sense that no other classifier, $g: \mathcal{X} \rightarrow \left\{ 0, 1 \right\}$ has a lower error. That is, for every classifier $L_\mathcal{D}(f_\mathcal{D}) \leq L_\mathcal{D}(g)$. ::: :::success The Bayes Optimal Predictor 給定任意的$\mathcal{X} \mathrm{x} \left\{0, 1 \right\}$上的機率分佈$\mathcal{D}$，從$\mathcal{X}$映射至$\left\{0, 1\right\}$的最佳標籤預測函數會是： $$ f_\mathcal{D}(x) = \begin{cases} 1, & \text{if $\mathbb{P}\left[y=1 \vert x \right] \geq 1/2 $} \\ 0, & \text{otherwise} \end{cases} $$ 上面的式子說明的是，給定$x$，其$y=1$的機率只要超過1/2，那$f_\mathcal{D}(x)=1$，其餘為0。驗證的部份可以參考練習7，對每一個機率分佈-$\mathcal{D}$而言，Bayes optimal predictor-$f_\mathcal{D}$就是最好的那一個： 1. 沒有其它的classifier-$g: \mathcal{X} \rightarrow \left\{ 0, 1 \right\}$會存在更低的誤差 2. 對任意的classifier-$g$來說，$L_\mathcal{D}(f_\mathcal{D}) \leq L_\mathcal{D}(g)$ 不過我們並不知道整個分佈-$\mathcal{D}$的全貌，因此沒有辦法利用這個最佳的predictor-$f_\mathcal{D}$，跟第二章說的一樣，learner能看的就只要訓練樣本-$S$。很明顯的，我們並不期待學習演算法能夠找到一個誤差可以比Bayes predictor還要低的hypothesis。後續我們會證明到一點，如果我們對資料生成的分佈-$\mathcal{D}$沒有任何的先驗假設，那就沒有辦法保證演算法可以找到一個像Bayes predictor一樣好的predictor。相反的，我們要求學習演算法找出一個predictor，其誤差不會比某些給定benchmark hypothesis class的情況下的predictor的最佳可能誤差還要大很多。不過這種要求的強度是取決於該hypothesis class的選擇。 ::: :::info definition 3.3 (Agnostic PAC Learnability) A hypothesis class $\mathcal{H}$ is agnostic PAC learnable if there exist a function $m_\mathcal{H}: (0, 1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following property: For every $\epsilon, \delta in (0, 1)$ and for every distribution $\mathcal{D}$ over $\mathcal{X} \mathrm{x} \mathcal{Y}$, when running the learning algorithm on $m \geq m_\mathcal{H}(\epsilon, \delta)$ i.i.d. examples generated by D, the algorithm returns a hypothesis h such that, with probability of at least $1 - \delta$ (over the choice of the m training examples), $$L_\mathcal{D}(h) \leq \min_{h' \in \mathcal{H}} L_\mathcal{D}(h') + \epsilon$$ ::: :::success definition 3.3 (Agnostic PAC Learnability) 下面條件滿足的時候，hypothesis class-$\mathcal{H}$為agonstic PAC learnable： * 存在一個函數-$m_\mathcal{H}: (0, 1)^2 \rightarrow \mathbb{N}$ * 學習演算法有下面的特性： * 對每一個$\epsilon, \delta \in (0, 1)$與每一個$\mathcal{X} \mathrm{x} \mathcal{Y}$上的分佈-$\mathcal{D}$，當學習演算法在$m \geq m_\mathcal{H}(\epsilon, \delta)$(i.i.d.，由$\mathcal{D}$所生成的樣本)上執行，然後學習演算法回傳一個hypothesis-$h$，其機率最少為$1 - \delta$(在選擇的m個訓練樣本上)，即： $$L_\mathcal{D}(h) \leq \min_{h' \in \mathcal{H}} L_\mathcal{D}(h') + \epsilon$$ 也提供第二章的最後證明： $$L_{(\mathcal{D}, f)}(h_S) \leq \epsilon$$ 很明顯的，如果realizability assumption成立，那agnostic PAC learning會提供與PAC learning相同的保證。某種角度來說，agnostic PAC learning概括了PAC learning(只要$h'=0$)。如果realizability assumption不在立，那也沒有learner可以保證任意的小誤差。儘管如此，在anostic PAC learning的定義之下，如果它的誤差沒有遠遠大過來自class-$\mathcal{H}$的predictor所能實現的最佳誤差，那麼learner仍然可以宣告成功。這跟PAC learning相反，因為PAC learning的learner被要求在絕對意義上去實現一個小誤差，而不是相對於hypothesis class所能實現的最佳誤差。 ::: #### 3.2.2 The Scope of Learning Problems Modeled :::info We next extend our model so that it can be applied to a wide variety of learning tasks. Let us consider some examples of different learning tasks. * Multiclass Classification Our classification does not have to be binary. Take, for example, the task of document classification: We wish to design a program that will be able to classify given documents according to topics (e.g., news, sports, biology, medicine). A learning algorithm for such a task will have access to examples of correctly classified documents and, on the basis of these examples, should output a program that can take as input a new document and output a topic classification for that document. Here, the domain set is the set of all potential documents. Once again, we would usually represent documents by a set of features that could include counts of different key words in the document, as well as other possibly relevant features like the size of the document or its origin. The label set in this task will be the set of possible document topics (so Y will be some large finite set). Once we determine our domain and label sets, the other components of our framework look exactly the same as in the papaya tasting example; Our training sample will be a finite sequence of (feature vector; label) pairs, the learner's output will be a function from the domain set to the label set, and, finally, for our measure of success, we can use the probability, over (document, topic) pairs, of the event that our predictor suggests a wrong label. * Regression In this task, one wishes to find some simple pattern in the data { a functional relationship between the X and Y components of the data. For example, one wishes to find a linear function that best predicts a baby's birth weight on the basis of ultrasound measures of his head circumference, abdominal circumference, and femur length. Here, our domain set X is some subset of R3 (the three ultrasound measurements), and the set of \labels," Y, is the the set of real numbers (the weight in grams). In this context, it is more adequate to call Y the target set. Our training data as well as the learner's output are as before (a finite sequence of (x; y) pairs, and a function from X to Y respectively). However, our measure of success is different. We may evaluate the quality of a hypothesis function, h : X ! Y, by the expected square difference between the true labels and their predicted values, namely, $$L_\mathcal{D}(h) \stackrel{\text{def}}{=} \mathbb{E}_{(x, y) \sim \mathcal{D}}(h(x) - y)^2 \tag{3.2}$$ ::: :::success 現在可以將模型擴展到其它各類的學習任務上，下面給出不同的學習任務： * Multiclass Classiffcation 這類任務的分類不再是兩類-$\left\{0, 1\right\}$，舉例來說，對依據主旨給定的文章做分類，像是新聞、運動、娛樂八卦..。延續上例，這類的學習演算法可以正確的分類文件，學習的模型可以以文件做為輸入，文件分類做為輸出。而這裡的domain set就是指所有可能文章的集合。我們通常會用一組特徵來表示文章，這特徵可以是文章內的不同關鍵字的數量或其它相關的特徵(像是文章的大小或來源)。而這類任務的label set就會是文章可能的主旨的集合，因此$\mathcal{Y}$就會是一個比較大的有限集合。在確定domain與label sets之後，整個框架的其它部份就跟木瓜案例一樣。我們的訓練樣本會是一個有限的序列，這序列每個元素都是成對的特徵、標籤，然後learner的輸出就會是一個從domain set映射到label set的函數，然後用機率來量測是否成功預測。 * Regression 這類的任務，我們會希望發現資料內的某些簡單的模式，也就是$\mathcal{X}, \mathcal{Y}$之間的函數關係。舉例來說，我們希望找到一個線性函數可以根據超音波量到的頭圍、腹圍與大腿骨的長度來預測小baby的出生體重。這個範例而言，domain set-$\mathcal{X}$是$\mathbb{R}^3$的一個子集(超音波量測的三個特徵)，而"標籤"-$\mathcal{Y}$則是一個實數，也就是以克為單位的體重。這種情況下，將$\mathcal{Y}$稱為target set是比較恰當的。訓練資料與learner的輸出與先前是相同的。訓練資料依然是一個成對的資料序列，輸出的依然是一個將$\mathcal{X}$映射到$\mathcal{Y}$的函數。但量測的方式已經不一樣了，我們也許會計算實際與預測值之間的預期平方差異來評估，如下： $$L_\mathcal{D}(h) \stackrel{\text{def}}{=} \mathbb{E}_{(x, y) \sim \mathcal{D}}(h(x) - y)^2 \tag{3.2}$$ 這個式子就是說，loss function-$L_\mathcal{D}(h)$，從分佈$\mathcal{D}$中sample出$x, y$，然後利用predictor-$h$預測的結果與實際的$y$之間的差異的平方 ::: :::info Generalized Loss Functions Given any set \mathcal{H}$ (that plays the role of our hypotheses, or models) and some domain $Z$ let $\ell$ be any function from $\mathcal{H} \mathrm{x} Z$ to the set of nonnegative real numbers, $\ell : \mathcal{H} \mathrm{x} Z \rightarrow \mathbb{R}_+$ We call such functions loss functions. Note that for prediction problems, we have that $Z = \mathcal{X} \mathrm{x} \mathcal{Y}$. However, our notion of the loss function is generalized beyond prediction tasks, and therefore it allows $Z$ to be any domain of examples (for instance, in unsupervised learning tasks such as the one described in Chapter 22, $Z$ is not a product of an instance domain and a label domain). ::: :::success 為了能夠適應廣泛的學習任務，我們將成功的衡量的形式概括如下： * Generalized Loss Functions 給定任意的一組$\mathcal{H}$(扮演著hypothesis或models的規則)與domain-$Z$，設定$\ell$為由$\mathcal{H} \mathrm{x} Z$映射到非負實數的函數，也就是$\ell : \mathcal{H} \mathrm{x} Z \rightarrow \mathbb{R}_+$ 我們將這個函數稱為損失函數，loss function。針對預測的問題，我們有著$Z = \mathcal{X} \mathrm{x} \mathcal{Y}$。我們損失函數的概念已經超過預測任務的範圍，因此這邊允許$Z$是任意domain的範本(舉例來說，在第22章所提的非監督式學習中，$Z$並不會是instance domain與label domain之下的一個產物)。現在，我們將risk function(風險函數)定義為相對於$Z$上的機率分佈-$\mathcal{D}$的classifier的預期損失，$h \in \mathcal{H}$，也就是： $$L_\mathcal{D}(h) \stackrel{\text{def}}{=} \mathbb{E}_{z \sim \mathcal{D}} \left[\ell(h, z)\right] \tag{3.3}$$ 也就是說，我們考慮了根據分佈-$\mathcal{D}$所隨機選擇的$z$上的$h$的損失的期望值。相同的，我們定義emprical risk(經驗風險)做為給定樣本$S=(z_1, \cdots, z_m) \in Z^m$的期望損失，也就是： $$L_S(h) \stackrel{\text{def}}{=} \dfrac{1}{m} \sum^m_{i=1} \ell(h, z_u) \tag{3.4}$$ ::: :::info The loss functions used in the preceding examples of classification and regression tasks are as follows: * 0-1 loss: Here, our random variable $z$ ranges over the set of pairs $\mathcal{X} \mathrm{x} \mathcal{Y}$ and the loss function is $$ \ell_{0-1}(h, (x, y)) \stackrel{\text{def}}{=} \begin{cases} 0 & \text{if $h(x) = y$} \\ 1, & \text{if $h(x) \neq y$} \end{cases} $$ * Square Loss: Here, our random variable $z$ ranges over the set of pairs $\mathcal{X} \mathrm{x} \mathcal{Y}$ and the loss function is $$\ell_{sq}(h, (x, y)) \stackrel{\text{def}}{=} (h(x) - y)^2$$ ::: :::success 先前的分類與迴歸任務所使用的損失函數如下： * 0-1 loss：隨機變數$z$的範圍是成對的$\mathcal{X} \mathrm{x} \mathcal{Y}$的集合，其損失函數為 $$ \ell_{0-1}(h, (x, y)) \stackrel{\text{def}}{=} \begin{cases} 0 & \text{if $h(x) = y$} \\ 1, & \text{if $h(x) \neq y$} \end{cases} $$ 這個損失函數是用於二分類或多類別的分類問題。應該注意的是，針對隨機變數-$\alpha$，其取值為$\left\{0, 1\right\}$，即$\mathbb{E}_{\alpha \sim \mathcal{D}}\left[\alpha \right]=\mathbb{P}_{\alpha \sim \mathcal{D}}\left[\alpha=1 \right]$。因此，對於這個損失而言，方程式3.3與3.1所給出的$L_\mathcal{D}(h)$是一致的。 * Square Loss：隨機變數$z$的範圍是成對的$\mathcal{X} \mathrm{x} \mathcal{Y}$的集合，其損失函數為 $$\ell_{sq}(h, (x, y)) \stackrel{\text{def}}{=} (h(x) - y)^2$$ 這個損失函數應用於迴歸問題。後續將會看到更多損失函數的範例。總而言之，我們正式的替一般損失函數定義了agnostic PAC learnability。 DEFINITION 3.4(Agnostic PAC Learnability for General Loss Functions)：下面條件成立則對於集合-$Z$以及損失函數-$\ell : \mathcal{H} \mathrm{x} Z \rightarrow \mathbb{R}_+$的hypothesis class-$\mathcal{H}$為agnostic PAC learnable： 1. 如果存在一個函數$m_\mathcal{H}: (0, 1)^2 \rightarrow \mathbb{N}$ 2. 對於每一個的$\epsilon, \delta \in (0, 1)$以及$Z$上的每一個分佈-$\mathcal{D}$，當學習演算法執行在$m \geq m_\mathcal{H}(\epsilon, \delta)$(由分佈-$\mathcal{D}$產生，且為i.i.d.)，演算法回傳$h \in \mathcal{H}$的機率最少為$1 - \delta$(依$m$個選擇的訓練樣本)， $$L_\mathcal{D}(h) \leq \min_{h' \in \mathcal{H}} L_\mathcal{D}(h') + \epsilon$$ 其中$L_\mathcal{D}(h) = \mathbb{E}_{z \sim \mathcal{D}} \left[\ell(h, z) \right]$ ::: :::info Remark 3.1 (A Note About Measurability\*) In the aforementioned definition, for every $h \in \mathcal{H}$, we view the function $\ell(h, \cdot):Z \rightarrow \mathbb{R}_+$ as a random variable and define $L_\mathcal{D}(h)$ to be the expected value of this random variable. For that, we need to require that the function $\ell(h, \cdot)$ is measurable. Formally, we assume that there is a $\sigma$-algebra of subsets of $Z$, over which the probability $\mathcal{D}$ is defined, and that the preimage of every initial segment in R+ is in this $\sigma$-algebra. In the specific case of binary classification with the 0􀀀1 loss, the $\sigma$-algebra is over $\mathcal{X} \mathrm{x} \left\{0 \right\}$ and our assumption on $\ell$ is equivalent to the assumption that for every h, the set $\left\{(x, h(x)): x \in \mathcal{X} \right\}$ is in the $\sigma$-algebra. ::: :::success Remark 3.1 (A Note About Measurability\*) 在先前的定義中，針對每一個$h \in \mathcal{H}$，我們將函數$\ell(h, \cdot):Z \rightarrow \mathbb{R}_+$視為隨機變數，並且定義$L_\mathcal{D}(h)$做為這個隨機變數的期待值。為此，我們必須要求函數$\ell(h, \cdot)$是可量測的。形式上來說，我們假設有一個$Z$子集的$\sigma$-algebra，在$\sigma$-algebra上定義$\mathcal{D}$的機率，而且$\mathbb{R}_+$內的每一個初始段的像原(Preimage)都在這個$\sigma$-algebra內。在損失為0-1的特別情況的二分類特殊狀況中，$\sigma$-algebra會大過$\mathcal{X} \mathrm{x} \left\{0 \right\}$，而且我們在$\ell$的假設會等價於對每一個$h$，其集合$\left\{(x, h(x)): x \in \mathcal{X} \right\}$會位於$\sigma$-algebra的假設 ::: :::info Remark 3.2 (Proper versus Representation-Independent Learning*) In the preceding definition, we required that the algorithm will return a hypothesis from $\mathcal{H}$. In some situations, $\mathcal{H}$ is a subset of a set $\mathcal{H'}$, and the loss function can be naturally extended to be a function from $\mathcal{H'} \mathrm{x} Z$ to the reals. In this case, we may allow the algorithm to return a hypothesis $h' \in \mathcal{H'}$, as long as it satisfies the requirement $L_\mathcal{D}(h') \leq \min_{h \in \mathcal{H}} L_\mathcal{D}(h) + \epsilon$. Allowing the algorithm to output a hypothesis from $\mathcal{H'}$ is called representation independent learning, while proper learning occurs when the algorithm must output a hypothesis from $\mathcal{H}$. Representation independent learning is sometimes called \improper learning," although there is nothing improper in representation independent learning. ::: :::success Remark 3.2 (Proper versus Representation-Independent Learning\*) 在先前的定義中，我們要求演算法從$\mathcal{H}$回傳一個hypothesis。在某些情況下，$\mathcal{H}$會是$\mathcal{H'}$的子集，而且損失函數可以很自然的擴展為$\mathcal{H'} \mathrm{x} Z \rightarrow$實數。這種情況下，我們也許可以允許演算法回傳hypothesis-$h' \in \mathcal{H'}$，只要這滿足$L_\mathcal{D}(h') \leq \min_{h \in \mathcal{H}} L_\mathcal{D}(h) + \epsilon$。允許演算法從$\mathcal{H'}$輸出hypothesis的作法稱為representation independent learning，而當演算法必須從$\mathcal{H}$輸出hypothesis的時候，正確的學習就會發生。 ::: ### 3.3 Summary :::info In this chapter we defined our main formal learning model - PAC learning. The basic model relies on the realizability assumption, while the agnostic variant does not impose any restrictions on the underlying distribution over the examples. We also generalized the PAC model to arbitrary loss functions. We will sometimes refer to the most general model simply as PAC learning, omitting the \agnostic" prefix and letting the reader infer what the underlying loss function is from the context. When we would like to emphasize that we are dealing with the original PAC setting we mention that the realizability assumption holds. In Chapter 7 we will discuss other notions of learnability. ::: :::success 這一章中，我們定義主要的正式的學習模型-PAC learning。這基礎模型是建立在realizability assumption，而agnostic variant並沒有對範本中的潛在分佈有任何的限制。我們還將PAC模型泛化到任何的損失函數。有時候我們會將最通用的模型簡單的稱為PAC learning，忽略掉前綴"agnostic"，讓讀者自己從上下文中推斷出潛在的損失函數是什麼。當我們想要強調我們正在處理的是原始的PAC時設置的時候，我們會提到realizability assumption成立。在第七章中，我們將會討論其它可學習性的概念。 :::