Learning Theory: (Agnostic) Probably Approximately Correct Learning(翻譯)

# Learning Theory: (Agnostic) Probably Approximately Correct Learning(翻譯) ###### tags: `Understanding Machine Learning From Theory to Algorithms` `翻譯` [原文連結](https://towardsdatascience.com/learning-theory-agnostic-probably-approximately-correct-learning-dfd0d7c76467) :::info In my [previous article](https://towardsdatascience.com/learning-theory-empirical-risk-minimization-d3573f90ff77), I discussed what is Empirical Risk Minimization and the proof that it yields a satisfactory hypothesis under certain assumptions. Now I want to discuss Probably Approximately Correct Learning (which is quite a mouthful but kinda cool), which is a generalization of ERM. For those who are not familiar with ERM, I suggest reading my previous article on the topic since it is a prerequisite for understanding PAC learning. ::: :::success 在[前一篇文章](https://towardsdatascience.com/learning-theory-empirical-risk-minimization-d3573f90ff77)中，我討論了什麼是Empirical Risk Minimization，並且證明在某些假設之下它可以生成讓人滿意的hypothesis。現在我要來談談Probably Approximately Correct Learning，這是ERM的泛化應用。對ERM不是那麼熟悉的朋友，我建議可以讀讀我前一篇文章的說明，這是理解PAC learning不可或缺的先決條件。 ::: :::info Remember that when analyzing ERM we came to the conclusion that for a finite hypothesis space $\mathcal{H}$ we can arrive at a hypothesis that has an error lower than epsilon with a certain probability delta when assuming that there exists such a hypothesis in the hypothesis space. Based on these parameters we could calculate how many samples we need to achieve such accuracy and we arrived at the following lower bound for the samples: ::: :::success 還記得，在分析ERM的時候我們得到一個結論，就是對於有限的hypothesis space $\mathcal{H}$ 我們可以得到一個hypothesis，在假設於hypothesis space中存在著一個hypothesis，這個hypothesis會有$\delta$的機率其誤差底於$\epsilon$。基於這些參數，我們可以計算要達到這樣的準確度我們需要多少的樣本，我們根據下面的數學式得到樣本量的下限： $$m \geq \dfrac{\log(\vert \mathcal{H} \vert / \delta)}{\epsilon}$$ ::: :::info This can be fit into the general PAC learning framework, the following formal definition was given from the book [Understanding Machine Learning](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf): ![image alt](https://miro.medium.com/max/700/1*11Sd7GFBJlpVO_L9Ew2jng.png) ::: :::success 這可以被包含在通用性的PAC learning framework，下面正式的定義來自於[Understanding Machine Learning](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf)這本書： ![image alt](https://miro.medium.com/max/700/1*11Sd7GFBJlpVO_L9Ew2jng.png) ::: :::info For me at least, this definition was a bit confusing at first. What does this mean? The definition states that a hypothesis class is PAC learnable if there exists a function $m_\mathcal{H}$ and an algorithm that for any labeling function $f$, distribution $\mathcal{D}$ over the domain of inputs $X$, delta and epsilon that with $m \geq m_\mathcal{H}$ produces a hypothesis $h$ such that with probability 1-delta it yields a true error lower than epsilon. A labeling function is nothing else than saying that we have a certain function $f$ that labels the data in the domain. ::: :::success 至少對我來說，這個定義是有一點點的感到困惑的。這什麼意思?這定義指出，如果存在著一個函數$m_\mathcal{H}$以及一個演算法，對任意的labeling function $f$、domain of inputs $X$(instance space)上的分佈$\mathcal{D}$、$\delta$、$\epsilon$，其樣本大小在$m \geq m_\mathcal{H}$的時候會有$1 - \delta$的機率產生一個真實誤差小於$\epsilon$的hypothesis $h$，那麼hypothesis class是PAC learnable。那所謂的lableing function就是說，我們有某個函數$f$它標記了domain內的資料。 ::: :::info Here, the hypothesis class can be any type of binary classifier, since the labeling function that assigns labels to the examples from the domain assigns the labels 0 or 1. The $m_H$ function gives us a bound to the minimal number of samples that we need to have to achieve an error lower than epsilon with confidence delta. The accuracy epsilon logically controls the necessary sample size, since the higher accuracy we have, logically we need our training set to be a more faithful sample from the domain, therefore, increasing the number of samples necessary to achieve such accuracy. ::: :::success 這邊，hypothesis class可以是任意類型的二分類的classifier，因為labeling function會給從domain取出的樣本下標記，標記為0或1。函數$m_H$給了我們在置信度$\delta$情況下，要達到誤差率比$\epsilon$還要低所需要的最少樣本數量的界限。邏輯上，準確度$\epsilon$控制了需求樣本的大小，因為我們有更高的準確度，我們需要從domain採樣出來的樣本是更有信心的，因此增加了達到這種準確性所需求的樣本數量。 ::: ## Making the Model Agnostic :::info The above model has a certain drawback, it is not general enough because of the realizability assumption (explained in Empirical Risk Minimization)— nobody guarantees that there exists a hypothesis that will result in a true error of 0 in our current hypothesis space because of a failure of the model. Another way of looking at it is that perhaps the labels aren’t well defined by the data because of lacking features. :::success 上面有模型有一定的缺陷，因為realizability assumption讓它不是那麼的通用(相關說明見[Empirical Risk Minimization](https://towardsdatascience.com/learning-theory-empirical-risk-minimization-d3573f90ff77))，因為模型上的不足，沒有人可以保證真的存在著一個hypothesis在我們當前的hypothesis space可以讓實際誤差為0。換個角度來看，或許因為特徵的不足，因此可能沒有辦法很好的利用資料來定義標記。 ::: :::info The way that we circumvent the realizability assumption is by replacing the labeling function by a data-labels distribution. You can look at this as introducing uncertainty in the labeling function since one data point can share different labels. So why is it called Agnostic PAC learning? Well, the word agnostic comes from the fact that the learning is agnostic towards the data-labels distribution — this means that it is going to learn the best labeling function $f$ by making no assumptions about the data-labels distribution. What changes in this case? Well, the true error definition changes since a label to a data point is a distribution over multiple labels. We cannot guarantee that the learner is going to achieve the minimal possible true error, since we do not have the data-labels distribution to account for the label uncertainty. ::: :::success 避免relizability assumption的方法就是利用data-labels distribution來取代掉labelfing function $f$。你可以把它想成這是在labeling function中引入不確定性，因為一個data point可以有不同的標記。所以，為什麼這稱為Agnostic PAC learning?嗯，agnostic這字來自一個事實，data-labels的分佈對學習過程而言是[不可知論](https://pedia.cloud.edu.tw/Entry/Detail/?title=%E4%B8%8D%E5%8F%AF%E7%9F%A5%E8%AB%96)的 - 這意味著在學習最佳的lable function $f$的過程中，對於data-labels的分佈是沒有任何的假設的。這種情況下會有什麼不一樣?嗯，實際誤差的定義變了，因為到data point的label的分佈變成是一個multiple labels的狀況(因為相同的特徵你可能取到不同的label)。因為我們並沒有data-labels distribution來解釋label的不確定性，因此我們無法保證learner可以實現最小可能實際誤差。 ::: :::info After those considerations, we arrive at the following formal definition from the book: ![image alt](https://miro.medium.com/max/700/1*Qs2-BgqJw1iCbPa44Mv72w.png) ::: :::success 經過種種考慮之後，我們得到書中的定義： ![image alt](https://miro.medium.com/max/700/1*Qs2-BgqJw1iCbPa44Mv72w.png) ::: :::info Notice the changes in the definition with regards to the definition of PAC learnability. By introducing the data-labels distribution \mathcal{D}$ we allow for the fact that the true error of the learned hypothesis is going to be less or equal to the error of the optimal hypothesis plus a factor epsilon. This also encapsulates PAC learning itself in the case that there is an optimal hypothesis in the hypotheses space that produces a true error of 0, but we also allow for the fact that there is maybe no such hypothesis. These definitions are going to be useful later on in explaining the VC-dimension and proving the No free lunch theorem. ::: :::success 注意到關於PAC learnability定義的變化。透過引入data-labels distribution $\mathcal{D}$，我們得到一個事實，也就是學習到的hypothesis的實際誤差會小於等於誤差率最小的那個hypothesis加上$\epsilon$。這種情況下Agnostic PAC learning還是有將PAC learning給包含進來，hypotheses space內還是有一個最佳的hypothesis有著實際誤差為0的可能，只是或許可能沒有。這個定義在後續解釋VC-dimension與證明沒有白吃的午餐的時候非常有用。 ::: :::info In case if the terminology was a bit foreign to you, I advise you to take a look at Learning Theory: Empirical Risk Minimization or a more detailed look at the brilliant book from Ben-David mentioned in the article. Other than that, keep machine learning! ::: :::success 如果你對這邊的術語不是那麼熟悉，建議你花點時間看一下學習理論：Empirical Risk Minimization，或者更詳細的文中提到的Ben-David所著作的書。其它的就是，持續研究機器學習。 :::