# Chapter 4-Understanding Machine Learning From Theory to Algorithms: Learning via Uniform Convergence ###### tags: `Understanding Machine Learning From Theory to Algorithms` ## Remark :::info 文章內容 ::: :::success 個人理解 ::: [課程連結]() [TOC] ## Learning via Uniform Convergence :::info The first formal learning model that we have discussed was the PAC model. In Chapter 2 we have shown that under the realizability assumption, any finite hypothesis class is PAC learnable. In this chapter we will develop a general tool, uniform convergence, and apply it to show that any finite class is learnable in the agnostic PAC model with general loss functions, as long as the range loss function is bounded. ::: :::success 我們所討論的第一個正式的學習模型就是PAC模型。在第二章我們已經說明,在realizability assumption下,任何有限的hypothesis class都是PAC learnable。這一章節中,我們會繼續建立一個通用性的工具,均勻收斂(uniform convergence),並且應用它來說明,只要損失函數有限制,在具一般損失函數的agnostic PAC model中,任意有限的class都是learnable。 ::: ### 4.1 Uniform Convergence Is Sufficient for Learnability :::info The idea behind the learning condition discussed in this chapter is very simple. Recall that, given a hypothesis class, $\mathcal{H}$, the ERM learning paradigm works as follows: Upon receiving a training sample, $S$, the learner evaluates the risk (or error) of each $h$ in $\mathcal{H}$ on the given sample and outputs a member of $\mathcal{H}$ that minimizes this empirical risk. The hope is that an h that minimizes the empirical risk with respect to $S$ is a risk minimizer (or has risk close to the minimum) with respect to the true data probability distribution as well. For that, it suffices to ensure that the empirical risks of all members of $\mathcal{H}$ are good approximations of their true risk. Put another way, we need that uniformly over all hypotheses in the hypothesis class, the empirical risk will be close to the true risk, as formalized in the following. ::: :::success 這章討論的學習條件背後的想法非常簡單。回頭看,給定一個hypothesis class-$\mathcal{H}$,ERM學習範例如下: * 在收到訓練樣本-$S$之後,learner評估給定樣本後,$\mathcal{H}$內的每一個$h$的風險(或誤差),並且輸出一個具有最小經驗風險的$\mathcal{H}$的成員。我們希望的是,對於實際的資料機率分佈而言,最小化$S$的經驗風險的$h$同時也是風險最小化者(或風險接近最小化) 為此,只要確保$\mathcal{H}$的所有成員的經驗風險都是真實風險的良好近似值就足夠。換句話說,我們需要統一hypothesis class內的所有hypothesis,那麼經驗風險就會接近實際風險,如下所述: * DEFINITION 4.1($\epsilon$-representative sample) 訓練集-$S$稱為$\epsilon$-representative(對應於domain-$Z$,hypothesis class-$\mathcal{H}$,loss function-$\ell$與distribution-$\mathcal{D}$),如果: $$\forall h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert \leq \epsilon$$ 也就是,所有$\mathcal{H}$內的$h$,其經驗風險-$L_S(h)$減實際損失$L_\mathcal{D}(h)$取絕對值小於等於$\epsilon$。 猜想,這意謂著這兩個值之間的差異其實非常的小,不管誰大誰小,它們之間的就是差一個$\epsilon$,所以拿掉絕對值就任你發揮,也才能做下面LEMMA的證明。 而上面的式子中的$S = (z_1, \cdots , z_m)$或者$S = ((x_1, y_1), \cdots, (x_m, y_m))$,且$L_S(h) = \dfrac{1}{m} \sum_{z \in S} \ell(h, z)$,只要$S$確實的是有代表性的,那我們就可以說$ERM_\mathcal{H}$是一個好的學習策略。 ::: :::info The next simple lemma states that whenever the sample is $(\epsilon / 2)$-representative, the ERM learning rule is guaranteed to return a good hypothesis. ::: :::success 下一個引理說明,只要樣本為$(\epsilon / 2)$-representative,那ERM的learning rule就能夠保證回傳一個好的hypothesis。 LEMMA 4.2 假設,訓練集-$S$為$\dfrac{\epsilon}{2}$-representative(對應於domain-$Z$,hypothesis class-$\mathcal{H}$,loss function-$\ell$與distribution-$\mathcal{D}$)。然後,$\mathrm{ERM}_\mathcal{H}(S)$的任意輸出,也就是,任意的$h_S \in \arg \min_{h \in \mathcal{H}}$,滿足: $$L_\mathcal{D}(h_S) \leq \min_{h \in \mathcal{H}}L_\mathcal{D}(h) + \epsilon$$ 注意到,$L_{\mathcal{D}}(h_S)$是指$ERM$所學出來的,而$L_\mathcal{D}(h)$是指整個hypotheses class內的任一個hypothesis。 這意謂著你用$ERM$學到的那個hypothesis_$h_S$的true loss會小於等於整個hypotheses_$\mathcal{H}$內true loss最低的那個hypothesis_$h$再加上$\epsilon$。不要忘記$h_S$只是資料集$S$的最好的那個hypothesis,不代表它應用在真實環境的時候會是最好的。 ::: :::info Proof For every $h \in \mathcal{H}$, $$L_\mathcal{D}(h_S) \leq L_S(h_S) + \dfrac{\epsilon}{2} \leq L_S(h) + \dfrac{\epsilon}{2} \leq L_\mathcal{D}(h) + \dfrac{\epsilon}{2} + \dfrac{\epsilon}{2} = L_\mathcal{D}(h) + \epsilon$$ where the first and third inequalities are due to the assumption that $S$ is $\dfrac{\epsilon}{2}$-representative (Definition 4.1) and the second inequality holds since $h_S$ is an ERM predictor. ::: :::success Proof 對每一個$h \in \mathcal{H}$: $$L_\mathcal{D}(h_S) \leq L_S(h_S) + \dfrac{\epsilon}{2} \leq L_S(h) + \dfrac{\epsilon}{2} \leq L_\mathcal{D}(h) + \dfrac{\epsilon}{2} + \dfrac{\epsilon}{2} = L_\mathcal{D}(h) + \epsilon$$ 其中,第一、第三個不等式是因為假設$S$為$\dfrac{\epsilon}{2}$-representative(Definition 4.1),而第二個不等式的限制是因為$h_S$是一個ERM的predictor。 前面提到的引理(LEMMA)說明著,為了確保ERM rul是一個agnostic PAC learner,這足以證明在隨機選擇訓練集機率至少為$1-\delta$的情況下,它會是一個$\epsilon$-representative的訓練集。均勻收斂條件形式化了這個要求。 回查線上課程發現,課程中並不是求證: $$L_\mathcal{D}(h_S) \leq \min_{h \in \mathcal{H}}L_\mathcal{D}(h) + \epsilon$$ 而是 $$L_\mathcal{D}(h_S) \leq \min_{h \in \mathcal{H}}L_\mathcal{D}(h) + 2\epsilon$$ 個人覺得$2\epsilon$比較不會讓人感到困惑,畢竟根據定義4.1是小於等於$\epsilon$: * $L_\mathcal{D}(h_S) \leq L_S(h_S) + \epsilon$是根據定義4.1,任意$h \in \mathcal{H}$,其$\vert L_S(h) - L_\mathcal{D}(h) \vert \leq \epsilon$,不管如何,兩者之間就是差一個$\epsilon$,拿掉絕對值符合這個條件就可以。 * $L_\mathcal{D}(h_S) \leq L_S(h_S) + \epsilon \leq L_S(h) + \epsilon$,第二個不等式的等立是因為,$h_S$是$ERM$所選出來的,它能得到的經驗風險一定是最小的,因此成立,即使你說$h\in \mathcal{H}$而且$h_S \in \mathcal{H}$,最多就是剛好這邊的$h$就是$h_S$,那也是成立 * 最後的不等式再加一個$\epsilon$則是分佈之間的轉換,當你由$S \to \mathcal{D}$或$\mathcal{D} \to S$的時候,在$\epsilon$-representative情況下,$\epsilon$就是一個margin,因此$L_\mathcal{D}(h) + 2\epsilon$ 上面的推論證明LEMMA4.2,定義是一種假設,把推論$2\epsilon$的變為推論$\epsilon$就不難理解為什麼是$\dfrac{\epsilon}{2}$ ::: :::info DEFINITION 4.3(Uniform Convergence) We say that a hypothesis class $\mathcal{H}$ has the uniform convergence property (w.r.t. a domain $Z$ and a loss function $\ell$) if there exists a function $m_\mathcal{H}^{UC}: (0, 1)^2 \rightarrow \mathbb{N}$ such that for every $\epsilon, \delta \in (0, 1)$ and for every probability distribution $\mathcal{D}$ over $Z$, if $S$ is a sample of $m \geq m_\mathcal{H}^{UC}(\epsilon, \delta)$ examples drawn i.i.d. according to $\mathcal{D}$, then, with probability of at least $1-\delta$, $S$ is $\epsilon$-representative. ::: :::success 我們假設hypothesis class-$\mathcal{H}$有著均勻收斂性(相對於doamin-$Z$與loss function-$\ell$),只要滿足下面條件: * 如果存在一個函數$m_\mathcal{H}^{UC}: (0, 1)^2 \rightarrow \mathbb{N}$,使得對每一個$\epsilon, \delta \in (0, 1)$,以及domain-$Z$上的每一個機率分佈-$\mathcal{D}$(這邊翻起來怪怪的) * 如果$S$是根據$\mathcal{D}$,且i.i.d.取得的樣本,且$m \geq m_\mathcal{H}^{UC}(\epsilon, \delta)$,那$S$最少會有$1-\epsilon$的機率是$\epsilon$-representative 類似於PAC learning採樣複雜度的定義,函數-$m_\mathcal{H}^{UC}$量測著獲得均勻收斂性的(最小)樣本複雜度,也就是說,我們需要多少範本來確保在最少$1-\delta$的機率下,樣本會是$\epsilon$-representative。 這是建立在上面所說的,如果$S$是$\epsilon$-representative,那ERM就是一個好的策略,而因為ERM是一個好策略,我們說,這是一個agnostic PAC learnable。 ::: :::info corollary 4.4 If a class $\mathcal{H}$ has the uniform convergence property with a function $m_\mathcal{H}^{UC}$ then the class is agnostically PAC learnable with the sample complexity $m_\mathcal{H}(\epsilon, \delta) \leq m_\mathcal{H}^{UC}(\epsilon/2, \delta)$. Furthermore, in that case, the ERMH paradigm is a successful agnostic PAC learner for $\mathcal{H}$. ::: :::success 這邊的"均勻"一詞所指的是具有一個固定的樣本大小,這樣本大小適用於$\mathcal{H}$的所有成員,以及domain上的所有可能機率分佈,下面推論直接來自引理4.2與均勻收斂的定義。 COROLLARY 4.4 如果class-$\mathcal{H}$具有函數-$m_\mathcal{H}^{UC}$的均勻收斂性,那麼這個class就是一個具有樣本複雜度$m_\mathcal{H}(\epsilon, \delta) \leq m_\mathcal{H}^{UC}(\epsilon/2, \delta)$的agnostically PAC learnable。此外,這種情況下,$\mathrm{ERM}_\mathcal{H}$範式是關於$\mathcal{H}$的一個成功的agnostic PAC learner。下面就會推論找出$m_\mathcal{H}^{UC}$的upper bound。 ::: ### 4.2 Finite Classes Are Agnostic PAC Learnable :::info In view of Corollary 4.4, the claim that every finite hypothesis class is agnostic PAC learnable will follow once we establish that uniform convergence holds for a finite hypothesis class. To show that uniform convergence holds we follow a two step argument, similar to the derivation in Chapter 2. The first step applies the union bound while the second step employs a measure concentration inequality. We now explain these two steps in detail. ::: :::success 從Corollary4.4的觀點來看,一旦我們確定有限的hypothesis class的均勻收斂成立,那麼每個有限的hypothesis class就都會是agnostic PAC learnable。 為了證明均勻收斂成立,我們遵循兩步論證,這類似第二章的推導: 1. 第一步,應用聯集上界(union bound) 2. 第二步,使用measure concentration inequality(量測集中不等式?) 下面說明兩個步驟的細節: * 固定某些$\epsilon, \delta$。我們必需去尋找一個樣本大小-$m$來保證一件事,保證對任意的分佈-$\mathcal{D}$,從$\mathcal{D}$中採樣到$S=(z_1, \cdots, z_m)$i.i.d.的機率最少為$1-\delta$,那我們就可以得到所有的$h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert \leq \epsilon$。也就是: $$\mathcal{D}^m(\left\{ S:\forall h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert \leq \epsilon \right\}) \geq 1 - \delta$$ 直觀來看,這個數學式說明的就是,我們從分佈-$\mathcal{D}$中sample出$m$組資料的機率,即$\mathcal{D}^m$,什麼樣的機率,就是資料集-$S$可以讓hypothesis-$\mathcal{H}$內的所有$h$都可以滿足經驗風險-$L_S(h)$減去實際風險-$L_\mathcal{D}(h)$要小於等於$\epsilon$,也就是$S:\forall h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert \leq \epsilon$,而這個機率會大於等於$1-\delta$。 同樣的,我們要證明: $$D^m(\left\{S: \exists h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\}) < \delta$$ 上面的數學式說的是,我們sample出一組資料的機率,什麼樣的機率?就是hypothesis-$\mathcal{H}$存在一個$h$會讓$\vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon$,而這個機率是小於$\delta$。 寫成這樣: $$\left\{S: \exists h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\} = \bigcup_{h \in \mathcal{H}} \left\{S: \vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\}$$ 然後應用Lemma 2.2的聯集上界,我們可以得到: $$\mathcal{D}^m(\left\{S: \exists h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\}) \leq \sum_{h \in \mathcal{H}} \mathcal{D}^m(\left\{S: \vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\}) \tag{4.1}$$ ::: :::info Our second step will be to argue that each summand of the right-hand side of this inequality is small enough (for a sufficiently large $m$m). That is, we will show that for any fixed hypothesis, $h$, (which is chosen in advance prior to the sampling of the training set), the gap between the true and empirical risks, $\vert L_S(h) - L_\mathcal{D}(h) \vert$, is likely to be small. ::: :::success 第二步的主張是,這個不等式右手邊的每一個被加數都足夠小(對一個足夠大的$m$而言)。也就是說,我們將說明,對任意固定的hypothesis-$h$而言,實際與經驗風險之間的差異,也就是$\vert L_S(h) - L_\mathcal{D}(h) \vert$,可能也會很小。 ::: :::info Recall that $L_\mathcal{D}(h) = \mathbb{E}_{z \sim \mathcal{D}} \left[\ell(h, z) \right]$ and that $L_S(h) = \dfrac{1}{m} \sum^m_{i=1} \ell(h, z_i)$. Since each $z_i$ is sampled i.i.d. from $\mathcal{D}$, the expected value of the random variable $\ell(h, z_i)$. By the linearity of expectation, it follows that $L_\mathcal{D}(h)$ is also the expected value of $L_S(h)$. Hence, the quantity $\vert L_\mathcal{D}(h) - L_S(h) \vert$ is the deviation of the random variable $L_S(h)$ from its expectation. We therefore need to show that the measure of $L_S(h)$ is concentrated around its expected value. ::: :::success 回頭看兩個數學式: * $L_\mathcal{D}(h) = \mathbb{E}_{z \sim \mathcal{D}} \left[\ell(h, z) \right]$ * $L_S(h) = \dfrac{1}{m} \sum^m_{i=1} \ell(h, z_i)$ 由於每一個$z_i$都是從$\mathcal{D}$獨立同分佈採樣而得,隨機變數$\ell(h, z_i)$的期望值為$L_\mathcal{D}(h)$。通過linearity of expectation,我們知道$L_\mathcal{D}(h)$也是$L_S(h)$的期望值。因此,$\vert L_\mathcal{D}(h) - L_S(h) \vert$就是隨機變數$L_S(h)$與其期望值之間的偏差。因此我們需要證明$L_S(h)$的量測是集中在期望值附近的。 一個基礎的統計事實,大數法則(Law of large numbers),也就是數量愈多,其平均值就愈接近期望值,也就是當$m$接近無限大的時候,其經驗平均就會收到變為實際期望值。這對$L_S(h)$也是成立的,因為它是獨立同分佈隨機變量$m$的經驗平均。然而,因為大數法則只是一個漸近結果,因此它並沒有提供任何關於經驗估測誤與實際值之間的差異(任何給定的有限樣本大小)的任何信息。 所以我們會用Hoeffding所提出的一個量測集中不等式的方法來量化經驗平均與其期望值之間的差異。 ::: :::info LEMMA 4.5 (Hoeffding's Inequality) Let $\theta_1, \cdots, \theta_m$ be a sequence of i.i.d. random variables and assume that for all $i$, $\mathbb{E}\left[\theta_i \right]=\mu$ and $\mathbb{P} \left[a \geq \theta_i \geq b \right]=1$. Then, for any $\epsilon > 0$ $$\mathbb{P}\left[\left\vert \dfrac{1}{m} \sum^m_{i=1} \theta_i - \mu \right\vert > \epsilon \right] \geq 2 \exp(-2m\epsilon^2 / (b-a)^2)$$ ::: :::success LEMMA 4.5 ([Hoeffding's Inequality](https://zh.wikipedia.org/wiki/%E9%9C%8D%E5%A4%AB%E4%B8%81%E4%B8%8D%E7%AD%89%E5%BC%8F)) [霍夫丁不等式](https://www.itread01.com/content/1546706721.html) * 假設$\theta_1, \cdots, \theta_m$是獨立同分佈隨機變數的序列,並假設,對所有的$i$, $\mathbb{E}\left[\theta_i \right]=\mu$且$\mathbb{P} \left[a \geq \theta_i \geq b \right]=1$。那麼,對任意的$\epsilon > 0$,即 $$\mathbb{P}\left[\left\vert \dfrac{1}{m} \sum^m_{i=1} \theta_i - \mu \right\vert > \epsilon \right] \geq 2 \exp(-2m\epsilon^2 / (b-a)^2)$$ $\mathbb{P} \left[a \geq \theta_i \geq b \right]=1$意謂著$\theta$是被bound住的,而且很明顯的,只要$m$愈大,那整個值就會愈小。 相關證明可以在附錄B找到。 回到我們的問題,假設$\theta_1$是隨機變數$\ell(h, z_i)$。因為$h$是固定的,且$z_1, \cdots, z_m$是獨立同分佈的採樣,因此,$\theta_1, \cdots, \theta_m$也是獨立同分佈的隨機變數。也因此,$L_S(h) = \dfrac{1}{m} \sum^m_{i=1} \theta_i$且$L_\mathcal{D}(h) = \mu$。 讓我們更進一步的假設$\ell$的區間是$\left[0, 1\right]$,因此$\theta_i \in \left[0, 1 \right]$。我們因此得到下面的數學式: $$\mathcal{D}^m(\left\{S:\vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\}) = \mathbb{P} \left[\left\vert \dfrac{1}{m} \sum^m_{i=1} \theta_i - \mu \right\vert > \epsilon \right] \geq 2 \exp(-2m\epsilon^2) \tag{4.2}$$ 結合方程式(4.1)得到: $$\begin{equation}\begin{aligned}\mathcal{D}^m(\left\{S:\exists h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\}) & \leq \sum_{h \in \mathcal{H}}2 \exp(-2m\epsilon^2) \\ &=2\vert \mathcal{H} \vert \exp(-2m\epsilon^2)\end{aligned}\end{equation}$$ 最終,如果我們選擇 $$m \geq \dfrac{\log(2 \vert \mathcal{H} \vert / \delta)}{2\epsilon^2}$$ 那麼就會得到 $$\mathcal{D}^m(\left\{S:\exists h \in \mathcal{H}, \vert L_S(h) - L_\mathcal{D}(h) \vert > \epsilon \right\}) \leq \delta$$ ::: :::info corollary 4.6 Let $\mathcal{H}$ be a finite hypothesis class, let $Z$ be a domain, and let $\ell: \mathcal{H} \mathrm{x} Z \rightarrow \left[0, 1\right]$ be a loss function. Then, $\mathcal{H}$ enjoys the uniform convergence property with sample complexity $$m_\mathcal{H}^{UC}(\epsilon, \delta) \leq \left \lceil \dfrac{\log(2\vert \mathcal{H} \vert / \delta)}{2\epsilon^2} \right \rceil$$ Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity $$m_\mathcal{H}(\epsilon, \delta) \leq m_\mathcal{H}^{UC}(\epsilon/2, \delta) \leq \left\lceil \dfrac{2\log(2\vert\mathcal{H}\vert /\delta)}{\epsilon^2} \right \rceil$$ ::: :::success COROLLARY 4.6 假設$\mathcal{H}$是一個有限的hypothesis class,$Z$是domain、且$\ell: \mathcal{H} \mathrm{x} Z \rightarrow \left[0, 1\right]$為loss function。那麼$\mathcal{H}$會擁有一樣的均勻收斂性,其樣本複雜度為 $$m_\mathcal{H}^{UC}(\epsilon, \delta) \leq \left \lceil \dfrac{\log(2\vert \mathcal{H} \vert / \delta)}{2\epsilon^2} \right \rceil$$ 此外,這個class是agnostically PAC learnable,使用ERM演算法,其樣本複雜度為 $$m_\mathcal{H}(\epsilon, \delta) \leq m_\mathcal{H}^{UC}(\epsilon/2, \delta) \leq \left\lceil \dfrac{2\log(2\vert\mathcal{H}\vert /\delta)}{\epsilon^2} \right \rceil$$ ::: :::info Remark 4.1 (The \Discretization Trick") While the preceding corollary only applies to finite hypothesis classes, there is a simple trick that allows us to get a very good estimate of the practical sample complexity of infinite hypothesis classes. Consider a hypothesis class that is parameterized by $d$ parameters. For example, let $\mathcal{X} = \mathbb{R}, \mathcal{Y} = \left\{ \pm 1 \right\}$, and the hypothesis class, $\mathcal{H}$, be all functions of the form $h_\theta(x)=\mathrm{sign}(x - \theta)$. That is, each hypothesis is parameterized by one parameter, $\theta \in \mathbb{R}$, and the hypothesis outputs 1 for all instances larger than $\theta$ and outputs -1 for instances smaller than \theta. This is a hypothesis class of an infinite size. However, if we are going to learn this hypothesis class in practice, using a computer, we will probably maintain real numbers using floating point representation, say, of 64 bits. It follows that in practice, our hypothesis class is parameterized by the set of scalars that can be represented using a 64 bits floating point number. There are at most $2^{64}$ such numbers; hence the actual size of our hypothesis class is at most $2^{64}$. More generally, if our hypothesis class is parameterized by $d$ numbers, in practice we learn a hypothesis class of size at most $2^{64d}$. Applying Corollary 4.6 we obtain that the sample complexity of such classes is bounded by $\dfrac{128d + 2 \log(2 / \delta)}{\epsilon^2}$. This upper bound on the sample complexity has the deficiency of being dependent on the specific representation of real numbers used by our machine. In Chapter 6 we will introduce a rigorous way to analyze the sample complexity of infinite size hypothesis classes. Nevertheless, the discretization trick can be used to get a rough estimate of the sample complexity in many practical situations. ::: :::success Remark 4.1 (The "Discretization Trick") 儘管先前的推論單純的應用在有限的hypothesis classes上,但有一個簡單的技術可以讓我們很好的估測無限的hypothesis classes的實際樣本複雜度。考慮一個利用參數-$d$做參數化的hypothesis class。 舉例來說: * 假設$\mathcal{X} = \mathbb{R}, \mathcal{Y} = \left\{ \pm 1 \right\}$,且hypothesis class-$\mathcal{H}$,其所有函數的形式為$h_\theta(x)=\mathrm{sign}(x - \theta)$ * 每個hypothesis都是由一個參數-$\theta \in \mathbb{R}$來參數化 * 對所有大於$\theta$實例輸出為1 * 對所有小於$\theta$實例輸出為-1 這個一個無限大小的hypothesis class。但如果我們要用電視實際的學習這個hypothesis class,我們可能會用浮點表示來保持實數,舉例來說,64bits。實務上,我們的hypothesis class用一組的量值來參數化,這組參數可以用64bit的浮點數來表示,那: * 最多最多有$2^{64}$的數值可以使用 * hypothesis class的實際大小,最多最多也是$2^{64}$個 * 如果我們的hypothesis class是由$d$個數值來參數化,那實際上我們學習到的hypothesis class的大小最多最多就會有$2^{64d}$個 應用Corollary 4.6,我們獲得這類的class的樣本複雜度的邊界為$\dfrac{128d + 2 \log(2 / \delta)}{\epsilon^2}$。這個樣本複雜度的upper bound有一個缺點,就是這個upper bound是取決於我們所使用的機器所指定的實數的表示。在第六章中,我們將介紹一個嚴謹的方式來分析無限大小的hypothesis classes的樣本複雜度。儘管如此,離散化的技巧可以用來得到許多實際狀況中樣本複雜度的一個粗略的估測。 ::: ### 4.3 Summary :::info If the uniform convergence property holds for a hypothesis class H then in most cases the empirical risks of hypotheses in H will faithfully represent their true risks. Uniform convergence suffices for agnostic PAC learnability using the ERM rule. We have shown that finite hypothesis classes enjoy the uniform convergence property and are hence agnostic PAC learnable. ::: :::success 如果hypothesis class-$\mathcal{H}$的均勻收斂性成立,那麼多數的狀況下,$\mathcal{H}$內的hypothesis的經驗風險將如實的表示它們的實際風險。使用ERM rule,其均勻收斂滿足agnostic PAC learnability。我們已經證明,有限的hypothesis classes擁有著均勻收斂性,且因此而agnostic PAC learnable。 :::