統計學習與深度學習簡介

tags: `Statistical Learning and Deep Learning`

統計學習是指一套用於建模和理解複雜數據集的工具，其是一個近年來不斷蓬勃發展的統計學子領域，並與計算機科學和機器學習相互融合。該領域包括許多方法，如正則化回歸、分類、圖形模型和最近的深度學習。

統計學習(statistical learning)的內涵

所謂統計學習，係指透過資料建構機率模型(probabilistic model)進行預測(prediction)或分析(analysis)。統計學習與機器學習的差異在於，機器學習主要是設計和分析一些讓電腦可以自動「學習」的演算法，而機器學習的「學習」是指涉讓電腦自己學習預測模型。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

從資料建模

為何我們要從資料建模，而不直接透過簡單的方式描述資料呢？當然，直觀上來說是可行的，比如我們今天要針對公司某項產品的顧客評價進行情緒分析，給定正面與負面詞表就可以大致勾勒出這項產品在顧客心中的地位。不過設想一下，今天公司不可能僅針對一項產品進行分析，必定是將多項產品進行綜合的分析，在如此複雜的情境之下，我們勢必要用一個更一般化(generalized)的方式去描述資料，一方面能夠節省時間，另一方面則可以提高精確度。又或者舉另一個例子，由於語言會隨著時間而有不同的發展，比如說「森77」這個詞彙是近幾年才發展出來的一個負面網路用語，面對多變的環境我們必須在建構模型上面有所調整。

統計學習的分析方法之一

當我們拿到資料之後，必定會先審視一下這份資料運用何種方式比較適當，從而能夠提高模型的效率與準確度。底下介紹的幾個分析方法的分類並非絕對，只是一般而言在統計學習上最常見的分類方式。

監督式學習(supervised learning)

監督式學習大致可以分為以下兩種類型。第一種是迴歸模型(regression model)，其目標是透過特徵(features)預測特定數值。比如說電影公司要預測即將上映電影開賣前兩周的票房，因此預測的目標就是票房總收入(box office gross)，特徵就可以是演員、劇情等等。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

另一類的監督式學習的方式是分類問題(classification)，其是在給定特徵之下，將資料分門別類。例如上面關於情緒分析的例子，我們可以將詞彙分類為正面(positive)、中性(neutral)與負面(negative)。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

非監督式學習(unsupervised learning)

所謂非監督式學習，是沒有給定事先標記過的訓練範例，自動對輸入的資料進行分類或分群。第一個方式為聚類分析(clustering)，我們希望從資料中看出群聚的狀況。

另一種方式係降維(dimension reduction)與視覺化(visualization)。所謂降維顧名思義就是降低高維度資料的維度，目的在於將複雜維度的資料變成我們人類生處於的三維空間。例如文字資料就是高維度資料，一般而言我們在處理文字的時候，是利用虛擬變數或稱作獨熱編碼(one-hot encoding)的方式進行處理，我們可以用 0 跟 1 的組合代表一個字。舉例來說，如果我們有以下的字詞 one dog 與 one cat，發現到我們能夠將上述的兩個詞彙變成以下的編碼模式：

one: [1, 0, 0]
cat: [0, 1, 0]
dog: [0, 0, 1]

這個方法看起來很好用，但實際上會產生高維度向量(high-dimensional vectors)的問題，將資料庫變得更加龐大。另一方面可能在資料表示(data representation)上顯示其缺點，上面的例子中，one、dog 與 cat 彼此之間的距離都是

\sqrt{2}

，但 dog 跟 cat 在現實生活當中距離應該是最近的，因此我們會將類似的字詞百放在一起解決這個問題。以下是中文字詞分類的演示，訓練資料是 2005 至 2012 年在 tw.yahoo.com 中的新聞，總共的字數為 87,848,812 個中文字，不重複的中文字字數為 9,405 個(word-embedding)，將 200 維的向量投影到二維平面上(t-SNE)，可以看到「陳」這個姓氏附近的姓氏如下：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

增強學習(reinforcement learning)

人類在進行決策時，常常會根據目前「環境」(environment) 的「狀態」 (state) 執行「動作」 (action)，執行動作會造成兩個結果：一是人們會得到「環境」給我們的回饋，也就是人類會得到「報酬」 (reward) ，接著我們所執行的動作也會去改變「環境」，使得「環境」進入到一個新的「狀態」。一般人會根據「環境」給予的「報酬」，修正自己執行的「策略」，試圖極大化自己的「長期報酬」。增強學習希望讓機器，或者稱為「代理人」 (agent) ，模仿人類的這一系列行為。^[1]

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

事實上，在經濟領域中，跨期決策(intertemporal decision)就是一個增強學習的例子。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

統計學習的分析方法之二

另一種分類統計學習的方式即是將這些方法區分為 instance-based learning 與 model-based learning。前者是指將訓練中的所有資料都記起來，可以想像成是~~可憐的台灣學生~~認真準備期中、期末考的學生，在考前將上課所學的知識記在腦海裡。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

後者不同於 instance-based learning，其不會記憶所有的資料，而是將模型背後的參數(parameters)記起來，迴歸模型就是最好的例子，當我們將資料丟入設定好的 OLS 迴歸模型後，最後得出的係數(coefficient)就是我們會記憶的參數。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

統計學習的分析方法之三

最後一種常見的分類方式是批量學習(batch-learning) 和增量學習(incremental learning)。前者可以理解為填鴨式學習，一次性的將資料輸入進模型，後者則是批次將資料丟入模型中，這種方式多用在資料蒐集的階段，或是訓練資料的中後期，因擔心資料量太大導致電腦運算不過來，因此採用這種方式降低運算上的負擔。

資料量(power of data)的迷思

一般咸認，當我們在建立模型時，資料量必須越大越好，越複雜的模型亦是如此。一個簡單的例子是，在英文中有許多易混淆(confusion)的單字，我們的任務就是要訓練一個消歧義(disambiguation)的模型。比如以下的字詞集合：{principle, principal}、{then, than}、{to, two, too}、{weather, whether} 等等，根據下方的趨勢圖，可以看到當資料量越大的時候，精確度就越高。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

學習筆記

這份筆記集是基於臺大盧信銘教授於 111-1 學期所開設的「統計學習與深度學習」課程所撰寫的筆記。

K-Nearnest neighbors(KNN)演算法

參考書目

Pattern Recognition and Machine Learning by Christopher M. Bishop; ISBN 0-387-31073-8.
Hands-on Machine Learning with Scikit-Learn & Tensorflow by Aurelien Geron; ISBN 978-1-491-96229-9.
Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville; https://www.deeplearningbook.org/
Dive into Deep Learning by Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola; https://d2l.ai/ and https://github.com/dsgiitr/d2l-pytorch

節錄自人工智慧與增強學習1：什麼是增強學習？ ↩︎

統計學習與深度學習簡介

tags: Statistical Learning and Deep Learning

統計學習(statistical learning)的內涵

從資料建模

統計學習的分析方法之一

監督式學習(supervised learning)

非監督式學習(unsupervised learning)

增強學習(reinforcement learning)

統計學習的分析方法之二

統計學習的分析方法之三

資料量(power of data)的迷思

學習筆記

K-Nearnest neighbors(KNN)演算法

參考書目

Read more

LaTeX 的前世今生

LaTeX 教學系列

程式語言

岳氏礦泉水的學習與成長日記

tags: `Statistical Learning and Deep Learning`