人工智慧導論期末整理

人工智慧導論期末整理

Chapter 19. Learning from examples

Entropy

討論一個隨機變數不確定性的方式。

$E n t r o p y = - \sum_{k} P (v_{k}) * l o g_{2} P (v_{k})$
對於 boolean variable 來說，
$e n t r o p y$ 為底下公式

$E n t r o p y = B (P) = - (P * l o g_{2} P + (1 - P) * l o g_{2} (1 - P))$
- 如果 training set 中有 p 個 positive ， n 個 negative，整個 set 的
  $e n t r o p y$ 為
  $B (\frac{p}{p + n})$
- 如果使用一個 attribute 把一個 set 分成
  $p_{k}$ 個 positive 和
  $n_{k}$ 個 negative，整個 set 剩下的
  $e n t r o p y$ 為
  $R e m a i n d e r = \sum_{k = 1}^{d} \frac{p_{k} + n_{k}}{p + n} * B (\frac{p_{k}}{p_{k} + n_{k}})$
透過一個 attribute 得到的資訊
$(G a i n)$ 為原先已知減掉
$R e m a i n d e r$

decision tree pruning

從一個已經完整建好的 decision tree 開始做，將不重要的 attribute 刪除。
- 決定是否重要可從
  $G a i n$ 得知，如果
  $G a i n ≃ 0$ 代表這個 attribute 不重要，也就是
  $\frac{p_{k}}{p_{k} + n_{k}} ≃ \frac{p}{p + n}$ 。

Statistical significance test

TBA, not something I could read
p.28 and p.29

independent and identically distributed (i.i.d.)

每個 sample 值與之前的 sample 直接無關
而且每次 sample 之間的機率分布狀況皆相同

k-fold cross validation

將資料分成
$k$ 等分，每次皆拿一部分作為 testing，其他部分作為 training data，將每次結果取平均作為整體模型的表現
- 如果
  $k = n$ ，這代表只使用一筆資料作為 testing，別名為 leave-one-out cross-validation (LOOCV)

expected loss

對於 hypothesis
$h$ 和 loss function
$L$ 來說，expected generalization loss(GenLoss) 為
$G e n L o s s_{L} (h) = \sum_{(x, y) \in ϵ} L (y, h (x)) * P (x, y), h (x) = \hat{y}$ ，而
$ϵ$ 為所有輸入輸出 (x, y) 組合
- 最佳的 hypothesis
  $h^{*}$ 即為上面式子中得出最小值的
  $h$ ，也就是
  $h^{*} = a r g m i n G e n L o s s_{L} (h)$ 。
因為實際的
$P (x, y)$ 未知，我們只能用 empirical loss 預測，假設一個 example set
$E$ ，預測算式為
$E m p L o s s_{L, E} (h) = \frac{1}{N} \sum_{(x, y) \in E} L (y, h (x))), h (x) = \hat{y}$
- 最佳的 hypothesis
  ${\hat{h}}^{*} = a r g m i n E m p L o s s_{L, E} (h)$

Regularization

也可以把 Cost 定義為 empirical loss 和 model complexity 的 weighted sum，也就是
$C o s t (h) = E m p L o s s (h) + λ * C o m p l e x i t y (h)$
- 此時最佳的 hypothesis
  $h^{*} = a r g m i n C o s t (h)$
- 而
  $λ$ 要找一個能讓 validation 判別最準確的數值。
- 透過懲罰過高 complexity 的方式定義 cost 就是regularization

Univariate(Multivariate) linear regression

??
p.46

Non-parametric model

parametric model 是透過一組參數去描述整筆資料，訓練完了之後就不需要原始資料。
一個方法是把所有的 training data 做為模型的參數，這方法叫做 instanced-based learning 或是 memory-based learning。

Table lookup

在遇到一個輸入參數時直接尋找 training data 中是否有相同的輸入，有的話就輸出對應值；沒有的話就輸出一個預設數值。

Nearest neighbor(KNN)

給定一個輸入，找距離最近的 K 個數值當作數值處理的依據
- 對於 classification 來說，將每個數值做為投票，較多數的做為輸出。(因此需要 K 為奇數)
- 對於 regression 來說，可輸出所有數值的平均或是中位數。又或是使用 linear regression 的方式取值。
計算距離的方式是 Minkowski Distance 或稱為
$L_{p} n o r m$ ，定義為
$L_{p} = (\sum_{i} | x_{j, i} - x_{q, i} |^{p})^{1 / p}$ 。
- $p = 2$ 代表是 Eulicean distance，而
  $p = 1$ 代表是 Manhattan distance。
- 對於 boolean variable，兩個點的 attribute 差異稱為 Hamming distance。
- 計算 distance 的實際數字容易受到數量級的影響，因此需要先經過 normalize 的處理。
- 另一個計算 distance 的更複雜方式稱為 Mahalanobis distance
在資料是低維度的狀況下，KNN 的表現還算不錯；但是到了高維度，KNN 會因為很難找到距離夠近的鄰居而讓表現變得非常差。( Curse of Dimensionality )

k-dimentional tree (k-d tree)

balanced binary tree
每個 node 都是將一個 attribute 做一個 threshold ，讓一半的元素落在一邊，另一半落在另一邊。
這方法需要非常大量的 data，至少
$2^{n}$ 個資料才能很快速的查詢到需要的資料。

Locality-sensitive hashing

如果兩點間距離小於
$c r$ ，這兩點在 hash 後落在同個 bin 的機率很高，反之亦然。
從 bit-string 的 subset 隨機取值作為 hash function 。

Locally weighted regression

給很接近輸入數值的點很高的 weight ，距離越遠 weight 越低，透過這種方式計算平均值。
計算 weight 的 function 稱為 kernel。

Support vector machine (SVM)

目標是建構出 maximum margin seperator。
透過 kernel function 將低維度的輸入資料映射到高維度後再進行 linear seperation。這方法叫做 kernel trick。
SVM 是 non-parametric model。
一個核心點是 SVM 認為有些資料相較於其他資料更加重要，如果給這些更重要的資料更多關注可以讓整個 model generalize 得更好。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Ensemble learning

將許多的 hypothesis 級合起來一起做出預測。
- 以 Classification 來說，這方法就是多數決。
- 以 Regression 來說，可以輸出這些預測的平均值。
Random forest 就是 decision tree 的 ensemble learning。透過隨機改變如何取 attribute 來建出不同的 tree。
- Random forest 不受 overfitting 影響。
- 經過證明，在 Random Forest 內加入越多 Tree 後，錯誤會逐漸收斂。
Stacked generalization(stacking) 是拿同一個 training data 在不同的 model 底下訓練，並將前面的訓練結果加入原先資料中在給下一層的 model 訓練。
Boosting 一開始先將所有 training data 的 weight 設為 1。每次訓練時對於分類錯誤的 trainging data 要提升對應的 weight；而分類正確的 data 則減少 weight。
- 調整完 weight 後救生成了一個新的 hypothesis ，重複直到有 K 個 hypothesis 後再互相比較表現狀況。
- 一個比較著名的 algorithm 是 ADABOOST ，只要 learning algorithm 的表現狀況比亂猜還要好，這個方法可以在足夠大的 K 之後完美的分類所有 data。
如果取樣不符合 i.i.d.，可以使用 online learning 的方式。也就是 agent 在當下獲取環境資訊並做出決策，然後再檢驗是否符合預設答案。

trust, intrepretability, explainability

trust: 我們有沒有辦法信任這個 model
intrepretability: 我們能不能解釋 model 內的那些參數產生了這個輸出。
explainability: 我們能不能理解為何這個輸入經過這個 model 後產生了這個輸出。

Chapter 20. Learning Probabilistic Models

Statistical learning

Baysian learning 在給定資料的狀況下計算每個 hypothesis 的機率，並且做出預測。也就是預測的結果會使用到全部的 hypothesis，並使用這個 hypothesis 對應的機率做為 weight。
使用貝式定理，可在只有事前機率和 likelihood 就得到事後機率
$P (h_{i} | d) = P (d | h_{i}) * P (h_{i})$ 。
- $P (h_{i})$ 是事前機率
- $P (d | h_{i})$ 是在特定的 hypothesis 狀況下的 likelihood，可透過
  $P (d | h_{i}) = Π_{j} P (d_{j} | j_{i})$ 計算得到(假設取樣皆 i.i.d.)。
對於錯誤的 hypothesis，事後機率會逐漸趨向零；對於正確的 hypothesis，事後機率會逐漸趨向實際分布狀況。

Maximum a posteriori(MAP hypothesis)

直接用一個最有可能的 hypothesis 去做預測，最佳 hypothesis 的取法是讓
$P (h_{i} | d)$ 最大化。
- 將
  $P (h_{i} | d) = P (d | h_{i}) * P (h_{i})$ 最大化的過程可視為將
  $- l o g_{2} P (d | h_{i}) - l o g_{2} P (h_{i})$ 最小化(取
  $l o g_{2}$ 後再轉負數)。
  $- l o g_{2} P (d | h_{i})$ 可當成是給定 hypothesis 後需要多少 bit 才能選定這筆 data，而
  $- l o g_{2} P (h_{i})$ 可以視為需要多少 bit 才能選定這個 hypothesis。
  - 這個方法則被稱為是 minimum description length(MDL)，直接計算這個 hypothesis 和 data 能夠取得多少 bits 的資料。
- 如果 hypothesis space 是 uniform，MAP 就變成只需要找到一個
  $h_{i}$ 可以最大化
  $P (d | h_{i})$ 。也就是 Maximum-likelihood hypothesis (
  $h_{M L}$ )。

Density estimation

在已知資料後，訓練一個模型的這回事。
TBA, uncertain, p.16

Naive bayes model

"Class" variable C (要預測的目標)設成 root，"attribute" variable
$X_{i}$ 設成 leaves。
稱為 "naive" 的原因是在給定某個 Class 時，假設每個 attribute 之間皆 conditionally independent。然而這個假設很常是個錯誤的假設。

Bayesian parameter learning

先針對所有可能的 hypothesis 定義事前機率的分布狀況，這種作法稱為 hypothesis prior。當 data 開始進來後，再逐步更新 posterior probablility。

beta distribution

這種 distribution 定義為
$b e t a [a, b] (p) = α * p^{a - 1} * (1 - p)^{b - 1}$ ，
$α$ 是 normalization constant。而 uniform distribution 正是
$b e t a [1, 1] (p)$ 。

Expectation-maximization algorithm

先隨機對於所有 data points 指派對應的 component
Expectation step(E-Step): 計算資料
$x_{j}$ 由 component
$i$ 生成的機率，也就是
$p_{i j} = P (C = i | x_{j}) = α * P (x_{j} | C = i) * P (C = i)$ ，其中
$P (x_{j} | C = i)$
$x_{j}$ 落在第
$i$ 個 component 的機率，而
$P (C = i)$ 是第
$i$ 個 component 的 weight 數值。
Maximization step(M-Step): 計算新的 mean, covariance, component weight。其中用到的數值
$n_{i} = \sum_{j} p_{i j}$ 為目前 component
$i$ 中的 data point 個數，以及
$N$ 是所有的 data point 個數。
- $μ_{i} \leftarrow \sum_{j} p i j * x_{j} / n_{i}$
- $\sum_{i} \leftarrow \sum_{j} p_{i j} * (x_{j} - μ_{i}) * (x_{i} - μ_{j})^{T} / n_{i}$
- $w_{i} \leftarrow n_{i} / N$

E-step 可視為是計算 hidden indicator variable
$Z_{i j}$ 的期望值，而
$Z_{i j}$ 在
$x_{j}$ 是被 component
$i$ 生成時才為 1，反之為 0。

Chapter 21. Deep Learning

Vanishing gradient

在某些狀況下，偏微分數值會變得極小。又因為偏微分的過程需要使用到 chain rule，越前面的 layer 更容易受到這個極小值的影響讓 weight 幾乎沒辦法更新。

Spatial invariance

在一個小區塊內的 pixel 不會相差太大。

Convolutional neural network (CNN)

在空間中距離很接近的 pixel 值可以透過 kernel 的運算傳送到下一層。
- stride: 每次運算之間 kernel 中心移動的 pixel 數量。
- receptive field: 這個 neuron 接收的涵蓋範圍。
- pooling and downsampling: 透過 pooling 運算，會讓輸出的圖片稍微變小，這叫做 downsampling。
- tensor: 任意維度的陣列
  - 其中 vector 就是一維的 tensor，matrix 則是二維的 tensor。

Residual network

將前面 layer 的資訊直接傳送到後面的 layer。

Learning algorithm

通常使用 Stochastic gradient decent(SGD)，透過 SGD 的隨機性讓學習過程不容易停在 local minimum。
另一個方法是使用 momentum，把前面幾次的數據納入考量後加入更新數值內。

Generalization

Weight decay: 在 loss function 中對於 weight 加一個 penalty
$λ$ 值。
- 對於能否幫助 generalization 不太明顯，但依舊是一個方法。
- 這方法也是 maximum a posteriori learning 的一個形式。
Dropout: 對於每個 neuron 加一個機率 p，在訓練時如果發生會直接停用這個 neuron，但是在實際使用模型時必須停用 dropout。
- 對於每個 minibatch，如果有 p 機率的 dropout，在輸出結果時需要多做一次除以 p。
- 這方法也可當成是在訓練時加入 noise。

Recurrent neural network(RNN)

將之前訓練的結果往回接到前面的 layer，讓這些成果可以在新的 input 就有影響。
Markov assumption: 每個 hidden state 都可以從之前的 input 取得足夠多的資料。
表示 Grediant 的式子有 recursive 的特性，因此也稱為 back-propagation through time。
- 其中包含了一個與
  $w_{z, z}$ 成正比的項，如果這個值小於 1，可能會發生 vanishing gradient 的問題；如果這個值大於 1，則會發生 exploding gradient 的問題。

Chapter 24. Deep Learning for Natural Language Processing

one-hot vector

來表示目前的字詞，但是這個方法沒辦法描述各個字之間的相似度。

word-embedding

這個方式可以讓模型學習各個字之間的相似度。

bidirectional RNN

這個方法讓文字辨識時可以取得這個字左邊及右邊的文字。

Attention

讓模型學習每個字對於判別的重要性，在模型中會以機率的方式呈現。
- 同時這種機率也可以讓人更容易解釋為何模型會這樣判別。(interpretable)

decoding – beam search

多緒平行選擇每一步的最適合字詞，並輸出最佳的字詞組合。

masked language models(MLM)

將部分字詞刻意 mask 起來，要求模型可以預測這些字詞。

Chapter 25. Computer Vision

CV 中兩大問題分別是 reconstruction 和 recognition。也就是根據圖形內容重建出世界以及辨識出圖形內的物件。

pinhole camera

針孔攝影機前端的開口就是光圈(Aperture)
focal length(焦距): 從針孔開口位置到 image plane 的距離
- 下圖中在場景中
  $P$ 點的座標為
  $(X, Y, Z)$ ，而在 image plane 上的
  $P^{'}$ 點的座標為
  $(x, y, z)$ 。
- 透過相似三角形可以得知
  $\frac{- x}{f} = \frac{X}{Z}, \frac{- y}{f} = \frac{Y}{Z}$ ，因此
  $x = \frac{- f X}{Z}, y = \frac{- f Y}{Z}$ 。
  - 這種成像方法也被稱為 image projection。隨著
    $Z$ 越大(物件離鏡頭越遠)，影像的大小也變小；而
    $x$ 和
    $y$ 的負號代表影像經過上下左右的顛倒。
- 針孔攝影機的問題是需要足夠小的光圈才能保持對焦。而過小的光圈會因為光線不足而讓影像暗；過大的光圈則有動態模糊的問題。
- 在焦點上產生的平面稱為 focal plane(焦平面)，而在焦點附近可以產生足夠清晰影像的距離稱為 depth of field(景深)。

到鏡頭的光量會受到三個因素影響
1. 環境光的光強度
2. 物件是面光還是背光
3. 從物件反射到鏡頭內的光量
  - 大多物件在反射光時會以漫反射(diffuse reflection)的方式將光反射回去；與漫反射相對地點則是鏡面反射(specular reflection)。

Edge

edge 是影像亮度有大幅度地改變的直線或曲線。
一個偵測 edge 的方式是 gaussian filter，設計一個特殊的 kernel 並透過 convolution 的方式找到 edge。

Chapter 22. Reinforcement Learning

agent 透過收到獎勵(reinforcement)來學習最佳的做法。

Action-utility learning

一個 Q-learning 的 agent 會學習一個 action-utility function(Q-function)，在特定的 state 內做出最佳的 action。
- 而從對應的 state 到 action 的過程就是 policy，找到最佳的 policy 就是 policy search。
在 passive reinforcement learning 的狀況下，agent 的 policy
$π$ 是固定的，也就是 action
$π (s)$ 是固定的，而此時的 agent 要學習 utility function
$U^{π} (s)$ 。
enviornment history: 曾經經過的 state 和做過的 action 紀錄。
每次在一個 state
$s$ ，因為做了 action
$a$ 而到了 state
$s^{'}$ 就會得到一個 reward
$R (s, a, s^{'})$ 。這個 reward 值可正可負，正值可以視為獎勵；而負值可以視為懲罰。

Bellman equation

計算 utility 的一個式子，表示為
$U (s) = m a x_{a \in A (s)} \sum_{s^{'}} P (s^{'} | s, a) [R (s, a, s^{'}) + γ U (s^{'})]$

Passive reinforcement learning

utility 在透過 policy
$π$ 之下，定義為 reward 的加總期望值，也就是
$U^{π} (s) = E [\sum_{t = 0}^{\infty} γ^{t} R (S_{t}, π (S_{t}), S_{t + 1})]$ ，其中
$γ$ 為 discount factor。

Temporal-difference learning

這裡的 utility 的更新方法為
$U^{π} (s) \leftarrow U^{π} (s) + α [R (S_{t}, π (S_{t}), S_{t + 1}) + γ U^{π} (s^{'}) - U^{π} (s)]$ ，其中
$α$ 是 learning rate。而算式中使用到了之前 state 的資料，因此稱為 temporal difference。

Active reinforcement learning

一個 active agent 要能自己決定當下要做哪個 action。
agent 必須決定何時應該探索還沒走過的 state，以及何時應該使用目前已知的最佳 action。
- 定義一個很樂觀的 utility 估計值
  $U^{+} (s) = m a x_{a} f (\sum_{s^{'}} P (s^{'} | s, a) [R (s, a, s^{'}) + γ U (s^{'})], N (s, a))$ ，其中
  $N (s, a)$ 為在 state
  $s$ 中嘗試 action
  $a$ 的個數，
  $f (u, n)$ 是 exploration function，用這個 function 決定應該如何做到探索和利用的 trade off。

Temporal-difference Q-learning

學習方法是使用 Q-function
$Q (s, a)$ 而不是之前提到的
$U (s)$ ，這個 function 定義為
$Q (s, a) = \sum_{s^{'}} P (s^{'} | s, a) [R (s, a, s^{'}) + γ max_{a^{'}} Q (s^{'}, a^{'})]$
而更新的方式為
$Q (s, a) \leftarrow Q (s, a) + α [R (S_{t}, π (S_{t}), S_{t + 1}) + γ m a x_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]$

期末題目整理

What is the main difference between CNN and RNN?
- CNN is suitable for spatial data like images. RNN is used for temporal data, also called sequential data.
What is overfitting?
- Overfitting refers to the condition when the model completely fits the training data but fails to generalize the testing unseen data.
How can we reduce overfitting? Describe at least two tips.
- k folds cross-validation
- Regularization
- Pruning
- Dropout
- Ensembling
- Batch normalization
What strategies can apply to reduce overfitting when learning a decision tree?
- decision tree pruning
  - build a full tree
  - find a test node
  - if irrelevant（only noise in data) –> replace with leaf node
What is the kernel trick in SVM.
- 用kernel function(低維度的計算)在很高維的空間中有效率的找到optimal linear separator，這個結果map回低維度時，可能會對應到一個扭曲非線性的boundary.
Describe the idea of ensemble learning.
- The idea of ensemble learning is to select a collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions.
Describe the idea of random forest.
- The key idea is randomly vary the attribute choices. At each split point in constructing the tree, select a random sampling of attributes, and then compute which of those gives the highest information gain.

reference
王承揚

人工智慧導論期末整理

Chapter 19. Learning from examples

Entropy

decision tree pruning

Statistical significance test

independent and identically distributed (i.i.d.)

k-fold cross validation

expected loss

Regularization

Univariate(Multivariate) linear regression

Non-parametric model

Table lookup

Nearest neighbor(KNN)

k-dimentional tree (k-d tree)

Locality-sensitive hashing

Locally weighted regression

Support vector machine (SVM)

Ensemble learning

trust, intrepretability, explainability

Chapter 20. Learning Probabilistic Models

Statistical learning

Maximum a posteriori(MAP hypothesis)

Density estimation

Naive bayes model

Bayesian parameter learning

beta distribution

Expectation-maximization algorithm

Chapter 21. Deep Learning

Vanishing gradient

Spatial invariance

Convolutional neural network (CNN)

Residual network

Learning algorithm

Generalization

Recurrent neural network(RNN)

Chapter 24. Deep Learning for Natural Language Processing

one-hot vector

word-embedding

bidirectional RNN

Attention

decoding – beam search

masked language models(MLM)

Chapter 25. Computer Vision

pinhole camera

Edge

Chapter 22. Reinforcement Learning

Action-utility learning

Bellman equation

Passive reinforcement learning

Temporal-difference learning

Active reinforcement learning

Temporal-difference Q-learning

期末題目整理

tags: 1111_courses intro_to_AI

Read more

LeetCode Summary

FDA 期末考古

資料庫作業五

資料庫作業四

tags: `1111_courses` `intro_to_AI`