Pattern Recognition
===
課本
===
[Pattern Recognition (可透過交大IP直接下載)](https://www.sciencedirect.com/science/book/9781597492720)
## Chapter 1: Clustering
### Intro.
- | Clustering | PR
---|---|---
待解的問題|$\{x_1, x_2,...,x_n\}$ n 點分 k 群 | 已有 $\{C_1, C_2,...,C_k\}$ k 群,現有一新資料點 $x_i$,問 $x_i$ 應該分到哪群?
其中一組解 | $C_1 = \{x_1, x_2\},\\C_2=\{x_3, x_5\},...\\ C_k=\{x_4,x_6,x_7...x_n\}$ | $x_i\in \{C_3\}$
解 | 無最佳解,只有能不能接受這種分群結果 | 有正解,我們期望誤分率越低越好(理論上界:貝氏定理)
Given n data points --> Clustering --> k clusters
Given a new data point --> PR --> it belongs to which cluster?
### 1.1 [k-means](https://hackmd.io/s/BkyMzbSsG#Ch1-k-means)
### 1.2 [Hierarchical Method](https://hackmd.io/s/BkyMzbSsG#Ch2-Hierarchical-Method-%E9%9A%8E%E5%B1%A4%E5%BC%8F)
## Chapter 2: Classifiers Based on Bayes Decision Theory
### 2.1 Introduction
Given a classification task of $M$ classes, $\omega_1, \omega_2, \dots, \omega_M$, and a feature vector $\boldsymbol{x}$, we form the $M$ conditional probabilities $P(\omega_i|\boldsymbol{x}), i = 1, 2, \dots, M$.
:::success
目標:$\max\limits_i P(\omega_i|\boldsymbol{x})$
:::
### 2.2 Bayes Decision Theory
Given priori probabilities $P(\omega_i)$ and class-conditional probability density functions $p(\boldsymbol{x}|\omega_i)$
#### Bayes rule
$$
P(\omega_i|\boldsymbol{x}) = \frac{p(\boldsymbol{x}|\omega_i)P(\omega_i)}{p(\boldsymbol{x})}
$$
where $p(\boldsymbol{x})$ is the pdf of $\boldsymbol{x}$
$$
p(\boldsymbol{x}) = \sum_{i = 1}^{M}p(\boldsymbol{x}|\omega_i)P(\omega_i)
$$
#### Bayes classification rule for two-class case
$$
\text{If }P(\omega_1|\boldsymbol{x})>P(\omega_2|\boldsymbol{x}),\ \ \boldsymbol{x}\text{ is classified to }\omega_1 \\
\text{If }P(\omega_2|\boldsymbol{x})>P(\omega_1|\boldsymbol{x}),\ \ \boldsymbol{x}\text{ is classified to }\omega_2 \\
\Bigg\Downarrow \text{ bayes rule} \\
p(\boldsymbol{x}|\omega_1)P(\omega_1)\gtrless p(\boldsymbol{x}|\omega_2)P(\omega_2) \\
\Bigg\Downarrow \text{if }P(\omega_1)=P(\omega_2) \\
p(\boldsymbol{x}|\omega_1) \gtrless p(\boldsymbol{x}|\omega_2)
$$
#### Minimizing the Classification Error Probability
++**Thm**++ The Bayesian classifier is optimal with respect to minimizing the classification error probability.
++**pf**++
Let $R_i$ be the region of the feature space in which we decide in favor of $\omega_i$.
$$
P_e = P(\boldsymbol{x}\in R_2, \omega_1)+P(\boldsymbol{x} \in R_1, \omega_2)
$$
:::info
$\boldsymbol{x}\in R_2$ 然而它其實是 $\omega_1$ 類,或 $\boldsymbol{x}\in R_1$ 然而它其實是 $\omega_2$ 類
:::
$$
\begin{align}
P_e &= P(\boldsymbol{x}\in R_2|\omega_1)P(\omega_1)+P(\boldsymbol{x}\in R_1|\omega_2)P(\omega_2)\qquad(\text{joint probabitilty})\\
& =P(\omega_1)\int\limits_{R_2}p(\boldsymbol{x}|\omega_1)d\boldsymbol{x}+P(\omega_2)\int\limits_{R_1}P(\boldsymbol{x}|\omega_2)d\boldsymbol{x}
\end{align}
$$

:::danger
找好一點的圖
:::
or using Bayes rule
$$
P_e = \int\limits_{R_2}P(\omega_1|\boldsymbol{x})p(\boldsymbol{x})d\boldsymbol{x}+\int\limits_{R_1}P(\omega_2|\boldsymbol{x})p(\boldsymbol{x})d\boldsymbol{x}
$$
Error is minimized if the partitioning regions $R_1$ and $R_2$ of the feature space are chosen $s.t.$
$$
R_1: P(\omega_1|\boldsymbol{x})>P(\omega_2|\boldsymbol{x}) \\
R_2: P(\omega_2|\boldsymbol{x}) > P(\omega_1|\boldsymbol{x})
$$
**Some Explaination**
Since the union of $R_1, R_2$ covers all the space, we have
$$
\int\limits_{R_1}P(\omega_1|\boldsymbol{x})p(\boldsymbol{x})d\boldsymbol{x}+\int\limits_{R_2}P(\omega_1|\boldsymbol{x})p(\boldsymbol{x})d\boldsymbol{x} = P(\omega_1)
$$
and thus
$$
P_e = P(\omega_1) - \int\limits_{R_1}(P(\omega_1|\boldsymbol{x})-P(\omega_2|\boldsymbol{x}))p(\boldsymbol{x})d\boldsymbol{x}
$$
which suggests the above partitioning choice.
**Generalization**
$\boldsymbol{x}$ is assigned to class $\omega_i$ if $P(\omega_i|\boldsymbol{x})>P(\omega_j|\boldsymbol{x}) \quad\forall j \ne i$
#### Minimizing the Average Risk
Let $R_j, j = 1,2,\dots,M$ be the regions of the feature space assigned to classes $\omega_j$ respectively.
Let *loss* $\lambda_{ki}$ be the penalty term associated with the misclassification, that a feature vector $\boldsymbol{x}$ that belongs to class $\omega_k$ lies in $R_i, i \ne k$.
Let *loss matrix* $L$ be a matrix where $l_{ki} = \lambda_{ki}$
$$L=\begin{bmatrix}
\lambda_{11} & \lambda_{12} & \cdots & \lambda_{1M} \\
\lambda_{21} & \lambda_{22} & \cdots & \lambda_{2M} \\
\vdots & \vdots & \ddots & \vdots \\
\lambda_{M1} & \lambda_{M2} & \cdots & \lambda_{MM}
\end{bmatrix}$$
:::info
屬於 $\omega_k$ 類的 $\boldsymbol{x}$ 卻在 $R_i$ 中($\boldsymbol{x}$ 被分到 $\omega_i$ 類),會得到一個 $\lambda_{ki}$ 的懲罰
一般來說 $\lambda_{kk}=0, \forall k = 1,2,\dots,M$
:::
The *risk* or *loss* associated with $\omega_k$ is defined as
$$
r_k = \sum_{i=1}^{M} \lambda_{ki}\int\limits_{R_i}p(\boldsymbol{x}|\omega_k)d\boldsymbol{x}
$$
:::info
積分的部分是 class $\omega_k$ 的 feature vector 被分到 class $\omega_i$ 的機率
:::
**Goal**:choose the partitioning regions $R_j$ so that the *average risk* $r$ is minimized.
$$
\begin{align}
r &= \sum_{k=1}^{M}r_kP(\omega_k) \\
&= \sum_{i=1}^{M}\int\limits_{R_i} \biggl( \sum_{k=1}^{M}\lambda_{ki}p(\boldsymbol{x}|\omega_k)P(\omega_k) \biggr)d\boldsymbol{x}
\end{align}
$$
This is achieved if each of the integrals is minimized.
Select partitioning regions $s.t.$
$$
\boldsymbol{x} \in R_i\ \ \text{if}\ \ l_i \equiv \sum_{k=1}^{M}\lambda_{ki}p(\boldsymbol{x}|\omega_k)P(\omega_k)<l_j\equiv\sum_{k=1}^{M}\lambda_{kj}p(\boldsymbol{x}|\omega_k)P(\omega_k)\ \ \forall j \ne i
$$
**The two-class case**
:::danger
懶得寫,待補
:::
### 2.3 Decision Boundary
:::info
有一點點內容,不是很重要,待補
:::
### 2.4 Bayesian Classifier for Normal Distribution
:::danger
In $l$-dimension space, Gaussian pdf $N(\boldsymbol{\mu_i}, \Sigma_i)$ for $i^{th}$ class:
$$
p(\boldsymbol{x}|\omega_i) = \frac{1}{(\sqrt{2\pi})^l \sqrt {|\Sigma_i|}}\ exp(-0.5 (\boldsymbol{x}-\boldsymbol{\mu_i})^T \Sigma_i ^{-1} (\boldsymbol{x}-\boldsymbol{\mu_i}))
$$
其中
$$
\boldsymbol{\mu_i} = E[\boldsymbol{x}] \text{ is a }(l\times1) \text{ MEAN vector}\\
\Sigma_i = E[(\boldsymbol{x}-\boldsymbol{\mu_i})(\boldsymbol{x}-\boldsymbol{\mu_i})^T] \text{ is an } (l \times l) \text{ covariance matrix of class } i
$$
紅紅這段要記清楚
:::
:::info
舉例 $l=2$
$\Sigma = \begin{bmatrix}\sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2\end{bmatrix}$
:::
:::success
assign $\boldsymbol{x}$ to class $i$ $\iff$ $P(\omega_i)p(\boldsymbol{x}|\omega_i)$ is max
:::
而 $P(\omega_i)p(\boldsymbol{x}|\omega_i)$ is max $\iff$ $\ln P(\omega_i) + \ln p(\boldsymbol{x}|\omega_i)$ is max
++**def**++ $g_i(\boldsymbol{x}) = -0.5 (\boldsymbol{x}-\boldsymbol{\mu_i})^T \Sigma_i ^{-1} (\boldsymbol{x}-\boldsymbol{\mu_i}) + \ln P(\omega_i) + c_i$ ,
$c_i = \bbox[yellow]{-\frac{l}{2} \ln 2\pi} - \frac{1}{2}\ln |\Sigma_i|$
黃色那塊大家都一樣,不用比
**Minimum Distance Classifiers**
If ++equi-probable++ and ++same covariance++ ( $P(\omega_i) = 1/M$ and $\Sigma_i = \Sigma,\ \forall i$ )
then 定義新的 $g_i(\boldsymbol{x}) = -0.5 (\boldsymbol{x}-\boldsymbol{\mu_i})^T \Sigma ^{-1} (\boldsymbol{x}-\boldsymbol{\mu_i})$
- If $\Sigma = \sigma^2 I$
then $\max\limits_{i}g_i(x)$ is to assign $\boldsymbol{x}$ to $\omega_i$ when **Euclidean Distance** $d_E \equiv ||\boldsymbol{x}-\boldsymbol{\mu_i}||$ is minimum
- Else (non diagonal, or diagonal but $\sigma_{i}\ne\sigma_j$)
then $\max\limits_{i}g_i(x)$ is to assign $\boldsymbol{x}$ to $\omega_i$ when **Mahalanobis Distance** $d_m \equiv \sqrt{(\boldsymbol{x}-\boldsymbol{\mu_i})^T \Sigma ^{-1} (\boldsymbol{x}-\boldsymbol{\mu_i})}$ is minimum
### 2.5 Estimation of Unknown PDF
不重要
### 2.6 Classifier using k nearest neighbors (kNN) Rule
從 $N$ 個點中找離 $\boldsymbol{x}$ 最近的 $k$ 個,看它們大多屬於哪類就說 $\boldsymbol{x}$ 屬於哪類
:::danger
Let $P_B$ be the classification error probability of Bayes Theory, $P_{NN}$ be the classification error probability of 1-NN
As $N \to \infty$,
$$
P_B \le P_{NN} \le P_B(2-\frac{M}{M-1}P_B) \le 2P_B
$$
必考
:::
:::danger
For two classes,
$$
P_{Bayes} \le P_{kNN} \le P_{Bayes} + \sqrt{\frac{2P_{NN}}{k}}
$$
Therefore,
$$
k \to \infty\ ,\ P_{kNN} \to P_{Bayes}
$$
必考
:::
If $P_{Bayes}$ is small, then
$P_{NN} \cong 2P_{Bayes}$
$P_{3NN} \cong P_{Bayes} + 3(P_{Bayes})^2$
## Reference
1. Pattern Recognition, by S. Theodoridis and K.Koutroumbas, 4th edition, 2009, Elsevier (Academic Press)