# SS22 Advanced Machine Learning 重點整理
[Tutorial](https://www.youtube.com/watch?v=P_og8H-VkIY&list=PLwJRxp3blEvZ8AKMXOy0fc0cqT61GsKCG)
## Linear Regression:
### Probabilistic Regression:
$x$: training examples (data)
$t$: target values
$y(x)=w^{T}w$: a linear function of inputs $x$ with parameters $w$

* Assumption in this formula:
1. Target function values y are generated by adding noise to the function.

2. The noise is Gaussian distribution.

* Assumption behind applying the product of N: $p(t|X,w,\beta)$ expresses the **conditional likelihood**

* We can directly optimise for $w$ in probabilistic regression. (maximum likelihood)
* Due to **the presence of the $\beta$ parameter**, a maximum likelihood regression **is possible**.
* Due to **the presence of the $\beta$ parameter**, a maximum likelihood estimation for $w$ would be more precisely.
* A maximum likelihood estimation for $w$ will be likely to overfit to the dataset.
---
* What is the optimal regression function under the squared error loss?
The mean of posterior.

* What effect does an L2 regularizer have on the weights $\omega$?
L2 regularization penalizes the sum of squares of the weights.


* What effect does an L1 regularizer have on the weights $\omega$?
L1 regularization penalizes the sum of absolute values of the weights. L1 regularization solution is sparse.

* Briefly explain the concept of MAP (Maximum-A-Posteriori) and the difference to MLE (maximum Likelihood Estimation)?
[MLE](https://medium.com/qiubingcheng/%E6%9C%80%E5%A4%A7%E6%A6%82%E4%BC%BC%E4%BC%B0%E8%A8%88-maximum-likelihood-estimation-mle-78a281d5f1d)
[MLE vs. MAP](https://www.ycc.idv.tw/deep-dl_3.html)
兩種估計模型參數的方法:
(1) 最大化Likelihood是間接的希望找出一組 θ 讓Model能產生Data的機率變高。(目的在透過真實觀察到的樣本資訊,找出最有可能產生這些樣本結果的模型參數。)
(2) 最大化後驗機率MAP是直接問貝氏定理什麼樣的 θ 在給定條件下出現機率最大。

* List at least 2 advantages of using a kernel function compared to non-linear basis functions for regression.

* Write down closed-form solution for parameters $\omega$ under Ridge Regression. Briefly explain all variables used.


* Is $\phi (x,x^{'}) = x-x{'}$ a valid kernel function? Justify your answer.
No, since any algorithm that uses data only in the form of inner products can be kernelized.
### Bayesian Regression:
* Briefly explain how the predictive distribution for Bayesian regression can be obtained.
Two terms: **Gaussian noise** and **maximum-a-posteriori(MAP)**

* How to optimise $w$ by Bayesian Regression?
Determine $w$ **by maximizing the posterior(= minimizing the regularized sum-of-squares error)**
- Assumption: The noise is (zero-mean) Gaussian distribution, so the posterior distribution will also be a Gaussian.



## Bayesian Network:
* Conditional independence assumption: It's the edges that are linking in the graph that are important.
### Bayes ball:
* Three cases for conditional independence:

### Undirected graph:
* Steps for converting a directed graph to an undirected graph:
(1) **Add undirected links** to marry the parrents of each node
(2) **Drop the arrows** on the original links (moral graph)
(3) **Find maximal cliques for each node** and initialize all clique potentials to 1
(4) Take each **conditional distribution factor** of the original directed graph and muliply it into one clique potential.
### Message passing Algorithm:
Message **from variable node to factor node** is a **product** of incoming messages. Message **from factor node to variable node** is a **sum** of factor contribution.
### Sum-product Algorithm:
* Goal: Find the **exact marginals** for an undirect graph, and to know conditional independence on undirected or direct graph.
* Sum/Max-Product BP for exact inference in tree-shaped MRFs.
* Original graph can be undirected tree or directed tree/polytree.

* Algorithm:
1. Pick an arbitrary node as root.
2. Compute and propagate messages from the leaf nodes to the root, storing received messages at every node.
3. Compute and propagate messages from the root to the leaf nodes, storing received messages at every node.
4. Compute the product of received messages at each node for which the marginal is required, and normalize if necessary.
### Max-Sum algorithm
* With proper initialization, it can compute either unconditional or conditional **max-marginal** probabilities. (note: maximum marginals $\neq$ joint maximum)
* Sum/Max-Product BP for exact inference in tree-shaped MRFs.

### Junction Tree Algorithm
* Junction Tree algorithm for converting arbitrary MRFs into trees.
* You have a direct graph nd you want to find the exact marginals.

### s-t minicut Algorithm
* You defined an MRF structure on an image and you would like to perform a task of foreground-background segmentation.
## Sampling:
* Proposal distribution $q(z)$: Simpler distribution, i.e. Gaussian or Gamma distribution
* Target distribution $p(z)$
* Weight $f(z)$
### Rejection Sampling:
* One of the independent sampling
* $kq(z) \ge \tilde{p}(z)$
(tilde means un-normalized distribution)

1. Problem: Impractical to find good proposal distribution for high dimensions.
2. Solution: We require that $kq(z)$ be close to the required distribution, so that the rate of rejection is minimal.
### Importance Sampling:
* One of the independent sampling
1. 目的:Our goal is not sampling from p(z) by itself, but to evaluate expectations of the form.



2. Problem:
(1) Number of terms grows exponentially with number of dimensions.
(2) If none of samples fall in the regions, where $p(z)f(z)$ is large (和$q(z)比$), the error of result increases.
4. Solution: $q(z)$ should not be small in regions, where $p(z)$ is significant.
### Metropolis Hastings Algorithm:
* One of the Markov Chain Monte Carlo (MCMC) sampling algorithm
* Properties of Markov Chains:
(1) Invariant distribution:

-

-
(2) Detailed balance:

(3) Homogeneous:
(4) Symmetric distribution:

* 演算法:
## Latent Variable Models:(潛變量模型)
https://zhuanlan.zhihu.com/p/85467515
* Concept: There is a latent variable (z) as a prior of the parameter of the model that can describe the observed datasets (x).
* 例子:(Using Ancestral sampling)

1. Sample w from p(w)
2. Sample y from p(y)
3. Sample $z_i$ from p($z_i$|w)
4. Sample $x_i$ from p($x_i$|$z_i$,y)
5. Accepted if the sample corresponds to the observed value
### Conjugate prior:
https://www.youtube.com/watch?v=aPNrhR0dFi8
1. ~~Conjugate~~ priors have same form like the likelihood.
2. The conjugate prior over the mean $\mu$ of a Gaussian distribution with fixed variance is not a Dirichlet distribution.
3. Dirichlet distribution is a multivariate generalization of the beta distribution. Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

4. Conjugate priors are useful when solving integrals that are otherwise not analytically solvable.
## Generative Advasarial Network(GAN):
https://youtu.be/jNY1WBb8l4U
* Concept: It's like a two-player game (minimax game). Generator and discriminator compete against each other and progress simultaneously.
* Goal: $argmax(D)$ and $argmin(G)$ (generator生成圖片的值越小越好)

* Generator needs to **reconstruct** data and fool the discriminator that the synthetical data is real.
* Discriminator needs to give correct labels to data.
* Use backpropagation to update parameters of models.

## Variational autoencoder(VAE):
https://youtu.be/3oHlf8-J3Nc


* Encoder的輸入和Decoder的輸出,越接近越好 -> reconstruction
* 高維度的圖片變成低維度的向量
* Which variables would you pass to the loss function while training an AutoEncoder? **Reconstructed input data and input data**
* Use backpropagation to update parameters of models.
* VAE use **Evidence Lower Bound (ELBO)** loss.
* VAE sample from the latent space

* VAE apply reparametrization trick to improve results.
## Attention & Transformer:
https://youtu.be/hYdO9CscNes
1. Query(q) and key-value pairs(k,v):

2. Attention score ($\alpha = kq$):
3. Softmax: normalization ($\alpha '$)

4. Output(O=VA'):

5. Three type of parameter to be learned($W^{q}, W^{k}, W^{v}$):


* CNN 是簡單版的self-attention
* Self-attention = CNN + learnable receptive field
* RNN 和 Self-attention/ Transformer(天涯若比鄰)的差別:
(1) Sequential computation vs. Parallel computation
(2) An ordered sequence of inputs vs. can operate over unordered sets or ordered sequences with positional encoding
(3) Transformer: Required a lot of memory
* Attention is permutation invariant,所以需要position encoding。

* Transformer的encoder:常用self-attention, RNN, CNN or BERT
* Multi-Head Self-Attention Layer
<style>
h2{color:#382FB5;}
h3{color:#5353E8;}
h4{color:#5e79ff;}
h5{color:#05b8ff;}
<style>