CS361 Final Notes

# CS361 Final Notes ## Ch1 & Ch2 ### Data types: Categorical/Ordinal/Continuous ### Location parameters: - Mean (**Lecture 2**) - Symbol: $\mu$ - Definition: $\mu$ = $\frac1N\sum_{i=1}^N x_i$ - Properties - $mean(k x) = k\cdot mean(x)$ - $mean(x+c) = mean(x) + c$ - Median (**Lecture 3**) - Definition: Middle of sorted value - Properties - $median(k\cdot x)=k\cdot median(x)$ - $median(x+c)=median(x)+c$ - Mode (**Lecture 3**) - Peak of histogram (most frequent value) ### Scale Parameter - Standard deviation (**Lecture 2**) - Symbol: $\sigma$ - Definition: $\sigma=\sqrt{\frac1N\sum_{i=1}^N(x_i-mean(x_i))^2}$ - Properties - $std(kx)=|k|\cdot std(x)$ - $std(x+c)=std(x)$ - Chebyshev's Inequality: - At most $\frac{N}{k^2}$ items are k std away - Variance (**Lecture 2**) - Symbol: $\sigma^2$ - Definition: $\sigma^2=\frac1N\sum_{i=1}^N(x_i-mean(x_i))^2$ - Properties - $var(kx)=k^2var(x)$ - $var(x+c)=var(x)$ - Percentile (**Lecture 3**) - Definition: kth percentile means k% of data is smaller - Median is 50th percentile - IQR (**Lecture 3**) - Definition: IQR = 75th percentile - 25th percentile - Properties - $iqr(k\cdot x)=|k|\cdot iqr(x)$ - $iqr(x+c)=iqr(x)$ ### Standard coordinate (**Lecture 2**) - Symbol: $\hat{x}$ - Definition: $\hat x=\frac{x-mean(x)}{std(x)}$ - Properties - $\mu(\hat{x})=0$ - $\sigma(\hat{x})=1$ - $\sigma^2(\hat{x})=1$, unitless. ### Outlier (**Lecture 3**) - less than 1.5 iqr or more than 1.5 iqr - mean and std are sentive, median and iqr are not ### Skew (**Lecture 3**) - Means tail end. - Right-skewed: mean $>$ median - Left-skewed: mean $<$ median ### Correlation Coeffcient (**Lecture 3**) - $corr(x,y)=\frac1N\sum_{i=1}^N\hat x\hat y=mean(\hat x\hat y)=\sum_{i=1}^N\frac{\hat x}{\sqrt N}\frac{\hat y}{\sqrt N}$ - Not change when translating data - unitless and defined in standard coordinate - bounded in $[-1, 1]$ - $corr(ax+b,cy+d)=sign(ac)corr(x,y)$ - causation $\Rightarrow$ correlation, correlation $\nRightarrow$ causation ### Prediction (**Lecture 3**) - Use standard coordinates - $\hat y^P = a\hat x +b$. - Error $u=\hat y - \hat y^P = \hat y - a\hat x -b$ - $\hat y^P = r\hat x_0$. - RMS error = $\sqrt{mean(u^2)}=\sqrt{1-r^2}$ --- ## Ch3 ### Outcome/Sample Space/Event (**Lecture 4**) - Outcome: possible results from random repeatalbe experiment - Sample Space: set of all possible outcomes, symbol: $\Omega$ - Event: subset of sample space - Venn Diagram->just like sets, can be combined like sets ### Probability Axiom (**Lecture 4**) - $P(E)\ge0$ - $P(\Omega) = 1$ - Probaility of disjoint events is additive $P(E_1\cup E_2...)=\sum_i^N P(E_i)$ if $E_i\cap E_j=\emptyset$ ### Properties of probability (**Lecture 4**) - Complement：$P(E^c) = 1-P(E)$ - Difference: $P(E_1-E_2)=P(E_1) - P(E_1\cap E_2)$ - Union: $P(E_1\cup E_2)=P(E_1) +P(E_2) - P(E_1\cap E_2)$ - Multiple union, see slide ### Calculate probability - For countable finite event: $P(E) = \sum P(A_i)$ (**Lecture 4**) - equal prob: $P(E)=\frac{\text{num of outcome}}{\text{total outcome}}$ - Use complement (**Lecture 4**) - $P(E)=1-P(E^c)$ - Example - Senate & Birthday problem (**Lecture 5**) ### Conditional Probability (**Lecture 5**) - Definition of conditional probability: the probability of A given B - $P(A|B)=\frac{P(A\cap B)}{P(B)}$ - Multiplication->Probability Tree - Symmetry of joint event - $P(A\cap B)=P(A|B)P(B)$ - $P(B\cap A)=P(B|A)P(A)$ - Therefore $P(A|B)P(B)=P(B|A)P(A)$ ### Bayes Rule (**Lecture 5**) - $P(A|B)=\frac{P(B|A)P(A)}{P(B)}=\frac{P(A\cap B)}{P(B)}$ ### Total Probability (**Lecture 6**) - $P(B)=P(B\cap A) + P(B\cap A^c)=\sum P(B\cap A_j)=\sum P(B|A_j)P(A_j)$ ### Independence (**Lecture 6**) - $P(A|B)=P(A)$ or $P(B|A)=P(B)$ - Event $A$ and $B$ are independent iff $P(A\cap B)=P(A)P(B)$ - Two disjoint event have prob>0 are dependent - Any event is independent of empty. - Pairwise independent is not mutual independent - Conditional Independence. $P(A\cap B|C)=P(A|C)P(B|C)$ --- ## Ch4 ### Random Variable (**Lecture 7**) - A random variable is a function that maps outcomes to numbers - Discrete/continuous/mixed - Facts: - Have probability function - Can be conditioned on events or other random variable - Have averages - $P(X=x)$ / $P(x)$ / $p(x)$ - $\ge0$ - sum to 1 - Cumulative distribution - $P(X<=x)$ - Discrete: $P(X<=x) =\sum P(X_i)$ - Conditional Probability - $P(x|y)=\frac{P(x,y)}{P(y)}$ - Joint probability - See slide 34 ### Expected Value (**Lecture 8**) - Expected value of a random variable $X$ is $E[X]=\sum xP(x)$ - It's a weighted sum - $E[kX] = kE[X]$ - $E[X+Y]=E[X] + E[Y]$ - $E[kX+c] = kE[X]+c$ - If $f$ is a function of random variable, $Y=f(x)$ is also a random variable - We can exchange variable - $E[Y]=E[f(X)]=\sum f(x)P(x)$ ### Variance of R.V.(**Lecture 8**) - $var[X]=E[(X-E[X])^2=E[X^2]-E[X]^2$ - $var[X]>0$ - $var[kX]=k^2var[X]$ - Recall that $E[x^2]=\sum x^2p(x)$ - $var[X+Y]=var[X]+var[Y]+2cov(x,y)$ - Can't assume $E[g(x)]=g(E[X])$ ### Covariance (**Lecture 8**) - $cov(X,Y)=E[X-E[X]E[Y-E[Y]]=E[XY]-E[X]E[Y]$ - $cov(X,Y)=E[(X-E[X])^2]=var[X]$ - $corr(X,Y)=\frac{cov(X,Y)}{\sigma X\sigma Y}=\frac{E[XY]-E[X]E[Y]}{\sigma X\sigma Y}$ - When covariance is zero, we have $E[XY]=E[X]E[Y]$. Necessary for independence. **But not equal to** - These are equivalent: - $cov(X,Y)=corr(X,Y)=0$ - $E[XY]=E[X]E[Y]$ - $var[X+Y]=var[X]+var[Y]$ - Uncorrelated != independent ### Weak Law of Large Numbers (**Lecture 9**) - If repeat random experiment many times, avg observation converges to expected value. - Justify using simulations to estimate expected values of random variables - Justify using histogram of large random samples to approximate probability distribution - I omit some stuffs here. See pg 40. ### Markov's and Chebyshev's Inequality (**Lecture 9, Pg30**) - PS: I don't think I ever saw these in hw and mt - Markov: For any random variable X that only takes $X\ge0$ and $a>0$ - $P(X\ge a)\le\frac{E[X]}{a}$ - Chebyshev: For any random variable X and constant $a >0$: - $P(|X−E[X]|≥a)≤\frac{var[X]}{a^2}$ ### IID (**Lecture 9**) - If $X_1...X_N$ independent and have identical probability function $P(x)$. The the numbers randomly generated is called IID samples. - Sample mean ($\bar X$) is a random variable - $E[\bar X]=E[X]$ - $var[\bar X]=\frac{var[X]}{N}$ ### Indicator function (**Lecture 10**) - $E[Y_i]=P(c_1≤X_i<c_2)=P(c_1≤X<c_2)$ --- ## Ch5 ### Table for Common Probability Distributions [See also](https://compass2g.illinois.edu/bbcswebdav/pid-335998-dt-announcement-rid-62445723_1/courses/cs_361_120208_194044/Table_SomeDistri.pdf) #### Discrete | Name | Probability Distribution | $E[X]$ | $var[X]$ | Constraints | Parameters | | - | - | - | - | - | - | | Bernoulli | $P(x)=\begin{cases}p,&x=1\\1-p,&x=0\\0,&otherwise\end{cases}$| $p$ | $1-p$ |a|b | | Binomial | $P(X=k)={n \choose k}p^k(1-p)^{n-k}$ | $np$ | $np(1-p)$ | $0≤p≤1$ | $X\sim B(n,p)$ | | Geometric | $P(X=k)=(1-p)^{k-1}p$ | $\frac{1}{p}$ | $\frac{1-p}{p^2}$ | $0≤p≤1$ | $X\sim G(p)$ | | Poisson | $P(X=k)=\frac{e^{-\lambda} \lambda^k}{k!}$ | $\lambda$ | $\lambda$ | $\lambda≥0$ | $X\sim Po(\lambda)$ | ### Continuous | Name | Probability Distribution Function | Expected Value ($E[X]$) | Variance ($var[X]$) | Constraints | Parameters | | - | - | - | - | - | - | | Normal | $\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ | $\mu$ | $\sigma^2$ | | $X\sim N(\mu, \sigma^2)$ | | Standard Normal | $\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$ | $0$ | $1$ | | $X\sim N(0, 1)$ | | Exponential | $\lambda e^{-\lambda x}$ |$\frac1\lambda$|$\frac1\lambda^2$| | Uniform| $p(x)=\frac{1}{b-a}$|$\frac{a+b}{2}$|$\frac{(b-a)^2}{12}$|||| ### Probability Density Function (**Lecture 11**) - $p(x)\ge0$ and $\sum_{-\infty}^\infty p(x)dx=1$ - Difference from probability distribution function - Not the probablity that $X=x$ - Can exceed 1 - $E[X]=\int_{-\infty}^\infty xp(x)dx$ ### Central limit theorem (**Lecture 12**) - The distribution of the sum of N independent identical (IID) random variables tends toward a normal distribution - EX - Binomial looks like normal when N large --- ## Ch6 ### Population and Sample (**Lecture 13**) - Population - entire dataset {X} with countable size - has popmean({X}) and std popsd({X}) as numbers - Sample - A random subset of population, sampled with replacement - sample mean $X^{(N)}$ is a random variable - IID samples - The sample mean of a popula(on is very similar to the sample mean of N random variables if the samples are IID samples -randomly & independently drawn with replacement. ### Sample mean of a population (**Lecture 13**) - $X^{(N)}=\frac1N(X_1+X_2+...X_N)$ - $E[X^{(N)}]=E[X^{(1)}]=popmean({x})$ - $X^{(N)}$ is unbiased estimator - $var[X^{(N)}]=\frac{popvar(X)}{N}$ - $std[X^{(N)}]=\frac{popsd(X)}{\sqrt N}$ - $stdunbiased(x)=\sqrt{\frac{1}{N-1}\sum (x-mean(x))^2}$ - $stderr(x) = \frac{popsd(X)}{\sqrt N}=\frac{stdunbiased(x)}{\sqrt N}$ - Interpret standard error - Sample mean is a random variable and has its own probability distribution, stderr is an estimate of the sample mean’s standard deviation - N large, by CLT, sample mean is normal distribution (N>30) - 68%: ±1 stderr - 95%: ±2 stderr - 99.7%: ±3 stderr - Center confidence intervals - For 1-2a of realized sample means, the population mean lies in ±b interval - Hypothesis Test - Determine whether a claim holds, see lower section. ### t-distribution (**Lecture 13**) - Use when $N<30$, degree of freedom = $N - 1$ ### Sample statistics and Bootstrap (**Lecture 14**) - Is a statistic of the data set that is formed by the realized sample - First sample without replacment into Sample. Then performace Boostrap Resample with replacement. Finally Boostrap statistics - The distribution simulated from bootstrapping is called empirical distribu(on. It is not the true population distribution. There is a statistics error. The number of bootstrapping replicates may not be enough. There is a numerical error. --- ## Ch7 ### Hypothesis Test (**Lecture 15**) - Assume hypothesis $H_0$ is true - Define test statistic - $x=\frac{\text{sample mean}-\text{hypothesized value}}{\text{standard error}}$ - When $N>0$, use normal, else use t-distribution of degree N-1 - The fraction $f$ is just the integral - Report P-Value - defined as $1-f$ - By convention, reject when $p<0.05$ ### Chi-squared test (**Lecture 15**) - Test independence between - Degree of freedom = (r-1)(c-1) - Very versatile --- ## Ch9 ### Maximum Likelihood Estimation(MLE) (**Lecture 15**) - Estimate the parameter assuming a certain distribution - We write the probability of seeing data D given parameter $\theta$ as $L(\theta)=P(D|\theta)$ - MLE of $\theta$ is $\hat\theta= argmaxL(\theta)$ - L is **not a probability distribution** - With IID, $L(\theta)=\prod P(x_i|\theta)$ - Unreliable with few data ### Common MLE (**Lecture 15**) - Binomial $\hat\theta =\frac kN$ - Poisson $\hat\theta =\frac {\sum_i^N k_i}N$ - Normal, too complicated, lazy to type ### Sum/difference of Independent Normals (**end of Lecture 15**) - There's a lot of algos, just see slides. ***Page 55*** ### Bayesian Inference (**Lecture 16**) - Want to maximize **posterior**, probability of $\theta$ given observed D. - $P(\theta|D)$, $\theta$ is R.V. - It's a probability distribution - Maximum called maximum a posterior (MAP) estimate - From Bayes Rule - $P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}$ - Note that $P(D|\theta)=L(\theta)$ - Benefit - Allows us to include beliefs about $\theta$ - Useful with few data - get a distribution of posterior ### Beta Distribution (**Lecture 16/17**) - $P(\theta)=K(\alpha, \beta)\theta^{\alpha-1}(1-\theta)^{\beta-1}$ - $K(\alpha, \beta)=\frac{\Gamma (\alpha+\beta)}{\Gamma \alpha\Gamma\beta}$ -> don't think we need to know this - Note that $Beta(\alpha = 1, \beta = 1)$ is uniform - Use beta distribution for conjugate prior for binomial - $P(\theta|D)=K(\alpha+k, \beta+N-k)\theta^{\alpha+k-1}(1-\theta)^{\beta+N-k-1}$ - maximized at $\hat\theta=\frac{\alpha-1+k}{\alpha+\beta-2+N}$ ### Common Prior (**Lecture 17**) |Likelihood|Conjugate| |--|--| |Bernoulli/Geometric/Binomial| Beta| |Poisson/Exponential|Gamma| |Normal with known $\sigma^2$|Normal| But what is gamma distribution and what is Normal with known $\sigma^2$ ?? --- ## Ch10 --- The rest are omitted for now since they're newer knowledge and consist small percent # 待会儿整理回上面 ### Handling Multi-Dimensional Data * Data matrix is $N\times d$ in dimension * $N$ rows of entries * $d$ columns of features * $x_i^{(j)}$ demotes the $j$th component of the $i$th entry * Mean centering is the process where every entry minus its row mean ### Covariance Matrix * Covariance matrix contains the covariance between every pair of components * Shorthand $covmat$ * $covmat=\frac{X_cX_c^T}{N}$, where $X_c$ is mean-centered data matrix, $N$ is the number of entries * Diagonal entries are the variance for those components, since $cov(X,X)=var[X]$ * Properties * Covmat is **Positive Semi-Definite** * $\lambda_i≥0$ for all eigenvalues * **Positive Definite** is the case where they are strictly larger than 0 * Positive Semi-Definite implies symmetry, i.e. $A_{ij}=A_{ji}$ * Covmat is diagonalizable ($A=PDP^{-1}$) ### Principal Component Analysis (PCA) * Steps of doing PCA 1. Mean center the data, transform $X$ to $X_c$ 2. Diagonalize covmat 3. Oops I really need a review * MSE of PCA * Sum of eigenvalues omitted ### ML Overview * General categories * Classification * k Nearest Neighbour (kNN) * Decision Tree * Random Forrest * Naïve Bayesian Classifier * Support Vector Machine (SVM) * Regression * Least Square Solution of Linear Regression * Unsupervised Learning * Principal Component Analys· * Hierarchical Clustering * kMeans Clustering ### Least Square Solution * For data $X$ and $y$, least square solution is $(X^TX)^{-1}X^Ty$ ### Sample Problem: kMeans Clustering * Problem: Consider a dataset ${(0, 0), (0, 2), (0, 10), (20, 0), (20, 2), (20, 10)}$. Find the eventual cluster centers after convergence of the k-means clustering algorithm with initial cluster centers $c_1 = (0, 0)$ and $c_2 = (20, 0)$. Also ﬁnd the eventual cluster centers after convergence if the initial cluster centers are $c_1 = (0, 0)$ and $c_2 = (0, 2)$. * Solution: * First part * Iteration 0: $c_1=(0,0)$ with ${(0, 0), (0, 2), (0, 10)}$, $c_2=(20,0)$ with ${(20, 0), (20, 2), (20, 10)}$ * Iteration 1: $c_1=(0,4)$ with ${(0, 0), (0, 2), (0, 10)}$, $c_2=(20,4)$ with ${(20, 0), (20, 2), (20, 10)}$ * Clustering does not change, algorithm complete ### Markov Chain ### MSE of PCA

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.