# CS361 Final Notes
## Ch1 & Ch2
### Data types: Categorical/Ordinal/Continuous
### Location parameters:
- Mean (**Lecture 2**)
- Symbol: $\mu$
- Definition: $\mu$ = $\frac1N\sum_{i=1}^N x_i$
- Properties
- $mean(k x) = k\cdot mean(x)$
- $mean(x+c) = mean(x) + c$
- Median (**Lecture 3**)
- Definition: Middle of sorted value
- Properties
- $median(k\cdot x)=k\cdot median(x)$
- $median(x+c)=median(x)+c$
- Mode (**Lecture 3**)
- Peak of histogram (most frequent value)
### Scale Parameter
- Standard deviation (**Lecture 2**)
- Symbol: $\sigma$
- Definition: $\sigma=\sqrt{\frac1N\sum_{i=1}^N(x_i-mean(x_i))^2}$
- Properties
- $std(kx)=|k|\cdot std(x)$
- $std(x+c)=std(x)$
- Chebyshev's Inequality:
- At most $\frac{N}{k^2}$ items are k std away
- Variance (**Lecture 2**)
- Symbol: $\sigma^2$
- Definition: $\sigma^2=\frac1N\sum_{i=1}^N(x_i-mean(x_i))^2$
- Properties
- $var(kx)=k^2var(x)$
- $var(x+c)=var(x)$
- Percentile (**Lecture 3**)
- Definition: kth percentile means k% of data is smaller
- Median is 50th percentile
- IQR (**Lecture 3**)
- Definition: IQR = 75th percentile - 25th percentile
- Properties
- $iqr(k\cdot x)=|k|\cdot iqr(x)$
- $iqr(x+c)=iqr(x)$
### Standard coordinate (**Lecture 2**)
- Symbol: $\hat{x}$
- Definition: $\hat x=\frac{x-mean(x)}{std(x)}$
- Properties
- $\mu(\hat{x})=0$
- $\sigma(\hat{x})=1$
- $\sigma^2(\hat{x})=1$, unitless.
### Outlier (**Lecture 3**)
- less than 1.5 iqr or more than 1.5 iqr
- mean and std are sentive, median and iqr are not
### Skew (**Lecture 3**)
- Means tail end.
- Right-skewed: mean $>$ median
- Left-skewed: mean $<$ median
### Correlation Coeffcient (**Lecture 3**)
- $corr(x,y)=\frac1N\sum_{i=1}^N\hat x\hat y=mean(\hat x\hat y)=\sum_{i=1}^N\frac{\hat x}{\sqrt N}\frac{\hat y}{\sqrt N}$
- Not change when translating data
- unitless and defined in standard coordinate
- bounded in $[-1, 1]$
- $corr(ax+b,cy+d)=sign(ac)corr(x,y)$
- causation $\Rightarrow$ correlation, correlation $\nRightarrow$ causation
### Prediction (**Lecture 3**)
- Use standard coordinates
- $\hat y^P = a\hat x +b$.
- Error $u=\hat y - \hat y^P = \hat y - a\hat x -b$
- $\hat y^P = r\hat x_0$.
- RMS error = $\sqrt{mean(u^2)}=\sqrt{1-r^2}$
---
## Ch3
### Outcome/Sample Space/Event (**Lecture 4**)
- Outcome: possible results from random repeatalbe experiment
- Sample Space: set of all possible outcomes, symbol: $\Omega$
- Event: subset of sample space
- Venn Diagram->just like sets, can be combined like sets
### Probability Axiom (**Lecture 4**)
- $P(E)\ge0$
- $P(\Omega) = 1$
- Probaility of disjoint events is additive $P(E_1\cup E_2...)=\sum_i^N P(E_i)$ if $E_i\cap E_j=\emptyset$
### Properties of probability (**Lecture 4**)
- Complement:$P(E^c) = 1-P(E)$
- Difference: $P(E_1-E_2)=P(E_1) - P(E_1\cap E_2)$
- Union: $P(E_1\cup E_2)=P(E_1) +P(E_2) - P(E_1\cap E_2)$
- Multiple union, see slide
### Calculate probability
- For countable finite event: $P(E) = \sum P(A_i)$ (**Lecture 4**)
- equal prob: $P(E)=\frac{\text{num of outcome}}{\text{total outcome}}$
- Use complement (**Lecture 4**)
- $P(E)=1-P(E^c)$
- Example
- Senate & Birthday problem (**Lecture 5**)
### Conditional Probability (**Lecture 5**)
- Definition of conditional probability: the probability of A given B
- $P(A|B)=\frac{P(A\cap B)}{P(B)}$
- Multiplication->Probability Tree
- Symmetry of joint event
- $P(A\cap B)=P(A|B)P(B)$
- $P(B\cap A)=P(B|A)P(A)$
- Therefore $P(A|B)P(B)=P(B|A)P(A)$
### Bayes Rule (**Lecture 5**)
- $P(A|B)=\frac{P(B|A)P(A)}{P(B)}=\frac{P(A\cap B)}{P(B)}$
### Total Probability (**Lecture 6**)
- $P(B)=P(B\cap A) + P(B\cap A^c)=\sum P(B\cap A_j)=\sum P(B|A_j)P(A_j)$
### Independence (**Lecture 6**)
- $P(A|B)=P(A)$ or $P(B|A)=P(B)$
- Event $A$ and $B$ are independent iff $P(A\cap B)=P(A)P(B)$
- Two disjoint event have prob>0 are dependent
- Any event is independent of empty.
- Pairwise independent is not mutual independent
- Conditional Independence. $P(A\cap B|C)=P(A|C)P(B|C)$
---
## Ch4
### Random Variable (**Lecture 7**)
- A random variable is a function that maps outcomes to numbers
- Discrete/continuous/mixed
- Facts:
- Have probability function
- Can be conditioned on events or other random variable
- Have averages
- $P(X=x)$ / $P(x)$ / $p(x)$
- $\ge0$
- sum to 1
- Cumulative distribution
- $P(X<=x)$
- Discrete: $P(X<=x) =\sum P(X_i)$
- Conditional Probability
- $P(x|y)=\frac{P(x,y)}{P(y)}$
- Joint probability
- See slide 34
### Expected Value (**Lecture 8**)
- Expected value of a random variable $X$ is $E[X]=\sum xP(x)$
- It's a weighted sum
- $E[kX] = kE[X]$
- $E[X+Y]=E[X] + E[Y]$
- $E[kX+c] = kE[X]+c$
- If $f$ is a function of random variable, $Y=f(x)$ is also a random variable
- We can exchange variable
- $E[Y]=E[f(X)]=\sum f(x)P(x)$
### Variance of R.V.(**Lecture 8**)
- $var[X]=E[(X-E[X])^2=E[X^2]-E[X]^2$
- $var[X]>0$
- $var[kX]=k^2var[X]$
- Recall that $E[x^2]=\sum x^2p(x)$
- $var[X+Y]=var[X]+var[Y]+2cov(x,y)$
- Can't assume $E[g(x)]=g(E[X])$
### Covariance (**Lecture 8**)
- $cov(X,Y)=E[X-E[X]E[Y-E[Y]]=E[XY]-E[X]E[Y]$
- $cov(X,Y)=E[(X-E[X])^2]=var[X]$
- $corr(X,Y)=\frac{cov(X,Y)}{\sigma X\sigma Y}=\frac{E[XY]-E[X]E[Y]}{\sigma X\sigma Y}$
- When covariance is zero, we have $E[XY]=E[X]E[Y]$. Necessary for independence. **But not equal to**
- These are equivalent:
- $cov(X,Y)=corr(X,Y)=0$
- $E[XY]=E[X]E[Y]$
- $var[X+Y]=var[X]+var[Y]$
- Uncorrelated != independent
### Weak Law of Large Numbers (**Lecture 9**)
- If repeat random experiment many times, avg observation converges to expected value.
- Justify using simulations to estimate expected values of random variables
- Justify using histogram of large random samples to approximate probability distribution
- I omit some stuffs here. See pg 40.
### Markov's and Chebyshev's Inequality (**Lecture 9, Pg30**)
- PS: I don't think I ever saw these in hw and mt
- Markov: For any random variable X that only takes $X\ge0$ and $a>0$
- $P(X\ge a)\le\frac{E[X]}{a}$
- Chebyshev: For any random variable X and constant $a >0$:
- $P(|X−E[X]|≥a)≤\frac{var[X]}{a^2}$
### IID (**Lecture 9**)
- If $X_1...X_N$ independent and have identical probability function $P(x)$. The the numbers randomly generated is called IID samples.
- Sample mean ($\bar X$) is a random variable
- $E[\bar X]=E[X]$
- $var[\bar X]=\frac{var[X]}{N}$
### Indicator function (**Lecture 10**)
- $E[Y_i]=P(c_1≤X_i<c_2)=P(c_1≤X<c_2)$
---
## Ch5
### Table for Common Probability Distributions
[See also](https://compass2g.illinois.edu/bbcswebdav/pid-335998-dt-announcement-rid-62445723_1/courses/cs_361_120208_194044/Table_SomeDistri.pdf)
#### Discrete
| Name | Probability Distribution | $E[X]$ | $var[X]$ | Constraints | Parameters |
| - | - | - | - | - | - |
| Bernoulli | $P(x)=\begin{cases}p,&x=1\\1-p,&x=0\\0,&otherwise\end{cases}$| $p$ | $1-p$ |a|b |
| Binomial | $P(X=k)={n \choose k}p^k(1-p)^{n-k}$ | $np$ | $np(1-p)$ | $0≤p≤1$ | $X\sim B(n,p)$ |
| Geometric | $P(X=k)=(1-p)^{k-1}p$ | $\frac{1}{p}$ | $\frac{1-p}{p^2}$ | $0≤p≤1$ | $X\sim G(p)$ |
| Poisson | $P(X=k)=\frac{e^{-\lambda} \lambda^k}{k!}$ | $\lambda$ | $\lambda$ | $\lambda≥0$ | $X\sim Po(\lambda)$ |
### Continuous
| Name | Probability Distribution Function | Expected Value ($E[X]$) | Variance ($var[X]$) | Constraints | Parameters |
| - | - | - | - | - | - |
| Normal | $\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ | $\mu$ | $\sigma^2$ | | $X\sim N(\mu, \sigma^2)$ |
| Standard Normal | $\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$ | $0$ | $1$ | | $X\sim N(0, 1)$ |
| Exponential | $\lambda e^{-\lambda x}$ |$\frac1\lambda$|$\frac1\lambda^2$|
| Uniform| $p(x)=\frac{1}{b-a}$|$\frac{a+b}{2}$|$\frac{(b-a)^2}{12}$||||
### Probability Density Function (**Lecture 11**)
- $p(x)\ge0$ and $\sum_{-\infty}^\infty p(x)dx=1$
- Difference from probability distribution function
- Not the probablity that $X=x$
- Can exceed 1
- $E[X]=\int_{-\infty}^\infty xp(x)dx$
### Central limit theorem (**Lecture 12**)
- The distribution of the sum of N independent identical (IID) random variables tends toward a normal distribution
- EX
- Binomial looks like normal when N large
---
## Ch6
### Population and Sample (**Lecture 13**)
- Population
- entire dataset {X} with countable size
- has popmean({X}) and std popsd({X}) as numbers
- Sample
- A random subset of population, sampled with replacement
- sample mean $X^{(N)}$ is a random variable
- IID samples
- The sample mean of a popula(on is very similar to the sample mean of N random variables if the samples are IID samples -randomly & independently drawn with replacement.
### Sample mean of a population (**Lecture 13**)
- $X^{(N)}=\frac1N(X_1+X_2+...X_N)$
- $E[X^{(N)}]=E[X^{(1)}]=popmean({x})$
- $X^{(N)}$ is unbiased estimator
- $var[X^{(N)}]=\frac{popvar(X)}{N}$
- $std[X^{(N)}]=\frac{popsd(X)}{\sqrt N}$
- $stdunbiased(x)=\sqrt{\frac{1}{N-1}\sum (x-mean(x))^2}$
- $stderr(x) = \frac{popsd(X)}{\sqrt N}=\frac{stdunbiased(x)}{\sqrt N}$
- Interpret standard error
- Sample mean is a random variable and has its own probability distribution, stderr is an estimate of the sample mean’s standard deviation
- N large, by CLT, sample mean is normal distribution (N>30)
- 68%: ±1 stderr
- 95%: ±2 stderr
- 99.7%: ±3 stderr
- Center confidence intervals
- For 1-2a of realized sample means, the population mean lies in ±b interval
- Hypothesis Test
- Determine whether a claim holds, see lower section.
### t-distribution (**Lecture 13**)
- Use when $N<30$, degree of freedom = $N - 1$
### Sample statistics and Bootstrap (**Lecture 14**)
- Is a statistic of the data set that is formed by the realized sample
- First sample without replacment into Sample. Then performace Boostrap Resample with replacement. Finally Boostrap statistics
- The distribution simulated from bootstrapping is called empirical distribu(on. It is not the true population distribution. There is a statistics error. The number of bootstrapping replicates may not be enough. There is a numerical error.
---
## Ch7
### Hypothesis Test (**Lecture 15**)
- Assume hypothesis $H_0$ is true
- Define test statistic
- $x=\frac{\text{sample mean}-\text{hypothesized value}}{\text{standard error}}$
- When $N>0$, use normal, else use t-distribution of degree N-1
- The fraction $f$ is just the integral
- Report P-Value
- defined as $1-f$
- By convention, reject when $p<0.05$
### Chi-squared test (**Lecture 15**)
- Test independence between
- Degree of freedom = (r-1)(c-1)
- Very versatile
---
## Ch9
### Maximum Likelihood Estimation(MLE) (**Lecture 15**)
- Estimate the parameter assuming a certain distribution
- We write the probability of seeing data D given parameter $\theta$ as $L(\theta)=P(D|\theta)$
- MLE of $\theta$ is $\hat\theta= argmaxL(\theta)$
- L is **not a probability distribution**
- With IID, $L(\theta)=\prod P(x_i|\theta)$
- Unreliable with few data
### Common MLE (**Lecture 15**)
- Binomial $\hat\theta =\frac kN$
- Poisson $\hat\theta =\frac {\sum_i^N k_i}N$
- Normal, too complicated, lazy to type
### Sum/difference of Independent Normals (**end of Lecture 15**)
- There's a lot of algos, just see slides. ***Page 55***
### Bayesian Inference (**Lecture 16**)
- Want to maximize **posterior**, probability of $\theta$ given observed D.
- $P(\theta|D)$, $\theta$ is R.V.
- It's a probability distribution
- Maximum called maximum a posterior (MAP) estimate
- From Bayes Rule
- $P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}$
- Note that $P(D|\theta)=L(\theta)$
- Benefit
- Allows us to include beliefs about $\theta$
- Useful with few data
- get a distribution of posterior
### Beta Distribution (**Lecture 16/17**)
- $P(\theta)=K(\alpha, \beta)\theta^{\alpha-1}(1-\theta)^{\beta-1}$
- $K(\alpha, \beta)=\frac{\Gamma (\alpha+\beta)}{\Gamma \alpha\Gamma\beta}$ -> don't think we need to know this
- Note that $Beta(\alpha = 1, \beta = 1)$ is uniform
- Use beta distribution for conjugate prior for binomial
- $P(\theta|D)=K(\alpha+k, \beta+N-k)\theta^{\alpha+k-1}(1-\theta)^{\beta+N-k-1}$
- maximized at $\hat\theta=\frac{\alpha-1+k}{\alpha+\beta-2+N}$
### Common Prior (**Lecture 17**)
|Likelihood|Conjugate|
|--|--|
|Bernoulli/Geometric/Binomial| Beta|
|Poisson/Exponential|Gamma|
|Normal with known $\sigma^2$|Normal|
But what is gamma distribution and what is Normal with known $\sigma^2$ ??
---
## Ch10
---
The rest are omitted for now since they're newer knowledge and consist small percent
# 待会儿整理回上面
### Handling Multi-Dimensional Data
* Data matrix is $N\times d$ in dimension
* $N$ rows of entries
* $d$ columns of features
* $x_i^{(j)}$ demotes the $j$th component of the $i$th entry
* Mean centering is the process where every entry minus its row mean
### Covariance Matrix
* Covariance matrix contains the covariance between every pair of components
* Shorthand $covmat$
* $covmat=\frac{X_cX_c^T}{N}$, where $X_c$ is mean-centered data matrix, $N$ is the number of entries
* Diagonal entries are the variance for those components, since $cov(X,X)=var[X]$
* Properties
* Covmat is **Positive Semi-Definite**
* $\lambda_i≥0$ for all eigenvalues
* **Positive Definite** is the case where they are strictly larger than 0
* Positive Semi-Definite implies symmetry, i.e. $A_{ij}=A_{ji}$
* Covmat is diagonalizable ($A=PDP^{-1}$)
### Principal Component Analysis (PCA)
* Steps of doing PCA
1. Mean center the data, transform $X$ to $X_c$
2. Diagonalize covmat
3. Oops I really need a review
* MSE of PCA
* Sum of eigenvalues omitted
### ML Overview
* General categories
* Classification
* k Nearest Neighbour (kNN)
* Decision Tree
* Random Forrest
* Naïve Bayesian Classifier
* Support Vector Machine (SVM)
* Regression
* Least Square Solution of Linear Regression
* Unsupervised Learning
* Principal Component Analys·
* Hierarchical Clustering
* kMeans Clustering
### Least Square Solution
* For data $X$ and $y$, least square solution is $(X^TX)^{-1}X^Ty$
### Sample Problem: kMeans Clustering
* Problem: Consider a dataset ${(0, 0), (0, 2), (0, 10), (20, 0), (20, 2), (20, 10)}$. Find the eventual cluster centers after convergence of the k-means clustering algorithm with initial cluster centers $c_1 = (0, 0)$ and $c_2 = (20, 0)$. Also find the eventual cluster centers after convergence if the initial cluster centers are $c_1 = (0, 0)$ and $c_2 = (0, 2)$.
* Solution:
* First part
* Iteration 0: $c_1=(0,0)$ with ${(0, 0), (0, 2), (0, 10)}$, $c_2=(20,0)$ with ${(20, 0), (20, 2), (20, 10)}$
* Iteration 1: $c_1=(0,4)$ with ${(0, 0), (0, 2), (0, 10)}$, $c_2=(20,4)$ with ${(20, 0), (20, 2), (20, 10)}$
* Clustering does not change, algorithm complete
### Markov Chain
### MSE of PCA