# [Prerequisites for CS467](https://probml.github.io/pml-book/book1.html) ## Linear Algebra ### Vector - A vector is a list of numbers - $x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R}^n$ - $x^\top = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}$ ### Inner product - given two vectors: $x, y\in \mathbb{R}^n$ - notations: $\langle x,y\rangle$ or $x^\top y$ - $x^\top y = y^\top x = \sum_{i=1}^{n} x_i y_i$ (a scalar) ### Euclidean norm (2-norm) - the length of the vector - given a vectors: $x \in \mathbb{R}^n$ - $\|x\| = \sqrt{\sum_{i=1}^{n} x_i^2}$ - $\|x\|^2 = x^\top x$ - unit vector: $\frac{x}{\|x\|}$ ### Orthogonal and orthonormal - given three vectors: $x, y, z\in \mathbb{R}^n$ - if $\langle x,y\rangle = 0$, then $x$ and $y$ is **orthogonal** (perpendicular) to each other - the set of vectors $\{x,y,z\}$ is said to be orthonormal if $x, y, z$ are mutually orthogonal, and $\|x\|= \|y\|= \|z\| = 1$ ### Linear independence - given a set of $m$ vectors $\{v_1, v_2,...,v_m\}$ - the set is **linearly dependent** if a vector in the set can be represented as a linear combination of the remaining vectors - $v_m = \sum_{i=1}^{m-1} \alpha_i v_i,\, \alpha_i \in \mathbb{R}$ - otherwise, the set is **linearly independent** ### Span - the **span** of $\{v_1,...,v_m\}$ is the set of all vectors that can be expressed as a linear combination of $\{v_1,...,v_m\}$ - span$(\{v_1,...,v_m\})$ = $\Big\{u: u= \sum_{i=1}^{m} \alpha_i v_i \Big\}$ - if $\{v_1, ..., v_m\}$ is linearly independent, where $v_i \in \mathbb{R}^m$, then any vector $u \in \mathbb{R}^m$ can be written as a linear combination of $v_1$ through $v_m$ - in other words, span$(\{v_1,...,v_m\}) = \mathbb{R}^m$ - e.g. $v_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \, v_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix},\,$ span$\bigg(\Big\{ \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \end{bmatrix} \Big\}\bigg) = \mathbb{R}^2$ - we call $\{v_1, ..., v_m\}$ a **basis** of $\mathbb{R}^m$ ### Matrix - A matrix $A \in \mathbb{R}^{m \times n}$ is a 2d array of numbers with $m$ rows and $n$ columns - if $m = n$, we call $A$ a square matrix - Matrix multiplication - given two matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$ - $C = AB \in \mathbb{R}^{m \times p}$, where $C_{ij} = \sum_{k=1}^n A_{ik} B_{kj}$ - not commutative: $AB \ne BA$ - Matrix-vector multiplication - can be viewed as a linear combination of the columns of a matrix ### Transpose $A^\top$ - The transpose of a matrix results from flipping the rows and columns - $(A^\top)_{ij} = A_{ji}$ - Some properties: - $(AB)^\top = B^\top A^\top$ - $(A+B)^\top = A^\top + B^\top$ - If $A^\top = A$, we call $A$ a **symmetric** matrix ### Trace $\text{tr}(A)$ - The trace of a square matrix is the sum of its diagonal - $\text{tr}(A) = \sum_{i=1}^n A_{ii}$ - If $A, B$ are square matrix, then $\text{tr}(AB) = \text{tr}(BA)$ - $(AB)_{ii} = \sum_{k=1}^n A_{ik} B_{ki} = \sum_{k=1}^n B_{ik} A_{ki} = (BA)_{ii}$ ### Inverse $A^{-1}$ - The inverse of a square matrix $A$ is the unique matrix such that - $A^{−1} A = I = AA^{−1}$ - Let $A$ be a $n\times n$ matrix. $A$ is invertible iff - the column vectors of $A$ is linearly independent (spans $\mathbb{R}^n$) - $\text{det}(A) \neq 0$ - Assume that both $A, B$ are invertible. Some properties: - $(AB)^{−1} = B^{-1}A^{-1}$ - $(A^{−1})^\top = (A^\top)^{-1}$ ### Orthogonal matrix - a square matrix is orthogonal if all its columns are **orthonormal** - $U$ is a orthogonal matrix iff - $U^\top U = I = UU^\top$ - $U^{-1} = U^\top$ - Note: if $U$ is not square but has orthonormal columns, we still have $U^\top U = I$ but not $UU^\top = I$ ## Calculus ### Chain rule - $\frac{dy}{{dx}} = \frac{dy}{{du}} \frac{{du}}{{dx}}$ (single-variable) - e.g., $y = (x^2 + 1)^3,\, \frac{dy}{{dx}} = ?$ - let $u = x^2+1$, then $y=u^3$ - $\frac{dy}{{dx}} = 3(x^2+1)^2(2x)$ ### Critical points - $f' > 0$ => $f$ is increasing - $f' < 0$ => $f$ is decreasing - $f'' > 0$ => $f'$ is increasing => $f$ is concave up - $f'' < 0$ => $f'$ is decreasing => $f$ is concave down - We call $x = a$ a critical point of a continous function $f(x)$ if either $f'(a) = 0$ or $f'(a)$ is undefined ### Taylor series - $f(x) = \sum\limits_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n$ (single-variable) - second order approximation: - $f(x) \approx f(a) + f'(a)(x-a) + \frac{1}{2} f''(a)(x-a)^{2}$ (single-variable) - for $x$ close enough to $a$ - $f(w) \approx f(u) + \nabla f(u)^{\top}(w - u) + \frac{1}{2} (w - u)^{\top} H_f(u) (w - u)$ (multivariate) - $w, u \in \mathbb{R}^d$, for $w$ close enough to $u$ - ${\nabla_w f}^{\top} = [\frac{\partial f}{{\partial{w_0}}}, \cdots, \frac{\partial f}{{\partial{w_d}}}]$ - $H_f$ is the Hessian matrix, where $(H_f)_{ij} = \frac{\partial^{2} f}{\partial w_{i} \partial w_{j} }$ ## Probability ### Conditional probability - $P(A|B)$: the conditional probability of event $A$ happening given that event $B$ has occurred - $P(A|B) = \frac{P(A, B)}{P(B)}$ - $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ (**Bayes rule**) - $\{A_i\}_{i=1}^n$ is a partition of the sample space. We have $P(B) = \sum_{i=1}^nP(B,A_i) = \sum_{i=1}^n P(B|A_i)P(A_i)$ ### Independece and conditional independence - random variable $X$ is independent with $Y$ iff - $P(X|Y) = P(X)$ - $P(X,Y) = P(X)P(Y)$ - random variable $X$ is conditionally independent with $Y$ given $Z$ iff - $P(X|Y,Z) = P(X|Z)$ - $P(X,Y|Z) = P(X|Z)P(Y|Z)$ ### Expectation - $\mathcal{X}$: the **sample space**. The set of all possible outcomes of an experiment. - e.g., the face of a dice, $\mathcal{X} = \{1,2,3,4,5,6\}$ - $p(x)$: probability mass function (discrete) or probability density function (continuous) - $\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x \, p(x)$ (discrete) - $\mathbb{E}[X] = \int_{\mathcal{X}} x \, p(x) \, dx$ (continuous) - more generally, $\mathbb{E}[g(X)] = \int_{\mathcal{X}} g(x) \, p(x) \, dx$ - Linearity of expectation - $\mathbb{E}[a X+b]=a \, \mathbb{E}[X]+b$ - $\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ - If $X$ and $Y$ are independent, $\mathbb{E}[XY] = \mathbb{E}[X] \, \mathbb{E}[Y]$ - why? plug in $P(X,Y) = P(X)P(Y)$ into the definintion of expectation - the converse is *not* true ### Variance and covariance - $\text{Var}[X] = \mathbb{E}\left[(X-\mu)^{2}\right]$, where $\mu=\mathbb{E}[X]$ - $=\mathbb{E}\left[X^{2}\right]-\mu^{2}$ - $\text{Var}[a X+b]=a^2\, \text{Var}[X]$ - $\text{Cov}[X,Y] = \mathbb{E}\left[(X -\mu_X)(Y -\mu_Y)\right]$ - If $X$ and $Y$ are independent, then $\text{Cov}[X, Y]=0$ - the converse is *not* true #### Covariance matrix - Consider $d$ random variables $X_1, ... , X_d$ - $\text{Cov}[X_i, X_j] = \mathbb{E}\left[(X_i -\mu_{X_i})(X_j -\mu_{X_j})\right]$ - We can write a $d \times d$ matrix $\Sigma$, where $\Sigma_{ij} = \text{Cov}[X_i, X_j]$ - when $i=j$ (diagonal), $\mathbb{E}\left[(X_i -\mu_{X_i})(X_i -\mu_{X_i})\right] = \text{Var}[X_i]$