# $\mathcal{Probability \ Measure \ and \ Conditional \ Expectation}$ Author: $\mathcal{CrazyDog}$, Ph.D. in Electrical Engineering, NCTU email: kccddb@gmail.com Date: 20241030 Copyright: CC BY-NC-SA <style> .blue { color: blue; } .bg { color: blue; } .bgblue { color: blue; font-size: 24px; font-weight: bold; } .red { color: red; font-size: 24px; font-weight: bold; } h1 {text-align: center;} h2 {text-align: center;} h3 {text-align: center;} </style> [家扶基金會](https://www.ccf.org.tw/support/donate/channel) [台灣之心愛護動物協會(Heart of Taiwan Animal Care, HOTAC)](https://www.hotac.org.tw/content-donate_main) [無國界醫生](https://www.msf.org.tw/about-us/who-we-are) **Mathematics (calculus, linear algebra, probability, statistics), Data Structure, Algorithm, DSP, Micro-processor and OS 是內功心法 network programming, embedded system 是戰鬥工法** ![1CCTfPz](https://hackmd.io/_uploads/B1zKJAFia.jpg) <h1>學以致用</h1> <h3>學而不思則罔 思而不學則殆 學而後思則悟 </h3> <h3>♥腦中真書藏萬卷,掌握文武半邊天♥</h3> --- <h3> Probability Space </h3> **$\sigma$-field** Definition 1.1. Let $F$ be a collection of subsets of $\Omega$. $F$ is called a "$\sigma$-field" or "$\sigma$-algebra" if the following holds (i) $\Omega \in F$. (ii) $A \in F$ implies that $A^c \in F$ and (iii) if $A _1,A_2, . . . \in F$ is a countable sequence of sets then $\bigcup A_i \in F$. **measure, measurable space** Definition 1.2. A measure is a nonnegative **countable** additive set function on a measurable space $(\Omega,F)$, i.e.,a function $μ : F \to R$ such that (i) μ(A) > 0 for all $A \in F$, and (ii) if $A_i \in F$ is a **countable** sequence of disjoint sets (i.e., $A_i \bigcap A_j = 0$; for all $i \ne j$), then $μ (\bigcup A_i) =\sum u(A_i)$ The tuple $(\Omega,F)$, where $F$ is a "$\sigma$-field", is called a measurable space. e.g., 1. $(\Omega,F)$ with $F = \{0,\Omega\}$. $F$ is the smallest "$\sigma$-field" on $\Omega$, called the trivial "$\sigma$-field". 2. $F=2^{\Omega}$, is the largest "$\sigma$-field" on $\Omega$. 1. Let $P$ is a measure on $(\Omega, F)$ with $0 \leq P(\omega) \leq P(\Omega)=1$, then $(\Omega, F, P)$ is a **probability space** with **probability measure** $P$. **Borel $\sigma$ algebra** In mathematics, a Borel set is any set in a topological space that can be formed from open sets (or, equivalently, from closed sets) through the operations of countable union, countable intersection, and relative complement. Borel sets are named after Émile Borel. For a topological space X, **the collection of all Borel sets on X forms a σ-algebra, known as the Borel algebra or Borel σ-algebra. The Borel algebra on X is the smallest σ-algebra containing all open sets (or, equivalently, all closed sets).** 2. [Lebesque measure on $(R^n,B)$, where $B$ Borel $\sigma$ algebra](https://en.wikipedia.org/wiki/Lebesgue_measure) Lebesque measure $\mu$ on $R$, $\mu (a,b)=b-a$, where $(a,b)$ open interval. If $f$ is differentiable in $[a,b]$ and the Labesque integral $\int_a^xf'(t)dt$ exists, then $f(x)-f(a)=\int_a^xf' (t)dt$ **Riemann Integral** $\int_{[a,b]} fdt=$**Lebesque Integral** $\int_{[a,b]} fdt$ if they exist. **Counting measure and $l^p$ spaces.** Let $X$ be any countable set, $M = P(X)$ ( $M: \sigma$-algebra , $P(X)$=power set of $X$) and $\mu$ be the counting measure. Recall that $µ(A)$ is the number of points in $A$ if $A$ is finite and equals $\infty$ otherwise. Integration is simply $\int f d\mu = \sum_{x \in X}f(x)$ for any non-negative function $f$, and $L^p$ is denoted by $l^p$ . **Definition 1.3.** Given two measure spaces $(\Omega,F)$ and $(S,B)$, a function $f : \Omega \to S$ is measurable if $f^{−1}(A) \in F$ for any $A \in S$. **random variable (measurable function)** **Definition 1.4.** A random variable is a **measurable function** from $(\Omega,F)$ to $(R, B)$, where B is the Borel $\sigma$-field of $R$. A random vector is a measurable function from $(\Omega,F)$ to $(R^d, B^d)$, where $B^d$ is the Borel $\sigma$-field of $R^d$. $X: \Omega \to R$ is a random variable, $P( a \le X \le b) \doteq P( \{\omega \in \Omega| a \le X(\omega) \le b\})$. Distribution function of $X$: $F_X(x) \doteq P(X \leq x)$ **Expectation, conditional expectation** Probability space $(\Omega, F, P)$ 1. Indicator function $1_A$:$1_1(\omega)=1$ if $\omega \in A$, $1_A(\omega)=0$, if $\omega \in \Omega \setminus A$, where $A \in F$. $E[1_A] \doteq \int_{\Omega}1_A (\omega)d P(\omega) \doteq P(A)$. 2. Simple random variable. A r.v. $\phi$ is a simple RV if $\phi(\omega) = \sum_{i=1}^n a_i1_A$,where $A_i$’s are disjoint measurable sets. $E[ \phi ] \doteq \sum_{i=1}^n a_i E [1_{A_i}]$. If $E[ \phi ] < \infty$, then $\phi$ **integrable**. **$E[X] \doteq \int X(\omega)dP$ if it exists.** $(X,\Sigma ,\mu )$ is a measure space, a property $\textbf{P}$ is said to hold **almost everywhere (almost surely)** in $X$ if there exists a set $N \in \Sigma$ with $u (N)=0$ and all $x \in X \setminus N$ have the property $\textbf{P}$. Define $0 \times \infty =0$. **e.g.**, $Q$:rational number. Let $f(x)=1$, if $x \in Q$, $f(x)=0$ if $x \in R \setminus \ Q$. Then $f(x)=0$ in $R$, **a.e.** , because $Q$ is countable and $u(Q)=0$, where $u$ is the Lebesque measure. $\int fdu= 0$ (in Lebesque integral). **Strong Law of Large Numbers** The sample average converges almost surely to the expected value, i.e., $\lim \bar{X_n} \to \mu$, a.s. ![](https://hackmd.io/_uploads/S1zAVF853.jpg) **Schwarz and Hölder Inequality** Let $(S, Σ, μ)$ be a measure space, $1 <p <\infty$ and $\frac{1}{p}+\frac{1}{q}=1$ $\int |xy|du \leq (\int |x|^p)^{\frac{1}{p}}(\int |y|^q)^{\frac{1}{q}}$ $||xy||_{L^1} \leq ||x||_{L^p}||y||_{L^q}$ $u=$ counting measure, $S=\{1,..,n \}$ $\sum_{i=1}^n |x_i y_i| \leq (\sum_{i=1}^n |x_i|^p)^ {\frac{1}{p}}(\sum_{i=1}^n |y_i|^q)^ {\frac{1}{q}}$ **$p=q=2$, Schwarz inequality** Hölder's inequality becomes an equality if and only if $|x |^p$ and $|y|^q$ are **linearly dependent** in $L_1(μ)$, meaning that there exist real numbers $α, β ≥ 0$, not both of them zero, such that $α|x|^p$ = $β |y|^q$ $μ-$almost everywhere. **Minkowski Inequality** $1 \leq p < \infty$ $(\int |x \pm y|^p)^{\frac{1}{p}} \leq (\int |x|^p)^{\frac{1}{p}}+(\int |x|^p)^{\frac{1}{p}}$ $||x \pm y||_p \leq ||x||_p+||y||_p$ ![simple1](https://hackmd.io/_uploads/Hkehl56H6.jpg) Remark. Random variable $X:\Omega \to R$ **Markov's inequality** If X is a nonnegative random variable and a > 0, $P(X \geq a) \leq \frac{E[X]}{a}$ $\bf{Proof.}$ Define $1_{\{X \geq a\}}$, then $a 1_{\{X \geq a\}} \leq X$ a.s. Hence $E[a1_{\{X \geq a\}}]=aP(X \geq a) \leq E[X]$. $\bf{Q.E.D.}$ Extended version for nondecreasing functions If $\phi$ is a nondecreasing nonnegative function, $X$ is a (not necessarily nonnegative) random variable, and $\phi(a) > 0$, then $P(X >a) \leq \frac{E[\phi(X)]}{\phi(a)}$ Proof. $\{ \omega : \ |X(\omega)|>a, \omega \in \Omega \}=\{\omega : \ \phi(X(\omega)) > \phi(a), \omega \in \Omega \}$ **[Hoeffding's inequality](https://en.wikipedia.org/wiki/Hoeffding%27s_inequality)** Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Let $X_1, ..., X_n$ be independent random variables such that $a_{i} \leq X_{i} \leq b_{i}$ almost surely. Then Hoeffding's theorem states that, for all $t > 0$, $S_n=\sum_{i=1}^nX_i$ $P(S_n -E[S_n] \geq t) \leq exp(-\frac{2t^2}{\sum_i (b_i-a_i)^2})$ $P(|S_n -E[S_n]| \geq t) \leq 2 exp(-\frac{2t^2}{\sum_i (b_i-a_i)^2})$ Proof. (**HINT**) $P(S_n -E[S_n] \geq t) =P(exp(s(S_n -E[S_n])) \geq exp(st))$ for $s >0$, since $e^{sx}$ is a monotone increasing function for $s>0,x >0$. Hint. Markov inequality (Extended version) ![Hoeffding 1](https://hackmd.io/_uploads/BkAM1LnLT.jpg) [Hoeffding’s inequality and convex ordering](https://eventuallyalmosteverywhere.wordpress.com/tag/increasing-convex-order/) [Concentration inequality](https://en.wikipedia.org/wiki/Concentration_inequality) --- **Fundamental concept**: $X$ is a r.v., if there exits a simple r.v.'s $\phi_i$, such that $\phi_i \to X$ and $E[\phi_i] \to \bar{X} <\infty$, then $E[X] \doteq \int X(\omega)dP \doteq \bar{X}$. **conditional expectation** $E[X|Y=1]$ (**a random variable**) contitional expection of $X$ given the event $\{Y=1\}$ **Radon-Nikodym theorem** Let $u$ be a $\sigma$-finite measure and $\lambda$ a **signed measure** on $\Sigma$ of subset of $\Omega$. Assume $\lambda$ is absolutely continuous with respect to $u$ ($\lambda << u$, if $u(A)=0 => \lambda(A)=0$ ). Then there is a Borel measurabe function $g: \Omega \to \bar{R}$ such that $\lambda(A)=\int_A d \lambda=\int_A g du$ for all $A \in \Sigma$. If $h$ is another such function, then $g=h \ a.e. [u]$ $g$ is called **Radon-Nikodym derivative or density** of $\lambda$ with respect to $u$, writen $\frac{d\lambda}{du}$. :::info **Fundamental theorem of calculus** ![FTC_geometric.svg](https://hackmd.io/_uploads/r18ZKwEoa.png) ![Fundamental_theorem_of_calculus_(animation_)](https://hackmd.io/_uploads/rJus1DNsT.gif) $F(x) \doteq \int_a^x f(t)dt, \ \frac{dF(t)}{dt}=f(t), t \in (a,x)$ $dF \to d\lambda$ $dt \to du$ $\frac{dF}{dt} \to \frac{d\lambda}{du}$ ::: $X$ is an integrable random variable. Assume $G$ is a $\sigma$-field $G \subset F$. The **conditional expectation** of $X$ given by $G$, $E[X | G]$ (**$G$-measurable random variable**) is given by $\int_B X dP=\int_B E[X |G] dP$ (or $E[X1_B] = E[E[X|G]1_B]]$) for $B \in G$. (**by Radon-Nikodym theorem**). **The $\sigma$-field generated by random variable $Y$ is denoted by $\sigma_Y$.** Then $E[X|Y]=E[X|\sigma_Y]$. **conditional density** Let $X, Y$ r.v. with joint density $f_{XY}(x,y)$. $P(Y \in C| x-h < X <x+h)=\frac{P(x-h<X<x+h , Y \in C)}{P(x-h<X<x+h)} \sim \frac{2h\int_Cf_{XY}(x,y)dy} { \int_{x-h}^{x+h}f_X(u)du}$ $\simeq \frac{\int_Cf_{XY}(x,y)dy}{f_X(x)}$. Define $h(y|x)=f_{XY}(x,y)/f_X(x)$ the conditional density of $Y$, given $X=x$. Let $B(x) \in F$ and $X=x$, then $B(x)$ will occur iff $Y \in B(x)$. $P(Y \in B(x)|X=x)=\int 1_{B}(x,y)h(y|x)dy=\int_{B(x)}h(y|x)dy$, by Fubini's theorem. **Properties.** 1. If $G$ is the trivial field $\{0,\Omega\}$, $E[X|G] = E[X]$ is a **random variable**. 2. $E[X|X]=X$ 3. If $X$ is $G$-measurable, then $E[XY|G] = XE[Y|G]$. 4. If $G_1 \subset G$, then $E( E[X|G]|G_1)=E[X|G_1]$. 5. If $\sigma(X)$ ($\sigma$-field generated by $X$) and $G$ are **independent**, then $E[X|G]=E[X]$ (it is a random variable). 6. Monotone convergence. If $0 \leq X_n$, and $X_n \uparrow X$ with $E[|X|] < \infty$, then $E[X_n|G] \uparrow E[X|G]$. 7. Dominated convergence. If $\lim_{n \to \infty}X_n =X$ a.s. and $|X_n| \leq Y$ with $E[Y] < \infty$, then $\lim_{n \to \infty}E[X_n|G] =E[X|G]$. 8. Fatous' lemma. If $0 \leq X_n$, then $E[\lim inf X|G] \leq \lim inf E[X_n|G]$. ![1280px-ConvexFunction.svg](https://hackmd.io/_uploads/Bk22lIXia.png) 10. $|E[X|G]| \leq E[|X||G]$. 11. Jenson's inequality. If $g(x)$ is a convex function on interval $I$, i.e., for all $x$,$y \in I, \lambda \in (0,1)$, $g(\lambda x+(1-\lambda)y) \leq \lambda g(x)+(1-\lambda)g(y)$. $X$ is a r.v. with range $I$, then $g(E[X|G]) \leq E[g(X)|G]$. 12. Let $X$ and $Y$ be two independent r.v.s and $\phi(x,y)$ be such that $E[|\phi(X,Y)|] < \infty$. Then $E[(\phi(X,Y))|Y]=G(Y)$, where $G(y)=E[\phi(X,y)]$. 13. $E[.|G]$ is a **projection operator** on $L^1(\Omega,F,P)$, since $E[E[.|G]|G]=E[.|G]$ $L^p$ space $(X,\Sigma ,\mu )$ is a measure space, and $1 \leq p <\infty$. Let $||f||_p \doteq (\int |f|^pdu)^{\frac{1}{p}} < \infty$. Notice that $||f||_p=0 \Rightarrow f=0$ **a.e.** **Remark.** $f_n \to f$ in $L^2 \nrightarrow f_n \to f \ a.e$, there exits one sub-sequence $f_nj \to f, a.e.$ $f_n \to f\ a.e. \nrightarrow f_n \to f \ in \ L^2$, e.g., $f_n=\sqrt{n}x^n$, $0\leq x<1$ **還有許多數學上問題要處理, 例如 seminorm, 因此工程上 我們處理 good function (finite energy, measurable function, complete inner product space,...)與 $p=2$, 最重要的 $L^2 與 l^2$ space**, 有興趣的讀者 請學 Functional Analysis. ![](https://hackmd.io/_uploads/Hyky76_o3.png) **Gibbs phenomenon** **Fourier Series** Let $\phi_n(x) =e^{2\pi i nx}$, $\hat{f(n)}=\int_0^1 f(x) \overline{\phi_n(x)}dx$ and $S_N(x)=\sum_{n=-N}^{N}\hat{f(n)}\phi_n(x)$. Then 1. $||\phi_n||^2=\int_0^1 |\phi_n(x)|^2dx=1$ 2. If $n \neq k$, $<\phi_n,\phi_k>=0$ 3. If $f(x)=\sum_{n=-N}^N c_n\phi_n(x)$, then $\int_0^1 |f|^2=\sum_{n=-N}^N |c_n|^2$ 4. $\sum_{n=-N}^N |\hat{f(n)}|^2= \int_0^1 |S_n(x)|^2 dx$ **Bessel’s Inequality** Let $f$ be a periodic $[0,1]$, square-integrable function, and write fb for its Fourier transform. Then, for every $N$, we have $\sum_{n=-N}^N (|\hat{f(n)|)^2} \leq \int_0^1 |f|^2$ **Hence we have $f(x)=\sum c_n\phi_n(x)$ in $L^2$ sense ($\lim_{N \to \infty} ||f-S_N||_{L^2}=0$), if $f \in L^2[0,1]$** **Lebesque-Stieltjes Integral** $X$ is a r.v., with distribution function $F_X(x)$, where $F(b)-F(a)=P(X(\omega) \in (a,b])$ right-continuous function Then $E[h(X)]=\int_{\Omega} h(X(\omega))dP(\omega)=\int h(x)dP_X(x)=\int h(x)d F_X(x)$ **Remark.** $X$ is a discrete r.v., $F_X(x_{n+1})-F_X(x_n)=P(X=x_{n+1})$, then $E[X]=\sum x_n P(X=x_n)$ **Sums of independent random variables** $X,Y$ discrete r.v. $Z=X+Y$ $P(Z=z)=\sum_y p_X(z-y)p_Y(y)$, **convolution** continuous r.v. $Z=X+Y$ $f_Z(z)=\int f_X(z-y)f_Y(y)dy$ **Proof.** Let $1_{A}, A=\{ \omega \in \Omega| Z=X(\omega)+Y(\omega) \leq z\}$ $P(Z\leq z)=E[1_A]=E[E[1_A|Y]]=E[P(X+Y \leq z)|Y]]$ $E[P(X+Y \leq z)|Y]]=\int F_X(z-y)dF_Y$, since $X,Y$ independent. **[Moment generating function (mgf)](https://en.wikipedia.org/wiki/Moment-generating_function) of $X$** $m(t)=E[e^{tX}]=E[\sum_{n=0}^{\infty}\frac{ t^nX^n}{n!}]=\sum_{n=0}^{\infty}\frac{ t^nE[X^n]}{n!}$ $E[X^n]=\frac{d^nm(t)}{dt^n}|_{t=0}$ $\vec{X}$~ **random vector** $m_{\vec{X}}(\vec{t}) \doteq E[e^{<\vec{t},\vec{X}>}]$ **Characteristic function of $X$** $\phi(t)=E[e^{itX}]=\int e^{itx}dF_X(x)$ If $f_X(x)=F_X(x)^{'}$, then $\phi(t)=\int e^{itx}f_X(x)dx$ **Proposition.** There is a **bijection** between **probability distributions** and **characteristic functions**. That is, for any two random variables $X_1$, $X_2$, both **have the same probability distribution** if and only if $\phi_{X_1}=\phi_{X_2}$ $X,Y$ independent, then $\phi_{X+Y}=\phi_X \phi_Y$ $\vec{X}$~ **random vector**, then $\phi_{\vec {X}}(\vec{t}) \doteq E[e^{i<\vec{t},\vec{X}>}]$ **Proposition.** Suppose that $X, Y$ are two random vectors whose **mgf**s $\phi_X(t), \phi_Y(t)$ exist for all $t \in B_{\varepsilon}(0)$ for some $\varepsilon >0$. If $\phi_X(t) =\phi_Y(t)$ for all $t \in B_{\varepsilon}(0)$, then the cdfs of $X$ and $Y$ are identical. $B_{\varepsilon}(0)=(-\varepsilon, \varepsilon)$ in $R^1$ **Central limit theorem** $X_1, X_2, ...$ i.i.d. $E[X_i]=\mu$, $VAR(X_i)=\sigma^2$, $\bar{X}=\frac{X_1+X_2+...+X_n}{n}$ $\sqrt{n}(X-\mu) \to N(0,\sigma^2)$ **convergence in distribution** **Proof.** $M(t)=E[e^{t\bar{X}}]$ **Using Taylor’s formula with a remainder** Find $M(t)...$ Take limit $n \to \infty$ $\to e^{\frac{1}{2}\sigma^2 t^2}$ Ref. Introduction To Stochastic Calculus With Applications, by F. C. Klebaner $L^2(P)$ processes (Second-order processes), processes for which $E[|X_t|^2] < \infty$. --- <h3>Markov Chain</h3> ![](https://hackmd.io/_uploads/BykPy9Eo3.jpg) **Discrete-time Markov chain** A discrete-time Markov chain is a sequence of random variables $X1, X2, X3, ...$ with the Markov property $P(X_{n+1}=x_{n+1}|X_{n}=x_n,X_{n-1},....X_1)=P(X_{n+1}=x_{n+1}|X_n=x_n)$ Time-homogeneous: $P(X_{n+1}=x_{n+1}|X_{n}=x_n)=P(X_n=x_{n}|X_{n-1}=x_{n-2})$ A discrete-time Markov chain with memory (or a Markov chain of order m) where $m$ is finite, is a process satisfying $P(X_{n+1}=x_{n+1}|X_{n}=x_n,X_{n-1},...,X_{n-m+1},....X_1)=P(X_{n+1}=x_{n+1}|X_n=x_n,..X_{n-m+1})$ Assume $X_n=j \in \{0,1,2,...\}$, let $p_{ij}=P(X_n=j|X_{n-1}=i)$ $P=[p_{ij}]$:**Transition matrix** $P(X_{n+1}=k|X_{n-1}=i)=\sum_j P(X_{n+1}=k,X_n=j|X_{n-1}=i)=\sum_j P(X_{n+1}=k|X_n=j,X_{n-1}=i)P(X_n=j|X_{n-1}=i$ Hence $P(X_{n+1}=k|X_{n-1}=i)=\sum_j p_{ij}p_{jk}$ Asssume $\pi_0=[\pi_0,\pi_1...]$ is the initial probability. $P(X_{2}=k|X_{0}=i)=\sum_j p_{ij}p_{jk}$, then $P(X_2=k)=\sum_i\pi_i\sum_j p_{ij}p_{jk}$. Let $\pi^n(j) =P(X_n=j)$, then $\pi^n=\pi_0P^n$. Assume $\lim_{n \to \infty} \pi^n=\pi$ exists, then we have $\pi P=\pi$. Matrix form: $P^2=P \times P$ Stationary​ probability $\pi=\pi P$ and $\sum \pi_i=1$. **Continuous-time Markov chain** Then, knowing $X(t)=i$, $X(t+h)=j$ is independent of previous values $(X_{s}:s<t)$, and as h → 0 for all j and for all t, $P(X(t+h)=j|X(t)=i)=\delta_{ij}+q_{ij}h+o(h)$, where (little-o) $\lim_{h \to 0} \frac{o(h)}{h}=0$ and $\delta_{ii}=1$, $\delta_{ij}=0$,if $i \ne j$. $q_{ij}$ **transition rate**. ![](https://hackmd.io/_uploads/SJxhZgOji2.png) Time Homogeneous: $p_{ij}(t)=P(X(t+s)=j|X(s)=i).$ Then 1. $\frac{p_{ij}(t+\delta t)}{\delta t}=\frac{q_{ij}\delta t+o(\delta t)}{\delta t}=q_{ij}+o(\delta t)/ \delta t$, if $i \neq j$ 2. $\frac{p_{ii}(t+s)-1}{\delta t}=q_{ii}+o(\delta t)/ \delta t$, if $i=j$. 3. $\sum_{j}p_{ij}(t)=1$, $0 \leq p_{ij}(t) \leq 1$ 4. $q_{ii} <0$ 5. **Chapman–Kolmogorov equation** 6. $p_i(X(t+s)=k)=\sum_jP_i(X(t)=j,X(t+s)=k)$, hence $p_{ik}(t+s)=\sum_j p_{ij}(t)p_{jk}(s)$. We have $\bf{P(t+s)=P(t)P(s)}$ 7. $\lim_{t \downarrow 0}P(t)=I$ 8. $Q \doteq [q_{ij}]$, $q_{ij}$ infnitesimal transition rate, $Q$ infnitesimal generator 9. $\sum_j q_{ij}=0$ 10. $P(t)=exp(tQ)$, $t \geq 0$ 11. Transition Rate $q_{ij}$, $i \neq j$ From 1. and 2., we have $p_{ii}$ ~$exp(-q_{ii})$ and $P'=QP$, where $P=[P_{ij}(t)]$ and $P_{ij}(t)=p_{ij}(t)$ $\frac{P(t+h)-P(t)}{h}=P(t)\frac{(P(h)-I)}{h}$ $P^{'}=\lim_{h \downarrow 0}P(t)\frac{(P(h)-I)}{h}=QP$ $\frac{P(h+s)-P(s)}{h}=(P(h)-I)P(s)$ Then $P'=QP=PQ$ For steady-state, $P'=0$, then $QP=PQ=0$. The steady-state probability $\pi$, $\pi Q=0$ and $\sum \pi_i=1$. If $P_{ij}$ exists, then $P^{n} =\hat{P}$ as $n\to \infty$ Kolmogorov **backward** equation $P'=QP$, $P(0)=I$. **Poisson process** with parameter $\lambda$ with state space $\{0,1,2...\}$, 1. $q_{ij}=\lambda$, for $j=i+1$ 2. $q_{ii}=-\lambda$, $i=0,1,2...$ 3. $q_{ij}=0$, otherwise. **Global balance equations** $\frac{p_{ij}(t+\delta t)}{\delta t}=\frac{q_{ij}\delta t+o(\delta t)}{\delta t}=q_{ij}+o(\delta t)/ \delta t$, if $i \neq j$ ![](https://hackmd.io/_uploads/H19S3Lijn.jpg) $\pi_i=\sum_j \pi_j q_{ji}$, or $\pi_i \sum_{j \in S-{i}}q_{ij}=\sum_{j \in S-{i}} \pi_j q_{ji}$, The left-hand side represents the $\bf{total \ flow \ from \ out \ of \ state \ i \ into \ states \ other \ than }$ $i$, while the right-hand side represents the $\bf{total \ flow \ out \ of \ all \ states}$ $j \neq i$ into state $i$. **stationary probability (steady-state probability)** $\pi Q=0$, and $\sum \pi_i=1$ <h1>Hidden Markov Models (HMM)</h1> ![HiddenMarkovModel.svg](https://hackmd.io/_uploads/SyGOE90Ip.png) :::info **<h3>Assumptions</h3>** $Q=q_1,q_2,...,q_N$ a set of N states (Markov Chain) $A=[a_{ij}]$ transition matrix (Markov Chain) $O=o_1, o_2, ...,o_T$ a sequence of $T$ observations $B=b_i(o_t)$ a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation ot being generated from a state $i$ $\vec{\pi}=[\pi_1,...,\pi_N]$ an initial probability distribution (Markov Chain) **Output Independence**: $p(o_i|q_1,...q_T,o_1,...,o_T)=p(o_i|q_i)$ $\lambda =(\vec{\pi},A,B)$ model ::: 1. **(likelihood)** Determine the likelihood $P(O|\lambda)$ 2. Given an observation sequence $O$ and an HMM $\lambda$, discover the **best** **hidden** state sequence $Q$. **best???** ML, maximum mutual information (MMI), $arg \ max$, $LLM$, $MAP$... 3. **(learning)** Given an observation sequence O and the set of statesin the HMM, learn the HMM parameters A and B. $P(O,Q)=P(O|Q)P(Q)=\prod_{i=1}^Tp(o_i|q_i) \prod_{i=1}^Tp(q_i)$ $P(O)=\sum_{Q}P(O,Q)$ $a_t(j)=P(o_1,...,o_t,q_t=j|\lambda)$ $a_{t-1}(j)$ the previous forward path probability from the previous time step $b_j(o_t)$ the state observation likelihood of the observation symbol ot given the current state $j$ **1.Initialization**: initialize $\vec{\pi}$, $A$ and $B$ Then the **forward** probability at time $t$, $a_t(j)=\sum_{i=1}^Na_{t-1}(i)a_{ij}b_j(o_t)$ **2. Recursion**: $a_t(j)=\sum_{i=1}^Na_{t-1}(i)a_{ij}b_j(o_t)$ **3. Termination**: The **backward** probability $\beta$ is the probability of seeing the observations from time $t +1$ to the end, given that we are in state $i$ at time $t$ (and given $\lambda$): $\beta_t(i)=P(o_{t+1},...,o_{T}|q_t=i,\lambda)$ 1. **Initialization**: $\beta_T(i)=1$, $i=1,...,N$ 2. **Recursion**: $\beta_t(i)=\sum_j a_{ij}b_j(o_{t+1})\beta_{t+1}(j)$, $1 \leq j \leq N$, $1 \leq t <T$ 3. **Termination**: $P(O|\lambda)=\sum_j \pi_j b_j(o_1)\beta_1(j)$ <span class="blue"> $$ \begin{cases} q_i &\longmapsto & b_j(o_{t+1}) \longmapsto & & q_j & \\ \leftarrow a_t(i) & t & & \beta_{t+1} \to \end{cases} $$ </span> Let $\eta_t(i,j) =P(q_t,q_{t+1}|O, \lambda)=\frac{ a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{P(Q|\lambda)}$. Hence $\eta_t(i,j)=\frac{ a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\sum_i \sum_ja_{ij}b_j(o_{t+1})\beta_{t+1}(j)}$ $\gamma_t(j)=\sum_ja_{ij}b_j(o_{t+1})\beta_{t+1}(j)$ Let $\pi_i$ be the expected number of time at $t=1$. Let $M_{ij}$ be the expected number of transitions from state $i$ to state $j$. Let $K_i=\sum_j M_{ij}$ be the expected number of transitions from state $i$. Let $S_j$ be the expected number of times in state $j$ Let $B_j(k)$ be the expected number of times and observing $O=o_k$. :::info EM (expectation-maximization) algorithm **(No guarantee exists that the sequence converges to a maximum likelihood estimator.)** **initialize $\pi$, A and B** (**a big problem**) iterate until convergence **E-step (Expectation step, update variables)** $\gamma_t(j)$ $\eta_t(i,j)$ $\forall i,j,t$ **M-step (Maximization step, update hypothesis)** $\hat{a}_{ij}=\frac{M_{ij}}{K_i}=\frac{\sum_{t=1}^T \eta_t(i,j)}{\sum_{t=1}^T \gamma_t(j)}$ $\hat{\beta}_j(k)=\frac{\sum_{t=1, Q=q_k}^T \eta_t(j)}{\sum_{t=1}^T \gamma_t(j)}$ **return A,B** --- [Markov Model](https://web.ntnu.edu.tw/~algo/HiddenMarkovModel.html) [Hidden Markov Model (part 2)](https://medium.com/ai-academy-taiwan/hidden-markov-model-part-2-ac46dcdd42d1) [ㄚㄚHidden Markov Model (part 3)](https://medium.com/ai-academy-taiwan/hidden-markov-model-part-2-ac46dcdd42d1) [EM algorithm, convergence???](https://www.statlect.com/fundamentals-of-statistics/EM-algorithm) [如何辨別機器學習模型的好壞?秒懂Confusion Matrix](https://ycc.idv.tw/confusion-matrix.html?fbclid=IwAR38n8o0iL9wREJkZHwmu2m1G1Se3z6TafiAsW35UXgYEdh7oFE2s8tc22g) [Hidden Markov Models](https://web.stanford.edu/~jurafsky/slp3/A.pdf) [A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner](https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf) ::: --- **memoryless property** $X \geq 0$, $P(X>t+s|X>s)=P(X>t)$ Let $f(x)=P(X>x)$, then $P(X>t)=P(X>t+s|X>s)=P(X>t+s,X>s)/P(X>s)$ Hence $f(t+s)=f(t)f(s)$. $f(x)$ is monotonically decreasing, $f(0)=1$ and $f(t+s)=f(t)f(s)$ $\to$ exponential distribution with parameter $\lambda >0$. $P(X>s)=e^{-\lambda s}$ Proof. 1. Check rational number $n/m$ 2. $0<r_n<x<q_n$, $r_n \to x$ and $q_n \to x$ 3. $f(n/m)=f(1)^{n/m}$ 4. $f(x)=f(1)^x=e^{ln(f(1))x}=e^{-\lambda x}$ --- ![](https://hackmd.io/_uploads/Hy6a7Cki2.jpg) **Little's Theorem** $N(t)$ number of customer in the system at time $t$ $\alpha(t)$ mumber of customer who arrived in $[0,t]$ $T_i$ time spent in the system by $i$th arriving customer $\beta(t)$ number of departures up to time $t$ $\frac{1}{t} \int_0^tN(t)dt=\frac{1}{t}\sum_{i=1}^{\alpha(t)}T_i=\frac{\alpha(t)}{t} \frac{\sum_{i=1}^{\alpha(t)}T_i}{\alpha(t)}$ $\frac{1}{t} \int_0^tN(t)dt \geq \frac{1}{t}\sum_{i=1}^{\beta(t)}T_i=\frac{\beta(t)}{t} \frac{\sum_{i=1}^{\beta(t)}T_i}{\beta(t)}$ Let $t \to \infty$, $\lim \frac{\alpha(t)}{t} =\lim \frac{\beta(t)}{t}=\lambda$ $N=\lambda T$ [Ref. Data Netwrks,by Dimitri P. Bertsekas and Robert G. Gallager](https://web.mit.edu/dimitrib/www/datanets.html) **PASTA property** (**Poisson Arrivals See Time Averages**) The probability of the state as seen by an outside random observer (**Poisson arrival**) is the same as the probability of the state seen by an arriving customer Proof. (**對EE太難, 知道就好, 主要因 memoryless 特性**) **arrival 認為的統計 與 時間 的統計 不一定 相同** 假設 arrival 是警察 且 都是 固定時間來巡邏, 小偷 知 警察 巡邏時間, 則 警察 看不見 小偷 **警察 得到~治安好 沒小偷 小偷 說 警察 笨** **Lemma.** Let $X_1, …, X_n$ be independent exponentially distributed random variables with rate parameters $\lambda_1, …, \lambda_n$. Then $X=min(X_1,X_2,...X_n)$ is also exponentially distributed, with parameter $\lambda=\sum_i \lambda_i$ Proof. $P(X>x)=P(min (X_1,X_2,...X_n)>x)=P(X_1>x,X_2>x,...,X_n>x)$ Hence, $P(X>x)=\prod_i P(X_i>x)=\prod_i e^{-x\lambda_i} =e^{-x \sum_i \lambda_i}$ **Theorem.** **Merging Independent Poisson Processes $\Rightarrow$ Poisson Process** **Splitting a Poisson Processes** Let $N(t)$ be a Poisson process with rate λ . Here, we divide $N(t)$ to two processes $N_1(t)$ and $N_2(t)$ in the following way. For each arrival, a coin with $P(H)=p$ is tossed. If the coin lands heads up, the arrival is sent to the first process ($N_1(t)$), otherwise it is sent to the second process. The coin tosses are independent of each other and are independent of $N(t)$. Then, $N_1(t)$ is a Poisson process with rate $λp$; $N_2(t)$ is a Poisson process with rate $λ(1−p)$; $N_1(t)$ and $N_2(t)$ are independent. Proof. $N(t)$=number of arrivals $P(N_1(t)=m,N_2(t)=k|N(t)=m+k)=\frac{(m+k)!}{m!k!}p^m(1-p)^k$ $P(N_1(t)=m,N_2(t)=k)=\frac{(p\lambda t)^m e^{p\lambda t}}{m!}\frac{((1-p)\lambda t)^k e^{(1-p)\lambda t}}{k!}$ --- **M/M/1 Queue** ![](https://hackmd.io/_uploads/B1Zvu5yn3.png) 1. Arrivals occur at rate λ according to a **Poisson process** and move the process from state $i$ to $i + 1$. 2. Service times have an exponential distribution with rate parameter $\mu$ in the M/M/1 queue, where $\frac{1}{\mu}$ is the mean service time. 3. All arrival times and services times are (usually) assumed to be independent of one another. 4. A single server serves customers one at a time from the front of the queue, according to a first-come, first-served discipline. When the service is complete the customer leaves the queue and **the number of customers** in the system reduces by one. 5. The buffer is of infinite size, so there is no limit on the number of customers it can contain. The model can be described as a continuous time Markov chain with transition rate matrix ![](https://hackmd.io/_uploads/BJhmv5132.png) ![](https://hackmd.io/_uploads/rJlDw9J23.png) $\rho = \frac{\lambda}{\mu}$ $\pi_i=(1-\rho)\rho^i$ $P(S>n)=\rho^{n+1}$ **不是線性**, $log \ P(S>n)=(n+1)log \ \rho$ $N=\sum \ i\pi_i =\frac{\rho}{1-\rho}$ $N=\lambda W$, by Little's formula ![](https://hackmd.io/_uploads/Hk22K9k33.png) ![](https://hackmd.io/_uploads/SkfBus1nh.jpg) ![](https://hackmd.io/_uploads/HJb5uiy3h.jpg) ![](https://hackmd.io/_uploads/Bkv8q3J22.png) M/G/1 system $X$:the service time **Pollaczek-Khinchin (P-K) formula**: $W=\frac{\lambda \bar{X^2}}{2(1-\rho)}$ Proof. $W_i$ = Waiting time in queue of the $i$th customer $R_i$= **Residual service time** seen by the $i$th customer. By this we mean that if customer $j$ is already being served when i arrives, $R_i$ is the remaining time until customer $j$'s service time is complete. If no customer is in service(i.e., the system is empty when $i$ arrives), then $R_i$ is zero.) $X_i$ = Service time of the $i$th custome $N_i$ = Number of customers found waiting in queue by the ith customer upon arrival Then $W_i=R_i+\sum_{j=i-N_i}^{i-1}X_j$ $N_i$ and $X_{i-1}, ... ,X_{i- Ni}$ independent $E[W_i]=E[R_i]+E[\sum E[X_j|N_i]]$ $W=\lim_{i \to \infty} E[W_i]=\lim_{i \to \infty} ( E[R_i]+\bar X E[N_i])=R+\frac{1}{\bar{X}}N_Q$ $N_Q=\lambda W$. Hence $W=\frac{R}{1-\rho}$ Find $\lim_{i \to \infty}E[R_i]=\frac{1}{2 }\lambda\bar{X^2}$ $M(t)$ is the number of service completions within $[0, t]$ and $\lim \frac{M(t)}{t}=\lambda$ for the **stable** Queueing system. ![](https://hackmd.io/_uploads/BksdkRy3h.jpg) <h1>Stochastic Dynamic Programming</h1> A discrete time Markov chain with **memory (or order)** $m$ is a process satisfying $P(X_{n+1}=j|X_n,X_{n-1},...,X_0)=P(X_{n+1}=j|X_n,...,X_{n-m+1})$. [RL: 強化學習(Reinforcement Learning)](https://www.terasoft.com.tw/support/tech_articles/reinforcement_learning_a_brief_guide.asp) NLP 神經語言規劃(英語:Neuro-Linguistic Programming) NLP 自然語言處理(英語:Natural Language Processing) RNN: 循環神經網路(Recurrent neural network:RNN)手寫識別是最早成功利用RNN的研究結果(2009) ![balance1](https://hackmd.io/_uploads/SkkKvhaIa.jpg) [ML (Expectation-Maximization) Algorithm ](https://www.geeksforgeeks.org/ml-expectation-maximization-algorithm/) 包含重要的 Viterbi Algorithm [Hidden Markov Model (part 1), by Kinna Chen](https://medium.com/ai-academy-taiwan/hidden-markov-model-part-1-d80d56811c2a) [Hidden Markov Model (part 2)](https://medium.com/ai-academy-taiwan/hidden-markov-model-part-2-ac46dcdd42d1) [ㄚㄚHidden Markov Model (part 3) 深入了解 Hidden Markov Model 的訓練理論](https://medium.com/ai-academy-taiwan/hidden-markov-model-part-3-dd2ec1480408) [Markov Model](https://web.ntnu.edu.tw/~algo/HiddenMarkovModel.html) [Speech and Language Processing. Daniel Jurafsky & James H. Martin. Chapter A](https://web.stanford.edu/~jurafsky/slp3/A.pdf) [ A tutorial on hidden Markov models and selected applications in speech recognition, by Rabiner](https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf) --- <h1>Large Deviations and Chernoff Bound</h1> $X$ is a random variable, let $Y=e^{sX}$ and $\alpha=e^{s\delta}$. From Markov inequality, we have $P(Y \geq \alpha) \leq \frac{E[e^{sX}]}{\alpha}=E[e^{s(X-\delta)}]$. Assume $s>0$ $\{ e^{sX} \geq e^{s\delta} \} \Longleftrightarrow \{ sX \geq s\delta \} \Longleftrightarrow \{ X \geq \delta \}$ $P(Y \geq \alpha)=P(X \geq \delta ) \leq e^{-s\delta}E[e^{sX}]$ Let $g(s)=e^{-s\delta}E[e^{sX}]$ $g(s)^{'}=E[(X-\delta)e^{s(X-\delta)}] \geq 0$ $g(s)^{''}=E[(X-\delta)^2e^{s(X-\delta)}] >0$ $g(s)$ is a increasing convex function for $s>0$. $g(s)^{'}=0 \Rightarrow s^* >0$ $g(s)^{} |_{s=0}=E[X]-\delta$ Hence $P(X\geq \delta) \leq exp(-s^{*} \delta) E[exp(s^*X)]$ for $\delta >E[X]$ The exponential function is convex, so by Jensen's inequality, $E[e^{sX}] \geq e^{sE[X]}$, $s>0$ Similarly, if $s<0$ $P(X \leq \delta) \leq exp(-s^{*} \delta) E[exp(s^*X)]$ for $\delta <E[X]$ $S_N=\sum_{i=1}^N X_i /N$, $X_1, X_2, ...X_N$ i.i.d. Then we $P(S_N \geq \delta)\leq exp(-s^{*} \delta) E[exp(s^*S_N)]$ for $\delta >E[S_N]$ **Cramér's theorem (large deviations)** logarithmic moment generating function of $X$ $\Lambda(s)=log E[e^{sX}]$ ![](https://hackmd.io/_uploads/BybJBzZh2.png) Legendre transform of $\Lambda$ : $\Lambda^*(x)=sup_{s \in R}(sx-\Lambda(s))$ $X_1, X_2, ...,X_N$ i.i.d. $\lim_{N \to \infty}\frac{1}{N} log(P(\frac{\sum_{i=1}^N X_i}{N} \geq x))=-\Lambda^*(x)$ for all $x>E[X_1]$ $\Lambda^*$: rate function **Theorem (Cramér’s Theorem)** Given a sequence of i.i.d. real valued random variables $X_i$, $i ≥ 1$ with a common moment generating function $Given a sequence of i.i.d. real valued random variables $X_i$, $i ≥ 1$ with a common moment generating function $M(θ) =E[exp(θX)]$, $E[X]=\mu$ and rate function $I(a)=\sup_{theta} \{ \theta x-M(\theta) \}$ the following holds: 1. For any closed set $F \subseteq R$, $\lim sup_{n \to \infty} \frac{1}{n}log(S_n \in F) \leq - inf_{x \in F} I(x)$ 2. For any open set $U \subseteq R$, $\lim inf_{n \to \infty} \frac{1}{n} log(S_n \in U) \geq -inf_{x\in U}I(x)$ Remark. 1. $[a, \infty)$ is a closed set 2. $M(0)=1$ 2. $I(x)$ is a convex non-negative function satisfying $I(µ) = 0$. $I(x)$ is an increasing function on $[µ, ∞)$ and a decreasing function on $(−∞, µ]$. $I(x)=sup_{\theta \geq 0} (\theta x-logM(\theta))$ for $x \geq \mu$. $I(x)=sup_{\theta \leq 0} (\theta x-logM(\theta))$ for $x \leq \mu$. [Introductory examples and definitions. Elena Kosygina](https://sites.math.northwestern.edu/~auffing/SNAP/Notes1.pdf) [A basic introduction to large deviations: Theory, applications, simulations∗](https://arxiv.org/pdf/1106.4146.pdf) <h1>Stochastic Orders</h1> **Increasing convex order** Let $X, Y$ be random variables such that $E[\phi(X)] \leq E[\phi(Y)]$ for all increasing convex functions $\phi:R \to R$ provided the expections exist(denoted by $X \leq_{icx} Y$) Properties 1. $X \leq_{icx} Y$ iff $E[X-a]^+ \leq E[Y-a]^+$ for all $a$. 2. $X \leq_{icx} Y$ then $E[X|\Theta] \leq_{icx} E[Y|\Theta]$ 3. $X_1,...,X_n$ iid, $Y_1,...,Y_n$ iid, If $X_i \leq_{icx} Y_i$, then $\sum X_i \leq_{icx} \sum Y_i$ $E[X-c]^+$ = loss rate Ref. Stochastic Orders and Their Applications (Probability and Mathematical Statistics) Moshe Shaked (Author) --- <h2>Information Entropy</h2> **Discrete random experiment** $log(\frac{1}{P(E)}), E$ is an event. When $p(E)$ is close to 1, the surprisal of the event is low, but if $P(E)$ is close to 0, the surprisal of the event is high. 1. $log(x), x>0$ monotone increasing 2. $log(1)=0$ 3. $log(a \times b )=log(a)+log(b)$ e.g., 電腦一開機 就爆炸 (機率很低), 太陽東方升起 機率~1 (沒有新聞 會報導) **measure of uncertainty** or **entropy** (average amount of information) $p \ 愈 \ 小, -log_2 (p) \ 愈 \ 大$ $X$ is a discrete random variable, $P(X=x_i)=p_i, \ x_i \neq x_j$, if $i \neq j$ $H(X)=H(p_1,p_2,...,p_n)=-\sum_{i=1}^n p_i \ log_2 \ p_i$ **(bits)**, if $p_i=0$, $p_i \ log \ p_i \doteq 0$ Remark. 1. $lim_{x \to 0^+}x log x \doteq 0$, $H(X) \geq 0$. 注意 後面 continuous random variable 有可能負的 2. 後面 log 代表 $log_2$ 3. self-information $I(E_k)=-log P(E_k)$ for event $E_k$ 4. ($P_I$)Continuity. $H(X)$ is continous in $p_i$ 5. ($P_{II})$Symmetry. $H(p_1,p_2,...,p_n)=H(p_2,p_1,...p_n)$ 6. ($P_{III}$)**Extremal Property:** maximum of $H(p_1, p_2,...,p_n)=H(\frac{1}{n}, ...,\frac{1}{n} )$ 7. $(P_{IV})$ **Additivity**. The event $E_n$ is dividded into disjoint subsets such that $E_n=\cup_{k=1}^m F_k$, $p_n=\sum_{k=1}^m$, $P(F_k)=q_k$. $H_1=H(p_1,...p_n)$, $H_2=H(p_1,...,p_{n-1},q_1,q_2,...,q_m)$, $H_3=H(\frac{q_1}{p_n},..., \frac{q_m}{p_n})$, then $H_2=H_1+p_n H_3$. 8. **Joint entropy** $H(X,Y) \doteq -\sum_k \sum_j p(k,j)log \ p(k,j)$, where $p(k,j) =P(X=x_k, Y=y_j)$. 9. **Conditional entropy** $H(X|Y) \doteq \sum_j P(Y=y_j)H(X|Y=y_j)$ 10. **Measure of Mutual Information** $I(x_i;y_j) \doteq log \frac{p(x_i|y_j)}{p(x_i)}=log \frac{p(x_i,y_j)}{p(x_i)p(y_j)}=I(y_j,x_i)$ $I(X;Y)=\sum_i \sum_j p(x_i,y_j)I(x_i, y_i)$ 11. Symmetry: $H(p_1,...p_n)$ iss invariant under permutation of $(p_1,...,p_n)$. 12. Small for small probabilities: $H(1-p,p) \to 0$ as $p \to 0$. 13. **Kraft–McMillan inequality**. $S={s_1, s_2,..., s_n}$ be encoded into a uniquely decodable code over an alphabet of size $r$ with codeword lengths $l_1, l_2, ...,l_n$. Then $\sum_{i=1}^n r^{-l_i} \leq 1$. Conversely for a given set of natural numbers $l_1, l_2, ...,l_n$ satisfying the above inequality, there exists a uniquely decodable code over an alphabet of size $r$ with those codeword lengths. **Remark.** $q(x_i)=\frac{1}{2} ^{-l_i}$, $E_p[l]=-E_p[\frac{ln(q(x))}{ln \ 2}]=-\sum p(x_i)log_2(q(x_i))=H(p,q)=$ **corss entropy**. Proof. $p_n=1-\sum_{i=1}^{n-1}p_i$ $\frac{dH}{dp_k}=-(log_2 e +log p_k)+(log_2 e +log p_n)=-log(\frac{p_k}{p_n})=0$, hence $p_k=p_n$. Thus $p_i=\frac{1}{n}$. It is a maximum and not a minimum, since $H(1,0,...,0)=0$. Property. 1. $H(X) \geq H(X|Y)$, **the equality signs** hold if and only if $X, Y$ are **statistically independent**. Fact. $ln \ x \leq x-1$ and $log \ x =ln \ x \ log \ e$, if $x>0$ Proof. $H(X|Y)-H(X)=\sum_j \sum_k p(x_k,y_i) log \frac {p(x_k)}{p(x_k|y_j)}$ $\leq \sum_j \sum_k p(x_k,y_j)(\frac{p(x_k)}{p(x_k|y_j)}-1)e=0$ 2. $I(X;Y)=H(X)+H(Y)-H(X,Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)$ 3. **If $X, Y$ are independent, then $I(X,Y)=0$, since $H(X|Y)=H(X)$**. **No information is transmitted throuth the channel.** 4. In a discrete communication system the channel capacity is the maximum of the transinformation . $C=max \ I(X;Y)$ ([A Mathematical Theory of Communication, by CE Shannon](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf)) 5. (Discrete Noicelss Channel) Let $X=\{x_i\}$ be the alphabet of a source containing $n$ symbols. $C=max \ I(X;Y)= max \ I(X,X)=max \ H(X)=H(\frac{1}{n},...,\frac{1}{n})=log \ n$ bits per symbol. 6. (Discrete Noisy Channel) The noice characteristic $P(Y=y_j|X=x_k)$ of the channel is **prespecified**. $C=max \ I(X,Y)$. **** idea: 機率最小的兩個 合併 形成 binary tree A **prefix-free binary code** (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root). Application: jpeg [Patent controversy (jpeg 專利爭議)](https://en.wikipedia.org/wiki/JPEG) ![330px-Huffman_coding_example.svg](https://hackmd.io/_uploads/HJLC2bkB6.png) 7. [Huffman coding~ particular type of optimal prefix code that is commonly used for lossless data compression](https://en.wikipedia.org/wiki/Huffman_coding) $X$ is a continuous random variable, the "differential entropy" $H(X) \triangleq -\int f_X(t)log(f_X(t))dt$ if it exists. Remark. **Riemann sum** $-\sum f_X(t_i)\Delta_i log(f_X(t_i)\Delta_i) \to -\sum f_X(t_i)logf_X (t_i) \Delta_i +\sum f_X(t_i)\Delta_i log(\Delta_i)$. Assume $-\sum f_X(t_i)logf_X (t_i) \Delta_i \to -\int f_X(t) log f_X(t) dt$ [Differential Entropy and Probability Distributions](https://www.mtm.ufsc.br/~taneja/book/node14.html) --- $P, Q$ distribution functions, $p=P^{'}, q= Q^{'}$ if $P, Q$ continuous distribution functions. **Kullback–Leibler divergence** $D_{KL}(P||Q)$ and Cross Entropy $H(P,Q)$ --- $P$: true distribution $D_{KL}(P||Q)$ ( 相對熵(relative entropy), KL散度, 訊息增益(information gain),訊息散度(information divergence) ) is a type of statistical distance: **a measure of how one probability distribution P is different from a second, reference probability distribution Q.** ![320px-KL-Gauss-Example](https://hackmd.io/_uploads/Hkf_gkAN6.png) ![kl](https://hackmd.io/_uploads/HkY9alyST.jpg) $D_{KL}(P||Q)=\sum P(x)log \frac{P(x)}{Q(x)}$ for discrete probability distribution $P, Q$ $(P(x) >0, Q(x) >0)$. $D_{KL}(P||Q)=-\sum P(x) log Q(x) - (-\sum P(x)log(P(x)))=H(P,Q)-H(P)$, where $H(P,Q)$ is the **cross entropy** (交叉熵) of $P$ and $Q$. **Remarks:** $P$: true distribution **Cross Entropy:** Average number of total bits to represent an event from $Q$ instead of $P$. **Relative Entropy (KL Divergence):** Average number of **extra** bits to represent an event from $Q$ instead of $P$. In general, $D_{KL}(P||Q) \neq D_{KL}(Q||P)$, it is **not** a metric. $H(P,Q) \neq H(Q,P)$ it is **not** a metric.. $D_{KL}(P||Q) \geq 0$, $D_{KL}(P||P)=0$ and $H(P,P)=H(P) \neq 0$. $D_{KL}(P||Q)=\int p(x)log \frac{p(x)}{q(x)}$ for continuous probability distributions with $p=P^{'}, q= Q^{'}$. $I(X;Y)=H(X)+H(Y)-H(X,Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)$ $I(X;Y)=D_{KL}(P(X,Y)||P_X \otimes P_Y)$, where $P_X \otimes P_Y (x,y)=P_X(x)P_Y(y)$ ![Entropy-mutual-information-relative-entropy-relation-diagram.svg](https://hackmd.io/_uploads/H1ADrmer6.png) **maximum likelihood estimation (MLE) 與 KL divergence $D_{KL}(P||Q)$ 的關係** Assume $P=p_{\theta}(x)$ is real distribution and $Q=p_{\hat{\theta}}(x)$ is approximate distribution. $D_{KL}=\sum_x p_{\theta} (x)log(\frac{p_{\theta}(x)}{p_{\hat{\theta}}(x)}$ $D_{KL} = E_{\theta}[ log(\frac{p_{\theta}(x)}{p_{\hat{\theta}}(x)})]=E_{\theta}log(p_{\theta}(x)) -E_{\theta}[log(p_{\hat{\theta}}(x)) ]$ $\hat{\theta}_{best}= arg \ min_{\hat{\theta}} D_{KL}(p_{\theta}||p_{\hat{\theta}})$ Assume $n$ samples, $E_{\theta}[log(p_{\hat{\theta}}(x)) ] \simeq \frac{1}{n} \sum_{i=1}^n log (p_{\hat{\theta}}(x_i))$ sample mean. $\hat{\theta}_{best} \simeq arg \ min_{\hat{\theta}} \ E_{\theta}log(p_{\theta}(x)) -\frac{1}{n} \sum log (p_{\hat{\theta}}(x)) =arg \ min_{\hat{\theta}}-\frac{1}{n} \sum log (p_{\hat{\theta}}(x_i)$ $= arg \ max_{\hat{\theta}} \frac{1}{n} \sum log (p_{\hat{\theta}}(x_i)) = arg \ max_{\hat{\theta}} \prod_{i=1}^n p_{\hat{\theta}}({x_i})$. **Key Point:** **Minimize the KL divergence as a loss function** 看例子: [Kullback-Leibler Divergence Explained](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) [A Gentle Introduction to Cross-Entropy for Machine Learning](https://machinelearningmastery.com/cross-entropy-for-machine-learning/) <h1>logistic regression</h1> **logistic function** ![320px-Logistic-curve.svg](https://hackmd.io/_uploads/By9I64dB6.png) $L=1, k=1, x_0=0$ A logistic function or logistic curve is a common S-shaped curve (**sigmoid curve**) with the equation $f(x)=\frac{L}{1+e^{-k(x-x_0)}}$, where $x_{0}$, the value of the function's midpoint; $L$, the supremum of the values of the function; $k$, the logistic growth rate or steepness of the curve. $f: (-\infty,\infty) \to (0,L)$ If $y = f(x) = a / (1 + b c^{–x})$ **inverse fnction** $f^{-1}(y)=\frac{-ln (\frac{(a-y)}{by})}{ln \ c}$. let $c=e, a=1, b=1$, $f^{-1}(y)=-ln \frac{1-y}{y}$ ![desmos-graph](https://hackmd.io/_uploads/HJ8L-r_r6.png) -ln(1-x) -ln(x) convex function ![desmos-graph](https://hackmd.io/_uploads/B18cPS_rT.png) ![1280px-Normal_Distribution_PDF.svg](https://hackmd.io/_uploads/SJNm1OdHa.png) standard deviation $\sigma$ 的含意 [Chebyshev's inequality](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality) $P(|X-\mu| \geq k \sigma) \leq \frac{1}{k^2}$, if $k>1$. Let $f(k)=\frac{1}{k^2}$, i.e., $\mu=0, \sigma=1$ ![chy _ Desmos_page-0001](https://hackmd.io/_uploads/r1rxQFuS6.jpg) Birnbaum–Raymond–Zuckerman inequality $X=(X_1, X_2, ...,X_n), \mu=(\mu_1,...,\mu_n), \sigma=(\sigma_1,...,\sigma_n)$ $P(||X-\mu|| \geq k ||\sigma||) \leq \frac{1}{k^2}$, if $k>1$. [Logistic Regression — Detailed Overview](https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc) --- **Bernoulli process** A Bernoulli process is a finite or infinite sequence of independent random variables $X_1, X_2, X_3, ...$, such that for each $i$, the value of $X_i$ is either 0 or 1; for all values of the probability $p$ that $X_i = 1$ is the same. In other words, a Bernoulli process is **a sequence of independent identically distributed Bernoulli trials.** $P =(p)^k(1-p)^{n-k}$ **Bernoulli MLE(maximum likelihood Estimation) Estimation** $L(\theta) =\prod_{i=1}^n p^{X_i} (1-p)^{n-X_n}$ **log MLE** $LL(\theta) =log \prod_{i=1}^n p^{X_i} (1-p)^{n-X_n}=\sum_{i=1}^n X_ilog p+(n-X_i)log(1-p)$ $\frac{\partial LL(\theta)}{\partial \theta}=0 \Rightarrow \hat{p}=\frac {\sum X_i}{n}=sample \ mean$ [Activation function](https://en.wikipedia.org/wiki/Activation_function) --- Ref. [Information Theory - Robert Ash](https://doc.lagout.org/Others/Information%20Theory/Information%20Theory/Information%20Theory%20-%20Robert%20Ash.pdf) 個人覺得很重要又有趣的 "物理" 觀念 (需要 calculus of variation 的知識) [Action principles (least action principle)](https://en.wikipedia.org/wiki/Action_principles) [The Feynman Lectures on Physics](https://www.feynmanlectures.caltech.edu/I_toc.html) [From the laws of reflection & refraction to variational principles](https://apmr.matelys.com/BasicsMechanics/VariationalPrinciplesStartingWithSnellDescartesLaw/index.html) [Theoretical Concepts in Physics, Malcolm S. Longair, University of Cambridge](https://www.cambridge.org/core/books/theoretical-concepts-in-physics/EB289A9081FA4101C4B943AEAC1AAF6E) [Introduction to the calculus of variations](https://www.open.edu/openlearn/mod/resource/view.php?id=72745) 這本書寫得很好, 可是對資電太難 Mathematical Methods of Classical Mechanics, V.I. Arnold.