# $\mathcal{Metric \ Space, Vector \ Space, Normed \ Space, Inner \ Product \ Space, Hilbert \ Space}$
Author: $\mathcal{CrazyDog}$, Ph.D. in Electrical Engineering, NCTU, R.O.C.
email: kccddb@gmail.com
Date: 20250304
Copyright: CC BY-NC-SA
<style>
.blue {
color: blue;
}
.bg {
color: blue;
}
.bgblue {
color: blue;
font-size: 24px;
font-weight: bold;
}
.blues {
color: blue;
font-size: 16px;
font-weight: bold;
}
.red {
color: red;
font-size: 24px;
font-weight: bold;
}
.reds {
color: red;
font-size: 16px;
font-weight: bold;
}
h1 {text-align: center;}
h2 {text-align: center;}
h3 {text-align: center;
color: red;}
h4 {text-align: center;}
</style>
$\textstyle$
$$\displaystyle$$

<h1>站在巨人的肩膀</h1>
<span class="blues">
<h3>腦中真書藏萬卷,掌握文武半邊天</h3>
</span>
**Mathematics (logic, calculus, linear algebra, probability, statistics), Data Structure, Algorithm, DSP, Micro-processor and OS 是內功心法**
**基礎要紮實**
network programming, embedded system 是戰鬥工法
**所有矩陣運算要注意電腦計算錯誤與overflow的可能性**
[家扶基金會](https://www.ccf.org.tw/support/donate/channel)
[台灣之心愛護動物協會(Heart of Taiwan Animal Care, HOTAC)](https://www.hotac.org.tw/content-donate_main)
[無國界醫生](https://www.msf.org.tw/about-us/who-we-are)
有些圖取至 wiki, 感謝這些圖的創作者
有關 Dirac delta function, 可以參考 EE 比較能接受的書 Griffel, D. H. Applied functional analysis. 其實 那積分 不能用 Riemann Integral, 也不能用 Lebesque Integral. 應該是 Schwartz distribution or generalized function
<h1>學而不思則罔 思而不學則殆
學而後思則悟 </h1>
簡單 整合 重要的 metric, norm, inner product, Schwarz inequality, $L^p$ norm 引入 $cos (\theta)$ 相關性 等工程 物理等重要觀念
<h3>Metric Space (賦距空間)</h3>
Set $X$ and define a function $d: (X,X) \to R$
1. $d(x,y) \geq 0$
2. $d(x,y)+d(y,z) \geq d(x,z)$
3. $d(x,y)=d(y,x)$
4. $d(x,y)=0$ iff $x=y$

We say $(X,d)$ is a metric space (賦距空間) with metric $d (.,.)$.
e.g., Euclidean Space.
Distance between $x \in X$ and the subset $A \subset X$
$d(x,A) = \liminf \{ d(x,y): y \in A\}$
:::info
Applications:
$d_p(x,y)=[\int |( x(t)-y(t) )|^p dt ]^{1/p}$, $p=1,2,3,...$
is also a metric. **(Minkowski Inequality**) for **good** functions.
**inner product** $< ., .>$ on vector space $X$:
$x, y, z$ vectors $\in X$
$< ., .>: X \otimes X \to$ $R$ or $C$
1. $<x+y,z>=<x,z>+<y,z>$
2. $<\alpha x,y>=\alpha <x,y>$
3. $<x,y>=\overline {<y,x>}$
4. $<x,x> \ne 0$ when $x \ne 0$

Define norm:
$||x||=\sqrt{<x,x>}$ $\geq 0$
$||x+y||_p \leq ||x||_p+||y||_p$ triangle inequality
e.g.,
1. $\vec{x}, \vec{y} \in R^n$, $<\vec{x},\vec{y}>=\sum_{i=1}^nx_iy_i$
2. $<f,g>\doteq \int_I fg$, where f, g real **function on $I=[a,b]$ (interval)
Remark.**
$\int_I fg$ : **Lebesque integral** (如果在 I 區間Riemann integral存在, 則兩個相等)
e.g., Let $I=[0,1]$, $f(x)=x^2$, $g(x)=0$, if $x \in A=\{0,1/2, 1/2^2,...\}$, $g(x) =1 \in I-A$ then $\int_I f =\int_I f(x)dx$ (Riemann integral) and $\int_Ig=1$. Riemann integral $\int g(x)dx$ 不存在.
[Probability Measure and Conditional Expectation](https://hackmd.io/@pingulinux/probspace)
[Functions are Vectors](https://thenumb.at/Functions-are-Vectors/)
3. $<f,g>\doteq \int f \overline {g}$, where f, g are **complex** functions.
$||f||\doteq \sqrt{<f,f>} < \infty$, $L^2$ norm
$L^p$ norm=$(\int |f|^p)^{\frac{1}{p}}$, $1\leq p <\infty$, $p \in N$
$L^2$ space (平方可積分函數), $l^2$ space (平方和 有限 的數列) 用在 數位通訊, 量子力學 等
$L^2$ space is also a Hilbert space(特例: $L^2, l^2$ ).
這裡 $\int$ 是 Lebesque Integral

Definition. A Hilbert space is a complete inner product space.
科學與工程 最重要的就是 $L^2, l^2$ space. 有距離 與角度 的觀念

Schwarz inequality in a **unit circle** of the Euclidean plane.
抽象空間(甚至是函數) 角度, e.g., $L^2$ space
Define $cos (\theta) = \frac {<x,y>}{||x|| ||y||} \leq 1$ when $||x|| ||y|| >0$, via Schwarz inequality
$||f_n -f||_{L^2} \to 0$ **can not** imply $f_n \to f$.
e.g., Gibbs phenomenon

:::
**Trace of a matrix**
$A$ is a $n \times n$ real matrix, $tr(A) \doteq \sum_{i=1}^n A[i,i]$. Then
$tr(A+B)=tr(A)+tr(B)$
$tr(\alpha A)=\alpha \ tr(A)$
$tr(A^T)=tr(A)$
**complex numbers**
The inner product of matrices is given by $tr(B^*A)$.
[Let be a subfield of $F$ (complex numbers). Then $M_{m,n}(F)$ is an inner product space.](https://sharmaeklavya2.github.io/theoremdep/nodes/linear-algebra/matrices/inner-product-space.html)
證明不難, 請自學
[Matrices form an inner-product space](https://sharmaeklavya2.github.io/theoremdep/nodes/linear-algebra/matrices/inner-product-space.html)
:::info
e.g.,
$<f,g> = \int_{I} f(t)g(t)dt$ (Lebesque integral) is a **inner product** on $L^2(I)$, where $I=[a,b]$ and provided $\int_{I}fg$ exists .
For any interval $I=[a,a+\delta t] \subset R$, $\delta t >0$, the length (i.e., measure) of the interval =$\delta t$.
**Schwarz Inequality**:
**$<x,y> \leq ||x|| ||y||$**
**with equality holding in the Cauchy–Schwarz Inequality if and only if $x$ and $y$ are linearly dependent.**
The Schwarz inequality allows one to extend the notion of **"angle between two vectors**" to any real inner-product space.
Define $cos (\theta) = \frac {<x,y>}{||x|| ||y||} \leq 1$ when $||x|| ||y|| >0$
**$cos (\theta)$ 代表 兩向量的相關性 (correlation)**
<h2>Random Variables</h2>
Probability space $(\Omega, F, P)$
$X,Y$ random variables, by Swartz inequality, we have the covariance inequality:
$<X,Y> \doteq E[XY]$, since $E[XY] \doteq \int X(\omega)Y(\omega)dP$ (Lebesque integral)
$cov(X,Y)=<X-E[X],Y-E[Y]>$
$var(X)=E[(X-E[X])^2]$
$var(X)var(Y) \geq cov(X,Y)^2$
$X=[X_1,...,X_n]$ random vector
**covariance matrix(dispersion matrix)** $COV_{X}(i,j)=E[[X_i-E[X_i][X_j-E[X_j]]$
**eigenspace**
$H$ is a Hibert space, $U:H \to H$ bounded linear operator, there exists an elelmet $x_0 \neq 0$ with the property that $Ux_0=\lambda x_0$, then $\lambda$ is the eigenvalue, $x_0$ eigenvector. The **eigenvectors** belonging to a given eigenvalue $\lambda$ form the **eigenspace** (or the **characteristic space** of $U$) $H_{\lambda}$ corresponding to $\lambda$.
**Lemma.** $H$ is a Hilbert space, $U:H \to H$, $Ux=\lambda_1x$ and $Ux=\lambda_2x$, $\lambda_1 \ne \lambda_2 \ne 0$ then **the eigensapce of $\lambda_1 \bot$ the eigensapce of $\lambda_2$.**
**Proof.** Assume $Ux=\lambda_1x$, $Uy=\lambda_2y$. $<x,y>=\frac{1}{\lambda_1}<Ux,y>=\frac{\lambda_2}{\lambda_1}<x,y>$, which is possible only if $<x,y>=0$.
:::
**<h2>Rayleigh quotient</h2>**
Any **covariance matrix** is **symmetric and positive semi-definite and its main diagonal contains variances.**
A **square matrix** is called **positive definite** if it is symmetric and all its eigenvalues $λ$ are positive, that is $λ > 0$.
If $A$ is **positive definite**, then it is **invertible** and $det A > 0$.
**A real symmetric matrix $A$ is positive definite if and only if $x^TAx > 0$ for every column $x \ne 0$ in $R^n$.**
e.g.,
$A=\left(
\begin{array}{ccc}
2 & -1 &0 \\
-1 & 2 & -1 \\
0 & -1 & 2\\
\end{array}
\right)$ is positive-definite.
$N=\left(
\begin{array}{cc}
1& 1 \\
-1 & 1 \\
\end{array}
\right)$ is not positive-definite in $C^2$ (and not in $R^2$), since $z^*Nz=2+2i$, where $z=[1,i]^T$.
$M$ is a real-valued $n \times n$ **symmetric positive-definite matrix.**
Let $<x, My>_{R^n}=<x,y>_{M}=x^T M y$ is an **inner product**.
**Rayleigh quotient** $R(M,x)=\frac{<x,Mx>}{||x||^2}$
$R(M,x) \in \{\lambda _{\min },\lambda _{\max } \}$ where
$\lambda _{\min },\lambda _{\max }$ are respectively the smallest and largest eigenvalues of $M$.
The **geometric multiplicity** of an eigenvalue is the **dimension** of the linear space of its associated eigenvectors (i.e., its **eigenspace**).
**characteristic polynomial** $f_A(\lambda)=det (\lambda I-A)$
$A:R^n \to R^n$, $dim(N(\lambda I-A))+dim(R(\lambda I-A)) =n$, where
**null space** $N(L)= \{ x \in X:Lx=0 \}$
and
**range space** $R(L)=\{ y= Lx: x \in X \}$
**geometric multiplicity of $\lambda=dim(N(\lambda I-A))$.**
e.g.,
In multivariate statistics and probability theory, the **scatter matrix** is a statistic that is used to make estimates of the covariance matrix.
$M=COV_X$
**Scatter Matrix S**
**classification**

Given **n samples** of **m-dimensional data**, represented as the $m \times n$ matrix, $x=[x_1,...,x_n]$, the sample mean is $\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$
The **scatter matrix** is the $m \times n$ **positive semi-definite matrix**
$S=\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^T$
[Hermitian 矩陣特徵值的變化界定](https://ccjou.wordpress.com/2010/03/16/hermitian-%E7%9F%A9%E9%99%A3%E7%89%B9%E5%BE%B5%E5%80%BC%E7%9A%84%E8%AE%8A%E5%8C%96%E7%95%8C%E5%AE%9A/)
[Scatter matrix , Covariance and Correlation Explained](https://medium.com/@raghavan99o/scatter-matrix-covariance-and-correlation-explained-14921741ca56)
[Scatter Plot](https://byjus.com/maths/scatter-plot/)
---
:::success
Applications:
數位通訊
機率統計
檢測與估計
machine learning
:::
---
Assume $I=[0,2\pi]$, $<f,g> = \int_{I} f(t)g(t)dt$ and $cos(\theta)=\frac {<f,g>}{||f|| ||g||} \leq 1$ (i.e., f,g 也有抽象夾角觀念),
$\int_0^{2\pi}cos(\omega t)sin(\omega t)$dt =0
Define $e_x = cos(\omega t)/||cos(\omega t)||$ and $e_y=sin(\omega t)/||sin(\omega t)||$ in $[0,2\pi]$.
$cos(\theta)=<e_x,e_y>=0$ and $||e_x||=||e_y||=1$
$e_x, e_y$ 當一 vector space basis 就像 $R^2$一樣, 有 距離, 內積 與 角度 觀念的 vector space.
**Probability space $(\Omega, F, P)$**
Ref. [Probability measure and conditional expectation for EE](https://hackmd.io/@pingulinux/probspace)
Assume $n$ is noise and $<f,g>=0, f,g \in L^2 (\Omega \otimes [0,T])$ (random processes, Stochastic Processes or random functions in $[0,T]$). Implicity, we consider $f,g \in L^2 \ or \ l^2$ (Hilbert Space)
**Correlation Receiver** ($<f,g>=\int_0^T fg$)~**Matched Filter Receiver**
Assume $<f,g>=0$.
1. $<f+n, f>=<f,f>+<n,f>=||f||^2+<n,f>$
2. $<f+n, g>=<f,g>+<n,g>=<n,g>$
Notice that $<f,h> \leq ||f||||h||$ by Schwarz Inequality.
Assume $f, g \in L^2$ or $l^2$(random process,random sequence), $<f,g> \doteq E[\int_0^T fg dt]=E[sample \ path \ integral]$.
$<n,f> \leq ||n||||f||$ and $<n,\frac{f}{||n||}> \leq ||f||$ by inner product properity and Schwarz Inequality.
:::info
Let $n^{'}=<n,h>$ and $||n^{'}||=N=\sqrt{E} < \infty$
Maximizing **SNR**(Signal-to-noise ratio) $\frac{||<f,h>||^2}{||n^{'}||^2}=\frac{ average \ signal \ power \times T}{average \ noise \ power \times T}$ (digital communication) to get good performance.
$||<f,h>|| \leq ||f||||h||$, **the two sides are equal if and only if $f$ and $h$ are linearly dependent**, i.e., $h=\alpha f$, $\alpha \ne 0$. Let $g(t)=f(T-t)$, $f \star g=<f,f>$ in $[0,T]$
**這是 基本的數位通訊Correlation Receiver or matched filter 的數學原理**
:::
---
**Lineary Independent**
A sequence of vectors $\vec{v}_1, \vec{v}_2,...,\vec{v}_k$ from a vector space V is said to be linearly dependent, if there exist scalars $a_1, a_2,...,a_k$ not all zero, such that $\sum_{i=1}^k a_i\vec{v}_i=0$. **A set of vectors is linearly dependent if and only if one of them is zero or a linear combination of the others.**
A sequence of vectors $\vec{v}_1, \vec{v}_2,...,\vec{v}_k$ is said to be **linearly independent** if it is not linearly dependent, that is, if the equation $\sum_{i=1}^k a_i\vec{v}_i=0$ can only be satisfied by $a_i=0$, $i=1,2,...,k$.
The kernel of a linear map, also known as the null space or nullspace, is the linear subspace of the domain of the map which is mapped to the zero vector.
A linear map $L : V \to W$ between two vector spaces $V$ and $W$
**the kernel of $L$:** $ker(L)=\{v \in V|L(v)=0 \}\doteq L^{-1}(0)$
**The image of $L$:** $img(L)\doteq \{ L(v): v \in V\}$
$W$ is a subspace of $V$, $img(L(W))=\{ L(v): v \in W\}$.
e.g.,
$B=\left(
\begin{array}{ccc}
1 & 0 & 0\\
0 & 1 & 0 \\
0 & 0 & 1 \\
\end{array}
\right)$
$ker(B-1I)^3=ker \left(
\begin{array}{ccc}
0 & 0 & 0\\
0 & 0 & 0 \\
0 & 0 & 0 \\
\end{array}
\right)=R^3$
$img(B)=R^3$
**eigenvalues** of $B$: 1, 1, 1
:::info
**$K_l \doteq ker(A-\lambda I)^l$, then $0 \neq K_1 \subseteq K_2 \subseteq K_3 ...$
Why?**
:::
**<h3>rank</h3>**
**Assume $f: V^m \to V^n$, $f:\vec{x} \to A\vec{x},\vec{x} \in V^m$, where $V^i$ is a $i$-dimentional vector space, $A$ is an $n \times m$ matrix.**
**rank** of $f$= **rank** $A \doteqdot dim \ f(V^m)$.
A sequence of vectors $V(A)$ denotes the set of all (finite) linear combinations of the points in $A$, i.e., **the linear subspace spanned by $A$** (or **linear span of A~span(A)**, **線性生成空間**).
e.g., $A=\{x_1,x_2,...x_n\}$, **$span(A)=\{x:x=\sum_{i=1}^n\lambda_ix_i,\ \lambda_i \in R \}$**
:::info
**Lemma**. $dimf(V^m)+dim \ ker(f)=m$
:::
**Proof.**
Let dim $f^{-1}(0)=m-r$ and $\{ u_{r+1},...,u_m \}$ be a basis of $f^{-1}(0)$. Then choose $u_i, 1 \leq i \leq r$ such that $\{u_1,...u_r,...,u_m \}$ is a basis of $V^m$.
**Claim** $f(u_1),...f(u_r)$ is a basis of $f(V^m)$ and dim$f(V^m)=r$. $f(V^m)=span \{f(u_1),...,f(u_r),...,f(u_m)\}$. Since $f(u_{r+1})=...=f(u_m)=0$, then $f(V^m)=span \{ f(u_1),...,f(u_r)\}$.
**Claim $f(u_1),...,f(u_r)$ are linearly independent.**
**Suppose** $\sum_{i=1}\alpha_i f(u_i)=0$. Then $f(\sum_{i=1}\alpha_i u_i)=\sum f(\alpha_i u_i)=0$ or $\sum_{i=1}^r \alpha_i u_i \in f^{-1}(0)$. There exists a $\alpha_i, r+1 \leq i \leq m$ such that $\sum_{i=1}^r \alpha_i u_i +\sum_{i=r+1}^m \alpha_iu_i=0$. Since $u_i, 1 \leq i \leq m$ are linearly independent then $\alpha_i=0, i=1,...,m$. Hence $f(u_1),...,f(u_r)$ are linearly independent and dim $f(V^m)=r$.
**Corollary. Let $f$ be a linear transformation of $V^n$ (into itself). Then $f$ is one-to-on if and only if $f$ is onto.**
Proof. $f(x_1)=f(x_2) \Leftrightarrow f(x_1-x_2)=0$. If $f^{-1}(0)= \{ 0 \}$, then $x_1=x_x$. Hence $f$ ono-to-one. Conversly, it is trival. $f^{-1}(0)=\{0\} \Leftrightarrow$ $dim \ f^{-1}(0)=0 \Leftrightarrow dim \ f(V^n)=n \Leftrightarrow f(V^n)=V^n$.
**Corollary.** $n \times m$ matrix $A=(a_{ij})$. Let $\vec{a}_1,..., \vec{a}_m$ denote the column vectors of $A$. For $\vec{x} \in V^m, \vec{x}=[x_1,...x_m]^T$, $f(\vec{x})=A\vec{x}=\sum_i x_i \vec{a}_i$ and $f(V^m)=span \{\vec{a_1},...,\vec{a_m} \}$. **$rank \ A =$** **the maximal number of linearly independent columns of $A$**
Corollary. Let $n \times m$ matrix $A$ and $m \times l$ matrix B, we have $rank AB \leq rank \ A, rank \ B$, since $rank \ AB = dim \ ABV^l \leq dim \ AV^m = rank \ A$, and $rank \ AB \leq BV^l =rank \ B$.
**Corollary. $P, Q$ nonsingular (invertible), then $rank \ PAQ =rank \ PA = \ rank \ AQ= rank A$.**
**Corollary.** Let $A$ be an $m$ by $n$ matrix. Then $rank(A)=m$ iff rows of $A$ are linearly independent.
**Corollary.** $A, B$ $n \times m$ matrices. $rank \ (A + B) \leq rank \ A + rank \ B$, since $rank \ (A+B) =dim (A+B)V^m \leq dim (AV^m+BV^m) \leq dim AV^m +dim BV^m =rank \ A+rank \ B$
這就是您 線代 解 rank 的方法之原理.

Let $H$ be a Hilbert space, $M$ is a subset of $H$. Define $M^{\bot}$, the othogonal complemet of $M$ by
$M^{\bot}= \{x \in H: <x,y>=0 \ for \ all \ y \in M\}$.
**Closed set**
A set $A$ in a metric space $(X,d)$ is a closed set if and only if if **every** convergent sequence $\{x_n\}$ with $x_n \in A$ has **it limit in $A$.**
e.g., $a_n=1-\frac{1}{n+1}$, $n=1,2,....$, $\lim_{n \to \infty} a_n =1$
$a_n \in (0,1)$, however $1 \notin (0,1)$, interrval (0,1) not closed set.
interval [0,1] is a closed set.
**Projection theorem**
Let $M$ be any **closed** linear subspace of a Hilbert space $H$, Then $H=M+M^{\bot}$ (linear subspace generated by $\{M, M^{\bot}\}$). Moreover, each $x \in H$ can be expressed uniquely $x=m+n$, where $m \in M$ and $n \in M^{\bot}$ and $||x||^2=||m||^2+||n||^2$.

Let $\hat{P}x =m$, $\hat{P}$ projection operator. Then $<\hat{P}x,n>=0$
Let $A$ be a nonempty set in a linear space $X$. The set is **lineary independent** if and only if for each $x \ne 0$ in $V(A)$, there is one and only one finite subset of $A$, say $\{x_1,x_2...x_n\}$, and a unique $n-$tuple of nonzero scalars, $\{a_1, a_2,...a_n\}$, such that $x=a_1x_1+...+a_nx_n$.
$\{a_1, a_2,...a_n\}$: **coordinates**
$\{x_1,x_2...x_n\}$: **basis**
[**Hilbert projection theorem**](https://en.wikipedia.org/wiki/Hilbert_projection_theorem)
e.g., $L^2$ and $l^2$.
Operator Norm $A:X \to Y$
$||A||_p= sup_{x \ne 0} \frac{||Ax||_p}{||x||_p}$, $1 \leq p \leq \infty$
1. $||A+B|| \leq ||A||+||B||$
2. $||AB|| \leq ||A||||B||$
3. $||A|| = ||A^*||$
$P$: projection operator, then $P^2=P$ and $||P||=1$.
:::info
3. $<x,y>=\overline {<y,x>}$
:::

**Definition.** Let $X$, $Y$ be Hibert spaces, $A:X \to Y$ bounded linear operator. The adjoint operator (共軛算子) $A^* Y \to X$ is defined by the equation
$<x, A^* y>_X=<Ax,y>_Y$.

<span class="reds">**Remark (重要).**</span>
1. The adjoint may also be called the **Hermitian conjugate**.
2. Hermitian matrix (or self-adjoint matrix) $A=[a_{ij}]=[\bar{a_{ji}}]$
3. **A square matrix $A$ is Hermitian if and only if it is unitarily diagonalizable with real eigenvalues, i.e.,** $A=Q\Lambda Q^*$ ([spectral decomposition](https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix) of A), where $\Lambda$ is a diagonal matrix.
4. $A: H \to H$ is called **self-adjoint** if $A=A^{*}$.
5. **$A: H \to H$ is a bounded** **self-adjoint** operator iff $<x,Ax>=<Ax,x>=\overline{<x,Ax>}$ is **real valued** for all $x \in H$, since $<x,y>=\overline {<y,x>}$
6. $T:H \to H$ is a boundded linear transformation. $T$ is said to be normal if $TT^*=T^*T$.
7. If $T$ **self-adjoint**, then it is normal.
8. A square matrix $U \in C^{n \times n}$ is **unitary** if $U^*U=UU^*=I$.
9. A square matrix $U \in R^{n \times n}$ is **orthogonal** if $U^TU=UU^T=I$.
e.g.,
$A=\left(
\begin{array}{ccc}
1 & 1 &0 \\
0 & 1 &1 \\
1 & 0 & 1\\
\end{array}
\right)$
is **normal**, since $A^*A=AA^*$ and its eigenvalues are $2, \frac{1 \pm i \sqrt{3}}{2}$. $A$ is neither unitary, Hermitian, nor skew-Hermitian matrix.
$A=\left(
\begin{array}{cc}
2 & 1+i \\
1-i & 3 \\
\end{array}
\right)$
$A$ is **self-adjoint**, and it can be checked (e.g., using the characteristic polynomial) that the eigenvalues of $A$ are $1, \ 4$.
$B=\left(
\begin{array}{ccc}
1 & 0 & 0\\
0 & 1 & 0 \\
0 & 0 & 2 \\
\end{array}
\right)$
$B$ is self-adjoint, and it can be checked (e.g., using the characteristic polynomial) that the eigenvalues of $B$ are $1, \ 1, \ 2$.
**Geometric multiplicity** of $\lambda=1$ is $2$.
Let $T(x)=Ax$, then $||T||$ is the **square root** of the largest eigenvalue of the **symmetric matrix** $A^TA$, i.e., $||T||= sup \{ \sqrt{\lambda}: \lambda \ eigenvalue \ of \ A^*A \}$.
**Why?**
e.g.,
$A=\left(
\begin{array}{ccc}
2 & 0 & 0 \\
3 & 0 & 2 \\
\end{array}
\right)$
Then $A^TA=\left(
\begin{array}{ccc}
13 & 0 & 6 \\
0 & 0 &0\\
6 & 0 & 4 \\
\end{array}
\right)$
, which has eigenvalues $\{0,1,16\}$, so $||A||=4$
The normed space of all bounded linear operators from the $X$ into $Y$ is denoted $B(X,Y)$.
**Lemma 1.** $X,Y,Z$ Hilbert spaces, $S \in B(X,Y)$ and $T \in B(Y,Z)$, then $||TS|| \leq ||T||||S||$.
**Lemma 2. 類似您學過的矩陣運算** $X, Y, Z$ Hilbert spaces.
1. If $A_1, A_2 \in B(X,Y)$, $(A_1+A_2)^*=A_1^*+A_2^*$.
2. If $\alpha \in R$, $(\alpha A)^*=\alpha A^*$
3. $A_1 \in B(X,Y)$, $A_2 \in B(Y,Z)$ then $(A_2A_1)^*=A_1^*A_2^*$.
4. If $A^{-1}$ exists, then $(A^{-1})^*=(A^*)^{-1}$.
5. $(A^*)^*=A$
$H$ a Hilbert space, $A \in B(X,X)$, $A$ is said to be **self-adjoint operator (自[共軛](https://blog.csdn.net/Strive_For_Future/article/details/108274742)算子)** if $A=A^*$.
$||A^*A|| \leq ||A^*||||A||=||A||^2$
$A^*Ax=\lambda x$
軛:在車衡兩端扼住牛、馬等頸背上的曲木。
e.g.,
[Covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix)
e.g.,
$A:R^n \to R^n$, $A^T=A^*=A$, $A^T$ transpose of $A$.
e.g.,
$B=\left(
\begin{array}{ccc}
1 & 0 & 0\\
0 & 1 & 0 \\
0 & 0 & 2 \\
\end{array}
\right)$
$B$ is a self-adjoint positive definite operator.
e.g.,
Fredholm (Volterrra) integral equation of the first kind
Let $X=Y=L^2[0,1]$ and define $Ax=\int_0^1 k(t,s)x(s)ds$, $t \in [0,1]$ and $||k(t,s)|| < \infty$ on $[0,1] \otimes [0,1]$, where $k(t,s)$ is called **kernel function**. Check $<Ax,y>=\int x(t)\int k(s,t)y(s)dsdt$. Then $A^*y=\int k(s,t)y(s)ds$.
**FACT(重要)**:
1. Every linear operator definited on **a finite dimentional normed space is compact**(**bounded operator, and so continuous**).
2. For any matrix $A$, **the eigenvalues of $A^*A$ are aways real and
non-negative** (in other words, $A^*A$ and $AA^*$ are positive definite).
**Proof.** Assume $<x,x>=1$ and $<Sx,x>=<x,Sx>=\lambda=<Sx,x>^*=\lambda^*$, then $0 < <Ax,Ax>=x^*A^*Ax=\lambda <x,x>$.
3. **Singular Value Decomposition(重要)**
Any matrix $A \in C^{m \times n}$ can be decomposed as $A=U\Sigma V$ and $U \in C^{m \times m}$ and $V \in C^{n \times n}$ are unitary matrices. The matrix $\Sigma \in C^{m \times n}$ is “diagonal,” with non-negative elements on the main diagonal. The non-zero elements of $\Sigma$ are called the singular values of $A$, and satisfy $\sigma=\sqrt{i-th \ eigenvalue \ of \ A^*A}$.
**Proof.** Assume $rank(A)=m$. Since $AA^*$ is Hermitian, there exists a diagonal matrix $\Lambda=diag(\lambda_1,...,\lambda_m)=diag(\sigma_1^2,...,\sigma_m^2)=\Sigma_1>0$ such that $U \Lambda U^*=AA^*$.
Let $V_1=\Sigma_1^{-1}U^*A$. Then $V_1^*V_1=I$. Construct $V=[V_1,V_2] \in C^{n \times n}$ by choosing the columns in $V_2$ so that $V$ is unitaary and $\Sigma=[\Sigma_1,0] \in R^{n \times n}$. $\Sigma V^*=\Sigma_1V_1^*=U^*A$, i.e., $A=U\Sigma V^*$.
**Proposition.**
Let $H_1$ and $H_2$ be Hilbert spaces with inner product $<.,.>_1$ and $<.,.>_2$ respectively. Let $L: H_1 \to H_2$ be a boundded linear operator. Define the adjoint $L^*: H_2 \to H_1$ so that $<y,Lx>_2=<L^*y,x>_1$ holds for all $x \in H_1$ and all $y \in H_2$.
1. $L^*$ is a boundded linear operator.
2. $||L||=||L^*||$
3. $L=L^{**}$
4. $U: H_1 \to H_2$ is unitary iff $U^*U=I_1$ and $UU^*=I_2$, where $I_i is the identity operator on $H_i$.
<span class="bg"> $A$ self-adjoint bounded linear operator on a real Hilbert space $H$ . $A$ is said to be positive semidefinite (positive definite) if $<x,Ax> \ \geq 0 (>0)$, for all $x \in H$</span>.
e.g.,
$(\Omega, F, P)$ is a **probability space**, $X_i \in L^2(P)$ i.i.d. $E[X_i^2]>0$ and $E[X_i]=0$
:::info
<h2>(sample) covariance matrix (重要)</h2>
**The covariance matrix $[K_{X_i X_j}]=[E[X_iX_j]]$ is semi-positive definite and symmetric.**
Proof:[ Positive semi-definiteness of the covariance matrix](https://statproofbook.github.io/P/covmat-psd.html)
<span class="reds">Hence $[K_{X_i X_j}]\vec{V}=\lambda \vec{V}$, the eigenvalue $\lambda >0$ </span> if $[K_{X_i X_j}]$ is positive definite.
**HOME WORK. Prove it.**
**Fact.** $||A||=||A^* ||$ (需要 Hahn-Banach Theorem, 對EE太難)
**Sample covariance matrix Q**
For a sample of vectors $x_i=(x_{i1},x_{i2},...,x_{ik
})^T, i=1,...,n$, where $n$ is the sample size. Sample meam is $\bar{x}=
\frac{1}{n}\sum_{i=1}^n x_i$
Sample covariance matrix is $q_{ik}=\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^T$
Assume $y \ne 0 \in R^k$ and $Q=[q_{ij}]$, then $y^TQy=\frac{1}{n} \sum_{i=1}^n ((x_i-\bar{x})y )^2 \geq 0$.
**Sample covariance matrix** is always **positive semi-definite.**
$L^2$ random process 經由 [Karhunen–Loève theorem](https://en.wikipedia.org/wiki/Kosambi%E2%80%93Karhunen%E2%80%93Lo%C3%A8ve_theorem) 也可得**covariance matrix** 用於 Principal component analysis
實務上 會有
MIT OpenCourseWare:
[Singular values](https://ocw.mit.edu/courses/6-241j-dynamic-systems-and-control-spring-2011/resources/mit6_241js11_lec04/)
[Matrix Perturbations](https://ocw.mit.edu/courses/6-241j-dynamic-systems-and-control-spring-2011/855766ddb3a22777d5425518c0ad51d2_MIT6_241JS11_chap05.pdf)
[Dynamic Systems And Control](https://ocw.mit.edu/courses/6-241j-dynamic-systems-and-control-spring-2011/pages/lecture-notes/)
[Generalized eigenvalue problem](https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Generalized_eigenvalue_problem)
[線代啟示錄](https://ccjou.wordpress.com/about/)
:::
---
**Geometric Analysis of Operator (投影運算子的幾何分析特性)**
這觀念一定要懂, 這也是 Hilbert space 的重要 觀念, 也 很多證明 對 你們 太難, 我用 最簡單的 的 來解釋, 沒有 很好 的數學 應是 一堆 聽不懂的 <span class="reds">"這很正常" </span>別 自責
**Null space $N(L)$**
$L:X \to Y$ is a linear transformation, $N(L) \doteq \{x \in X: Lx=0 \}$
**Range space $R(L)$**
$R(L) \doteq \{y=Lx: x \in X \}$
$P:H \to H$ is an orthongonal projection on a Hilbert space $H$, then
1. P is a identity operator on $R(P)$ and
2. P is the zero operator on $N(P)$.
Hence, $H=R(P)+N(P)$.
**Let $\{P_1, P_2,...P_m \}$ pojection operators, $P_iP_j =0$ if $i \neq j$, $P_i \neq 0$, and $I=\sum_{i=1}^m P_i$, then $H=\sum_{i=1}^m R(P_i)$.**
**Assume $\lambda_i \in C$, $i=1,2,..m$, $\lambda_i \neq \lambda_j$, if $i \neq j$. Let $L=\sum \lambda_i P_i$.**
**寫到 "連續" 學生就不懂了, 至少 有限維度(矩陣) 與 bounded (物理與工程應用) 是對的**
Assume $P_i$ continuious projection operator (e.g., finite dimentional projection operator, bounded operator, ODE, ...), for any vector $x \in H$, there exits $x_i \in R(P_i)$ and $Lx=\sum_{i=1}^m \lambda_i x_i$
**Eigenvalue-Eigenvector**
$\lambda x- Lx=0$ has a solution $\lambda_i$, $i=1,..., m$(eigenvalue), then $x_i \in R(P_i)$ (eigenvector) and $R(P_i)=N(\lambda_i I-L)$
:::info
$L\vec{x}=\lambda \vec{x}$
Intuitively, an eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it.
:::
**Riesz Representation Theory**
$H$ is a Hilbert space and let $I$ be a bounded linear functional ($||l|| < \infty$) on $H$. Then there is **one and only one** vector $y \in H$ such that
$I(x)=<x,y>$ for all $x \in H$.
e.g., $\vec{x} \in L^2[a,b], \ \vec{y} \in L^2[a,b]$, $f(\vec{x})=\int_{[a.b]} y(t)x(t)dt$. Then $f$ is a linear functional and $f=<x,y>$.
**Hilbert Space 的連續線性泛函都可以表示成內積。**
**Remark.** **A linear operator is bounded if and only if it is continuous**.
---
**以電路而言, $<f,g>$ 積分器** , 有沒有看出 SNR 的重要性而且用 power?
**$l^2$ space**
$x=\{x_k\}, y=\{y_k\}, k=1,2,..., n$
$<x,y>=\sum x_iy_i$ is an inner product on $l^2$
---
**Probability space $(\Omega,\mathcal{F},P)$**
$\Omega \ne \phi$ sample space
$\mathcal{F}$ is the $\sigma$-field of subsets of sample space $\Omega$
$P: \mathcal{F} \to [0,1]$ probability measure with $P(\phi)=0$
$A \in \mathcal{F}$, $A\subseteq \Omega$ is an event
$X$ is a random variable($\mathcal{F}$-measurable function from $\Omega \to R$).
$P(X \leq x)=Prob\{\omega \in \Omega|X(\omega) \leq x \}$
$X$, $Y$ $\in L^2(\Omega,\mathcal{F},P)$ are random variables with $E[X^2]<\infty$ and $E[Y^2]<\infty$
$<X,Y>\doteq E[XY]$ is an **inner product**
$<X-EX, Y-EY>\doteq E[(X-E[X])E(Y-E[Y])]$ is an inner product
$||X|| \doteq \sqrt{<X,X>}$ is the **norm** of $X$
**Markov's inequality**
If X is a nonnegative random variable and a > 0,
$P(X \geq a) \leq \frac{E[X]}{a}$
If $\phi$ is a monotonically increasing nonnegative function for the nonnegative reals, X is a random variable, $a ≥ 0$, and $\phi(a) > 0$, then
$P(X \geq a ) \leq \frac{E[\phi (X)]}{\phi(a)}$
e.g.,
$P(|X|\geq a) \leq \frac{E|X|^n}{a^n}$
Let $Y=X-E[X]$, then Chebyshev's inequality.
**顯示了隨機變數的「幾乎所有」值都會「接近」平均**
**函數 數列 隨機變數 也可看成 vector**
$\sqrt{E[X^2]}$ is a norm if $E[X^2]< \infty$.
correlation coefficient $X-E[X], Y-E[Y]$
correlation coefficient= $\frac{<X-EX,Y-EY>}{||X-EX||||Y-EY||}$
:::
[Hamming distance is a metric ](https://planetmath.org/hammingmetric)
Linear independence
Projection operator
basis
orthonormal basis
Independent random variables
...
請自行看 linear algebra, probability 書~寫起來很長, 若引入 measure theory 可以寫一本書了, 對一般資電又太難, wiki 沒有寫得很好
**擴充 為 random vector:**
If $f \in L^2$ and $f=a_1 e_1+a_2 e_2+...a_ne_n$,
$||e_i||=1$ and $<e_i,e_j>=0$ if $i \neq j$ (orthonormal basis, $i=1,2,3...$)
then
$<f,e_i>=a_i$.
$f \leftrightarrow_{one-to-one} \vec{a}=(a_1,a_2,...,a_n)$, **one-to-one mapping**


e.g., **QAM** (Quadrature amplitude modulation)
$\vec{g_i}=a_{i1} \vec{e_1}+a_{i2}\vec{e_2}$
$\vec{g_i} \leftrightarrow (a_{i1},a_{i2})$
$\vec{g}=(g_1,g_2,...g_{64})$ ~64 QAM (Quadrature amplitude modulation) 原理
Why?
**Wi-Fi 6/6E 1024 QAM**
**5th generation mobile networks (5G) 512-QAM or 1024-QAM**
**4G 256-QAM or 64-QAM**
---
Assume $\vec{f}=(f_1,f_2,...f_m)$ if $f_i \in L^2$ and $f_i=\sum_{j=1}^{n}a_{ij} e_j$. Then $\vec{f}^T=A\vec{e}^T$, where $\vec{e}=(e_1,e_2,...e_n)$ and $A=[a_{ij}]$
$\vec{f} \leftrightarrow [a_{ij}]$ one-to-one mapping
---
**Homework:**
$a_{ij}=?$
Hint: $<f_i,e_k>$
estimation and prediction 基礎
用到機率 [covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix)
因中文翻譯不同 請用 英文
$X_i$ is a random variable and $E[X_i]=0$, $\vec{X}=(X_1,X_2,...X_m)$ if $X_i \in L^2$
e.g., $X'=X-E[X]$ if $E[X]\ne 0$
----
[Concave vs. Convex](https://www.grammarly.com/blog/concave-vs-convex/)
**Jensen's inequality**

$t \in [0,1]$, $f(t_1 x_1+(1-t_1)x_2)\leq t_1f(x_1)+(1-t_1)f(x_2)$
(convex, concave 有些是反過來的 請自行確認)
If $X$ is a random variable and $\phi$ is a convex function, then
$\phi( E[X]) \leq E[\phi(X)]$
e.g.,
1. $\phi(x)=(x-C)^+=max\{x-C,0\}$, $C \geq 0$. $\phi(x)$ is a increasing convex function on $[0,\infty)$
e.g.,
Let $X$ be traffic rate and $C$ bandwidth, then $E[\phi(X)]$ denotes the average loss rate. Assume $X_1,...,X_n$ iid samples, $\frac {\sum_{i=1}^n \phi(X_i)\Delta t}{n\Delta t} \to E[\phi(X)]$.

2. $\phi(x)=-ln(1-x)$ increasing convex function on $[0,1)$

3. $\phi(x)=-ln(x)$ decreasing convex function on $(0,1]$
If the function $\phi$ is concave, then
$\phi( E[X]) \geq E[\phi(X)]$
**Jensen's inequality (Random vector $\vec{X}$)**
Let $\phi$ be a convex function on a convex set $C \subset R^n$, $\vec{X}$ is a random vector with $P(\vec{X} \in C)=1$, then $E[\vec{X}] \in C$, $E[\phi(X)]$ exists and $\phi(E[\vec{X}]) \leq E(\phi(\vec{X}))$.
---
**linear time-invariant (LTI) system and convolotion**
Let $<\delta(t), f(t)> \doteq f(0)$, i.e., $f(0)=\int f(t)\delta (t) dt$ for EE (In fact, this is **not** the Riemann integral).
Then $f(t)=<\delta(\tau-t), f(\tau)>$, by **LTI properties**.
**Notice that $f(t)=<\delta(\tau-t), f(\tau)>$ is incorrent if the system is not time-invariant.**
---
**Nyquist–Shannon sampling theorem**— If a function
$x(t)$ contains no frequencies higher than $W$ hertz, then it can be completely determined from its ordinates at a sequence of points spaced less than $\frac{1}{2W}$, i.e., sample rate
$f_{s}> 2W$
Ref. [Principles of Digital Communication, by Robert G. Gallager](https://www.mit.edu/~6.450/handouts/6.450book.pdf)
這本書 有難度, 沒老師教 不容易懂, 而且數學要有一定程度, 有用到 Lebesgue integration, $L^2$ space, entropy, Markov chain, Viterbi algorithm ( dynamic programming algorithm) ...
---
Let $h(t)$ be the **impulse responce** $\delta (t)$. Assume $h(t)$ exists.
Assume $f(t)$ is the **band-limited** input process of the LTI sysyem,
then the output porcess $g(t) =<h(t-\tau),f(\tau)>=<h(\tau), f(t-\tau)>$
$=\int h(\tau)f(t-\tau)d \tau=\int h(t-\tau)f(\tau)d \tau$,
by LTI properties.
We define $f \star h (t)=<h(t-\tau), f(\tau)>$ as the convolution of $f$ and $h$.


:::info
$f \to \fbox{LTI system: h(t)} \to f \star h$
:::
$(f \star g) \star h=f \star (g \star h)$
$f \star h=h \star f$
$f \star (g+h)=f \star g +f \star h$
**Discrete time**
Assume
$f(n)$ is the discrete time band-limited input process of the LTI sysyem.
Let $\delta(n) =1$ if $n=0$ and $\delta(n) =0$ if $n \neq 0$.
Let $<\delta(n), f(n)> \doteq f(0)>$
$f(n)=<\delta(m-n), f(m)>$ by LTI properties.
:::info
$\delta(n) \to \ \fbox{LTI system} \to \ h(n)$
$f(n) \to \ \fbox{LTI system: h(n)} \to f \star h (n)$
:::
Assume discrete time impulse response is given by $h(n)$,
we define $f \star h (n)=<h(n-m), f(m)>$ as the convolution of $f$ and $h$.
Then $f \star h (n)$ is the output sequence of the LTI system by LTI properties.
**Discrete Fourier transform** (DFT)
$\vec{x}=\{x_0,x_1...,x_{N-1}\}$ complex numbers, DFT of $\vec{x} \ X_k=\sum_{i=0}^{N-1} x_i e^{-2 \pi j kn /N}$
$x_n=\frac{1}{N}\sum_{k=0}^{N-1}X_ke^{2\pi jkn/N}$
$\vec{u_k}= e^{2 \pi kn /N}, k=0,...,N-1$ form an **orthogonal basis** over the set of $N$-dimensional complex vectors.
Claim $<\vec{u_k}, \vec{u_j}>=N\delta_{kj}$, i.e., $< \frac{\vec{u_k}}{\sqrt{N}}, \frac{\vec{u_j}}{\sqrt{N}}>=\delta_{kj}$, where $\delta_{kj}=1$ if $k=j$ and $\delta_{kj}=0$ if $k \neq j$.

Parseval's theorem
$<\vec{x},\vec{y}>=\frac{1}{N} <\vec {X}, \vec{Y}>=\frac{1}{N}\sum_k X_kY_k^*$
If we define DFT via $\frac{1}{\sqrt{N}} \sum_{k=0}^{N-1}X_ke^{-2\pi jkn/N}$, then we have $<\vec{x},\vec{y}>=<\vec {X}, \vec{Y}>$
In practice, we can choose $N$~$2^n$
There are two main types of DFT errors: "[aliasing](https://en.m.wikipedia.org/wiki/File:FFT_aliasing_600.gif)" and “[leakage](https://en.wikipedia.org/wiki/Spectral_leakage)"
Increase sampling rate +Window Function...
[183--FFT系列:甚麼是洩漏leakage?窗函數window如何改善洩漏leakage?](https://www.youtube.com/watch?v=51yzsu0f174)
**n-dimentional DFT**
$\vec{x}=[x_1,x_2,...,x_n]$
$\vec{n}=[n_1,n_2,...,n_n]$
$\vec{k}=[k_1,k_2,...,k_n]$
$N={N_1,N_2,...,N_n}$,
$0 \leq k_i <N_i$
$F_{\vec{x}}(\vec{k})=\sum_{n1=0}^{N_1-1}...\sum_{n_n=0}^{N_n-1}f_{\vec{x}}(n_1,n_2...,n_n) exp(-2\pi j (\frac{n_1 k_1}{N_1}+...+\frac{n_n k_n}{N_n} )).$
$f_{\vec{x}}(\vec{n})=\frac{1}{N_1...N_n} \sum_{k_1=0}^{N_1-1} ...\sum_{k_n=0}^{N_n-1} exp(2\pi j (\sum_{i=1}^n(n_ik_i/N_i)))$

2D Convolution Animation(by Michael Plotke)
[Image Filtering and Edge Detection](https://sbme-tutorials.github.io/2018/cv/notes/4_week4.html)
**Z-transform**
The Z-transform converts a discrete-time signal, which is a sequence of real or complex numbers, into a complex frequency-domain (the z-domain or z-plane) representation.
Bilateral Z-transform
$X(z)=\sum_{n=-\infty}^{\infty} x_n z^{-n}$
Unilateral Z-transform
$X(z)=\sum_{n=0}^{\infty} x_n z^{-n}$
inverse Z-transform

where $C$ is a counterclockwise closed path encircling the origin and entirely in the region of convergence (ROC).

In control theory, a **causal** system (also known as a physical or nonanticipative system) is a system where the output depends on past and current inputs but not future inputs—i.e., the output $y(t_{{0}})$ depends only on the i[nput $x(t)$ for values $t \leq t_{{0}}$.
[Cauchy's integral formula](https://en.wikipedia.org/wiki/Cauchy%27s_integral_formula)
[
Image Analysis by Andrew Zisserman](https://www.robots.ox.ac.uk/~az/lectures/ia/)
---
[Vector Integral Calculus](https://www.feynmanlectures.caltech.edu/II_03.html?fbclid=IwAR1gcdYXAIIbZJdkJC55pBRVGrCYvRC18Xn-_VLFujsCrTIJexLG62msY90)
---
Q:
Assume input signal is $f(t), t \in [0,T]$ and $f(t) \in L^2[0,T]$, find a LTI filter to maximize SNR in $[0,T]$ after the signal to the recever over a AWGN channel. Please see books, wiki or [通訊實驗matched filter結報](https://hackmd.io/@HsuChiChen/matched-filter)....
A: matched filter.
Idea: $<f+n,g>$ if $g(t)=f(T-t)$, where $n$ is a white noice. (Amplication is not good, because the noice for LTI system.)
$l^2[0,N-1]$ is also the same result.
---
這是數位通訊 的 基本原理 例如 QAM 等, 原來 你學的乘法器,積分器 是在進行 inner product, 所以很多研究所 入學考試 有 線性代數 ~因為重要
$e_i$ 已知, 只要 傳送 $a_i$ 接收端 能收到 $a_i$, 可還原 $f$
---
SNR (Signal-to-noise ratio) Why?
**Parseval's Theorem for Hilbert space (e.g., $L^2, l^2$)**
Define Fourier Transform of a real function g(x) as
$S(f) \doteq (\Phi\circ g)(f) \doteq \int g(\xi)e^{ -2 \pi j f \xi }d\xi$, where $f$ means frequency (not angular fequency).
Fourier inversion integral:
$g(t) = \int S(f) e^{2 \pi j f t }d f$
F: frequency domain
T: time domain
Then we have
$||S(f)||_{F^2}=||g||_{T^2}$
and
$<g_1, g_2>_T=<\Phi\circ g_1,\Phi \circ g_2>_F= \int S_1(f) \overline{S_2(f)} df$.
$\Phi: L^2_T \to L^2_F$ is a unitary operator.
$\Phi^{-1}=\Phi^*$, since let $f_i=\Phi (g_i)$, $<g_1,\Phi^{-1} (f_2)>_T=<\Phi (g_1), f_2>_F$.
**Time scaling**
Let $a \in R, a \ne 0$, if $h(x) = g(ax)$, then $\Phi \circ h (f)=\frac{1}{|a|} (\Phi \circ g)(\frac{f}{a})$
**Differentiation**
$\Phi (f^{'})=2 \pi j f \Phi (f)$
$\Phi (f^{(n)})= (2 \pi j f)^n \Phi (f)$
:::info
**multi-dimensional Fourier Transform**
$S(\vec{f})=(\Phi \circ g)(\vec{f}) \doteq \int g(\vec{\xi})e^{ -2 \pi j \vec{f} \centerdot \vec{\xi} }d \vec{\xi}$
Fourier inversion integral:
$g(\vec{x})=\int S(\vec{f}) e^{2 \pi j \vec{f} \centerdot \vec{x} }d \vec{f}$,
where $\vec{a} \centerdot \vec{b} =a_1b_1+a_2b_2+...a_nb_n$
**Plancherel theory**
The Fourier transform on very nice functions $G(R^n)$ is an $L^2$-isometry: more precisely, for any
$f, g ∈ G(R^n)$, we have $<f,g>=<S(f),\bar{S(g)}>$
:::
**conservation of energy**
**Parseval's Theorem states that the total energy computed in the time domain must equal the total energy computed in the frequency domain. It is a statement of conservation of energy.**
[Parseval’s Theorem](https://www.storyofmathematics.com/parsevals-theorem/)
[matched filter, SNR, AWGN channel](https://courses.grainger.illinois.edu/ece361/sp2011/Newlectures/Lecture03.pdf)
Assume $X(t)$ is a zero meam **wide-sense stationary(WSS)** random process. The auto-correlation function is given by $R_{XX}(\tau)=R_{XX}(t_1-t_2)=E[X(t_1)X(t_2)]$.
Let $P_{X}(f)=\Phi(R_{XX})(f)$, $P_X(f)$ is the **power spectral density** of $X(t)$.
Let $Y(t)$ is the output random process and the input process of a LTI system is $X(t)$, which is a WSS random process, then $Y(t)$ is also a WSS random process.
**Principal Component Analysis (PCA) 主成分分析**
**Karhunen–Loève theorem(Karhunen-Loève Expansion)**
$X(t)$ is a $L^2$ random process. $X(t) \in L^2(\Omega, F,P)$, $E[X(t)]=0$, covariance function $K_X(s,t)=E[X(s)X(t)]$, $s, t \in [a,b]$. Let $T_{K_X}: f \to T_{K_X}f=\int_{a}^b K_X(s,.)f(s)ds$ be a linear operator $L^2[a,b] \to L^2[a,b]$.
Since $T_
{K_X}$ is a linear operator, it makes sense to talk about its eigenvalues $λ_k$ and eigenfunctions $e_k$, which are found solving the homogeneous Fredholm integral equation of the second kind
$\int_a^b K_X(s,t)e_k(s)ds
=\lambda_k e_k(t)$
The covariance function $K_X$ satisfies the definition of a Mercer kernel.
An important special case
1. $K_X(s,t)=K_X(t,s)$ symmetric continuous function
2. $K_X$ is said to be positive-definite if and only if $\sum \sum K(x_i,x_j)c_i c_j >0$ or all finite sequences of points $x_1, ..., x_n$ of $[a, b]$ and all choices of real numbers $c_1, ..., c_n$.
**Remark.** A $N \times N$ symmetric real matrix
$A$ is said to be positive-definite if $x^TAx >0$ for all $x \in R^N \neq 0$. Assume $Ax =\lambda x$, then $x^T \lambda x >0$, **we have $\lambda >0$**.
e.g.,
$I=\begin{vmatrix} 1 & 0 & 0\\ 0 & 1 &0\\0 & 0&1 \end{vmatrix}$ is positive definite.
**<span class="blue">eigenvalue's geometric multiplicity</span>**
The **geometric multiplicity** of an eigenvalue is the **dimension** of the linear space of its associated eigenvectors (i.e., its <span class="red">eigenspace</span>).
The dimension of the eigenspace $E$ associated with $λ$, or equivalently the maximum number of linearly independent eigenvectors associated with $λ$, is referred to as the eigenvalue's geometric multiplicity $\gamma _{A}(\lambda )$. Because E is also the nullspace of $(A − λI)$, the **geometric multiplicity** of $λ$ is the dimension of the nullspace of $(A − λI)$, also called the nullity of $(A − λI)$, which relates to the dimension and rank of $(A − λI)$ as $\gamma_{A}(\lambda)=n-rank(A-\lambda I)$
By **Mercer's theorem(對EE 太難)**, there consequently exists a set $λ_k, e_k(t)$ of eigenvalues and eigenfunctions of $T_{K_X}$ forming an orthonormal basis of $L^2([a,b])$, and $K_X$ can be expressed as
1. $K_{X}(s,t)=\sum _{{k=1}}^{\infty }\lambda _{k}e_{k}(s)e_{k}(t)$.
2. $X(t)=\sum_{k=1}^{\infty} Z_k e_k(t)$ converge in $L^2$
3. $Z_k= \int_a^b X(t)e_k (t)dt$. Furthermore,
4. $E[Z_k]=0$ and $E[Z_iZ_j]=\delta_{ij}\lambda_j$
5. Sorting the eigenvalues in decreasing order $\alpha_1 \geq \alpha_2,..., \geq \alpha_n \geq ...$, where $\alpha_i \in \{ \lambda_1, ..., \lambda_n,... \}$
[Implementing a Principal Component Analysis (PCA) 主成分分析](https://sebastianraschka.com/Articles/2014_pca_step_by_step.html)
**Discrete Fourier Transform(DFT)**
$u_k=e^{ \frac{2j \pi}{N}kn}$, $n=0,1,...N-1$ form an orthogonal basis over the set of N-dimensional complex vectors.
$\vec{x}=[x_0,x_1,...x_{N-1}]^T$, $\vec{u}=[u_0,u_1,...u_{N-1}]^T$
$X_k= <\vec{x},\vec{u}> = \vec{x}^T \vec{u}^*$ is the DFT of $\vec{x}$.
---
**$\nabla f(\vec{x})$** and **mathematical optimization**

**Proof.**
Assume $\vec{x} \in R^n$, a hyperplane with a normal vector $\vec{n}$ at $\vec{n_0}$ is given by $<\vec{x}-\vec{n_0},\vec{n}>=0$.
Hence $<\vec{x}-\vec{n_0},\nabla f(\vec{x})>=0$ defines a tanget plane. $\nabla f(\vec{x})$ is **perpendicular** to the level curve $f(\vec{x})=c$ at $\vec{n_0}$.
**Q.E.D.**
**您可能要先假設存在性(困難的部分), 方法是對的** (**wiki 寫的不一定是對的, 還有最好看英文的**)
[Lagrange multiplier](https://en.wikipedia.org/wiki/Lagrange_multiplier)
$\vec{x}=[x_1,x_2,...,x_n], \vec{\lambda}=[\lambda_1,...,\lambda_m], g_i(\vec{x})=0, i=1,...,m, \ \vec{g}(\vec{x})=[g_1, ...,g_m]$. Find the local maxima and minima of $f(\vec{x})$ via
Let $L(\vec{x},\vec{\lambda})=f(\vec{x})-<\vec{\lambda}, \vec{g}(\vec{x})>$
$\bigtriangledown_{\vec{x},\vec{\lambda}}L(\vec{x},\vec{\lambda})=0$. $\lambda_1,...,\lambda_m$ Lagrange multiplier
[Lagrange Multipliers~examples](https://tutorial.math.lamar.edu/classes/calciii/lagrangemultipliers.aspx)
[Karush–Kuhn–Tucker conditions](https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions)
---
Taylor series with remainder

The Taylor polynomials for $ln(1 + x)$ only provide accurate approximations in the range $−1 < x ≤ 1$. **For $x > 1$, Taylor polynomials of higher degree provide worse approximations.**
$f(x)=f(a)+f^{'}(a)(x-a)+f^{''}(x-a)^2/2!+R(h)$, where $lim_{h \to 0}R(h)=0$.
$f :R^n \to R$, if there exists a linear functional $L:R^n \to R$ and $h:R^n \to R$ such that $f(x)=f(a)+L(x-a)+h(x)||x-a||$, where $\lim_{x \to a}h(x)=0$ and $L=df(a)$.
$df(a)(v)=\sum_{i =1}^{n}\frac{\partial f}{\partial x_i}{v_i}=<\nabla f,v>$.
A linear functional on a real vector space $V$ is a function $T:V->R$, which satisfies the following properties
1. $T(v+w)=T(v)+T(w)$, and
2. $T(\alpha v)=\alpha T(v)$.
[Taylor Polynomials of Functions of Two Variables](https://math.libretexts.org/Bookshelves/Calculus/Supplemental_Modules_(Calculus)/Multivariable_Calculus/3%3A_Topics_in_Partial_Derivatives/Taylor__Polynomials_of_Functions_of_Two_Variables?fbclid=IwY2xjawGdTrxleHRuA2FlbQIxMQABHfPg1ZVkbeaHhwGMWTFkVNzUOQ95xdZrhWpr35QLqFF05OTmj-04g2EmGg_aem_40VU5XtLhZR5FVhHnIyc_w)
這本好書只適合 數學好的資電人士 (博士班以上)
Gâteaux derivative:
[Optimization by Vector Space Methods: Luenberger, David G.](https://sites.science.oregonstate.edu/~show/old/142_Luenberger.pdf)
---
[Affine transformation 仿射轉換, $y=Ax+b$](https://en.wikipedia.org/wiki/Affine_transformation)


Gateaux derivative
[Optimization by Vector Space Methods: Luenberger, David G.](https://sites.science.oregonstate.edu/~show/old/142_Luenberger.pdf)
An affine transformation preserves:
1. **collinearity between points:** three or more points which lie on the same line (called collinear points) continue to be collinear after the transformation.
2. **parallelism**: two or more lines which are parallel, continue to be parallel after the transformation.
3. **convexity of sets**: a convex set continues to be convex after the transformation. Moreover, the extreme points of the original set are mapped to the extreme points of the transformed set.
4. **ratios of lengths of parallel line segments**
5. **barycenters (center of mass) of weighted collections of points**
比較好的書是 (會用到 functional analysis):
Optimization by Vector Space Methods, David G. Luenberger
The Laplacian of $f$: $\Delta f =\sum_i \frac{\partial^2}{\partial^2 x_i}f$. The Laplacian is invariant under all Euclidean transformations: rotations and translations.
[邊緣偵測 - 拉普拉斯算子 ( Laplacian Operator )](https://medium.com/%E9%9B%BB%E8%85%A6%E8%A6%96%E8%A6%BA/%E9%82%8A%E7%B7%A3%E5%81%B5%E6%B8%AC-%E6%8B%89%E6%99%AE%E6%8B%89%E6%96%AF%E7%AE%97%E5%AD%90-laplacian-operator-ea877f1945a0)
[OpenCV 基礎篇-邊緣檢測(edge detection)](https://hackmd.io/@cws0701/SJiFl5Ghq)
---
**Statistical Inference, detection and estimation**
**<h1>Linear regression(線性迴歸)</h1>**
---
**Least-Squeres Estimate and Projection Theorem**
$y$ is am $m$ vector and $W=[w_1,w_2,...,w_n]$ $m \ x \ n$ matrix with linearly independent columns $(m \geq n)$. Then there is a unique $n$ vector $\hat{\beta}$ which minimize $||y-W\beta||$ over all $\beta$ and the estimator $\hat{\beta}=(W^TW)^{-1}W^Ty$ if the inverse matrix exists.
Proof.
By projection theorem and $\{w_1,w_2,...w_n\}$ is linearly independent, $z=\hat{p}y$ is the projection onto the span of $\{w_1,w_2,...w_n\}$, where $\hat{p}$ projection operator. Let $y=x+z$, then $<w_i,y>=<w_i,z>$, $i=1, 2, ...,n$.
Hence $W^Tz=W^Ty=W W^T\hat{\beta}$. If $(W^TW)^{-1}$ exists, then $\hat{\beta}=(W^TW)^{-1}W^Ty$.
**Gauss-Markov estimate**
An $n\times n$ symmetric real matrix $M$ is said to be positive-definite if $x^TMx>0$ for all non-zero $x$ in $R^n$.
$\epsilon$ is a r.v., e.g., $\epsilon$~ $N(0,\sigma^2)$
$y=W\beta+\epsilon$, $E[\epsilon]=0$ and $Q=E[\epsilon \epsilon^T]$. Assume $Q$ is positive definite.

Proof.
$<x,y>_Q:=x^TQy=<x,Qy>$. Then $<x,y>_Q$ is an inner product.
Hence we can find the estimator **$\hat{\beta}$** by projection theorem.
A square matrix $A$
$A$ is Hermitian if and only if it is equal to its adjoint, that is, it satisfies $<Ax,y>=<x,Ay>$, i.e., $A=A^H$, $x,y \in C^n$.
**e.g., [Covariance (共變異數, 又稱協方差) matrix](https://en.wikipedia.org/wiki/Covariance_matrix)**
$A$ square matrix
$A$ is **Hermitian** if and only if it is unitarily diagonalizable with real eigenvalues.
A matrix $Q$ is **positive-definite** iff $Q$ is symmetric or Hermitian, and **all its eigenvalues are real and positive**.
**HW. Why?**
The estimator is said to be **unbiased** if and only if $E[\hat{\beta}]=E[\beta]$

**Kalman Filter (recursive estimation)**
Probability space $(\Omega,\mathcal{F},P)$
n-dimensional dynamic model of a random process on $L^2(P)$
1. $x(k):\Omega \to R^n, u(k):\Omega \to R^n$, $x(k+1)=\Phi(k)x(k)+u(k)$, $\Phi(k) \sim n \times n$ matrix, $E[u(k)]=0$ and $E[u(k)u(l)^{T}]=Q(k)\delta_{kl}$.
2. $\hat{x(0)}$ initial estimate of the initial random vector $x(0)$
3. $P(0)=E[(\hat{x}(0)-x(0))(\hat{x}(0)-x(0))^T]$
4. Measurement of the process, assumed to be of the form $v(k)=M(k)x(k)+w(k)$, where $M(k) \sim m \times n$ matrix and $w(k)$ m-dimentsional measurement error having zero mean and $E[w(k)w(j)^T]=R(k)\delta_{kj}$, where $R(k)$ positive definite.
5. $\hat{x}(k|j)$ the optimal estimate of $x(k)$ given the observation $v(0), v(1),...v(j)$. Let $V(j)=$ span $\{ v(0), v(1),...,v(j) \}$
6. covariance matrix $P(k)=E[\hat{x}(k|k-1)-x[k]](\hat{x}(k|k-1)-x[k])^T]$
7. Assume $u(k)$ is orthogonal to $v(k)$ and $x(k)$.
8. If $Y_1, Y_2 \subseteq L^2(P)$, the projection of a vector $\beta$ onto $Y_1+Y_2=Y_1 \oplus \hat{Y}_2$, where $\hat{Y_2}$ is orthogonal to $Y_1$. Hence we can find the projection $\beta$ onto the subspace of $Y_1+Y_2$, denoted $\hat{\beta}$.
9. The optimal estimate of $\Phi(k)x(k)$ is $\Phi(k)\hat{x}(k|k)$, since $u(k)$ is orthogonal to $v(k)$ and $x(k)$.
10. Get the optimal estimate $\hat{x}(k+1)|k)$ of $x(k+1)=\Phi(k)\hat{x}(k)+u(k)$, given the oservation $v(k)$.
Ref. Optimization by Vector Space Methods. David G. Luenberger
證明重點 就是 **Projection Theorem** 式子很長不想打字了
---
convex, concave 於優化時local minimum, local maximun 存在性 很重要 (注意 有些文章兩個 相反)
distribution function 有些左連續 有些右連續

**concave function**
---
**[convex function](https://en.wikipedia.org/wiki/Convex_function)**

convex set

non-convex set
Let f be a function of many variables defined on a **convex set** $S$. Then **$f$ is convex** if for all $x \in S$, for all $y\in S$ and all $\alpha \ \in (0,1)$ we have $f(\alpha x+(1-\alpha)y) \leq \alpha f(x)+(1-\alpha)f(y)$



<h1>maximum likelihood test</h1>
**likelihood**: the chance that something will happen
In probability theory, **odds** provide a measure of the likelihood of a particular outcome, e.g., $odds=\frac{p}{1-p}$, $p \in (0,1)$

$-ln(1-x)=x+\frac{x^2}{2}+\frac{x^3}{3}+...$ convex function on $[0,1)$

$-ln(x)$ convex function on $(0,1]$
**log-convex**
A function $f(x):D \subset R^n \to R$ is log-convex if $f(\theta x +(1-\theta)y) \leq f(x)^{\theta}f(y)^{1-\theta}$, D is a convet set and $0 < \theta <1$, i.e., $log f(x)$ is convex on $D$.
**<span class="red">Key Point:</span>**
A function $f$ of many variables defined on a convex set $S$ is convex **if and only if** the set of points on or above its graph is convex: **$\{(x, y): x ∈ S \ and \ y ≥ f(x) \}$ is convex.**
**Jensen's inequality**
A function $f$ of many variables defined on a **convex set** $S$ is convex if and only if for all $n ≥ 2$ $f(λ_1x_1 + ... + λ_nx_n) \leq λ_1f(x_1) + ... + λ_nf(x_n)$ for all $x_1 ∈ S, ..., x_n ∈ S$ and all $λ_1 ≥ 0, ..., λ_n ≥ 0$ with $\sum \lambda_i =1$. Let $y_i=f(x_i)$, then $f(<\vec{\lambda}, \vec{x}>) \leq < \vec{\lambda}, \vec{y}>$
[3.3 Concave and convex functions of many variables](https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cvn/t)
Bernoulli sampling
$B(k)=p^k(1-p)^{n-k}$
$f(k)=ln(B(k))=k ln(p)+(n-k)ln(1-p)$
$p=0.1,n=100, f(k)$

$g(x)=x^k(1-x)^{n-k}$
$n=50, k=6, g(x)$

---
$S$ is a binary random variable with a-$\bf{priori}$ probabilities $P(S=a_0)=p_0$ and $P(S=a_1)=p_1$.
The receiver of a communication system receives a random variable $S$ (**maybe** continuous random variable).
$\hat{S}$ denotes the estimator of $S$.
The conditional densities $f_{R|S}(R=r|a_i), i \in \{0,1\}$, are called $\bf{likelihoods}$. $f_R(r)=p_0 f_{R|S}(r|a_0)+p_1 f_{R|S}(r|a_1)$. The a-$\bf{posteriori}$ probability of $S$ is given by $p_{S|R}(a_i|r)=\frac{p_i f_{R|S}(r|a_i)}{f_R(r)}$ (Bayes' theorem).
Let $\Lambda (r)=\frac{f_{R|S}(r|a_0)}{f_{R|S}(r|a_1)}$ and threshold $\eta =p_1/p_0$.
$\Lambda (r)$ is called the $\bf{likelihood}$ $\bf{ratio}$.
$\hat{S}=a_0$ if $\Lambda (r) \geq \eta$.
$\hat{S}=a_1$ if $\Lambda (r) < \eta$.
This is called a $\bf{maximum \ likelihood \ test}$.
$\hat{S}$: **maximum likelihood estimator**
**Probability of error $P(e)$**
$P(e|S=a_0)=P(\hat{S}=a_1|S=a_0)=P(\Lambda(r)<\eta|S=a_0)$
$P(e|S=a_1)=P(\hat{S}=a_0|S=a_1)=P(\Lambda(r) \geq \eta|S=a_1)$
Hence,
$P(e)=p_0P(\Lambda(r)<\eta|S=a_0)+p_1P(\Lambda(r)<\eta|S=a_1)$
**實際上 要注意 Arithmetic Overflow**
$\Lambda(\lambda)$ is small, $e.g. 10^{-9}$
The **log likelihood ratio**, $LLR(r)=ln(\Lambda(r))$ is convenient for use with Gaussian noise with $Z = N(0,N_0/2)$.
$V=S+Z$, $f_{V|S}(v|a_0)=\frac{1}{\sqrt{(\pi N_0)}}exp(\frac{-(v-a_0)^2}{N_0})$
$\Lambda(v)=exp(\frac{-(v-a_0)^2 +(v-a_1)^2}{N_0})$
Hence,
$LLR(v)=\frac{(v-a_1)^2 -(v-a_0)^2}{N_0}=\frac{||v-a_1||^2-||v-a_0||^2}{N_0}=\frac{2va_0-2va_1+a_1^2-a_0^2}{N_0}$
$R^N$ space
e.g., $\vec{S}=[S_0,S_1, ...S_{N-1}]$ and $\vec{R}=[R_0, R_1,...R_{N-1}]$. Assume $\vec(R)=\vec{S}+\vec{Z}$, where $\vec{Z}=[Z_0,Z_1,..., Z_{N-1}]$. $Z_i$ iid random sequence with $N(0, \sigma^2)$ distribution.
**MAP (Maximum A Posterior Estimation)**
$arg \ max_{x \in S} f(x) \doteq \{x \in S: f(s) \leq f(x) \ for \ all \ s \in S\}$
e.g., $arg \ max_{x \in R} (1-|x|)=\{0\}$, i.e., at $x=0$
Assume $X_1, X_2, ...,X_n$ IID, $\hat{\theta}_{MAP}=arg \ max _{\theta} f(\theta|X_1, X_2, ..., X_n)=arg \ max _{\theta} f(X_1, X_2, ...,X_n |\theta)g(\theta)/h(X_1,X_2, ...,X_n)$
$=arg \ max _{\theta} \prod f(X_i|\theta)g(\theta)$, since IID.
Ref. [Principles of Digital Communication, by Robert G. Gallager
](https://www.mit.edu/~6.450/handouts/6.450book.pdf)
---
**Sufficient statistic** 充分統計量
A **statistic** is a function $T = f(X_1, X
_2, · · · , X_n)$ of the random sample $X_1, X_2, · · · , X_n$.
In statistics, a statistic is **sufficient** with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter"
Roughly, given a set $X$ of independent identically distributed data conditioned on an unknown parameter $\theta$ , a sufficient statistic is a function $T(X)$ whose value contains all the information needed to compute any estimate of the parameter (e.g., a maximum likelihood estimate).
A statistic $t = T(X)$ is sufficient for underlying parameter θ precisely if the conditional probability distribution of the data $X$, given the statistic $t = T(X)$, does not depend on the parameter $θ$.
e.g., sample mean $\hat{\mu}=T(\vec{X};\mu)=\frac{\sum_{i=1}^N X_i}{N}$
Suppose we have a statistical model, parameterized by a real number $θ$, giving rise to a probability distribution for observed data, $P_{\theta }(x)=P(x\mid \theta )$, and a statistic $\hat{\theta}$ which serves as an estimator of $θ$ based on any observed data $x$.

where $E_{x\mid \theta }[ . ]$ denotes expected value over the distribution $P(x\mid \theta )$. An estimator is said to be **unbiased** if its bias is equal to zero for all values of parameter $θ$.
e.g.,
**sample mean** $\hat{\mu}$: **unbiased** estimator.
**sample variance I** $S^2 = \frac{1}{N} \sum_{i=1}^N (X_i-\hat{\mu})^2$: biased estimator, since $E[S^2]=(1-\frac{1}{N})\sigma^2 <\sigma^2$
**sample variance II** $V^2 = \frac{1}{N-1} \sum_{i=1}^N (X_i-\hat{\mu})^2$ **unbiased** estimator.
**statistic** $T_{max}=max \{X_1, ..., X_N\}$
HW. Find $E[T_{max}]$, if $X_i$ iid and $uinform(0,\theta)$.
A:$\frac{N}{N+1} \theta$
Hint:
$\{\omega \in \Omega| T_{max}(\omega) \leq x \}$ if and only $\{\omega \in \Omega |X_i(\omega) \leq x, for \ all \ i=1,...,N\}$
**statistic** $T_{min}=min \{X_1, ..., X_N\}$
**Nonparametric statistics(無母數統計)**
Nonparametric statistics is a type of statistical analysis that does not rely on the assumption of a specific underlying distribution (such as the normal distribution), or any other specific assumptions about the population parameters (such as mean and variance).
**Fisher–Neyman factorization theorem**
If the probability density function is $ƒ_θ(x)$, then $T$ is **sufficient** for $θ$ **if and only if** nonnegative functions $g$ and $h$ can be found such that $f_{\theta }(x)=h(x) g_{\theta }(T(x))$ i.e. the density $ƒ$ can be factored into a product such that one factor, $h$, does not depend on $θ$ and the other factor, which does depend on $θ$, depends on $x$ only through $T(x)$.
e.g., $f(x_1,x_2,...,x_n; \lambda)=f(x_1;\lambda)...f(x_n;\lambda)$
# **Statistical Hypothesis Testing(統計假說檢定)**
(a) A statistical hypothesis.
(b) A test of a hypothesis against an alternative hypothesis and the associated concept of the critical region of the test.
(c) The power of a test.
**Definition 1**. Let C be that subset of the sample space which, in accordance with a prescribed test, leads to the **rejection** of the hypothesis under consideration. Then C is called the critical region of the test.
**Definition 2.** The power function of a test of a statistical hypothesis $H_0$ **against** an alternative hypothesis $H_1$ is that function, defined for all distributions under consideration, which yields the probability that the sample point falls in the critical region $C$ of the test, that is, a function that yields the probability of rejecting the hypothesis under consideration. The value of the power function at a parameter point is called the power of the test at that point.
**Definition 3**. Let $H_0$ denote a hypothesis that is to be tested against an alternative hypothesis $H_1$ in accordance with a prescribed test. The significance level of the test (or **the size of the critical region C**) is the maximum value (actually supremum) of the power function of the test when $H_0$ is true.
**Type II error** (false negative, 偽陰性): the test result says you don’t have coronavirus, but you actually do.
偽陽性(明明沒確診卻誤判為有)和偽陰性(明明確診卻誤判為沒有)
(英語:False positives and false negatives)
type I error , type II error
**Neyman-Pearson Theorem**
Let $\vec{X}=X_1, X_2...,X_n$ denote a **random sample** from a distribution with density function $f(x;\theta)$. Then the joint p.d.f. of $\vec{X}=X_1, X_2...,X_n$ is $L(\vec{x};\theta)=f(x_1;\theta)....f(x_n;\theta)$. Let $k>0$, $\Omega={\{\theta:\theta=\theta_1, \theta_2\}}$. Let $C$ be a subset of the sample space such that:
1. $\Lambda(\vec{x};\theta_1,\theta_2)=\frac{L(\vec{x};\theta_1)}{L(\vec{x};\theta_2)} \leq k$, $\vec{x} \in C$
2. $\frac{L(\vec{x};\theta_1)}{L(\vec{x};\theta_2)} \geq k$, $\vec{x} \in C^*$
3. power function 檢定力: $\alpha=P(\vec{X} \in C; H_0)$ ( **the probability of falsely rejecting the $H_0$ hypothesis is $\alpha$,** **type I error**, False positives, **偽陽性**)
4. level of significance (significant level) $\alpha$ (顯著水平(顯著水準))
Then $C$ is the **best critical region of size $\alpha$** for the simple hypothesis $H_0: \theta=\theta_0$ against the $H_1:\theta=\theta_1$
**Example**
$f(x;\theta)=\frac{1}{2\pi} exp(-(x-\theta)^2 /2)$
$H_0: \theta=\theta_0=0$
$H_1: \theta=\theta_1=1$
$\Lambda(\vec{x};0,1)=exp(-\sum x_i+2/n)$
If $k>0$, then $\sum x_i \geq n/2-ln \ k=c$
The ciritical region $C=\{ \vec{x}: \sum x_i \geq c\}$
Assume $n$ samples, $c_1=c/n$, $\bar{X}=\sum_{i=1}^n X_i /n$ and $\bar{X}$~$n(0,1/n)$.
$P(\bar{X} \geq c_1;H_0)=\alpha$ (**type I error**), hence given $\alpha$, one can find $c_1$ and find the probabilty $P(\bar{X} \geq c_1;H_1)= \frac{1}{\sqrt{2\pi 1/n}}$ $\int_{c_1}exp(- (\bar{x}-1)^2/(2 \times 1/n))$
1. density function
2. $H_0$ $H_1$
3. level of significance $\alpha$ (顯著水準)
4. Type I error, Type II error
5. sample size
If $n=25,\alpha=0.05$, then $c_1$~$0.329$, $k$~$exp(n/2-c_1)$ and the power of of the best test of $H_0$ again $H_1$ is 0.05 when $H_0$ is true. $P(\bar{X} \geq c_1;H_1)$~ $0.999$ when $H_1$ is true.
**Type II error**: $\beta = P(Fail \ to \ Reject \ H_0 | H_0 \ is \ False)=1-P(\bar{X} \geq c_1;H_1)$
---
**Proof.**
If there is another critical region $A$ of size $\alpha$, then
claim $\int_C L(\theta_2)-\int_A L(\theta_2) \geq 0$
$\int_C L(\theta_2)-\int_AL(\theta_2)$
$=\int_{C \bigcap A} L(\theta_2)+\int_{C \bigcap A^* }L(\theta_2)-\int_{A \bigcap C}L(\theta_2)-\int_{A \bigcap C^*} L(\theta_2)$
$=\int_{C \bigcap A^*}L(\theta_2)-\int_{A \bigcap C^*} L(\theta_2)$
Since $L(\theta_2) \geq \frac{1}{k} L(\theta_1)$ at each point of $C$ and $L(\theta_2) \leq \frac{1}{k} L(\theta_1)$ at each point of $C^*$, then
$\int_{C \bigcap A^*}L(\theta_2)-\int_{A \bigcap C^*} L(\theta_2) \geq \frac{1}{k} (\int_{C \bigcap A^*} L(\theta_!)-\int_{A \bigcap C^*} L(\theta_1))$
$= \frac{1}{k} (\int_{C \bigcap A^*} L(\theta_1)-\int_{C \bigcap A} L(\theta_1)+\int_{C \bigcap A} L(\theta_1)+\int_{A \bigcap C^*} L(\theta_1))$
$=\frac{1}{k}(\int_C L(\theta_1)- \int_A L(\theta_1))=0$. Hence $\int_C L(\theta_2)-\int_A L(\theta_2) \geq 0$
If the random variables are of the discrete type, the proof is the same, with integration replaced by summation.
[Type I & Type II Errors | Differences, Examples, Visualizations](https://www.scribbr.com/statistics/type-i-and-type-ii-errors/)
[Neyman-Pearson Lemma: Definition](https://www.statisticshowto.com/neyman-pearson-lemma/)
[
Introduction to Mathematical Statistics](https://online.stat.psu.edu/stat415/section/6)
[Neyman–Pearson lemma](https://en.wikipedia.org/wiki/Neyman%E2%80%93Pearson_lemma)
[A First Course on Statistical Inference](https://bookdown.org/egarpor/inference/)
[Confidence interval(信賴區間)](https://en.wikipedia.org/wiki/Confidence_interval)
[Confidence Intervals](https://www.baeldung.com/cs/confidence-intervals-statistics)
[Understanding the Confusion Matrix (II)](https://dev.to/overrideveloper/understanding-the-confusion-matrix-264i)
Ref. Introduction to Mathematical Statistics, R. V. Hogg and A. T. Craig
---
Successive Approximation
**Taylor's formula with remainder**
[TAYLOR’S THEOREM WITH REMAINDER](https://courses.lumenlearning.com/calculus2/chapter/taylors-theorem-with-remainder/)
---
以前下課時間 很喜歡 問學生一問題 events A, B probability independent $P(A B)=P(A)P(B)$. Why?
通常得到 有些說 定義, 很多茫茫然的眼神~
---
A:
$n$ 實驗次數
$n_A$ 發生 event $A$ 的次數, $n_B$ 發生 event $B$ 的次數
$n_{AB}$ 發生 $A \cap B$ 的次數
$\frac{n_A}{n} \sim P(A)$
$\frac{n_B}{n} \sim P(B)$
$\frac{n_{A \cap B}}{n} \sim P(A \cap B)$,
如果 event $B$ 的發生與否 不會 影響 對 event $A$ 的統計, i.e.,
$\frac{n_{A \cap B}}{n_B} \sim P(A)$, when $n_B \to \infty$ (i.e., P(A|B)=P(A) conditional probability)
$n_{A \cap B} \sim \frac{n_B}{n} \frac {n_A}{n}$, as $n_A, n_B, n \to \infty$
Hence,
$P(A B)=P(A)P(B)$ events A, B probability independent
[Probability measure by CrazyDog](https://hackmd.io/@pingulinux/probspace)
---
還有 微積分 最重要的定理是什麼? Why?茫茫然...
A:
微積分基本定理(Fundamental theorem of calculus)
orthonormal basis 用在除了線性代數外你們哪門 課? 茫茫然...
$B$ is a orthonormal basis of $H$ ( $L^2$ or $l^2$) space, then
$\forall \vec{x} \in H$, $\vec{x}=\sum_{\vec{e} \in B } <\vec{x},\vec{e}>\vec{e}$.
**Simulation:**
Assumme $X$ is a random variable, $F_X(x)=P(X \le x)$.
Let (generalized inverse function) $F_X^{-1} (u) =\inf_x \{x| F_X(x)\geq u\}$, where $0<u<1$.
Assume $U$ is a uniform random variable on $[0,1]$.
Then $P(F^{-1}(U) \leq x) = P(U \leq x)=F_U(x)$.
e.g., $X$ exponential distribution, $F_X(x)=1-e^{\lambda x}$.
$F_X^{-1}(u)=-ln(1-u)/\lambda$
[Inverse transform sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling)
---
**Successive Approximation, Newton's Method, Fixed Point Theorem**
Successive Approximation 重要的是 選擇 的**初始值** **能否 收斂 到 "正確解"** 與 **收斂 的效率**
Definition. A sequence $\{x_i \} in a normed space is said to be a **Cauchy sequence** if $||x_n-x_m|| \to 0$ as $n, m \to \infty$.
Remark. If $x_n \to x$, then $||x_n-x_m|| \leq ||x_n-x||+||x-x_m|| \to 0$.
Definition. A normed linear vector space $X$ is **complete** if **every Cauchy sequence** from $X$ has **a limit in $X$**. A complete normed linear vector space is called a **Banana space**.
Remark. **Hilbert space is complete.**
**contraction mapping**
$X$ is a normed space, $S \subseteq X, T:S \to S$. Then $T$ is a contraction mapping if there exists $\alpha \in [0,1)$ such that $||T(x_1)-T(x_2)|| \leq \alpha ||x_1-x_2||$ for all $x_1, x_2 \in S$.
**Contraction Mapping Theorem**
$X$ is a Hilbert (or Banach) space, $S \subseteq X$ and $S$ is a closed set of $X$. There is a unique $x_0 \in S$ satisfying $x_0=T(x_0)$. **Furthermore, $x_0$ can be obtained by the method of Successive Approximation starting from an arbitary initial point in $S$.**
Proof. Select an **arbitrary** element $x_1 \in X$.
Let $x_{n+1}=T(x_n)$, then $||x_{n+1}-x_n||=||T(x_n)-T(x_{n-1})|| \leq \alpha ||x_{n}-x_{n-1}||$. Therefor, $||x_{n+1}-x_n|| \leq \alpha^{n-1} ||x_2-x_1||$. Then $||x_{n+p}-x_n|| \leq ||x_{n+p}-x_{n+p-1}||+...+||x_{n+1}-x_n||$
Hence $||x_{n+p}-x_n|| \leq \frac{\alpha^{n-1}}{1-\alpha}||x_2-x_1||$. $\{x_n\}$ is a Cauchy sequence. Since $S$ is closed and $X$ is complete, there exists a $x_0 \in S$ such that $x_n \to x_0$.
Claim $x_0=T(x_0)$
Proof.
$||x_0-T(x_0)||=||x_0-x_n+x_n
-T(x_0)|| \leq ||x_0-x_n||+||x_n -T(x_0)||\leq ||x_0-x_n||+\alpha||x_{n-1}-x_0||$. **By appropriate choice of $n$**, the above inequality can be made arbitrarily small. Thus $||x_0-T(x_0)||=0$; $x_0= T(x_0)$.
Claim $x_0$ is unique.
Proof.
Assume $x_0$ and $y_0$ are fixed points. Then $||x_0-y_0||=||T(x_0)-T(y_0)|| \leq \alpha ||x_0-y_0||$. Thus $x_0=y_0$.
[Newton's Method](https://en.wikipedia.org/wiki/Newton%27s_method)

$x_{n+1}$ is a better approximation than $x_n$ for the root $x$ of the function $f$ (blue curve)
wiki

Iteration typically improves the approximation
wiki
Solving an equation of the form $P(x)=0$.
$x_{n+1}=x_n -\frac{P(x_n)}{P'(x_n)}$
Let $T(x)=x-[P'(x)]^{-1}P(x)$. A fixed point of $T$ is the solution of $P(x)=0$.
至於 擴充至 Hilbert (Banana) space, 對 EE 太難
---
**Exterior algebra**
Let $V$ be a vector space over a field $K$.
Let $e_1, ..., e_n$ be a basis of the vector space $V$. Definition. The exterior algebra $\Lambda V$ is the vector space with basis formal symbols
$e_{i1}\Lambda ... \Lambda e_{ia}$
where a ≥ 0 and $i1 < ... < ia$ is a strictly increasing sequence. The exterior product is
$(e_{i1} \Lambda ... \Lambda e_{ia}) \Lambda (e_{j1} \Lambda ... \Lambda e_{jb}) = (0 \ or \ ± e_{k1} \Lambda ... \Lambda e_{ka+b})$
where in the first case the sequences $i$ and $j$ have an element in common. $\Lambda$ is called exterior product (or wedge product).
e.g., $\Lambda R^3=K\oplus R \oplus \Lambda^2 R \oplus \Lambda^3 R$


The cross product (blue vector) in relation to the exterior product (light blue parallelogram). The length of the cross product is to the length of the parallel **unit** vector (red) as the size of the exterior product is to the size of the reference parallelogram (light red).
The Cartesian plane $R^2$ is a real vector space equipped with a basis consisting of a pair of unit vectors $e_x, e_y$.
$u,v,w$ vectors, $\alpha, s$ real numbers.
$v \Lambda v = 0$
$v \Lambda u = - u \Lambda v$
$(u+v) \Lambda w= u \Lambda w + u \Lambda w$
$\alpha v \Lambda su =\alpha s (v \Lambda u)$
$v=ae_x+be_y$, $w=ce_x+de_y$, $v \Lambda w= ace_x \Lambda e_x+bc e_y \Lambda e_x+ade_x \Lambda e_y+bd e_y \Lambda e_y=-bce_x \Lambda e_y+ade_x \Lambda e_y=$ $\begin{vmatrix} a & b\\ c & d \end{vmatrix}e_x \Lambda e_y$.

平行四邊形 面積 =| ad-bc |
Such an area is called the signed area of the parallelogram.
Let $u=u_1 e_1+u_2 e_2+u_3 e_3, v=v_1e_1+v_2e_2+v_3 e_3$ then $u \Lambda v=(u_1v_2-u_2v_1)e_1 \Lambda e_2+(u_2v_3-v_3u_2) e_2 \Lambda e_3+(u_3v_1-u_1v_3) e_3 \Lambda e_1$. Let $w=w_1e_1+w_2e_2 +w_3e_3$,
**Determinant**
$det=\begin{vmatrix}
u_1 & u_2 & u_3\\
v_1 & v_2 & v_3\\
w_1 & w_2 & w_3
\end{vmatrix}$
$u \Lambda v \Lambda w = det \ e_1 \Lambda e_2 \Lambda e_3$
平行六面體的體積= | det |
$u_i \in R^n$ vectors,$u_i= \sum a_{ij}e_j$. $u_1 \Lambda u_2... \Lambda u_n = \begin{vmatrix} a_{11} & a_{12} & ...\\ a_{21} & ...\\...\\a_{n1} & ... \end{vmatrix} e_1 \Lambda e_2...\Lambda e_n$
**oriented volume of $u_i, i=1,2,...n$**=\begin{vmatrix} a_{11} & a_{12} & ...\\ a_{21} & ...\\...\\a_{n1} & ... \end{vmatrix}
Let $R^n$ be an n-dimentional real vector space.
A form of degree 1 (or a **1-form**) is a linear function $\omega : R^n \to R$, i.e., $\omega(\lambda_1 \eta_1 + \lambda_2 \eta_2)=\lambda_1 \omega (\eta_1)+ \lambda_2 \omega(\eta_2)$, where $\lambda_i \in R$ and $\eta_i \in R^n$.
**2 form** is a function on pairs of vectors $\omega^2: \ R^n \times R^n \to R$, which is bilinear and skew symmetric.
e.g., **oriented area** $S(\eta_1,\eta_2)=\begin{vmatrix} \eta_{11}& \eta_{12}\\ \eta_{21} & \eta_{22} \end{vmatrix}$ is a 2 form.
k form is a function of k vectors which is k-linear and antisymmetric.
$\omega(\eta_{i_1},...\eta_{i_k}) =(-1)^v \omega(\eta_1,...\eta_k)$
$v=\left\{
\begin{array}{c}
0 \ \ if \ the \ permulation \ i_1,...,i_k \ is \ even; \\
1 \ \ if \ the \ permulation \ i_1,...,i_k \ is \ odd. \\
\end{array}
\right.$
不懂 permulation 的請回憶 行列式
e.g., oriented volume on $R^n$
**Differentiable manifold (微分流形)**
A **chart** is an open set $U$ in the enclidean coordinate space $\vec{q}=(q_1,...,q_n)$ together with a one-to-one mapping $\phi$ of $U$ onto some subset of $M$, $\phi: U \to \phi U \subset M$.
A set $M$ is given the structure of a differential manifold if $M$ is provided with a **finite or countable** collection of charts, so that every point is represented in at least one chart.
**1-form $df_x$ on the tanget space $TM_x$**
Let $f: M \to R$ be a differentiable function on the manifold $M$ (imagine a function of many variables $f:R^n \to R$). The differential $df|_x$ of $f$ at $x$ is a linear map $df_x:TM_x \to R$ of the tangent space to $M$ at $x$ into the real line.
tangent bundle $TM = \cup_x TM_x$
Definition. A 1-form on a manifold is a smooth map $\omega : TM \to R$ of the tangent bundle of $M$ to the line, linear on each tangent space $TM_x$.
**Differential k-form**
Definition. A differential k-form $\omega^k |_x$ at a point $x$ of a manifold $M$ is an exterior k-form on the tangett space $TM_x$ to $M$ at $x$, i,e,, a k-linear skew-symmetric function of k vectors $\eta_1, ..., \eta_k$ tanget to $M$ at $x$.
**Theorem.** Every differential k-form on $R^n$ with a given coordinate system $x_1, ..., x_n$ can be written uniquely in the form $\omega^k= \sum_{i_1 <...<i_k} a_{i_1,...i_k}(x)dx_{i_1} \Lambda ... \Lambda x_{i_k}$ where the $a_{i_1,...i_k}(x)$ are smooth functions on $R^n$.
---
<h2>
GF(2) Galois Field
</h2>
In mathematics, a **field** is a set on which addition, subtraction, multiplication, and division are defined and behave as the corresponding operations on rational and real numbers do.
建構式數學教的是最基本的數學觀念 **不是錯**~**只是 對於 國高中小 是考試與教育走火入魔**, **感嘆~ 就像 兩個事件獨立一樣~對學生 叫做 定義**
**GF(2) GF:Galois field**
The modulo-2 addition can be
implemented with an **XOR ($\oplus$)** gate and the
modulo-2 multiplication can be implemented
with an **AND ($\otimes$)** gate.
The set {0,1} together with modulo-2 addition
and multiplication is called a binary field ,
denoted GF(2).
**Remark.**
1. Every element $x$ of GF(2) satisfies $x \oplus x = 0$ and therefore $−x = x$
2. $x, y \in GF(2)$, $x-y= x \oplus y$
3. $1 \otimes 1=1$, **the only invertible element is 1**. (e.g., rational number $4 \times (1/4)=1, 1/4 := 4 ^{-1}, a/b:=a \times b^{-1}, b \ne 0$)
4. Hence, GF(2) is a field.
**Vector Space over GF(2**)
A binary n-tuple $\vec{a}=(a_1, a_2,...a_n)$
is an ordered sequence with components from GF(2).
Define an **addition operation** for any two binary
vector $\vec{a}+\vec{b}=(a_1 \oplus b_1, ..., a_n \oplus b_n)$, where $\oplus$=XOR.
Define a **scalar multiplication** between an element $c$ in GF(2) and a binary n-tuple $\vec{a}=(a_1,a_2,…,a_n)$ as
follows: $c \times \vec{a}=(c \otimes a_1, c \otimes a_2,...,c \otimes a_n)$, where $\otimes$ = AND.
The set $V^n$ together with the addition defined for any two binary n-tuple in $V^n$ and the scalar multiplication defined between an element in GF(2) and a binary n-tuple in $V^n$ is called a vector space over GF(2).
Hamming distance $d(\vec{a},\vec{b})=\sum |a_i-b_i|=\sum |a_i \oplus b_i|$
HW:
[Show that Hamming distance is a metric.](https://planetmath.org/hammingmetric)
An (n,k) **convolutional encoder (CE)** will encode a k-bit input block into an n-bit ouput block, which depends on the current input block and the m preceding input blocks
A convolutional encoder is a discrete linear time-invariant system.
**Remark. Why LTI system?**
input of CE $\vec{0} \to$ CE $\to \vec{0}$
input of CE $\delta(n) \to$ CE $\to \vec{g}$
input of CE $\delta(n-m) \to$ CE $\to \vec{g}(n-m)$ and
$a \otimes \delta(n) \to$ CE $\to a \otimes \vec{g}(n-m)$, $a \in GF(2)$
since the convolutional encoder is finite state machine (FSM) with initial state $\vec{0}$. Assume $\vec{g} \in V^n$, then CE is a LTI system. **Hence we can define impulse response of the CE.**
(2,1) convolution encoder:
Assume discrete time impulse response is given by $g^0(n)=[1,0,1,1]$ and $g^1(n)=[1,1,1,1]$ over GF(2).

input sequence=$\vec{U}$
output $\vec{V^0}=g^0 \star \vec{U}$ and $\vec{V^1}=g^0 \star \vec{U}$
output sequence=$(V^0_0,V^1_0,V^0_1,V^1_1,...)$
Decoding convolutional codes:
**Viterbi algorithm**...by VLSI or Software
---
**Statistical Inference (統計推論)**
Formally, the partial derivative with respect to $\theta$ of the natural logarithm of the likelihood function is called the **score**.
$s(X;\theta)$=score=$\frac{\partial}{\partial \theta}log L(X|\theta)$.
e.g.,
$L(X|\theta)=\prod_{i=1}^N f_X(x|\theta)$, $X_1,...,X_N$ iid.
$p(x_i|\mu, \sigma)=\frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{1}{2 \sigma^2}(x_i-\mu)^2}$
$L(X|\mu, \sigma)=\prod_i p(x_i|\mu, \sigma)$
In general, $\{f(\vec{x}|\theta \in \Theta) \}$, $\Theta$ is called the **parameter space**.

**Maximum likelihood estimator (MLE)**
$\hat{\theta}=arg \ max_{\theta}L(x|\theta)$
Therefore, MLE can be estimated by solving $\frac{\partial}{\partial \theta}log L(X|\theta)$ or $\nabla_{\vec{\theta}}log (L(X|\vec{\theta}))$ if it exists, since log(.) is a **monotone increasing function** and
$\frac{d}{dx}ln(f(x))=\frac{f(x)^{'}}{f(x)}$, $f(x)>0$.
**HW. $arg \ max_{x \in [0,2\pi]}cos(x)=?$**
The expected value (the first moment) of the score, evaluated at the **true parameter** value **$\theta$** , is 0.
1. $E[s(X;\theta)|\theta]=0$
2. $I(\theta)=E[s(X;\theta)^2|\theta]$
The **Fisher information** $I(\theta)$is defined to be **the variance of the score**.
Proof.
$\frac{d \ ln f(x)}{dx}=\frac{f^{'}(x)}{f(x)}$
$E[s(X;\theta)|\theta]$
$=\int \frac{\partial}{\partial \theta} ln L(x|\theta)L(x|\theta)dx$
$=\int \frac{ \frac{\partial \ L(x|\theta)}{\partial \theta}}{L(x|\theta)} L(x|\theta)dx=\frac{\partial}{\partial \theta}\int L(x|\theta) d x=0$
**score vector**
$\vec{\theta}=[\theta_1,...,\theta_n]^T$
the **log-likelihood function** $L(x;\vec{\theta})$
**score vector**$I(\vec{\theta})=\nabla_{\vec{\theta}}L(x;\vec{\theta})$
Under mild regularity conditions, the expected value of the score is equal to zero:
$E(I(\vec{\theta})|\vec{\theta})=0$
[Score Function and Fisher Information Matrix](https://bobondemon.github.io/2022/01/07/Score-Function-and-Fisher-Information-Matrix/)
[Fisher information](https://en.wikipedia.org/wiki/Fisher_information)
The information matrix is the **covariance matrix of the score**.
Fisher information matrix (FIM)
$\vec{\theta}=[\theta_1,...,\theta_n]^T$
$I(\vec{\theta})=(I[\theta])_{ij}=E[\nabla_{\vec{\theta}}log L(X|\theta) \nabla_{\vec{\theta}}log L(X|\theta)|\theta]^T]$
If ${T}_{\theta}(X)$ is an unbiased estimator of $\theta$
then the Cramér–Rao bound reduces to
$COV_{\theta}(T_{\theta}(X)) \geq I(\theta)^{-1}$
**[Cramér–Rao bound (Rao-Cramér inequality, Rao-Cramér lower bound)](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound)**
[Score Function and Fisher Information Matrix](https://bobondemon.github.io/2022/01/07/Score-Function-and-Fisher-Information-Matrix/)
---
<h2>
Number Theory
</h2>
1. $(a,b)$=gcd(a,b)=greatest common divisor of $a$ and $b$
2. if $(a,b)=1$, then $a$ and $b$ are said to be relative prime.
3. Let $a, \ b,\ p \in Z$ with $p$ prime. If $p|ab$ then $p|a$ or $p|b$.
4. Let $a, \ b,\ m \in Z$ with $m>0$. Then $a$ is said to be congruent to $b$ modulo $m$, denoted $a \equiv b \ mod \ m$ (同餘).
5. Every integer greater than $1$ has a prime divisor.
6. If $a$ is a composite number(合成數), there exist $a,b \in Z$ such that $n=ab$, $1<a<n, 1 <b <n,$ e.g., $14=2 \times 7.$
7. Let $n$ be a composite number(合成數). Then $n$ has a prime divisor $p$ with $p \leq \sqrt{n}$.
8. $1$ 不是質數也不是 composite number. Every positive integer can be written as a product of prime numbers. e.g., $60=(2^2) (3) (5)$. In general, $n=p_1 ^{a_1} ... p_n^{a_n}$, if $n>1$ and $p_i$ prime numbers.
9. $a_i \in Z$, $a_1,...,a_n$ not all zero. If $(a_1,...,a_n)=1$, $a_1,...,a_n$ are said to be **relative prime**. If $(a_i,a_j)=1$ for all pairs $i,j $ with $i \ne j$, then $a_1,...,a_n$ are said to be **pairwise relatively prime**.
10. A set of integers such that every integer is congruent modulo $m$ to exactly one integer of the set is said to be a **complete residue(餘數) system modulo $m$**.
e.g.,
$\{0,1,...,m-1 \}$ is a complete residue system modulo $m$.
The set of integers {1,5} is a reduced residue system modulo 6.
11. $6a=6b \ mod \ 3 \nrightarrow a=b \ mod \ 3$
12. $a, b,c \in Z$, $ca=cb \ mod \ m \leftrightarrow a \equiv b \ mod \ (m/(c,m))$
13. Definition: A arithmetric function is a function whose domain is the set of positive integer.
14. An arthmetric function (算術函數) $f$ is said to be **multiplicate** if $f(mn)=f(m)f(n)$ where $m$ and $n$ are relative prime positive numbers.
15. the least common multiple (最小公倍數) of two integers $a$ and $b$ = $lcm(a,b)=$ the smallest positive integer that is divisible by both $a$ and $b$
Definition. Let $a,b \in Z$. A congruence of the form $ax \equiv b \mod \ m$ is said to be a **linear congruence in the variable $x$ (一元線性同餘方程).**
Definition. Let $a,m \in Z$ with $m>0$ and $(a,m)=1$. The order of a modulo $m$, denoted $ord_m \ a$, is the least positive integer $n$ for which $a^n \equiv 1 \ mod \ m$.
Definition. Let $r,m \in Z$ with $m>0$ and $(r,m)=1$. Then $r$ is said to be a **primitive root (原根) modulo $m$ if $ord_m r= \phi(m)$.**
e.g.,
$3$ is a primitive root modulo $7$ since $ord_7 3=6=\phi(7)$.
$2$ is a primitive root modulo $13$, $ord_{13} 2= \phi(13)$
$2$ is not a primitive root modulo $7$, since $ord_7 =3 \neq \phi(7)$
**Theorem.** Let $ax \equiv b \mod \ m$ and $d=(a,m)$. If $d \nmid b$, then the congruence has no solution in $Z$; if $d | b$, then the congruence has exactly $d$ incongruent solutions modulo $m$ in $Z$.
Proof. Elementary Number Theory, chap 2.2, by STRAYER.
**Definition.** Any solution of the linear congruence in one variable $ax \equiv 1 \mod \ m$ to be said to be a (multiplicative) **inverse** of a modulo $m$.
e.g., the inverse **5 mod 16**= 13, since $5 \times 13 \ mod \ 16 \equiv 1$

**瑞士數學家 Euler 大師就是大師 腦袋不知裝時麼 寫下這 重要的理論** **Public key, Priviate key 就是用到這定理**
**Euler's Theorem.**
Let $a, m \in Z$ with $m>0$. If $(a,m)=1$, then $a^{\phi(m)} \equiv 1 \ mod \ m$, where **Euler phi-function (Eulur's totient function)**, denoted $\phi(n)$, $\phi(n)=| \{x \in Z: 1 \leq x \leq n;(x,n)=1 \}|$.
Remark.
1. $|S|$ cardinality of the set $S$ (the number of elements of the set $S$). **$\phi(n)$ is the number of positive integers less than or equal to $n$ that are relatively prime to $n$**. e.g., $\phi(1)=1, \ \phi(3)=2, \ \phi(5)=4, \ \phi(6)=2, \ \phi(8)=4$
2. $\phi(p)=p-1$, **if $p$ is a prime number**.
4. **$p, q$ two prime numbers,** for $n=pq$, $\phi(n)=\phi(pq)=\phi(p)\phi(q)=(p-1)(q-1)$. **Euler phi-finction $\phi(n)$ is multiplicative.**
**Proof.**
Consider the matrix $[kp+i], 1 \leq i \leq p, 0 \leq k \leq q-1$, if $(p,i)=d >1$, then no entry of $i$th-row is relatively prime to $p$ ( $pq$), since if $d >1$,then $d|p$ and $d|i$. Hence $(p,i)=1$. Consider the row $i$ with $(p,i)=1$, $[i, p+i,2p+i,...,(q-1)p+i]$, each of this intergers is relatively prime to $p$. $\phi(q)$ of these integers are relatively prime to $q$. Since $\phi(q)$ integers are relatively prime to $p$, they are relatively prime to $pq$. The residue system modulo $\phi(n)$ is $[0,1,...,pq-1]$. The residues that are not relatively prime to $n$ are the set $\{p, 2p,...,(q-1)p \}$, the set $\{q, 2q,...,(p-1)q\}$ and 0. Hence $\phi(n)=(pq)-[(q-1)+(p-1)+1]=\phi(p)\phi(q)$
5. $a^{ \phi(m)+1} \equiv a \ mod \ m$
Proof. Let $0<r_1, r_2,...r_{\phi(m)}\leq m$ and $(r_i,m)=1, i=1,...,\phi(m)$. Consider $r_ia, r_2a,...r_{\phi(m)}a$. Since $(a,m)=1$, the inverse of $a$ =$a'$. Hence if $r_i a \equiv r_j a \mod \ m$ with $i \neq j$ then $r_iaa' =r_jaa' \mod m$, $r_i \equiv r_j \ mod \ m$. This is impossible. So the least non-negative resdiues modulo $m$ of the integers $r_1a, r_2a,...,r_{\phi(m)}$. Then $(r_1a)(r_2a)...(r_{\phi(m)}a) \equiv r_1...r_{\phi(m)} \ mod \ m$. Hence $m| a^{\phi(m)} r_1...r_\phi(m)-r_1...r_{\phi(m)}$, since $(r_1,...,r_{\phi(m)},m)=1$, we have
$m|a^{\phi(m)}-1$.
**Compute Euler Phi-Function:**
Theorem. Let $p$ be a prime number and let $0< a \in Z$, then $\phi(p^a)=p^a-p^{a-1}$
Proof. The positive integers not exceeding $p^a$ that are not relatively prime to $p^a$ are $p, 2p, 3p,...p^{a-1}p$, hence $\phi(p^a)=p^a-p^{a-1}$.
**Theorem. Let $0<z \in Z$, then $\phi(n)=n \sqcap_{p|n, p \ prime}(1-\frac{1}{p})$.**
e.g.,
$\phi(12)=12 \sqcap_{p|12, p \ prime} (1-\frac{1}{p})= 12 (1-1/2)(1-1/3)$
$504=2^3 3^2 7$,
$\phi(504)=504 (1-\frac{1}{2})(1-\frac{1}{3})(1-\frac{1}{7})= 144$
Proof. Assume $n >1$ and $n=p_1^{a_!}...p_r^{a_r}$ with $p_1,...,p_r$ distinct prime number and $a_1...a_r$ postive integers. Then $\phi(n)=\phi(p_1^{a_1}...p_r^{a_r})=\phi(p_1^{a_1}...\phi(p_r^{a_r})$ since Euler phi-finction is multiplicative. Thus $\phi(n)=(p_1^{a_1}-p_1^{a_1-1})...(p_r^{a_r}-p_r^{a_r-1})=n\sqcap_{p|n, p \ prime}(1-\frac{1}{p})$$
---
**The RSA Encryption/Decryption Scheme**
Let $p$ and $q$ be distinct odd prime number (typically, very large), let $m=pq$ and let $e$ be a positive integer such that $(e, \phi(m) )=1$. The ordered pair $\{e,m\}$ is the public key. Let $d \equiv e^{-1} mod \ \phi(m)$, then $\{d,m\}$ is the private key.
**Encryption**
block $P$ to a new block $C$
$P^e \equiv C \ mod \ m, \ 0 \leq C < m$
**Decryption**
block $C$ to block $P$
$C^{d} \equiv P \ mod \ m, \ 0\leq P <m$
Remark. $ed=1 \mod \ \phi(m) \to$ $ed=k\phi(m)+1$, $C^{d} \equiv (P^e)^d \equiv P^{ed} \equiv P^{k\phi(m)+1} \equiv P^{k\phi(m)}P\equiv P \ mod \ m$, by Euler's theorem.
**Example.**
1. select two prime numbers, p=7 and q=17
2. n=pq=119
3. $\phi(n)=(p-1)(q-1)=96$
4. select $e$ is relative prime to $\phi(n)=96$, e.g., $e=5$.
5. $de \equiv 1 \ mod \ 96$ and $d <96$, then $d=77$.
public key = {5,119}
private ket ={77,119}
**Diffie-Hellman Key Exchange Algorithm**


1. Global Public Emements
$q$ prime number, $\alpha < q$ and $\alpha$ is primitive root of $q$
2. User A Key Generation
Select priviate $X_A$, $X_A <q$
Calculate Public $Y_A$, $Y_A \equiv \alpha^{X_A} \ mod \ q$
3. User B Key Generation
Select priviate $X_B$, $X_B <q$
Calculate Public $Y_B$, $Y_B \equiv \alpha^{X_B} \ mod \ q$.
4. Generation of Secret Key by User A
$K \equiv Y_B^{X_A} \ mod \ q$
5. Generation of Secret Key by User B
$K \equiv Y_A^{X_B} \ mod \ q$
Ref. Cryptography and Network Security, by William Stallings
Ref. [Introduction to Higher Mathematics, by Patrick Keef and David Guichard.](https://www.whitman.edu/mathematics/higher_math_online/)
**中國餘數定理, 孫子定理, 韓信點兵 (Chinese Remender Theorem)**

Let $m_1, m_2,..., m_n$ be pairwise relative prime positive numbers and let $b_1,...b_n$ be any integers. The system of linear congruences in one variables given by
$x \equiv b_1 \ mod \ m_1$
$x \equiv b_2 \ mod \ m_2$
...
$x \equiv b_n \ mod \ m_n$
has a unigue solution modulo $m_1m_2...m_n$.
Proof.
**Existence**
Let $M=m_1m_2...m_n$ and let $M_i=M/m_i$. Then $(M_i,m_i)=1$ since they are pairwise relative prime positive numbers. Thus $M_ix_i \equiv 1 \ mod \ m_i$ has a soultion for each $i$. Form $x \equiv \sum_{i=1}^n b_iM_ix_i \equiv 0+0+...b_i+0...+0 \ mod \ m_i \equiv b_i \ mod \ m_i$ for each $i$.
**Uniqueness**
Let $x'$ be another solution. We have $x' \equiv b_i \ mod \ m$. $x \equiv x' \ mod \ m$. Then $m_i | (x-x')$ for all $i$. Then $M|(x-x')$, from which $x \equiv x' \ mod \ M$.
e.g.,
$x \equiv 2 \mod \ 3$
$x \equiv 1 \mod \ 4$
$x \equiv 3 \mod \ 5$
$M=(3)(4)(5)=60$, $M_1=20,M_2=15,M_3=12$, $x_1 \equiv 2 \ mod \ 3$ , $x_2 \equiv 3 \ mod \ 4$, $x_3 \equiv 3 \ mod \ 5$, then $x=b_1M_1 x_1+b_2M_2 x_2+b_3M_3x_3 \equiv 233 \equiv 53 \ mod \ 60$
**Tensor Product of Hilbert Spaces**
Let $H_1$ and $H_2$ be Hilbert spaces over a filed $K$, e.g., $R$. For each $\phi_1 \in H_1$, $\phi_2 \in H_2$, let $\phi_1 \otimes \phi_2$ dentote conjugate (for complex number) bilinear form which acts on $H_1 \otimes H_2$ (**tensor product**) by
$(\phi_1 \otimes \phi_2)<\psi_1,\psi_2>=<\psi_1, \phi_1>_{H_1}<\psi_2,\phi_2>_{H_2}.$
Let $E$ be the set of finite linear conbinations of such conjugate biliear forms; we define an inner product $< \, \ >_E$ on $E$ by defining $<\phi \otimes \psi, \eta \otimes \mu>_E=<\phi, \eta><\psi,\mu>$ and extending by linearity to $E$ (tensor: a multilinear mapping, multidimensional array...)
Fact 1. $< \ , \ >_E$ is well defined and positive definite.
Fact 2. If $\{ \phi_i \}$ and $\{ \psi_i \}$ are orthonormal bases for $H_1$ and $H_2$, then $\phi_i \otimes \psi_i$ is an orthonormal basis for $H_1 \otimes H_2$.
Then we have tensor product: $H_1 \otimes H_2 \otimes ...H_n$.
Fact 3.
$V \otimes (W \otimes U) \cong (V \otimes W) \otimes U$
$V \otimes K \cong K \otimes V \cong V$
Ref. Functional Analysis, M. Reed, B. Simon
---
Appendix
如果您數學程度不錯, 大數學家( Soviet and Russian mathematician) Vladimir Arnold 的書不錯(需 數學系程度)
"將數學嚴謹性與物理直覺相結合的清晰的寫作風格"
有助 別人理解
Mathematical Methods of Classical Mechanics, by V. I. Arnold
(此書也要有數學研究所程度, 適合 要念研究所與進攻博士的人)
大學部線代與機率推薦書(好像都是 數學系的~要有觀念不是計算, **計算與應用是幫助理解**):
Linear Algebra:
Introduction to Linear Algebra, Gilbert Strang
Probability:
A First Course in Probability, Sheldon Ross
---
Ref.
**Mathematical Methods and Algorithms for Signal Processing, by Todd Moon (Author), Wynn Stirling (Author)**
Appendix 有簡潔的 measure theory
**這書適合有心念研究所的同學 或 想加強自己能力的 資電人員, 我以前 用過 這本 教 estimation and prediction theory**
---
**Stochastic Processes, by Sheldon M. Ross** (沒有用到 measure theory, 適合 EE, 資工 統計 應數 研究所)
Linear Operator Theory in Engineering and Science (Applied Mathematical Sciences, 40), by Arch W. Naylor (Author), George R. Sell (Author)
Real Analysis and Probability
Probability and Mathematical Statistics, by Robert B. Ash
(數研所程度, 較深)
有空再來 寫 神奇的 Dirac delta function $\delta(t)$
當時
$\int$ $\delta$ $(t)dt=1$ Riemann integral, Lebesque integral 是不行的
天才物理學家,量子力學的奠基者之一 Paul Dirac
還有一位 Évariste Galois, Coding 的原理就是 年輕的 他 奠立的~現稱 Évariste Galois
年輕 20歲 鬥劍 過世
後來才有數學理論:
Schwartz distribution or generalized function
這三本書 需要數學系以上的基礎
**Applied Functional Analysis** (Dover Books on Mathematics) Reprint 版本
作者 **D.H. Griffel** (Author)
這對 EE 比較容易懂, Dover Books 的書 很多是 沒有 版權問題的 好書 重印, 不貴, 這本就是 評價 很高的書, 要修 數研所 的課 沒那麼簡單