$Probability Measure and Conditional Expectation$

Author:

CrazyDog

, Ph.D. in Electrical Engineering, NCTU
email: kccddb@gmail.com
Date: 20241030

家扶基金會

台灣之心愛護動物協會（Heart of Taiwan Animal Care, HOTAC）

無國界醫生
Mathematics (calculus, linear algebra, probability, statistics), Data Structure, Algorithm, DSP, Micro-processor and OS 是內功心法
network programming, embedded system 是戰鬥工法

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

學以致用

學而不思則罔思而不學則殆學而後思則悟

♥腦中真書藏萬卷，掌握文武半邊天♥

Probability Space

$σ$ -field

Definition 1.1. Let

F

be a collection of subsets of

Ω

F

is called a "

σ

-field" or "

σ

-algebra" if the following holds
(i)

Ω \in F

.
(ii)

A \in F

implies that

A^{c} \in F

and
(iii) if

A_{1}, A_{2}, . . . \in F

is a countable sequence of sets then

⋃ A_{i} \in F

measure, measurable space

Definition 1.2. A measure is a nonnegative countable additive set function on a measurable space

(Ω, F)

, i.e.,a function

μ : F \to R

such that
(i) μ(A) > 0 for all

A \in F

, and
(ii) if

A_{i} \in F

is a countable sequence of disjoint sets (i.e.,

A_{i} ⋂ A_{j} = 0

; for all

i \neq j

), then

μ (⋃ A_{i}) = \sum u (A_{i})

The tuple

(Ω, F)

, where

F

is a "

σ

-field", is called a measurable space.

e.g.,

$(Ω, F)$ with
$F = {0, Ω}$ .
$F$ is the smallest "
$σ$ -field" on
$Ω$ , called the trivial "
$σ$ -field".
$F = 2^{Ω}$ , is the largest "
$σ$ -field" on
$Ω$ .
Let
$P$ is a measure on
$(Ω, F)$ with
$0 \leq P (ω) \leq P (Ω) = 1$ , then
$(Ω, F, P)$ is a probability space with probability measure
$P$ .

Borel

$σ$ algebra

In mathematics, a Borel set is any set in a topological space that can be formed from open sets (or, equivalently, from closed sets) through the operations of countable union, countable intersection, and relative complement. Borel sets are named after Émile Borel.

For a topological space X, the collection of all Borel sets on X forms a σ-algebra, known as the Borel algebra or Borel σ-algebra. The Borel algebra on X is the smallest σ-algebra containing all open sets (or, equivalently, all closed sets).

Lebesque measure on
$(R^{n}, B)$ , where
$B$ Borel
$σ$ algebra

Lebesque measure

μ

R

μ (a, b) = b - a

, where

(a, b)

open interval.

f

is differentiable in

[a, b]

and the Labesque integral

\int_{a}^{x} f^{'} (t) d t

exists, then

f (x) - f (a) = \int_{a}^{x} f^{'} (t) d t

Riemann Integral

\int_{[a, b]} f d t =

Lebesque Integral

\int_{[a, b]} f d t

if they exist.

Counting measure and

$l^{p}$ spaces.

Let

X

be any countable set,

M = P (X)

(

M : σ

-algebra ,

P (X)

=power set of

X

)
and

μ

be the counting measure. Recall that

µ (A)

is the number of points in

A

A

is finite and equals

\infty

otherwise. Integration is simply

\int f d μ = \sum_{x \in X} f (x)

for any non-negative function

f

, and

L^{p}

is denoted by

l^{p}

Definition 1.3. Given two measure spaces

(Ω, F)

and

(S, B)

, a function

f : Ω \to S

is measurable if

f^{- 1} (A) \in F

for any

A \in S

random variable (measurable function)

Definition 1.4. A random variable is a measurable function from

(Ω, F)

(R, B)

, where B is the Borel

σ

-field of

R

. A random vector is a measurable function from

(Ω, F)

(R^{d}, B^{d})

, where

B^{d}

is the Borel

σ

-field of

R^{d}

X : Ω \to R

is a random variable,

P (a \leq X \leq b) ≐ P ({ω \in Ω | a \leq X (ω) \leq b})

.
Distribution function of

X

F_{X} (x) ≐ P (X \leq x)

Expectation, conditional expectation
Probability space

(Ω, F, P)

Indicator function
$1_{A}$ :
$1_{1} (ω) = 1$ if
$ω \in A$ ,
$1_{A} (ω) = 0$ , if
$ω \in Ω ∖ A$ , where
$A \in F$ .
$E [1_{A}] ≐ \int_{Ω} 1_{A} (ω) d P (ω) ≐ P (A)$ .
Simple random variable. A r.v.
$ϕ$ is a simple RV if

$ϕ (ω) = \sum_{i = 1}^{n} a_{i} 1_{A}$ ,where
$A_{i}$ ’s are disjoint measurable sets.
$E [ϕ] ≐ \sum_{i = 1}^{n} a_{i} E [1_{A_{i}}]$ . If
$E [ϕ] < \infty$ , then
$ϕ$ integrable.

$E [X] ≐ \int X (ω) d P$ if it exists.

(X, Σ, μ)

is a measure space, a property

P

is said to hold almost everywhere (almost surely) in

X

if there exists a set

N \in Σ

with

u (N) = 0

and all

x \in X ∖ N

have the property

P

. Define

0 \times \infty = 0

e.g.,

Q

:rational number. Let

f (x) = 1

, if

x \in Q

f (x) = 0

x \in R ∖ Q

. Then

f (x) = 0

R

, a.e. , because

Q

is countable and

u (Q) = 0

, where

u

is the Lebesque measure.

\int f d u = 0

(in Lebesque integral).

Strong Law of Large Numbers
The sample average converges almost surely to the expected value, i.e.,

lim \bar{X_{n}} \to μ

, a.s.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Schwarz and Hölder Inequality Let

(S, Σ, μ)

be a measure space,

1 < p < \infty

and

\frac{1}{p} + \frac{1}{q} = 1

\int | x y | d u \leq (\int | x |^{p})^{\frac{1}{p}} (\int | y |^{q})^{\frac{1}{q}}

| | x y | |_{L^{1}} \leq | | x | |_{L^{p}} | | y | |_{L^{q}}

u =

counting measure,

S = {1, . ., n}

\sum_{i = 1}^{n} | x_{i} y_{i} | \leq (\sum_{i = 1}^{n} | x_{i} |^{p})^{\frac{1}{p}} (\sum_{i = 1}^{n} | y_{i} |^{q})^{\frac{1}{q}}

$p = q = 2$ , Schwarz inequality

Hölder's inequality becomes an equality if and only if

| x |^{p}

and

| y |^{q}

are linearly dependent in

L_{1} (μ)

, meaning that there exist real numbers

α, β \geq 0

, not both of them zero, such that

α | x |^{p}

β | y |^{q}

μ -

almost everywhere.

Minkowski Inequality

1 \leq p < \infty

(\int | x \pm y |^{p})^{\frac{1}{p}} \leq (\int | x |^{p})^{\frac{1}{p}} + (\int | x |^{p})^{\frac{1}{p}}

| | x \pm y | |_{p} \leq | | x | |_{p} + | | y | |_{p}

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Remark. Random variable

X : Ω \to R

Markov's inequality

If X is a nonnegative random variable and a > 0,

P (X \geq a) \leq \frac{E [X]}{a}

P r o o f .

Define

1_{{X \geq a}}

, then

a 1_{{X \geq a}} \leq X

a.s. Hence

E [a 1_{{X \geq a}}] = a P (X \geq a) \leq E [X]

Q . E . D .

Extended version for nondecreasing functions

ϕ

is a nondecreasing nonnegative function,

X

is a (not necessarily nonnegative) random variable, and

ϕ (a) > 0

, then

P (X > a) \leq \frac{E [ϕ (X)]}{ϕ (a)}

Proof.

{ω : | X (ω) | > a, ω \in Ω} = {ω : ϕ (X (ω)) > ϕ (a), ω \in Ω}

Hoeffding's inequality

Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount.

Let

X_{1}, . . ., X_{n}

be independent random variables such that

a_{i} \leq X_{i} \leq b_{i}

almost surely.

Then Hoeffding's theorem states that, for all

t > 0

S_{n} = \sum_{i = 1}^{n} X_{i}

P (S_{n} - E [S_{n}] \geq t) \leq e x p (- \frac{2 t^{2}}{\sum_{i} (b_{i} - a_{i})^{2}})

P (| S_{n} - E [S_{n}] | \geq t) \leq 2 e x p (- \frac{2 t^{2}}{\sum_{i} (b_{i} - a_{i})^{2}})

Proof. (HINT)

P (S_{n} - E [S_{n}] \geq t) = P (e x p (s (S_{n} - E [S_{n}])) \geq e x p (s t))

for

s > 0

, since

e^{s x}

is a monotone increasing function for

s > 0, x > 0

.
Hint. Markov inequality (Extended version)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Hoeffding’s inequality and convex ordering

Concentration inequality

Fundamental concept:

X

is a r.v., if there exits a simple r.v.'s

ϕ_{i}

, such that

ϕ_{i} \to X

and

E [ϕ_{i}] \to \bar{X} < \infty

, then

E [X] ≐ \int X (ω) d P ≐ \bar{X}

conditional expectation

E [X | Y = 1]

(a random variable) contitional expection of

X

given the event

{Y = 1}

Radon-Nikodym theorem
Let

u

be a

σ

-finite measure and

λ

a signed measure on

Σ

of subset of

Ω

. Assume

λ

is absolutely continuous with respect to

u

(

λ << u

, if

u (A) = 0 => λ (A) = 0

). Then there is a Borel measurabe function

g : Ω \to \bar{R}

such that

λ (A) = \int_{A} d λ = \int_{A} g d u

for all

A \in Σ

h

is another such function, then

g = h a . e . [u]

g

is called Radon-Nikodym derivative or density of

λ

with respect to

u

, writen

\frac{d λ}{d u}

Fundamental theorem of calculus

F (x) ≐ \int_{a}^{x} f (t) d t, \frac{d F (t)}{d t} = f (t), t \in (a, x)

d F \to d λ

d t \to d u

\frac{d F}{d t} \to \frac{d λ}{d u}

X

is an integrable random variable. Assume

G

is a

σ

-field

G \subset F

. The conditional expectation of

X

given by

G

E [X | G]

(
$G$ -measurable random variable) is given by

\int_{B} X d P = \int_{B} E [X | G] d P

(or

E [X 1_{B}] = E [E [X | G] 1_{B}]]

) for

B \in G

. (by Radon-Nikodym theorem).

The

$σ$ -field generated by random variable
$Y$ is denoted by
$σ_{Y}$ . Then

E [X | Y] = E [X | σ_{Y}]

conditional density
Let

X, Y

r.v. with joint density

f_{X Y} (x, y)

P (Y \in C | x - h < X < x + h) = \frac{P (x - h < X < x + h, Y \in C)}{P (x - h < X < x + h)} \sim \frac{2 h \int_{C} f_{X Y} (x, y) d y}{\int_{x - h}^{x + h} f_{X} (u) d u}

≃ \frac{\int_{C} f_{X Y} (x, y) d y}{f_{X} (x)}

Define

h (y | x) = f_{X Y} (x, y) / f_{X} (x)

the conditional density of

Y

, given

X = x

Let

B (x) \in F

and

X = x

, then

B (x)

will occur iff

Y \in B (x)

P (Y \in B (x) | X = x) = \int 1_{B} (x, y) h (y | x) d y = \int_{B (x)} h (y | x) d y

, by Fubini's theorem.

Properties.

If
$G$ is the trivial field
${0, Ω}$ ,
$E [X | G] = E [X]$ is a random variable.
$E [X | X] = X$
If
$X$ is
$G$ -measurable, then
$E [X Y | G] = X E [Y | G]$ .
If
$G_{1} \subset G$ , then
$E (E [X | G] | G_{1}) = E [X | G_{1}]$ .
If
$σ (X)$ (
$σ$ -field generated by
$X$ ) and
$G$ are independent, then
$E [X | G] = E [X]$ (it is a random variable).
Monotone convergence. If
$0 \leq X_{n}$ , and
$X_{n} ↑ X$ with
$E [| X |] < \infty$ , then
$E [X_{n} | G] ↑ E [X | G]$ .
Dominated convergence. If
$lim_{n \to \infty} X_{n} = X$ a.s. and
$| X_{n} | \leq Y$ with
$E [Y] < \infty$ , then
$lim_{n \to \infty} E [X_{n} | G] = E [X | G]$ .
Fatous' lemma. If
$0 \leq X_{n}$ , then
$E [lim i n f X | G] \leq lim i n f E [X_{n} | G]$ .

$| E [X | G] | \leq E [| X | | G]$ .
Jenson's inequality. If
$g (x)$ is a convex function on interval
$I$ , i.e., for all
$x$ ,
$y \in I, λ \in (0, 1)$ ,
$g (λ x + (1 - λ) y) \leq λ g (x) + (1 - λ) g (y)$ .
$X$ is a r.v. with range
$I$ , then
$g (E [X | G]) \leq E [g (X) | G]$ .
Let
$X$ and
$Y$ be two independent r.v.s and
$ϕ (x, y)$ be such that
$E [| ϕ (X, Y) |] < \infty$ . Then
$E [(ϕ (X, Y)) | Y] = G (Y)$ , where
$G (y) = E [ϕ (X, y)]$ .
$E [. | G]$ is a projection operator on
$L^{1} (Ω, F, P)$ , since
$E [E [. | G] | G] = E [. | G]$

L^{p}

space

(X, Σ, μ)

is a measure space, and

1 \leq p < \infty

. Let

| | f | |_{p} ≐ (\int | f |^{p} d u)^{\frac{1}{p}} < \infty

. Notice that

| | f | |_{p} = 0 \Rightarrow f = 0

a.e.

Remark.

f_{n} \to f

L^{2} ↛ f_{n} \to f a . e

, there exits one sub-sequence

f_{n} j \to f, a . e .

f_{n} \to f a . e . ↛ f_{n} \to f i n L^{2}

, e.g.,

f_{n} = \sqrt{n} x^{n}

0 \leq x < 1

還有許多數學上問題要處理, 例如 seminorm, 因此工程上我們處理 good function (finite energy, measurable function, complete inner product space,…)與

$p = 2$ , 最重要的
$L^{2} 與 l^{2}$ space, 有興趣的讀者請學 Functional Analysis.

Gibbs phenomenon

Fourier Series
Let

ϕ_{n} (x) = e^{2 π i n x}

\hat{f (n)} = \int_{0}^{1} f (x) \overset{―}{ϕ_{n} (x)} d x

and

S_{N} (x) = \sum_{n = - N}^{N} \hat{f (n)} ϕ_{n} (x)

. Then

$| | ϕ_{n} | |^{2} = \int_{0}^{1} | ϕ_{n} (x) |^{2} d x = 1$
If
$n \neq k$ ,
$< ϕ_{n}, ϕ_{k} >= 0$
If
$f (x) = \sum_{n = - N}^{N} c_{n} ϕ_{n} (x)$ , then
$\int_{0}^{1} | f |^{2} = \sum_{n = - N}^{N} | c_{n} |^{2}$
$\sum_{n = - N}^{N} | \hat{f (n)} |^{2} = \int_{0}^{1} | S_{n} (x) |^{2} d x$

Bessel’s Inequality

Let

f

be a periodic

[0, 1]

, square-integrable function, and write fb for its Fourier transform. Then, for every

N

, we have

\sum_{n = - N}^{N} (| \hat{f (n) |)^{2}} \leq \int_{0}^{1} | f |^{2}

Hence we have

$f (x) = \sum c_{n} ϕ_{n} (x)$ in
$L^{2}$ sense (
$lim_{N \to \infty} | | f - S_{N} | |_{L^{2}} = 0$ ), if
$f \in L^{2} [0, 1]$

Lebesque-Stieltjes Integral

X

is a r.v., with distribution function

F_{X} (x)

, where

F (b) - F (a) = P (X (ω) \in (a, b])

right-continuous function

Then

E [h (X)] = \int_{Ω} h (X (ω)) d P (ω) = \int h (x) d P_{X} (x) = \int h (x) d F_{X} (x)

Remark.

X

is a discrete r.v.,

F_{X} (x_{n + 1}) - F_{X} (x_{n}) = P (X = x_{n + 1})

, then

E [X] = \sum x_{n} P (X = x_{n})

Sums of independent random variables

X, Y

discrete r.v.

Z = X + Y

P (Z = z) = \sum_{y} p_{X} (z - y) p_{Y} (y)

, convolution

continuous r.v.

Z = X + Y

f_{Z} (z) = \int f_{X} (z - y) f_{Y} (y) d y

Proof.

Let

1_{A}, A = {ω \in Ω | Z = X (ω) + Y (ω) \leq z}

P (Z \leq z) = E [1_{A}] = E [E [1_{A} | Y]] = E [P (X + Y \leq z) | Y]]

E [P (X + Y \leq z) | Y]] = \int F_{X} (z - y) d F_{Y}

, since

X, Y

independent.

Moment generating function (mgf) of

$X$

m (t) = E [e^{t X}] = E [\sum_{n = 0}^{\infty} \frac{t^{n} X^{n}}{n!}] = \sum_{n = 0}^{\infty} \frac{t^{n} E [X^{n}]}{n!}

E [X^{n}] = \frac{d^{n} m (t)}{d t^{n}} |_{t = 0}

\vec{X}

~ random vector

m_{\vec{X}} (\vec{t}) ≐ E [e^{< \vec{t}, \vec{X} >}]

Characteristic function of

$X$

ϕ (t) = E [e^{i t X}] = \int e^{i t x} d F_{X} (x)

f_{X} (x) = F_{X} (x)^{^{'}}

, then

ϕ (t) = \int e^{i t x} f_{X} (x) d x

Proposition.
There is a bijection between probability distributions and characteristic functions. That is, for any two random variables

X_{1}

X_{2}

, both have the same probability distribution if and only if

ϕ_{X_{1}} = ϕ_{X_{2}}

X, Y

independent, then

ϕ_{X + Y} = ϕ_{X} ϕ_{Y}

\vec{X}

~ random vector, then

ϕ_{\vec{X}} (\vec{t}) ≐ E [e^{i < \vec{t}, \vec{X} >}]

Proposition.
Suppose that

X, Y

are two random vectors whose mgfs

ϕ_{X} (t), ϕ_{Y} (t)

exist for all

t \in B_{ε} (0)

for some

ε > 0

. If

ϕ_{X} (t) = ϕ_{Y} (t)

for all

t \in B_{ε} (0)

, then the cdfs of

X

and

Y

are identical.

B_{ε} (0) = (- ε, ε)

R^{1}

Central limit theorem

X_{1}, X_{2}, . . .

i.i.d.

E [X_{i}] = μ

V A R (X_{i}) = σ^{2}

\bar{X} = \frac{X_{1} + X_{2} + . . . + X_{n}}{n}

\sqrt{n} (X - μ) \to N (0, σ^{2})

convergence in distribution

Proof.

M (t) = E [e^{t \bar{X}}]

Using Taylor’s formula with a remainder
Find

M (t) . . .

Take limit

n \to \infty

\to e^{\frac{1}{2} σ^{2} t^{2}}

Ref. Introduction To Stochastic Calculus With Applications, by F. C. Klebaner

L^{2} (P)

processes (Second-order processes), processes for which

E [| X_{t} |^{2}] < \infty

Markov Chain

Discrete-time Markov chain

A discrete-time Markov chain is a sequence of random variables

X 1, X 2, X 3, . . .

with the Markov property

P (X_{n + 1} = x_{n + 1} | X_{n} = x_{n}, X_{n - 1}, . . . . X_{1}) = P (X_{n + 1} = x_{n + 1} | X_{n} = x_{n})

Time-homogeneous:

P (X_{n + 1} = x_{n + 1} | X_{n} = x_{n}) = P (X_{n} = x_{n} | X_{n - 1} = x_{n - 2})

A discrete-time Markov chain with memory (or a Markov chain of order m) where

m

is finite, is a process satisfying

P (X_{n + 1} = x_{n + 1} | X_{n} = x_{n}, X_{n - 1}, . . ., X_{n - m + 1}, . . . . X_{1}) = P (X_{n + 1} = x_{n + 1} | X_{n} = x_{n}, . . X_{n - m + 1})

Assume

X_{n} = j \in {0, 1, 2, . . .}

, let

p_{i j} = P (X_{n} = j | X_{n - 1} = i)

P = [p_{i j}]

:Transition matrix

P (X_{n + 1} = k | X_{n - 1} = i) = \sum_{j} P (X_{n + 1} = k, X_{n} = j | X_{n - 1} = i) = \sum_{j} P (X_{n + 1} = k | X_{n} = j, X_{n - 1} = i) P (X_{n} = j | X_{n - 1} = i

Hence

P (X_{n + 1} = k | X_{n - 1} = i) = \sum_{j} p_{i j} p_{j k}

Asssume

π_{0} = [π_{0}, π_{1} . . .]

is the initial probability.

P (X_{2} = k | X_{0} = i) = \sum_{j} p_{i j} p_{j k}

, then

P (X_{2} = k) = \sum_{i} π_{i} \sum_{j} p_{i j} p_{j k}

. Let

π^{n} (j) = P (X_{n} = j)

, then

π^{n} = π_{0} P^{n}

. Assume

lim_{n \to \infty} π^{n} = π

exists, then we have

π P = π

Matrix form:

P^{2} = P \times P

Stationary probability

π = π P

and

\sum π_{i} = 1

Continuous-time Markov chain
Then, knowing

X (t) = i

X (t + h) = j

is independent of previous values

(X_{s} : s < t)

, and as h → 0 for all j and for all t,

P (X (t + h) = j | X (t) = i) = δ_{i j} + q_{i j} h + o (h)

, where (little-o)

lim_{h \to 0} \frac{o (h)}{h} = 0

and

δ_{i i} = 1

δ_{i j} = 0

,if

i \neq j

q_{i j}

transition rate.

Time Homogeneous:

p_{i j} (t) = P (X (t + s) = j | X (s) = i) .

Then

$\frac{p_{i j} (t + δ t)}{δ t} = \frac{q_{i j} δ t + o (δ t)}{δ t} = q_{i j} + o (δ t) / δ t$ , if
$i \neq j$
$\frac{p_{i i} (t + s) - 1}{δ t} = q_{i i} + o (δ t) / δ t$ , if
$i = j$ .
$\sum_{j} p_{i j} (t) = 1$ ,
$0 \leq p_{i j} (t) \leq 1$
$q_{i i} < 0$
Chapman–Kolmogorov equation
$p_{i} (X (t + s) = k) = \sum_{j} P_{i} (X (t) = j, X (t + s) = k)$ , hence
$p_{i k} (t + s) = \sum_{j} p_{i j} (t) p_{j k} (s)$ . We have
$P (t + s) = P (t) P (s)$
$lim_{t ↓ 0} P (t) = I$
$Q ≐ [q_{i j}]$ ,
$q_{i j}$ infnitesimal transition rate,
$Q$ infnitesimal generator
$\sum_{j} q_{i j} = 0$
$P (t) = e x p (t Q)$ ,
$t \geq 0$
Transition Rate
$q_{i j}$ ,
$i \neq j$

From 1. and 2., we have

p_{i i}

e x p (- q_{i i})

and

P^{'} = Q P

, where

P = [P_{i j} (t)]

and

P_{i j} (t) = p_{i j} (t)

\frac{P (t + h) - P (t)}{h} = P (t) \frac{(P (h) - I)}{h}

P^{^{'}} = lim_{h ↓ 0} P (t) \frac{(P (h) - I)}{h} = Q P

\frac{P (h + s) - P (s)}{h} = (P (h) - I) P (s)

Then

P^{'} = Q P = P Q

For steady-state,

P^{'} = 0

, then

Q P = P Q = 0

The steady-state probability

π

π Q = 0

and

\sum π_{i} = 1

P_{i j}

exists, then

P^{n} = \hat{P}

n \to \infty

Kolmogorov backward equation

P^{'} = Q P

P (0) = I

Poisson process with parameter

λ

with state space

{0, 1, 2. . .}

$q_{i j} = λ$ , for
$j = i + 1$
$q_{i i} = - λ$ ,
$i = 0, 1, 2. . .$
$q_{i j} = 0$ , otherwise.

Global balance equations

\frac{p_{i j} (t + δ t)}{δ t} = \frac{q_{i j} δ t + o (δ t)}{δ t} = q_{i j} + o (δ t) / δ t

, if

i \neq j

π_{i} = \sum_{j} π_{j} q_{j i}

,
or

π_{i} \sum_{j \in S - i} q_{i j} = \sum_{j \in S - i} π_{j} q_{j i}

The left-hand side represents the

t o t a l f l o w f r o m o u t o f s t a t e i i n t o s t a t e s o t h e r t h a n

i

, while the right-hand side represents the

t o t a l f l o w o u t o f a l l s t a t e s

j \neq i

into state

i

stationary probability (steady-state probability)

π Q = 0

, and

\sum π_{i} = 1

Hidden Markov Models (HMM)

Assumptions

Q = q_{1}, q_{2}, . . ., q_{N}

a set of N states (Markov Chain)

A = [a_{i j}]

transition matrix (Markov Chain)

O = o_{1}, o_{2}, . . ., o_{T}

a sequence of

T

observations

B = b_{i} (o_{t})

a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation ot being generated from a state

i

\vec{π} = [π_{1}, . . ., π_{N}]

an initial probability distribution (Markov Chain)

Output Independence:

p (o_{i} | q_{1}, . . . q_{T}, o_{1}, . . ., o_{T}) = p (o_{i} | q_{i})

λ = (\vec{π}, A, B)

model

(likelihood) Determine the likelihood
$P (O | λ)$
Given an observation sequence
$O$ and an HMM
$λ$ , discover the best hidden state sequence
$Q$ . best??? ML, maximum mutual information (MMI),
$a r g m a x$ ,
$L L M$ ,
$M A P$ …
(learning) Given an observation sequence O and the set of statesin the HMM, learn the HMM parameters A and B.

P (O, Q) = P (O | Q) P (Q) = \prod_{i = 1}^{T} p (o_{i} | q_{i}) \prod_{i = 1}^{T} p (q_{i})

P (O) = \sum_{Q} P (O, Q)

a_{t} (j) = P (o_{1}, . . ., o_{t}, q_{t} = j | λ)

a_{t - 1} (j)

the previous forward path probability from the previous time step

b_{j} (o_{t})

the state observation likelihood of the observation symbol ot given the current state

j

1.Initialization: initialize

\vec{π}

A

and

B

Then the forward probability at time

t

a_{t} (j) = \sum_{i = 1}^{N} a_{t - 1} (i) a_{i j} b_{j} (o_{t})

2. Recursion:

a_{t} (j) = \sum_{i = 1}^{N} a_{t - 1} (i) a_{i j} b_{j} (o_{t})

3. Termination:

The backward probability

β

is the probability of seeing the observations from time

t + 1

to the end, given
that we are in state

i

at time

t

(and given

λ

β_{t} (i) = P (o_{t + 1}, . . ., o_{T} | q_{t} = i, λ)

Initialization:
$β_{T} (i) = 1$ ,
$i = 1, . . ., N$
Recursion:
$β_{t} (i) = \sum_{j} a_{i j} b_{j} (o_{t + 1}) β_{t + 1} (j)$ ,
$1 \leq j \leq N$ ,
$1 \leq t < T$
Termination:
$P (O | λ) = \sum_{j} π_{j} b_{j} (o_{1}) β_{1} (j)$

${\begin{cases} q_{i} & ⟼ & b_{j} (o_{t + 1}) ⟼ & q_{j} \\ \leftarrow a_{t} (i) & t & β_{t + 1} \to \end{cases}$

Let

η_{t} (i, j) = P (q_{t}, q_{t + 1} | O, λ) = \frac{a_{i j} b_{j} (o_{t + 1}) β_{t + 1} (j)}{P (Q | λ)}

. Hence

η_{t} (i, j) = \frac{a_{i j} b_{j} (o_{t + 1}) β_{t + 1} (j)}{\sum_{i} \sum_{j} a_{i j} b_{j} (o_{t + 1}) β_{t + 1} (j)}

γ_{t} (j) = \sum_{j} a_{i j} b_{j} (o_{t + 1}) β_{t + 1} (j)

Let

π_{i}

be the expected number of time at

t = 1

Let

M_{i j}

be the expected number of transitions from state

i

to state

j

Let

K_{i} = \sum_{j} M_{i j}

be the expected number of transitions from state

i

Let

S_{j}

be the expected number of times in state

j

Let

B_{j} (k)

be the expected number of times and observing

O = o_{k}

EM (expectation-maximization) algorithm

(No guarantee exists that the sequence converges to a maximum likelihood estimator.)

initialize

$π$ , A and B (a big problem)
iterate until convergence

E-step (Expectation step, update variables)

γ_{t} (j)

η_{t} (i, j)

\forall i, j, t

M-step (Maximization step, update hypothesis)

{\hat{a}}_{i j} = \frac{M_{i j}}{K_{i}} = \frac{\sum_{t = 1}^{T} η_{t} (i, j)}{\sum_{t = 1}^{T} γ_{t} (j)}

{\hat{β}}_{j} (k) = \frac{\sum_{t = 1, Q = q_{k}}^{T} η_{t} (j)}{\sum_{t = 1}^{T} γ_{t} (j)}

return A,B

Markov Model
Hidden Markov Model (part 2)
ㄚㄚHidden Markov Model (part 3)

EM algorithm, convergence???
如何辨別機器學習模型的好壞？秒懂Confusion Matrix

Hidden Markov Models

A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner

memoryless property

X \geq 0

P (X > t + s | X > s) = P (X > t)

Let

f (x) = P (X > x)

, then

P (X > t) = P (X > t + s | X > s) = P (X > t + s, X > s) / P (X > s)

Hence

f (t + s) = f (t) f (s)

f (x)

is monotonically decreasing,

f (0) = 1

and

f (t + s) = f (t) f (s)

\to

exponential distribution with parameter

λ > 0

P (X > s) = e^{- λ s}

Proof.

Check rational number
$n / m$
$0 < r_{n} < x < q_{n}$ ,
$r_{n} \to x$ and
$q_{n} \to x$
$f (n / m) = f (1)^{n / m}$
$f (x) = f (1)^{x} = e^{l n (f (1)) x} = e^{- λ x}$

Little's Theorem

N (t)

number of customer in the system at time

t

α (t)

mumber of customer who arrived in

[0, t]

T_{i}

time spent in the system by

i

th arriving customer

β (t)

number of departures up to time

t

\frac{1}{t} \int_{0}^{t} N (t) d t = \frac{1}{t} \sum_{i = 1}^{α (t)} T_{i} = \frac{α (t)}{t} \frac{\sum_{i = 1}^{α (t)} T_{i}}{α (t)}

\frac{1}{t} \int_{0}^{t} N (t) d t \geq \frac{1}{t} \sum_{i = 1}^{β (t)} T_{i} = \frac{β (t)}{t} \frac{\sum_{i = 1}^{β (t)} T_{i}}{β (t)}

Let

t \to \infty

lim \frac{α (t)}{t} = lim \frac{β (t)}{t} = λ

N = λ T

Ref. Data Netwrks,by Dimitri P. Bertsekas and Robert G. Gallager

PASTA property (Poisson Arrivals See Time Averages)

The probability of the state as seen by an outside random observer (Poisson arrival) is the same as the probability of the state seen by an arriving customer

Proof. (對EE太難, 知道就好, 主要因 memoryless 特性)

arrival 認為的統計與時間的統計不一定相同

假設 arrival 是警察且都是固定時間來巡邏, 小偷知警察巡邏時間, 則警察看不見小偷

警察得到~治安好沒小偷
小偷說警察笨

Lemma. Let

X_{1}, \dots, X_{n}

be independent exponentially distributed random variables with rate parameters

λ_{1}, \dots, λ_{n}

. Then

X = m i n (X_{1}, X_{2}, . . . X_{n})

is also exponentially distributed, with parameter

λ = \sum_{i} λ_{i}

Proof.

P (X > x) = P (m i n (X_{1}, X_{2}, . . . X_{n}) > x) = P (X_{1} > x, X_{2} > x, . . ., X_{n} > x)

Hence,

P (X > x) = \prod_{i} P (X_{i} > x) = \prod_{i} e^{- x λ_{i}} = e^{- x \sum_{i} λ_{i}}

Theorem.
Merging Independent Poisson Processes

$\Rightarrow$ Poisson Process

Splitting a Poisson Processes
Let

N (t)

be a Poisson process with rate λ
. Here, we divide

N (t)

to two processes

N_{1} (t)

and

N_{2} (t)

in the following way. For each arrival, a coin with

P (H) = p

is tossed. If the coin lands heads up, the arrival is sent to the first process (

N_{1} (t)

), otherwise it is sent to the second process. The coin tosses are independent of each other and are independent of

N (t)

. Then,

N_{1} (t)

is a Poisson process with rate

λ p

;

N_{2} (t)

is a Poisson process with rate

λ (1 - p)

;

N_{1} (t)

and

N_{2} (t)

are independent.

Proof.

N (t)

=number of arrivals

P (N_{1} (t) = m, N_{2} (t) = k | N (t) = m + k) = \frac{(m + k)!}{m! k!} p^{m} (1 - p)^{k}

P (N_{1} (t) = m, N_{2} (t) = k) = \frac{(p λ t)^{m} e^{p λ t}}{m!} \frac{((1 - p) λ t)^{k} e^{(1 - p) λ t}}{k!}

M/M/1 Queue

Arrivals occur at rate λ according to a Poisson process and move the process from state
$i$ to
$i + 1$ .
Service times have an exponential distribution with rate parameter
$μ$ in the M/M/1 queue, where
$\frac{1}{μ}$ is the mean service time.
All arrival times and services times are (usually) assumed to be independent of one another.
A single server serves customers one at a time from the front of the queue, according to a first-come, first-served discipline. When the service is complete the customer leaves the queue and the number of customers in the system reduces by one.
The buffer is of infinite size, so there is no limit on the number of customers it can contain.

The model can be described as a continuous time Markov chain with transition rate matrix

ρ = \frac{λ}{μ}

π_{i} = (1 - ρ) ρ^{i}

P (S > n) = ρ^{n + 1}

不是線性,

l o g P (S > n) = (n + 1) l o g ρ

N = \sum i π_{i} = \frac{ρ}{1 - ρ}

N = λ W

, by Little's formula

M/G/1 system

X

:the service time

Pollaczek-Khinchin (P-K) formula:

W = \frac{λ \bar{X^{2}}}{2 (1 - ρ)}

Proof.

W_{i}

= Waiting time in queue of the

i

th customer

R_{i}

= Residual service time seen by the

i

th customer. By this we mean that if customer

j

is already being served when i arrives,

R_{i}

is the remaining time until customer

j

's service time is complete. If no customer is in service(i.e., the system is empty when

i

arrives), then

R_{i}

is zero.)

X_{i}

= Service time of the

i

th custome

N_{i}

= Number of customers found waiting in queue by the ith customer upon arrival

Then

W_{i} = R_{i} + \sum_{j = i - N_{i}}^{i - 1} X_{j}

N_{i}

and

X_{i - 1}, . . ., X_{i - N i}

independent

E [W_{i}] = E [R_{i}] + E [\sum E [X_{j} | N_{i}]]

W = lim_{i \to \infty} E [W_{i}] = lim_{i \to \infty} (E [R_{i}] + \bar{X} E [N_{i}]) = R + \frac{1}{\bar{X}} N_{Q}

N_{Q} = λ W

. Hence

W = \frac{R}{1 - ρ}

Find

lim_{i \to \infty} E [R_{i}] = \frac{1}{2} λ \bar{X^{2}}

M (t)

is the number of service completions within

[0, t]

and

lim \frac{M (t)}{t} = λ

for the stable Queueing system.

Stochastic Dynamic Programming

A discrete time Markov chain with memory (or order)

m

is a process satisfying

P (X_{n + 1} = j | X_{n}, X_{n - 1}, . . ., X_{0}) = P (X_{n + 1} = j | X_{n}, . . ., X_{n - m + 1})

RL: 強化學習(Reinforcement Learning)

NLP 神經語言規劃（英語：Neuro-Linguistic Programming)

NLP 自然語言處理（英語：Natural Language Processing)

RNN: 循環神經網路（Recurrent neural network：RNN）手寫識別是最早成功利用RNN的研究結果(2009)

ML (Expectation-Maximization) Algorithm

包含重要的 Viterbi Algorithm
Hidden Markov Model (part 1), by Kinna Chen
Hidden Markov Model (part 2)
ㄚㄚHidden Markov Model (part 3)
深入了解 Hidden Markov Model 的訓練理論
 Markov Model

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Chapter A

A tutorial on hidden Markov models and selected applications in speech recognition, by Rabiner

Large Deviations and Chernoff Bound

Assume

s > 0

{e^{s X} \geq e^{s δ}} ⟺ {s X \geq s δ} ⟺ {X \geq δ}

P (Y \geq α) = P (X \geq δ) \leq e^{- s δ} E [e^{s X}]

Let

g (s) = e^{- s δ} E [e^{s X}]

g (s)^{^{'}} = E [(X - δ) e^{s (X - δ)}] \geq 0

g (s)^{^{″}} = E [(X - δ)^{2} e^{s (X - δ)}] > 0

g (s)

is a increasing convex function for

s > 0

g (s)^{^{'}} = 0 \Rightarrow s^{*} > 0

g (s)^{} |_{s = 0} = E [X] - δ

Hence

P (X \geq δ) \leq e x p (- s^{*} δ) E [e x p (s^{*} X)]

for

δ > E [X]

The exponential function is convex, so by Jensen's inequality,

E [e^{s X}] \geq e^{s E [X]}

s > 0

Similarly, if

s < 0

P (X \leq δ) \leq e x p (- s^{*} δ) E [e x p (s^{*} X)]

for

δ < E [X]

S_{N} = \sum_{i = 1}^{N} X_{i} / N

X_{1}, X_{2}, . . . X_{N}

i.i.d.
Then we

P (S_{N} \geq δ) \leq e x p (- s^{*} δ) E [e x p (s^{*} S_{N})]

for

δ > E [S_{N}]

Cramér's theorem (large deviations)

logarithmic moment generating function of

X

Λ (s) = l o g E [e^{s X}]

Legendre transform of

Λ

Λ^{*} (x) = s u p_{s \in R} (s x - Λ (s))

X_{1}, X_{2}, . . ., X_{N}

i.i.d.

lim_{N \to \infty} \frac{1}{N} l o g (P (\frac{\sum_{i = 1}^{N} X_{i}}{N} \geq x)) = - Λ^{*} (x)

for all

x > E [X_{1}]

Λ^{*}

: rate function

Theorem (Cramér’s Theorem)

Given a sequence of i.i.d. real valued random variables

X_{i}

i \geq 1

with a common moment generating function $Given a sequence of i.i.d. real valued random variables

X_{i}

i \geq 1

with a common moment generating function

M (θ) = E [e x p (θ X)]

E [X] = μ

and rate function

I (a) = sup_{t h e t a} {θ x - M (θ)}

the following holds:

For any closed set
$F \subseteq R$ ,
$lim s u p_{n \to \infty} \frac{1}{n} l o g (S_{n} \in F) \leq - i n f_{x \in F} I (x)$
For any open set
$U \subseteq R$ ,
$lim i n f_{n \to \infty} \frac{1}{n} l o g (S_{n} \in U) \geq - i n f_{x \in U} I (x)$

Remark.

$[a, \infty)$ is a closed set
$M (0) = 1$
$I (x)$ is a convex non-negative function satisfying
$I (µ) = 0$ .
$I (x)$ is an increasing function on
$[µ, \infty)$ and a decreasing function on
$(- \infty, µ]$ .

$I (x) = s u p_{θ \geq 0} (θ x - l o g M (θ))$ for
$x \geq μ$ .

$I (x) = s u p_{θ \leq 0} (θ x - l o g M (θ))$ for
$x \leq μ$ .

Introductory examples and definitions. Elena Kosygina

A basic introduction to large deviations:
Theory, applications, simulations∗

Stochastic Orders

Increasing convex order

Let

X, Y

be random variables such that

E [ϕ (X)] \leq E [ϕ (Y)]

for all increasing convex functions

ϕ : R \to R

provided the expections exist(denoted by

X \leq_{i c x} Y

)

Properties

$X \leq_{i c x} Y$ iff
$E [X - a]^{+} \leq E [Y - a]^{+}$ for all
$a$ .
$X \leq_{i c x} Y$ then
$E [X | Θ] \leq_{i c x} E [Y | Θ]$
$X_{1}, . . ., X_{n}$ iid,
$Y_{1}, . . ., Y_{n}$ iid, If
$X_{i} \leq_{i c x} Y_{i}$ , then
$\sum X_{i} \leq_{i c x} \sum Y_{i}$

E [X - c]^{+}

= loss rate

Ref. Stochastic Orders and Their Applications (Probability and Mathematical Statistics)
Moshe Shaked (Author)

Information Entropy

Discrete random experiment

l o g (\frac{1}{P (E)}), E

is an event.

When

p (E)

is close to 1, the surprisal of the event is low, but if

P (E)

is close to 0, the surprisal of the event is high.

$l o g (x), x > 0$ monotone increasing
$l o g (1) = 0$
$l o g (a \times b) = l o g (a) + l o g (b)$

e.g., 電腦一開機就爆炸 (機率很低), 太陽東方升起機率~1 (沒有新聞會報導)

measure of uncertainty or entropy (average amount of information)

p 愈 小, - l o g_{2} (p) 愈 大

X

is a discrete random variable,

P (X = x_{i}) = p_{i}, x_{i} \neq x_{j}

, if

i \neq j

H (X) = H (p_{1}, p_{2}, . . ., p_{n}) = - \sum_{i = 1}^{n} p_{i} l o g_{2} p_{i}

(bits), if

p_{i} = 0

p_{i} l o g p_{i} ≐ 0

Remark.

$l i m_{x \to 0^{+}} x l o g x ≐ 0$ ,
$H (X) \geq 0$ . 注意後面 continuous random variable 有可能負的
後面 log 代表
$l o g_{2}$
self-information
$I (E_{k}) = - l o g P (E_{k})$ for event
$E_{k}$
(
$P_{I}$ )Continuity.
$H (X)$ is continous in
$p_{i}$
(
$P_{I I})$ Symmetry.
$H (p_{1}, p_{2}, . . ., p_{n}) = H (p_{2}, p_{1}, . . . p_{n})$
(
$P_{I I I}$ )Extremal Property: maximum of
$H (p_{1}, p_{2}, . . ., p_{n}) = H (\frac{1}{n}, . . ., \frac{1}{n})$
$(P_{I V})$ Additivity. The event
$E_{n}$ is dividded into disjoint subsets such that
$E_{n} = \cup_{k = 1}^{m} F_{k}$ ,
$p_{n} = \sum_{k = 1}^{m}$ ,
$P (F_{k}) = q_{k}$ .
$H_{1} = H (p_{1}, . . . p_{n})$ ,
$H_{2} = H (p_{1}, . . ., p_{n - 1}, q_{1}, q_{2}, . . ., q_{m})$ ,
$H_{3} = H (\frac{q_{1}}{p_{n}}, . . ., \frac{q_{m}}{p_{n}})$ , then
$H_{2} = H_{1} + p_{n} H_{3}$ .
Joint entropy
$H (X, Y) ≐ - \sum_{k} \sum_{j} p (k, j) l o g p (k, j)$ , where
$p (k, j) = P (X = x_{k}, Y = y_{j})$ .
Conditional entropy
$H (X | Y) ≐ \sum_{j} P (Y = y_{j}) H (X | Y = y_{j})$
Measure of Mutual Information

$I (x_{i}; y_{j}) ≐ l o g \frac{p (x_{i} | y_{j})}{p (x_{i})} = l o g \frac{p (x_{i}, y_{j})}{p (x_{i}) p (y_{j})} = I (y_{j}, x_{i})$

$I (X; Y) = \sum_{i} \sum_{j} p (x_{i}, y_{j}) I (x_{i}, y_{i})$
Symmetry:
$H (p_{1}, . . . p_{n})$ iss invariant under permutation of
$(p_{1}, . . ., p_{n})$ .
Small for small probabilities:
$H (1 - p, p) \to 0$ as
$p \to 0$ .
Kraft–McMillan inequality.
$S = s_{1}, s_{2}, . . ., s_{n}$ be encoded into a uniquely decodable code over an alphabet of size
$r$ with codeword lengths
$l_{1}, l_{2}, . . ., l_{n}$ . Then
$\sum_{i = 1}^{n} r^{- l_{i}} \leq 1$ . Conversely for a given set of natural numbers
$l_{1}, l_{2}, . . ., l_{n}$ satisfying the above inequality, there exists a uniquely decodable code over an alphabet of size
$r$ with those codeword lengths.
Remark.
$q (x_{i}) = {\frac{1}{2}}^{- l_{i}}$ ,
$E_{p} [l] = - E_{p} [\frac{l n (q (x))}{l n 2}] = - \sum p (x_{i}) l o g_{2} (q (x_{i})) = H (p, q) =$ corss entropy.

Proof.

p_{n} = 1 - \sum_{i = 1}^{n - 1} p_{i}

\frac{d H}{d p_{k}} = - (l o g_{2} e + l o g p_{k}) + (l o g_{2} e + l o g p_{n}) = - l o g (\frac{p_{k}}{p_{n}}) = 0

, hence

p_{k} = p_{n}

. Thus

p_{i} = \frac{1}{n}

.
It is a maximum and not a minimum, since

H (1, 0, . . ., 0) = 0

Property.

$H (X) \geq H (X | Y)$ , the equality signs hold if and only if
$X, Y$ are statistically independent.
Fact.
$l n x \leq x - 1$ and
$l o g x = l n x l o g e$ , if
$x > 0$
Proof.
$H (X | Y) - H (X) = \sum_{j} \sum_{k} p (x_{k}, y_{i}) l o g \frac{p (x_{k})}{p (x_{k} | y_{j})}$

$\leq \sum_{j} \sum_{k} p (x_{k}, y_{j}) (\frac{p (x_{k})}{p (x_{k} | y_{j})} - 1) e = 0$
$I (X; Y) = H (X) + H (Y) - H (X, Y) = H (X) - H (X | Y) = H (Y) - H (Y | X)$
If
$X, Y$ are independent, then
$I (X, Y) = 0$ , since
$H (X | Y) = H (X)$ . No information is transmitted throuth the channel.
In a discrete communication system the channel capacity is the maximum of the transinformation .
$C = m a x I (X; Y)$ (A Mathematical Theory of Communication, by CE Shannon)
(Discrete Noicelss Channel) Let
$X = {x_{i}}$ be the alphabet of a source containing
$n$ symbols.
$C = m a x I (X; Y) = m a x I (X, X) = m a x H (X) = H (\frac{1}{n}, . . ., \frac{1}{n}) = l o g n$ bits per symbol.
(Discrete Noisy Channel) The noice characteristic
$P (Y = y_{j} | X = x_{k})$ of the channel is prespecified.
$C = m a x I (X, Y)$ .

idea: 機率最小的兩個合併形成 binary tree
A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).

Application: jpeg

Patent controversy (jpeg 專利爭議)

Huffman coding~ particular type of optimal prefix code that is commonly used for lossless data compression

X

is a continuous random variable, the "differential entropy"

H (X) ≜ - \int f_{X} (t) l o g (f_{X} (t)) d t

if it exists.

Remark. Riemann sum

- \sum f_{X} (t_{i}) Δ_{i} l o g (f_{X} (t_{i}) Δ_{i}) \to - \sum f_{X} (t_{i}) l o g f_{X} (t_{i}) Δ_{i} + \sum f_{X} (t_{i}) Δ_{i} l o g (Δ_{i})

. Assume

- \sum f_{X} (t_{i}) l o g f_{X} (t_{i}) Δ_{i} \to - \int f_{X} (t) l o g f_{X} (t) d t

Differential Entropy and Probability Distributions

P, Q

distribution functions,

p = P^{^{'}}, q = Q^{^{'}}

P, Q

continuous distribution functions.
Kullback–Leibler divergence

D_{K L} (P | | Q)

and Cross Entropy

H (P, Q)

P

: true distribution

D_{K L} (P | | Q)

( 相對熵（relative entropy）, KL散度, 訊息增益（information gain）,訊息散度（information divergence） ) is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q.

D_{K L} (P | | Q) = \sum P (x) l o g \frac{P (x)}{Q (x)}

for discrete probability distribution

P, Q

(P (x) > 0, Q (x) > 0)

D_{K L} (P | | Q) = - \sum P (x) l o g Q (x) - (- \sum P (x) l o g (P (x))) = H (P, Q) - H (P)

, where

H (P, Q)

is the cross entropy (交叉熵) of

P

and

Q

Remarks:

P

: true distribution
Cross Entropy: Average number of total bits to represent an event from

Q

instead of

P

.
Relative Entropy (KL Divergence): Average number of extra bits to represent an event from

Q

instead of

P

In general,

D_{K L} (P | | Q) \neq D_{K L} (Q | | P)

, it is not a metric.

H (P, Q) \neq H (Q, P)

it is not a metric..

D_{K L} (P | | Q) \geq 0

D_{K L} (P | | P) = 0

and

H (P, P) = H (P) \neq 0

D_{K L} (P | | Q) = \int p (x) l o g \frac{p (x)}{q (x)}

for continuous probability distributions with

p = P^{^{'}}, q = Q^{^{'}}

I (X; Y) = H (X) + H (Y) - H (X, Y) = H (X) - H (X | Y) = H (Y) - H (Y | X)

I (X; Y) = D_{K L} (P (X, Y) | | P_{X} \otimes P_{Y})

, where

P_{X} \otimes P_{Y} (x, y) = P_{X} (x) P_{Y} (y)

maximum likelihood estimation (MLE) 與 KL divergence
$D_{K L} (P | | Q)$ 的關係

Assume

P = p_{θ} (x)

is real distribution and

Q = p_{\hat{θ}} (x)

is approximate distribution.

D_{K L} = \sum_{x} p_{θ} (x) l o g (\frac{p_{θ} (x)}{p_{\hat{θ}} (x)}

D_{K L} = E_{θ} [l o g (\frac{p_{θ} (x)}{p_{\hat{θ}} (x)})] = E_{θ} l o g (p_{θ} (x)) - E_{θ} [l o g (p_{\hat{θ}} (x))]

{\hat{θ}}_{b e s t} = a r g m i n_{\hat{θ}} D_{K L} (p_{θ} | | p_{\hat{θ}})

Assume

n

samples,

E_{θ} [l o g (p_{\hat{θ}} (x))] ≃ \frac{1}{n} \sum_{i = 1}^{n} l o g (p_{\hat{θ}} (x_{i}))

sample mean.

{\hat{θ}}_{b e s t} ≃ a r g m i n_{\hat{θ}} E_{θ} l o g (p_{θ} (x)) - \frac{1}{n} \sum l o g (p_{\hat{θ}} (x)) = a r g m i n_{\hat{θ}} - \frac{1}{n} \sum l o g (p_{\hat{θ}} (x_{i})

= a r g m a x_{\hat{θ}} \frac{1}{n} \sum l o g (p_{\hat{θ}} (x_{i})) = a r g m a x_{\hat{θ}} \prod_{i = 1}^{n} p_{\hat{θ}} (x_{i})

Key Point:
Minimize the KL divergence as a loss function

看例子:
Kullback-Leibler Divergence Explained
A Gentle Introduction to Cross-Entropy for Machine Learning

logistic regression

logistic function

L = 1, k = 1, x_{0} = 0

A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with the equation

f (x) = \frac{L}{1 + e^{- k (x - x_{0})}}

, where

x_{0}

, the value of the function's midpoint;

L

, the supremum of the values of the function;

k

, the logistic growth rate or steepness of the curve.

f : (- \infty, \infty) \to (0, L)

y = f (x) = a / (1 + b c^{- x})

inverse fnction

f^{- 1} (y) = \frac{- l n (\frac{(a - y)}{b y})}{l n c}

let

c = e, a = 1, b = 1

f^{- 1} (y) = - l n \frac{1 - y}{y}

-ln(1-x)
-ln(x)
convex function

standard deviation

σ

的含意
Chebyshev's inequality

P (| X - μ | \geq k σ) \leq \frac{1}{k^{2}}

, if

k > 1

Let

f (k) = \frac{1}{k^{2}}

, i.e.,

μ = 0, σ = 1

Birnbaum–Raymond–Zuckerman inequality

X = (X_{1}, X_{2}, . . ., X_{n}), μ = (μ_{1}, . . ., μ_{n}), σ = (σ_{1}, . . ., σ_{n})

P (| | X - μ | | \geq k | | σ | |) \leq \frac{1}{k^{2}}

, if

k > 1

Logistic Regression — Detailed Overview

Bernoulli process
A Bernoulli process is a finite or infinite sequence of independent random variables

X_{1}, X_{2}, X_{3}, . . .

, such that for each

i

, the value of

X_{i}

is either 0 or 1; for all values of the probability

p

that

X_{i} = 1

is the same. In other words, a Bernoulli process is a sequence of independent identically distributed Bernoulli trials.

P = (p)^{k} (1 - p)^{n - k}

Bernoulli MLE(maximum likelihood Estimation) Estimation

L (θ) = \prod_{i = 1}^{n} p^{X_{i}} (1 - p)^{n - X_{n}}

log MLE

L L (θ) = l o g \prod_{i = 1}^{n} p^{X_{i}} (1 - p)^{n - X_{n}} = \sum_{i = 1}^{n} X_{i} l o g p + (n - X_{i}) l o g (1 - p)

\frac{\partial L L (θ)}{\partial θ} = 0 \Rightarrow \hat{p} = \frac{\sum X_{i}}{n} = s a m p l e m e a n

Activation function

Ref. Information Theory - Robert Ash

個人覺得很重要又有趣的 "物理" 觀念 (需要 calculus of variation 的知識)
Action principles (least action principle)
The Feynman Lectures on Physics
From the laws of reflection & refraction to variational principles

Theoretical Concepts in Physics, Malcolm S. Longair, University of Cambridge
Introduction to the calculus of variations

這本書寫得很好, 可是對資電太難
Mathematical Methods of Classical Mechanics, V.I. Arnold.

Probability Measure and Conditional Expectation

學以致用

學而不思則罔 思而不學則殆 學而後思則悟

♥腦中真書藏萬卷，掌握文武半邊天♥

Probability Space

Markov Chain

Hidden Markov Models (HMM)

Assumptions

Stochastic Dynamic Programming

Large Deviations and Chernoff Bound

Stochastic Orders

Information Entropy

logistic regression

Read more

Linux Network Programming, by WhoAmI

Metric space, vector space, inner product space

使用 pthread 的 echo server (有bug! why?)

QEMU linux kernel and rootfs (busybox-1.22.0 and linux-4.4.50)

$Probability Measure and Conditional Expectation$

學而不思則罔思而不學則殆學而後思則悟