Try   HackMD

Probability Measure and Conditional Expectation

Author:

CrazyDog, Ph.D. in Electrical Engineering, NCTU
email: kccddb@gmail.com
Date: 20241030

Copyright: CC BY-NC-SA

家扶基金會

台灣之心愛護動物協會(Heart of Taiwan Animal Care, HOTAC)

無國界醫生
Mathematics (calculus, linear algebra, probability, statistics), Data Structure, Algorithm, DSP, Micro-processor and OS 是內功心法
network programming, embedded system 是戰鬥工法

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

學以致用

學而不思則罔 思而不學則殆 學而後思則悟

♥腦中真書藏萬卷,掌握文武半邊天♥

Probability Space

σ-field

Definition 1.1. Let

F be a collection of subsets of
Ω
.
F
is called a "
σ
-field" or "
σ
-algebra" if the following holds
(i)
ΩF
.
(ii)
AF
implies that
AcF
and
(iii) if
A1,A2,...F
is a countable sequence of sets then
AiF
.

measure, measurable space

Definition 1.2. A measure is a nonnegative countable additive set function on a measurable space

(Ω,F), i.e.,a function
μ:FR
such that
(i) μ(A) > 0 for all
AF
, and
(ii) if
AiF
is a countable sequence of disjoint sets (i.e.,
AiAj=0
; for all
ij
), then
μ(Ai)=u(Ai)

The tuple

(Ω,F), where
F
is a "
σ
-field", is called a measurable space.

e.g.,

  1. (Ω,F) with
    F={0,Ω}
    .
    F
    is the smallest "
    σ
    -field" on
    Ω
    , called the trivial "
    σ
    -field".

  2. F=2Ω, is the largest "
    σ
    -field" on
    Ω
    .

  3. Let

    P is a measure on
    (Ω,F)
    with
    0P(ω)P(Ω)=1
    , then
    (Ω,F,P)
    is a probability space with probability measure
    P
    .

Borel

σ algebra

In mathematics, a Borel set is any set in a topological space that can be formed from open sets (or, equivalently, from closed sets) through the operations of countable union, countable intersection, and relative complement. Borel sets are named after Émile Borel.

For a topological space X, the collection of all Borel sets on X forms a σ-algebra, known as the Borel algebra or Borel σ-algebra. The Borel algebra on X is the smallest σ-algebra containing all open sets (or, equivalently, all closed sets).

  1. Lebesque measure on
    (Rn,B)
    , where
    B
    Borel
    σ
    algebra

Lebesque measure

μ on
R
,
μ(a,b)=ba
, where
(a,b)
open interval.

If

f is differentiable in
[a,b]
and the Labesque integral
axf(t)dt
exists, then
f(x)f(a)=axf(t)dt

Riemann Integral

[a,b]fdt=Lebesque Integral
[a,b]fdt
if they exist.

Counting measure and

lp spaces.

Let

X be any countable set,
M=P(X)
(
M:σ
-algebra ,
P(X)
=power set of
X
)
and
μ
be the counting measure. Recall that
µ(A)
is the number of points in
A
if
A
is finite and equals
otherwise. Integration is simply
fdμ=xXf(x)

for any non-negative function
f
, and
Lp
is denoted by
lp
.

Definition 1.3. Given two measure spaces

(Ω,F) and
(S,B)
, a function
f:ΩS
is measurable if
f1(A)F
for any
AS
.

random variable (measurable function)

Definition 1.4. A random variable is a measurable function from

(Ω,F) to
(R,B)
, where B is the Borel
σ
-field of
R
. A random vector is a measurable function from
(Ω,F)
to
(Rd,Bd)
, where
Bd
is the Borel
σ
-field of
Rd
.

X:ΩR is a random variable,
P(aXb)P({ωΩ|aX(ω)b})
.
Distribution function of
X
:
FX(x)P(Xx)

Expectation, conditional expectation
Probability space

(Ω,F,P)

  1. Indicator function
    1A
    :
    11(ω)=1
    if
    ωA
    ,
    1A(ω)=0
    , if
    ωΩA
    , where
    AF
    .
    E[1A]Ω1A(ω)dP(ω)P(A)
    .
  2. Simple random variable. A r.v.
    ϕ
    is a simple RV if
    ϕ(ω)=i=1nai1A
    ,where
    Ai
    ’s are disjoint measurable sets.
    E[ϕ]i=1naiE[1Ai]
    . If
    E[ϕ]<
    , then
    ϕ
    integrable.

E[X]X(ω)dP if it exists.

(X,Σ,μ) is a measure space, a property

P is said to hold almost everywhere (almost surely) in
X
if there exists a set
NΣ
with
u(N)=0
and all
xXN
have the property
P
. Define
0×=0
.

e.g.,

Q:rational number. Let
f(x)=1
, if
xQ
,
f(x)=0
if
xR Q
. Then
f(x)=0
in
R
, a.e. , because
Q
is countable and
u(Q)=0
, where
u
is the Lebesque measure.
fdu=0
(in Lebesque integral).

Strong Law of Large Numbers
The sample average converges almost surely to the expected value, i.e.,

limXn¯μ, a.s.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Schwarz and Hölder Inequality Let

(S,Σ,μ) be a measure space,
1<p<
and
1p+1q=1

|xy|du(|x|p)1p(|y|q)1q

||xy||L1||x||Lp||y||Lq

u= counting measure,
S={1,..,n}

i=1n|xiyi|(i=1n|xi|p)1p(i=1n|yi|q)1q

p=q=2, Schwarz inequality

Hölder's inequality becomes an equality if and only if

|x|p and
|y|q
are linearly dependent in
L1(μ)
, meaning that there exist real numbers
α,β0
, not both of them zero, such that
α|x|p
=
β|y|q
μ
almost everywhere.

Minkowski Inequality

1p<

(|x±y|p)1p(|x|p)1p+(|x|p)1p

||x±y||p||x||p+||y||p

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Remark. Random variable

X:ΩR

Markov's inequality

If X is a nonnegative random variable and a > 0,

P(Xa)E[X]a

Proof.

Define

1{Xa}, then
a1{Xa}X
a.s. Hence
E[a1{Xa}]=aP(Xa)E[X]
.
Q.E.D.

Extended version for nondecreasing functions

If

ϕ is a nondecreasing nonnegative function,
X
is a (not necessarily nonnegative) random variable, and
ϕ(a)>0
, then

P(X>a)E[ϕ(X)]ϕ(a)

Proof.

{ω: |X(ω)|>a,ωΩ}={ω: ϕ(X(ω))>ϕ(a),ωΩ}

Hoeffding's inequality

Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount.

Let

X1,...,Xn be independent random variables such that
aiXibi
almost surely.

Then Hoeffding's theorem states that, for all

t>0,
Sn=i=1nXi

P(SnE[Sn]t)exp(2t2i(biai)2)
P(|SnE[Sn]|t)2exp(2t2i(biai)2)

Proof. (HINT)

P(SnE[Sn]t)=P(exp(s(SnE[Sn]))exp(st)) for
s>0
, since
esx
is a monotone increasing function for
s>0,x>0
.
Hint. Markov inequality (Extended version)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Hoeffding’s inequality and convex ordering

Concentration inequality


Fundamental concept:

X is a r.v., if there exits a simple r.v.'s
ϕi
, such that
ϕiX
and
E[ϕi]X¯<
, then
E[X]X(ω)dPX¯
.

conditional expectation

E[X|Y=1] (a random variable) contitional expection of
X
given the event
{Y=1}

Radon-Nikodym theorem
Let

u be a
σ
-finite measure and
λ
a signed measure on
Σ
of subset of
Ω
. Assume
λ
is absolutely continuous with respect to
u
(
λ<<u
, if
u(A)=0=>λ(A)=0
). Then there is a Borel measurabe function
g:ΩR¯
such that

λ(A)=Adλ=Agdu for all
AΣ
.

If

h is another such function, then
g=h a.e.[u]

g is called Radon-Nikodym derivative or density of
λ
with respect to
u
, writen
dλdu
.

Fundamental theorem of calculus

FTC_geometric.svg

Fundamental_theorem_of_calculus_(animation_)

F(x)axf(t)dt, dF(t)dt=f(t),t(a,x)

dFdλ
dtdu

dFdtdλdu

X is an integrable random variable. Assume
G
is a
σ
-field
GF
. The conditional expectation of
X
given by
G
,
E[X|G]
(
G
-measurable random variable
) is given by
BXdP=BE[X|G]dP
(or
E[X1B]=E[E[X|G]1B]]
) for
BG
. (by Radon-Nikodym theorem).

The

σ-field generated by random variable
Y
is denoted by
σY
.
Then
E[X|Y]=E[X|σY]
.

conditional density
Let

X,Y r.v. with joint density
fXY(x,y)
.
P(YC|xh<X<x+h)=P(xh<X<x+h,YC)P(xh<X<x+h)2hCfXY(x,y)dyxhx+hfX(u)du

CfXY(x,y)dyfX(x)
.

Define

h(y|x)=fXY(x,y)/fX(x) the conditional density of
Y
, given
X=x
.

Let

B(x)F and
X=x
, then
B(x)
will occur iff
YB(x)
.

P(YB(x)|X=x)=1B(x,y)h(y|x)dy=B(x)h(y|x)dy, by Fubini's theorem.

Properties.

  1. If
    G
    is the trivial field
    {0,Ω}
    ,
    E[X|G]=E[X]
    is a random variable.
  2. E[X|X]=X
  3. If
    X
    is
    G
    -measurable, then
    E[XY|G]=XE[Y|G]
    .
  4. If
    G1G
    , then
    E(E[X|G]|G1)=E[X|G1]
    .
  5. If
    σ(X)
    (
    σ
    -field generated by
    X
    ) and
    G
    are independent, then
    E[X|G]=E[X]
    (it is a random variable).
  6. Monotone convergence. If
    0Xn
    , and
    XnX
    with
    E[|X|]<
    , then
    E[Xn|G]E[X|G]
    .
  7. Dominated convergence. If
    limnXn=X
    a.s. and
    |Xn|Y
    with
    E[Y]<
    , then
    limnE[Xn|G]=E[X|G]
    .
  8. Fatous' lemma. If
    0Xn
    , then
    E[liminfX|G]liminfE[Xn|G]
    .

1280px-ConvexFunction.svg

  1. |E[X|G]|E[|X||G]
    .
  2. Jenson's inequality. If
    g(x)
    is a convex function on interval
    I
    , i.e., for all
    x
    ,
    yI,λ(0,1)
    ,
    g(λx+(1λ)y)λg(x)+(1λ)g(y)
    .
    X
    is a r.v. with range
    I
    , then
    g(E[X|G])E[g(X)|G]
    .
  3. Let
    X
    and
    Y
    be two independent r.v.s and
    ϕ(x,y)
    be such that
    E[|ϕ(X,Y)|]<
    . Then
    E[(ϕ(X,Y))|Y]=G(Y)
    , where
    G(y)=E[ϕ(X,y)]
    .
  4. E[.|G]
    is a projection operator on
    L1(Ω,F,P)
    , since
    E[E[.|G]|G]=E[.|G]

Lp space

(X,Σ,μ) is a measure space, and
1p<
. Let
||f||p(|f|pdu)1p<
. Notice that
||f||p=0f=0
a.e.

Remark.

fnf in
L2fnf a.e
, there exits one sub-sequence
fnjf,a.e.

fnf a.e.fnf in L2, e.g.,
fn=nxn
,
0x<1

還有許多數學上問題要處理, 例如 seminorm, 因此工程上 我們處理 good function (finite energy, measurable function, complete inner product space,)與

p=2, 最重要的
L2l2
space
, 有興趣的讀者 請學 Functional Analysis.

Gibbs phenomenon

Fourier Series
Let

ϕn(x)=e2πinx,
f(n)^=01f(x)ϕn(x)dx
and
SN(x)=n=NNf(n)^ϕn(x)
. Then

  1. ||ϕn||2=01|ϕn(x)|2dx=1
  2. If
    nk
    ,
    <ϕn,ϕk>=0
  3. If
    f(x)=n=NNcnϕn(x)
    , then
    01|f|2=n=NN|cn|2
  4. n=NN|f(n)^|2=01|Sn(x)|2dx

Bessel’s Inequality

Let

f be a periodic
[0,1]
, square-integrable function, and write fb for its Fourier transform. Then, for every
N
, we have

n=NN(|f(n)|)2^01|f|2

Hence we have

f(x)=cnϕn(x) in
L2
sense (
limN||fSN||L2=0
), if
fL2[0,1]

Lebesque-Stieltjes Integral

X is a r.v., with distribution function
FX(x)
, where
F(b)F(a)=P(X(ω)(a,b])
right-continuous function

Then

E[h(X)]=Ωh(X(ω))dP(ω)=h(x)dPX(x)=h(x)dFX(x)

Remark.

X is a discrete r.v.,
FX(xn+1)FX(xn)=P(X=xn+1)
, then
E[X]=xnP(X=xn)

Sums of independent random variables

X,Y

discrete r.v.

Z=X+Y
P(Z=z)=ypX(zy)pY(y)
, convolution

continuous r.v.

Z=X+Y

fZ(z)=fX(zy)fY(y)dy

Proof.

Let

1A,A={ωΩ|Z=X(ω)+Y(ω)z}
P(Zz)=E[1A]=E[E[1A|Y]]=E[P(X+Yz)|Y]]

E[P(X+Yz)|Y]]=FX(zy)dFY
, since
X,Y
independent.

Moment generating function (mgf) of

X
m(t)=E[etX]=E[n=0tnXnn!]=n=0tnE[Xn]n!

E[Xn]=dnm(t)dtn|t=0

X~ random vector
mX(t)E[e<t,X>]

Characteristic function of

X

ϕ(t)=E[eitX]=eitxdFX(x)
If
fX(x)=FX(x)
, then
ϕ(t)=eitxfX(x)dx

Proposition.
There is a bijection between probability distributions and characteristic functions. That is, for any two random variables

X1,
X2
, both have the same probability distribution if and only if
ϕX1=ϕX2

X,Y independent, then
ϕX+Y=ϕXϕY

X~ random vector, then
ϕX(t)E[ei<t,X>]

Proposition.
Suppose that

X,Y are two random vectors whose mgfs
ϕX(t),ϕY(t)

exist for all
tBε(0)
for some
ε>0
. If
ϕX(t)=ϕY(t)
for all
tBε(0)
, then the cdfs of
X
and
Y
are identical.
Bε(0)=(ε,ε)
in
R1

Central limit theorem

X1,X2,... i.i.d.
E[Xi]=μ
,
VAR(Xi)=σ2
,
X¯=X1+X2+...+Xnn

n(Xμ)N(0,σ2) convergence in distribution

Proof.

M(t)=E[etX¯]
Using Taylor’s formula with a remainder
Find
M(t)...

Take limit
n
e12σ2t2

Ref. Introduction To Stochastic Calculus With Applications, by F. C. Klebaner

L2(P) processes (Second-order processes), processes for which
E[|Xt|2]<
.


Markov Chain

Discrete-time Markov chain

A discrete-time Markov chain is a sequence of random variables

X1,X2,X3,... with the Markov property

P(Xn+1=xn+1|Xn=xn,Xn1,....X1)=P(Xn+1=xn+1|Xn=xn)

Time-homogeneous:

P(Xn+1=xn+1|Xn=xn)=P(Xn=xn|Xn1=xn2)

A discrete-time Markov chain with memory (or a Markov chain of order m) where

m is finite, is a process satisfying

P(Xn+1=xn+1|Xn=xn,Xn1,...,Xnm+1,....X1)=P(Xn+1=xn+1|Xn=xn,..Xnm+1)

Assume

Xn=j{0,1,2,...}, let
pij=P(Xn=j|Xn1=i)

P=[pij]:Transition matrix

P(Xn+1=k|Xn1=i)=jP(Xn+1=k,Xn=j|Xn1=i)=jP(Xn+1=k|Xn=j,Xn1=i)P(Xn=j|Xn1=i

Hence

P(Xn+1=k|Xn1=i)=jpijpjk
Asssume
π0=[π0,π1...]
is the initial probability.
P(X2=k|X0=i)=jpijpjk
, then
P(X2=k)=iπijpijpjk
. Let
πn(j)=P(Xn=j)
, then
πn=π0Pn
. Assume
limnπn=π
exists, then we have
πP=π
.

Matrix form:

P2=P×P

Stationary​ probability

π=πP and
πi=1
.

Continuous-time Markov chain
Then, knowing

X(t)=i,
X(t+h)=j
is independent of previous values
(Xs:s<t)
, and as h → 0 for all j and for all t,

P(X(t+h)=j|X(t)=i)=δij+qijh+o(h), where (little-o)
limh0o(h)h=0
and
δii=1
,
δij=0
,if
ij
.

qij transition rate.

Time Homogeneous:

pij(t)=P(X(t+s)=j|X(s)=i). Then

  1. pij(t+δt)δt=qijδt+o(δt)δt=qij+o(δt)/δt
    , if
    ij
  2. pii(t+s)1δt=qii+o(δt)/δt
    , if
    i=j
    .
  3. jpij(t)=1
    ,
    0pij(t)1
  4. qii<0
  5. Chapman–Kolmogorov equation
  6. pi(X(t+s)=k)=jPi(X(t)=j,X(t+s)=k)
    , hence
    pik(t+s)=jpij(t)pjk(s)
    . We have
    P(t+s)=P(t)P(s)
  7. limt0P(t)=I
  8. Q[qij]
    ,
    qij
    infnitesimal transition rate,
    Q
    infnitesimal generator
  9. jqij=0
  10. P(t)=exp(tQ)
    ,
    t0
  11. Transition Rate
    qij
    ,
    ij

From 1. and 2., we have

pii ~
exp(qii)
and
P=QP
, where
P=[Pij(t)]
and
Pij(t)=pij(t)

P(t+h)P(t)h=P(t)(P(h)I)h
P=limh0P(t)(P(h)I)h=QP

P(h+s)P(s)h=(P(h)I)P(s)

Then

P=QP=PQ

For steady-state,

P=0, then
QP=PQ=0
.

The steady-state probability

π,
πQ=0
and
πi=1
.

If

Pij exists, then
Pn=P^
as
n

Kolmogorov backward equation

P=QP,
P(0)=I
.

Poisson process with parameter

λ with state space
{0,1,2...}
,

  1. qij=λ
    , for
    j=i+1
  2. qii=λ
    ,
    i=0,1,2...
  3. qij=0
    , otherwise.

Global balance equations

pij(t+δt)δt=qijδt+o(δt)δt=qij+o(δt)/δt, if
ij

πi=jπjqji,
or
πijSiqij=jSiπjqji
,

The left-hand side represents the

total flow from out of state i into states other than
i
, while the right-hand side represents the
total flow out of all states
ji
into state
i
.

stationary probability (steady-state probability)

πQ=0, and
πi=1

Hidden Markov Models (HMM)

HiddenMarkovModel.svg

Assumptions

Q=q1,q2,...,qN a set of N states (Markov Chain)
A=[aij]
transition matrix (Markov Chain)
O=o1,o2,...,oT
a sequence of
T
observations
B=bi(ot)
a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation ot being generated from a state
i

π=[π1,...,πN]
an initial probability distribution (Markov Chain)

Output Independence:

p(oi|q1,...qT,o1,...,oT)=p(oi|qi)

λ=(π,A,B) model

  1. (likelihood) Determine the likelihood
    P(O|λ)
  2. Given an observation sequence
    O
    and an HMM
    λ
    , discover the best hidden state sequence
    Q
    . best??? ML, maximum mutual information (MMI),
    arg max
    ,
    LLM
    ,
    MAP
  3. (learning) Given an observation sequence O and the set of statesin the HMM, learn the HMM parameters A and B.

P(O,Q)=P(O|Q)P(Q)=i=1Tp(oi|qi)i=1Tp(qi)
P(O)=QP(O,Q)

at(j)=P(o1,...,ot,qt=j|λ)
at1(j)
the previous forward path probability from the previous time step

bj(ot) the state observation likelihood of the observation symbol ot given the current state
j

1.Initialization: initialize

π,
A
and
B

Then the forward probability at time

t,
at(j)=i=1Nat1(i)aijbj(ot)

2. Recursion:
at(j)=i=1Nat1(i)aijbj(ot)

3. Termination:

The backward probability

β is the probability of seeing the observations from time
t+1
to the end, given
that we are in state
i
at time
t
(and given
λ
):
βt(i)=P(ot+1,...,oT|qt=i,λ)

  1. Initialization:

    βT(i)=1,
    i=1,...,N

  2. Recursion:

    βt(i)=jaijbj(ot+1)βt+1(j),
    1jN
    ,
    1t<T

  3. Termination:

    P(O|λ)=jπjbj(o1)β1(j)

    {qibj(ot+1)qjat(i)tβt+1

Let

ηt(i,j)=P(qt,qt+1|O,λ)=aijbj(ot+1)βt+1(j)P(Q|λ). Hence

ηt(i,j)=aijbj(ot+1)βt+1(j)ijaijbj(ot+1)βt+1(j)

γt(j)=jaijbj(ot+1)βt+1(j)

Let

πi be the expected number of time at
t=1
.

Let

Mij be the expected number of transitions from state
i
to state
j
.

Let

Ki=jMij be the expected number of transitions from state
i
.

Let

Sj be the expected number of times in state
j

Let

Bj(k) be the expected number of times and observing
O=ok
.

EM (expectation-maximization) algorithm

(No guarantee exists that the sequence converges to a maximum likelihood estimator.)

initialize

π, A and B (a big problem)
iterate until convergence

E-step (Expectation step, update variables)

γt(j)
ηt(i,j)
i,j,t

M-step (Maximization step, update hypothesis)

a^ij=MijKi=t=1Tηt(i,j)t=1Tγt(j)
β^j(k)=t=1,Q=qkTηt(j)t=1Tγt(j)

return A,B


Markov Model
Hidden Markov Model (part 2)
ㄚㄚHidden Markov Model (part 3)

EM algorithm, convergence???
如何辨別機器學習模型的好壞?秒懂Confusion Matrix

Hidden Markov Models

A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner


memoryless property

X0,
P(X>t+s|X>s)=P(X>t)

Let

f(x)=P(X>x), then
P(X>t)=P(X>t+s|X>s)=P(X>t+s,X>s)/P(X>s)

Hence
f(t+s)=f(t)f(s)
.

f(x) is monotonically decreasing,
f(0)=1
and
f(t+s)=f(t)f(s)
exponential distribution with parameter
λ>0
.

P(X>s)=eλs

Proof.

  1. Check rational number
    n/m
  2. 0<rn<x<qn
    ,
    rnx
    and
    qnx
  3. f(n/m)=f(1)n/m
  4. f(x)=f(1)x=eln(f(1))x=eλx

Little's Theorem

N(t) number of customer in the system at time
t

α(t)
mumber of customer who arrived in
[0,t]

Ti
time spent in the system by
i
th arriving customer
β(t)
number of departures up to time
t

1t0tN(t)dt=1ti=1α(t)Ti=α(t)ti=1α(t)Tiα(t)

1t0tN(t)dt1ti=1β(t)Ti=β(t)ti=1β(t)Tiβ(t)

Let

t,
limα(t)t=limβ(t)t=λ

N=λT

Ref. Data Netwrks,by Dimitri P. Bertsekas and Robert G. Gallager

PASTA property (Poisson Arrivals See Time Averages)

The probability of the state as seen by an outside random observer (Poisson arrival) is the same as the probability of the state seen by an arriving customer

Proof. (對EE太難, 知道就好, 主要因 memoryless 特性)

arrival 認為的統計 與 時間 的統計 不一定 相同

假設 arrival 是警察 且 都是 固定時間來巡邏, 小偷 知 警察 巡邏時間, 則 警察 看不見 小偷

警察 得到~治安好 沒小偷
小偷 說 警察 笨

Lemma. Let

X1,,Xn be independent exponentially distributed random variables with rate parameters
λ1,,λn
. Then

X=min(X1,X2,...Xn)

is also exponentially distributed, with parameter

λ=iλi

Proof.

P(X>x)=P(min(X1,X2,...Xn)>x)=P(X1>x,X2>x,...,Xn>x)
Hence,

P(X>x)=iP(Xi>x)=iexλi=exiλi

Theorem.
Merging Independent Poisson Processes

Poisson Process

Splitting a Poisson Processes
Let

N(t) be a Poisson process with rate λ
. Here, we divide
N(t)
to two processes
N1(t)
and
N2(t)
in the following way. For each arrival, a coin with
P(H)=p
is tossed. If the coin lands heads up, the arrival is sent to the first process (
N1(t)
), otherwise it is sent to the second process. The coin tosses are independent of each other and are independent of
N(t)
. Then,
N1(t)
is a Poisson process with rate
λp
;
N2(t)
is a Poisson process with rate
λ(1p)
;
N1(t)
and
N2(t)
are independent.

Proof.

N(t)=number of arrivals

P(N1(t)=m,N2(t)=k|N(t)=m+k)=(m+k)!m!k!pm(1p)k

P(N1(t)=m,N2(t)=k)=(pλt)mepλtm!((1p)λt)ke(1p)λtk!


M/M/1 Queue

  1. Arrivals occur at rate λ according to a Poisson process and move the process from state
    i
    to
    i+1
    .
  2. Service times have an exponential distribution with rate parameter
    μ
    in the M/M/1 queue, where
    1μ
    is the mean service time.
  3. All arrival times and services times are (usually) assumed to be independent of one another.
  4. A single server serves customers one at a time from the front of the queue, according to a first-come, first-served discipline. When the service is complete the customer leaves the queue and the number of customers in the system reduces by one.
  5. The buffer is of infinite size, so there is no limit on the number of customers it can contain.

The model can be described as a continuous time Markov chain with transition rate matrix

ρ=λμ
πi=(1ρ)ρi

P(S>n)=ρn+1
不是線性,
log P(S>n)=(n+1)log ρ

N= iπi=ρ1ρ

N=λW
, by Little's formula

M/G/1 system

X:the service time

Pollaczek-Khinchin (P-K) formula:

W=λX2¯2(1ρ)

Proof.

Wi = Waiting time in queue of the
i
th customer

Ri= Residual service time seen by the
i
th customer. By this we mean that if customer
j
is already being served when i arrives,
Ri
is the remaining time until customer
j
's service time is complete. If no customer is in service(i.e., the system is empty when
i
arrives), then
Ri
is zero.)

Xi = Service time of the
i
th custome

Ni = Number of customers found waiting in queue by the ith customer upon arrival

Then

Wi=Ri+j=iNii1Xj
Ni
and
Xi1,...,XiNi
independent
E[Wi]=E[Ri]+E[E[Xj|Ni]]

W=limiE[Wi]=limi(E[Ri]+X¯E[Ni])=R+1X¯NQ

NQ=λW. Hence
W=R1ρ

Find

limiE[Ri]=12λX2¯
M(t)
is the number of service completions within
[0,t]
and
limM(t)t=λ
for the stable Queueing system.

Stochastic Dynamic Programming

A discrete time Markov chain with memory (or order)

m is a process satisfying

P(Xn+1=j|Xn,Xn1,...,X0)=P(Xn+1=j|Xn,...,Xnm+1).

RL: 強化學習(Reinforcement Learning)

NLP 神經語言規劃(英語:Neuro-Linguistic Programming)

NLP 自然語言處理(英語:Natural Language Processing)

RNN: 循環神經網路(Recurrent neural network:RNN)手寫識別是最早成功利用RNN的研究結果(2009)

balance1

ML (Expectation-Maximization) Algorithm

包含重要的 Viterbi Algorithm
Hidden Markov Model (part 1), by Kinna Chen
Hidden Markov Model (part 2)
ㄚㄚHidden Markov Model (part 3)
深入了解 Hidden Markov Model 的訓練理論

Markov Model

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Chapter A

A tutorial on hidden Markov models and selected applications in speech recognition, by Rabiner


Large Deviations and Chernoff Bound

Assume

s>0
{esXesδ}{sXsδ}{Xδ}

P(Yα)=P(Xδ)esδE[esX]

Let
g(s)=esδE[esX]

g(s)=E[(Xδ)es(Xδ)]0

g(s)=E[(Xδ)2es(Xδ)]>0

g(s)
is a increasing convex function for
s>0
.
g(s)=0s>0

g(s)|s=0=E[X]δ

Hence

P(Xδ)exp(sδ)E[exp(sX)] for
δ>E[X]

The exponential function is convex, so by Jensen's inequality,

E[esX]esE[X],
s>0

Similarly, if

s<0
P(Xδ)exp(sδ)E[exp(sX)]
for
δ<E[X]

SN=i=1NXi/N,
X1,X2,...XN
i.i.d.
Then we
P(SNδ)exp(sδ)E[exp(sSN)]
for
δ>E[SN]

Cramér's theorem (large deviations)

logarithmic moment generating function of

X
Λ(s)=logE[esX]

Legendre transform of

Λ :
Λ(x)=supsR(sxΛ(s))

X1,X2,...,XN i.i.d.
limN1Nlog(P(i=1NXiNx))=Λ(x)
for all
x>E[X1]

Λ: rate function

Theorem (Cramér’s Theorem)

Given a sequence of i.i.d. real valued random variables

Xi,
i1
with a common moment generating function $Given a sequence of i.i.d. real valued random variables
Xi
,
i1
with a common moment generating function
M(θ)=E[exp(θX)]
,
E[X]=μ
and rate function
I(a)=suptheta{θxM(θ)}
the following holds:

  1. For any closed set
    FR
    ,
    limsupn1nlog(SnF)infxFI(x)
  2. For any open set
    UR
    ,
    liminfn1nlog(SnU)infxUI(x)

Remark.

  1. [a,) is a closed set

  2. M(0)=1

  3. I(x) is a convex non-negative function satisfying
    I(µ)=0
    .
    I(x)
    is an increasing function on
    [µ,)
    and a decreasing function on
    (,µ]
    .
    I(x)=supθ0(θxlogM(θ))
    for
    xμ
    .
    I(x)=supθ0(θxlogM(θ))
    for
    xμ
    .

Introductory examples and definitions. Elena Kosygina

A basic introduction to large deviations:
Theory, applications, simulations∗

Stochastic Orders

Increasing convex order

Let

X,Y be random variables such that
E[ϕ(X)]E[ϕ(Y)]
for all increasing convex functions
ϕ:RR
provided the expections exist(denoted by
XicxY
)

Properties

  1. XicxY
    iff
    E[Xa]+E[Ya]+
    for all
    a
    .
  2. XicxY
    then
    E[X|Θ]icxE[Y|Θ]
  3. X1,...,Xn
    iid,
    Y1,...,Yn
    iid, If
    XiicxYi
    , then
    XiicxYi

E[Xc]+ = loss rate

Ref. Stochastic Orders and Their Applications (Probability and Mathematical Statistics)
Moshe Shaked (Author)


Information Entropy

Discrete random experiment

log(1P(E)),E is an event.

When

p(E) is close to 1, the surprisal of the event is low, but if
P(E)
is close to 0, the surprisal of the event is high.

  1. log(x),x>0
    monotone increasing
  2. log(1)=0
  3. log(a×b)=log(a)+log(b)

e.g., 電腦一開機 就爆炸 (機率很低), 太陽東方升起 機率~1 (沒有新聞 會報導)

measure of uncertainty or entropy (average amount of information)

p  ,log2(p)  

X is a discrete random variable,
P(X=xi)=pi, xixj
, if
ij

H(X)=H(p1,p2,...,pn)=i=1npi log2 pi
(bits), if
pi=0
,
pi log pi0

Remark.

  1. limx0+xlogx0
    ,
    H(X)0
    . 注意 後面 continuous random variable 有可能負的
  2. 後面 log 代表
    log2
  3. self-information
    I(Ek)=logP(Ek)
    for event
    Ek
  4. (
    PI
    )Continuity.
    H(X)
    is continous in
    pi
  5. (
    PII)
    Symmetry.
    H(p1,p2,...,pn)=H(p2,p1,...pn)
  6. (
    PIII
    )Extremal Property: maximum of
    H(p1,p2,...,pn)=H(1n,...,1n)
  7. (PIV)
    Additivity. The event
    En
    is dividded into disjoint subsets such that
    En=k=1mFk
    ,
    pn=k=1m
    ,
    P(Fk)=qk
    .
    H1=H(p1,...pn)
    ,
    H2=H(p1,...,pn1,q1,q2,...,qm)
    ,
    H3=H(q1pn,...,qmpn)
    , then
    H2=H1+pnH3
    .
  8. Joint entropy
    H(X,Y)kjp(k,j)log p(k,j)
    , where
    p(k,j)=P(X=xk,Y=yj)
    .
  9. Conditional entropy
    H(X|Y)jP(Y=yj)H(X|Y=yj)
  10. Measure of Mutual Information
    I(xi;yj)logp(xi|yj)p(xi)=logp(xi,yj)p(xi)p(yj)=I(yj,xi)

    I(X;Y)=ijp(xi,yj)I(xi,yi)
  11. Symmetry:
    H(p1,...pn)
    iss invariant under permutation of
    (p1,...,pn)
    .
  12. Small for small probabilities:
    H(1p,p)0
    as
    p0
    .
  13. Kraft–McMillan inequality.
    S=s1,s2,...,sn
    be encoded into a uniquely decodable code over an alphabet of size
    r
    with codeword lengths
    l1,l2,...,ln
    . Then
    i=1nrli1
    . Conversely for a given set of natural numbers
    l1,l2,...,ln
    satisfying the above inequality, there exists a uniquely decodable code over an alphabet of size
    r
    with those codeword lengths.
    Remark.
    q(xi)=12li
    ,
    Ep[l]=Ep[ln(q(x))ln 2]=p(xi)log2(q(xi))=H(p,q)=
    corss entropy.

Proof.

pn=1i=1n1pi
dHdpk=(log2e+logpk)+(log2e+logpn)=log(pkpn)=0
, hence
pk=pn
. Thus
pi=1n
.
It is a maximum and not a minimum, since
H(1,0,...,0)=0
.

Property.

  1. H(X)H(X|Y), the equality signs hold if and only if
    X,Y
    are statistically independent.
    Fact.
    ln xx1
    and
    log x=ln x log e
    , if
    x>0

    Proof.
    H(X|Y)H(X)=jkp(xk,yi)logp(xk)p(xk|yj)

    jkp(xk,yj)(p(xk)p(xk|yj)1)e=0

  2. I(X;Y)=H(X)+H(Y)H(X,Y)=H(X)H(X|Y)=H(Y)H(Y|X)

  3. If

    X,Y are independent, then
    I(X,Y)=0
    , since
    H(X|Y)=H(X)
    . No information is transmitted throuth the channel.

  4. In a discrete communication system the channel capacity is the maximum of the transinformation .

    C=max I(X;Y) (A Mathematical Theory of Communication, by CE Shannon)

  5. (Discrete Noicelss Channel) Let

    X={xi} be the alphabet of a source containing
    n
    symbols.
    C=max I(X;Y)=max I(X,X)=max H(X)=H(1n,...,1n)=log n
    bits per symbol.

  6. (Discrete Noisy Channel) The noice characteristic

    P(Y=yj|X=xk) of the channel is prespecified.
    C=max I(X,Y)
    .


idea: 機率最小的兩個 合併 形成 binary tree
A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).

Application: jpeg

Patent controversy (jpeg 專利爭議)

330px-Huffman_coding_example.svg

  1. Huffman coding~ particular type of optimal prefix code that is commonly used for lossless data compression

X is a continuous random variable, the "differential entropy"
H(X)fX(t)log(fX(t))dt
if it exists.

Remark. Riemann sum

fX(ti)Δilog(fX(ti)Δi)fX(ti)logfX(ti)Δi+fX(ti)Δilog(Δi). Assume
fX(ti)logfX(ti)ΔifX(t)logfX(t)dt

Differential Entropy and Probability Distributions


P,Q distribution functions,
p=P,q=Q
if
P,Q
continuous distribution functions.
Kullback–Leibler divergence
DKL(P||Q)
and Cross Entropy
H(P,Q)


P: true distribution
DKL(P||Q)
( 相對熵(relative entropy), KL散度, 訊息增益(information gain),訊息散度(information divergence) ) is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q.
320px-KL-Gauss-Example

kl

DKL(P||Q)=P(x)logP(x)Q(x) for discrete probability distribution
P,Q
(P(x)>0,Q(x)>0)
.
DKL(P||Q)=P(x)logQ(x)(P(x)log(P(x)))=H(P,Q)H(P)
, where
H(P,Q)
is the cross entropy (交叉熵) of
P
and
Q
.

Remarks:

P: true distribution
Cross Entropy: Average number of total bits to represent an event from
Q
instead of
P
.
Relative Entropy (KL Divergence): Average number of extra bits to represent an event from
Q
instead of
P
.

In general,

DKL(P||Q)DKL(Q||P), it is not a metric.
H(P,Q)H(Q,P)
it is not a metric..
DKL(P||Q)0
,
DKL(P||P)=0
and
H(P,P)=H(P)0
.

DKL(P||Q)=p(x)logp(x)q(x) for continuous probability distributions with
p=P,q=Q
.

I(X;Y)=H(X)+H(Y)H(X,Y)=H(X)H(X|Y)=H(Y)H(Y|X)

I(X;Y)=DKL(P(X,Y)||PXPY), where
PXPY(x,y)=PX(x)PY(y)

Entropy-mutual-information-relative-entropy-relation-diagram.svg

maximum likelihood estimation (MLE) 與 KL divergence
DKL(P||Q)
的關係

Assume

P=pθ(x) is real distribution and
Q=pθ^(x)
is approximate distribution.
DKL=xpθ(x)log(pθ(x)pθ^(x)

DKL=Eθ[log(pθ(x)pθ^(x))]=Eθlog(pθ(x))Eθ[log(pθ^(x))]

θ^best=arg minθ^DKL(pθ||pθ^)

Assume

n samples,
Eθ[log(pθ^(x))]1ni=1nlog(pθ^(xi))
sample mean.
θ^bestarg minθ^ Eθlog(pθ(x))1nlog(pθ^(x))=arg minθ^1nlog(pθ^(xi)

=arg maxθ^1nlog(pθ^(xi))=arg maxθ^i=1npθ^(xi)
.

Key Point:
Minimize the KL divergence as a loss function

看例子:
Kullback-Leibler Divergence Explained
A Gentle Introduction to Cross-Entropy for Machine Learning

logistic regression

logistic function

320px-Logistic-curve.svg
L=1,k=1,x0=0

A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with the equation

f(x)=L1+ek(xx0), where
x0
, the value of the function's midpoint;
L
, the supremum of the values of the function;
k
, the logistic growth rate or steepness of the curve.

f:(,)(0,L)

If

y=f(x)=a/(1+bcx)

inverse fnction

f1(y)=ln((ay)by)ln c.

let

c=e,a=1,b=1,
f1(y)=ln1yy

desmos-graph

-ln(1-x)
-ln(x)
convex function

desmos-graph

1280px-Normal_Distribution_PDF.svg

standard deviation

σ 的含意
Chebyshev's inequality
P(|Xμ|kσ)1k2
, if
k>1
.

Let

f(k)=1k2, i.e.,
μ=0,σ=1

chy _ Desmos_page-0001

Birnbaum–Raymond–Zuckerman inequality

X=(X1,X2,...,Xn),μ=(μ1,...,μn),σ=(σ1,...,σn)

P(||Xμ||k||σ||)1k2, if
k>1
.

Logistic Regression — Detailed Overview


Bernoulli process
A Bernoulli process is a finite or infinite sequence of independent random variables

X1,X2,X3,..., such that for each
i
, the value of
Xi
is either 0 or 1; for all values of the probability
p
that
Xi=1
is the same. In other words, a Bernoulli process is a sequence of independent identically distributed Bernoulli trials.

P=(p)k(1p)nk
Bernoulli MLE(maximum likelihood Estimation) Estimation

L(θ)=i=1npXi(1p)nXn

log MLE

LL(θ)=logi=1npXi(1p)nXn=i=1nXilogp+(nXi)log(1p)

LL(θ)θ=0p^=Xin=sample mean

Activation function


Ref. Information Theory - Robert Ash

個人覺得很重要又有趣的 "物理" 觀念 (需要 calculus of variation 的知識)
Action principles (least action principle)
The Feynman Lectures on Physics
From the laws of reflection & refraction to variational principles

Theoretical Concepts in Physics, Malcolm S. Longair, University of Cambridge
Introduction to the calculus of variations

這本書寫得很好, 可是對資電太難
Mathematical Methods of Classical Mechanics, V.I. Arnold.