Author: , Ph.D. in Electrical Engineering, NCTU
email: kccddb@gmail.com
Date: 20241030
Copyright: CC BY-NC-SA
台灣之心愛護動物協會(Heart of Taiwan Animal Care, HOTAC)
無國界醫生
Mathematics (calculus, linear algebra, probability, statistics), Data Structure, Algorithm, DSP, Micro-processor and OS 是內功心法
network programming, embedded system 是戰鬥工法
-field
Definition 1.1. Let be a collection of subsets of . is called a "-field" or "-algebra" if the following holds
(i) .
(ii) implies that and
(iii) if is a countable sequence of sets then .
measure, measurable space
Definition 1.2. A measure is a nonnegative countable additive set function on a measurable space , i.e.,a function such that
(i) μ(A) > 0 for all , and
(ii) if is a countable sequence of disjoint sets (i.e., ; for all ), then
The tuple , where is a "-field", is called a measurable space.
e.g.,
with . is the smallest "-field" on , called the trivial "-field".
, is the largest "-field" on .
Let is a measure on with , then is a probability space with probability measure .
Borel algebra
In mathematics, a Borel set is any set in a topological space that can be formed from open sets (or, equivalently, from closed sets) through the operations of countable union, countable intersection, and relative complement. Borel sets are named after Émile Borel.
For a topological space X, the collection of all Borel sets on X forms a σ-algebra, known as the Borel algebra or Borel σ-algebra. The Borel algebra on X is the smallest σ-algebra containing all open sets (or, equivalently, all closed sets).
Lebesque measure on , , where open interval.
If is differentiable in and the Labesque integral exists, then
Riemann Integral Lebesque Integral if they exist.
Counting measure and spaces.
Let be any countable set, ( -algebra , =power set of )
and be the counting measure. Recall that is the number of points in if is finite and equals otherwise. Integration is simply
for any non-negative function , and is denoted by .
Definition 1.3. Given two measure spaces and , a function is measurable if for any .
random variable (measurable function)
Definition 1.4. A random variable is a measurable function from to , where B is the Borel -field of . A random vector is a measurable function from to , where is the Borel -field of .
is a random variable, .
Distribution function of :
Expectation, conditional expectation
Probability space
if it exists.
is a measure space, a property
is said to hold almost everywhere (almost surely) in if there exists a set with and all have the property . Define .
e.g., :rational number. Let , if , if . Then in , a.e. , because is countable and , where is the Lebesque measure. (in Lebesque integral).
Strong Law of Large Numbers
The sample average converges almost surely to the expected value, i.e.,
, a.s.
Schwarz and Hölder Inequality Let be a measure space, and
counting measure,
, Schwarz inequality
Hölder's inequality becomes an equality if and only if and are linearly dependent in , meaning that there exist real numbers , not both of them zero, such that = almost everywhere.
Minkowski Inequality
Remark. Random variable
Markov's inequality
If X is a nonnegative random variable and a > 0,
Define , then a.s. Hence .
Extended version for nondecreasing functions
If is a nondecreasing nonnegative function, is a (not necessarily nonnegative) random variable, and , then
Proof.
Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount.
Let be independent random variables such that almost surely.
Then Hoeffding's theorem states that, for all ,
Proof. (HINT) for , since is a monotone increasing function for .
Hint. Markov inequality (Extended version)
Hoeffding’s inequality and convex ordering
Fundamental concept: is a r.v., if there exits a simple r.v.'s , such that and , then .
conditional expectation
(a random variable) contitional expection of given the event
Radon-Nikodym theorem
Let be a -finite measure and a signed measure on of subset of . Assume is absolutely continuous with respect to (, if ). Then there is a Borel measurabe function such that
for all .
If is another such function, then
is called Radon-Nikodym derivative or density of with respect to , writen .
Fundamental theorem of calculus
is an integrable random variable. Assume is a -field . The conditional expectation of given by , (-measurable random variable) is given by (or ) for . (by Radon-Nikodym theorem).
The -field generated by random variable is denoted by . Then .
conditional density
Let r.v. with joint density .
.
Define the conditional density of , given .
Let and , then will occur iff .
, by Fubini's theorem.
Properties.
space
is a measure space, and . Let . Notice that a.e.
Remark. in , there exits one sub-sequence
, e.g., ,
還有許多數學上問題要處理, 例如 seminorm, 因此工程上 我們處理 good function (finite energy, measurable function, complete inner product space,…)與 , 最重要的 space, 有興趣的讀者 請學 Functional Analysis.
Gibbs phenomenon
Fourier Series
Let , and . Then
Bessel’s Inequality
Let be a periodic , square-integrable function, and write fb for its Fourier transform. Then, for every , we have
Hence we have in sense (), if
Lebesque-Stieltjes Integral
is a r.v., with distribution function , where
right-continuous function
Then
Remark.
is a discrete r.v., , then
Sums of independent random variables
discrete r.v.
, convolution
continuous r.v.
Proof.
Let
, since independent.
Moment generating function (mgf) of
~ random vector
Characteristic function of
If , then
Proposition.
There is a bijection between probability distributions and characteristic functions. That is, for any two random variables , , both have the same probability distribution if and only if
independent, then
~ random vector, then
Proposition.
Suppose that are two random vectors whose mgfs
exist for all for some . If for all , then the cdfs of and are identical. in
Central limit theorem
i.i.d. , ,
convergence in distribution
Proof.
Using Taylor’s formula with a remainder
Find
Take limit
Ref. Introduction To Stochastic Calculus With Applications, by F. C. Klebaner
processes (Second-order processes), processes for which .
Discrete-time Markov chain
A discrete-time Markov chain is a sequence of random variables with the Markov property
Time-homogeneous:
A discrete-time Markov chain with memory (or a Markov chain of order m) where is finite, is a process satisfying
Assume , let
:Transition matrix
Hence
Asssume is the initial probability. , then . Let , then . Assume exists, then we have .
Matrix form:
Stationary probability and .
Continuous-time Markov chain
Then, knowing
, is independent of previous values , and as h → 0 for all j and for all t,
, where (little-o) and , ,if .
transition rate.
Time Homogeneous:
Then
From 1. and 2., we have ~ and , where and
Then
For steady-state, , then .
The steady-state probability , and .
If exists, then as
Kolmogorov backward equation , .
Poisson process with parameter with state space ,
Global balance equations
, if
,
or
,
The left-hand side represents the , while the right-hand side represents the into state .
stationary probability (steady-state probability)
, and
a set of N states (Markov Chain)
transition matrix (Markov Chain)
a sequence of observations
a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation ot being generated from a state
an initial probability distribution (Markov Chain)
Output Independence:
model
the previous forward path probability from the previous time step
the state observation likelihood of the observation symbol ot given the current state
1.Initialization: initialize , and
Then the forward probability at time ,
2. Recursion:
3. Termination:
The backward probability is the probability of seeing the observations from time to the end, given
that we are in state at time (and given ):
Initialization: ,
Recursion: , ,
Termination:
Let . Hence
Let be the expected number of time at .
Let be the expected number of transitions from state to state .
Let be the expected number of transitions from state .
Let be the expected number of times in state
Let be the expected number of times and observing .
EM (expectation-maximization) algorithm
(No guarantee exists that the sequence converges to a maximum likelihood estimator.)
initialize , A and B (a big problem)
iterate until convergence
E-step (Expectation step, update variables)
M-step (Maximization step, update hypothesis)
return A,B
Markov Model
Hidden Markov Model (part 2)
ㄚㄚHidden Markov Model (part 3)
EM algorithm, convergence???
如何辨別機器學習模型的好壞?秒懂Confusion Matrix
A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabiner
memoryless property
,
Let , then
Hence .
is monotonically decreasing, and exponential distribution with parameter .
Proof.
Little's Theorem
number of customer in the system at time
mumber of customer who arrived in
time spent in the system by th arriving customer
number of departures up to time
Let ,
Ref. Data Netwrks,by Dimitri P. Bertsekas and Robert G. Gallager
PASTA property (Poisson Arrivals See Time Averages)
The probability of the state as seen by an outside random observer (Poisson arrival) is the same as the probability of the state seen by an arriving customer
Proof. (對EE太難, 知道就好, 主要因 memoryless 特性)
arrival 認為的統計 與 時間 的統計 不一定 相同
假設 arrival 是警察 且 都是 固定時間來巡邏, 小偷 知 警察 巡邏時間, 則 警察 看不見 小偷
警察 得到~治安好 沒小偷
小偷 說 警察 笨
Lemma. Let be independent exponentially distributed random variables with rate parameters . Then
is also exponentially distributed, with parameter
Proof.
Hence,
Theorem.
Merging Independent Poisson Processes Poisson Process
Splitting a Poisson Processes
Let be a Poisson process with rate λ
. Here, we divide to two processes and in the following way. For each arrival, a coin with is tossed. If the coin lands heads up, the arrival is sent to the first process (), otherwise it is sent to the second process. The coin tosses are independent of each other and are independent of . Then,
is a Poisson process with rate ;
is a Poisson process with rate ;
and are independent.
Proof.
=number of arrivals
M/M/1 Queue
The model can be described as a continuous time Markov chain with transition rate matrix
不是線性,
, by Little's formula
M/G/1 system
:the service time
Pollaczek-Khinchin (P-K) formula:
Proof.
= Waiting time in queue of the th customer
= Residual service time seen by the th customer. By this we mean that if customer is already being served when i arrives, is the remaining time until customer 's service time is complete. If no customer is in service(i.e., the system is empty when arrives), then is zero.)
= Service time of the th custome
= Number of customers found waiting in queue by the ith customer upon arrival
Then
and independent
. Hence
Find
is the number of service completions within and for the stable Queueing system.
A discrete time Markov chain with memory (or order) is a process satisfying
.
RL: 強化學習(Reinforcement Learning)
NLP 神經語言規劃(英語:Neuro-Linguistic Programming)
NLP 自然語言處理(英語:Natural Language Processing)
RNN: 循環神經網路(Recurrent neural network:RNN)手寫識別是最早成功利用RNN的研究結果(2009)
ML (Expectation-Maximization) Algorithm
包含重要的 Viterbi Algorithm
Hidden Markov Model (part 1), by Kinna Chen
Hidden Markov Model (part 2)
ㄚㄚHidden Markov Model (part 3)
深入了解 Hidden Markov Model 的訓練理論
Markov Model
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Chapter A
A tutorial on hidden Markov models and selected applications in speech recognition, by Rabiner
Assume
Let
is a increasing convex function for .
Hence for
The exponential function is convex, so by Jensen's inequality, ,
Similarly, if
for
, i.i.d.
Then we for
Cramér's theorem (large deviations)
logarithmic moment generating function of
Legendre transform of :
i.i.d.
for all
: rate function
Theorem (Cramér’s Theorem)
Given a sequence of i.i.d. real valued random variables , with a common moment generating function $Given a sequence of i.i.d. real valued random variables , with a common moment generating function , and rate function the following holds:
Remark.
is a closed set
is a convex non-negative function satisfying . is an increasing function on and a decreasing function on .
for .
for .
Introductory examples and definitions. Elena Kosygina
A basic introduction to large deviations:
Theory, applications, simulations∗
Increasing convex order
Let be random variables such that for all increasing convex functions provided the expections exist(denoted by )
Properties
= loss rate
Ref. Stochastic Orders and Their Applications (Probability and Mathematical Statistics)
Moshe Shaked (Author)
Discrete random experiment
is an event.
When is close to 1, the surprisal of the event is low, but if
is close to 0, the surprisal of the event is high.
e.g., 電腦一開機 就爆炸 (機率很低), 太陽東方升起 機率~1 (沒有新聞 會報導)
measure of uncertainty or entropy (average amount of information)
is a discrete random variable, , if
(bits), if ,
Remark.
Proof.
, hence . Thus .
It is a maximum and not a minimum, since .
Property.
, the equality signs hold if and only if are statistically independent.
Fact. and , if
Proof.
If are independent, then , since . No information is transmitted throuth the channel.
In a discrete communication system the channel capacity is the maximum of the transinformation . (A Mathematical Theory of Communication, by CE Shannon)
(Discrete Noicelss Channel) Let be the alphabet of a source containing symbols. bits per symbol.
(Discrete Noisy Channel) The noice characteristic of the channel is prespecified. .
idea: 機率最小的兩個 合併 形成 binary tree
A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).
Application: jpeg
Patent controversy (jpeg 專利爭議)
is a continuous random variable, the "differential entropy" if it exists.
Remark. Riemann sum . Assume
Differential Entropy and Probability Distributions
distribution functions, if continuous distribution functions.
Kullback–Leibler divergence and Cross Entropy
: true distribution
( 相對熵(relative entropy), KL散度, 訊息增益(information gain),訊息散度(information divergence) ) is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q.
for discrete probability distribution .
, where is the cross entropy (交叉熵) of and .
Remarks:
: true distribution
Cross Entropy: Average number of total bits to represent an event from instead of .
Relative Entropy (KL Divergence): Average number of extra bits to represent an event from instead of .
In general, , it is not a metric.
it is not a metric..
, and .
for continuous probability distributions with .
, where
maximum likelihood estimation (MLE) 與 KL divergence 的關係
Assume is real distribution and is approximate distribution.
Assume samples, sample mean.
.
Key Point:
Minimize the KL divergence as a loss function
看例子:
Kullback-Leibler Divergence Explained
A Gentle Introduction to Cross-Entropy for Machine Learning
logistic function
A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with the equation
, where
, the value of the function's midpoint;
, the supremum of the values of the function;
, the logistic growth rate or steepness of the curve.
If
inverse fnction .
let ,
-ln(1-x)
-ln(x)
convex function
standard deviation 的含意
Chebyshev's inequality
, if .
Let , i.e.,
Birnbaum–Raymond–Zuckerman inequality
, if .
Logistic Regression — Detailed Overview
Bernoulli process
A Bernoulli process is a finite or infinite sequence of independent random variables , such that for each , the value of is either 0 or 1; for all values of the probability that is the same. In other words, a Bernoulli process is a sequence of independent identically distributed Bernoulli trials.
Bernoulli MLE(maximum likelihood Estimation) Estimation
log MLE
Ref. Information Theory - Robert Ash
個人覺得很重要又有趣的 "物理" 觀念 (需要 calculus of variation 的知識)
Action principles (least action principle)
The Feynman Lectures on Physics
From the laws of reflection & refraction to variational principles
Theoretical Concepts in Physics, Malcolm S. Longair, University of Cambridge
Introduction to the calculus of variations
這本書寫得很好, 可是對資電太難
Mathematical Methods of Classical Mechanics, V.I. Arnold.