Try   HackMD

[Crude Draft] Data Analysis (Half Semester) Course @ IIT Dharwad

B. N. Bharath
Assistant professor,
Department of Electrical, Electronics and Communication Engineering (EECE),
IIT Dharwad


Commenting

  • Please write your name and roll number followed by your comments. This helps me know who has commented. Also, please do not hesitate to comment. All the comments will be taken seriously, and no comment is a stupid comment
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Logistics

There will be two evaluations-one quiz (

40%) and one final exam (
60%
). The quiz may contain some assignments (given during the regular classes) as well like writing a code, solving problems etc.

  • Quiz:

    31/10/2023 from
    8:30
    AM to
    9:30
    AM (regular class hours). The syllabus will include everything that is covered until
    20/10/2023
    .

  • Final exam will be as per the academic calendar shared by the academic office.

  • The repeat exam or evaluation will not be encouraged. If someone misses the quiz for a genuine reason, then I may may use the final exam marks to grade with a possible penalty. The penalty will be decided based on the reason for absence.

  • The answer scripts will be distributed on

    24th November in the PC. More details will be conveyed by the TAs.


References

  • "Introduction to statistics," MIT Opencourseware, Link: click here
  • "All of statistics (A concise course in statistical inference)" by Larry Wasserman, Springer.
  • "Introduction to Probability, statistics and Random processes," click here
  • Nice problems can be found here click here

Announcements

  • The classes will be held in UG4 in the Permanent Campus on Thursday's and Friday's and PC300 on Tuesday's.

Introduction

Let us start with the question of what is data analysis? This refers to providing inference and insights using data. The classical approach involves modelling the data using some sort of a function or a probability distribution and use tools from probability to make inferences. One might be wondering why do we need data analysis? Imagine that you are an investment banker and you have to decide whether to invest in some stocks. One can use the past experience, current political situation etc. to infer. A more systematic approach is to use the past data and provide inference using probabilitstic tools in addition to one's rich experience. This course covers classical approaches to these problems, and provide some insights and tools necessary to solve certain class of problems. Let us begin by looking a stylized example involving a coin toss.
-

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →
-
In particular, let us consider the following four example problems

  • Problem

    1: Given a coin, find its bias.

  • Problem

    2: We have two coins, one with bias
    1/2
    and the other with bias
    3/4
    . Imagine that I am going to randomly pick one of them, and give it to you. You are supposed to find whether I have given a biased coin or an unbiased coin.

  • Problem

    3 (more challenging): Same as Problem
    2
    with bias
    3/4
    replaced by "not
    1/2
    ", i.e., biased coin versus unbiased coin.

  • Problem

    4 (more challenging): Same as Problem
    2
    with bias
    3/4
    replaced by "bias is a rational number" and bias
    1/2
    replaced by "bias is an irrational number", i.e., rational bias versus irrational bias.

A typical sample path when you toss a coin

n=15 times (or sample Bernoulli with bias
1/2
) is given below

[1  0  0  1  1  0  1  1  1  0  0  0  1  0  1].

Answer to Problem
1

In the first problem, our end result will be the bias, which is any number between

0 and
1
. On the other hand, Problem
2
asks us to find the right answer among two possibities, i.e., biased and unbiased. Problem
1
is called an estimation problem while the second (third and fourth) problem is called a hypothesis testing or classification or decision problem. First, let us begin by modeling the experiment involving Problem
1
above. The first step is to identify the experiment involved, i.e., tossing a coin whose sample space is
Ω:={H,T}
. We start by modelling each outcome of the coin by a random variable (rv), i.e., let the outcome of the
i
-th coin toss be
Xi:Ω{0,1}
,
i=1,2,,n
, which is defined as
Xi={1if outcome is H0otherwise.

Note that each coin toss is independent of each other, i.e., the outcome of any coin toss will not impact other outcomes. A mathematical way of saying this is that
X1,X2,,Xn
are independent rvs, whose definition is as follows:


Definition: (independent rvs): We say that

X1,X2,,Xn are independent rvs if the joint distribution
pXi:iI(xi:iI)=iIpXi(xi)
for any subset
I{1,2,,n}
. If the distributions of all
Xi
's are same, then we say that the rvs are identically distributed. If
X1,X2,,Xn
are both independent and identically distributed (i.i.d. for short), we use the notation
X1,X2,,Xni.i.d.pX1(x)
. For continuous rvs, it suffices to replace the PMF by PDF.

Note: Sometimes we use

to indicate independence. For example,
XY
means
X
and
Y
are independent rvs.

Thus, by our intuition, it makes sense to assume that

X1,X2,,Xni.i.d.Bern(p), where
p
is unknown! Now, our goal is to find this
p
. Again, guided by our intuition, a natural estimate is to count the number of heads, and divide it by
n
. Mathematically,
p^n:=1ni=1nXi.

Let us closely look at what we want our estimate to be.

  • First, on an average (true average), the estimate should be true; mathematically,
    Ep^n=p
    .
    • The above is in general called an unbiased estimator. We will see more about this later.
  • Second, as we increase the number of tosses (more samples), the estimation should become "better" (right now we do not know what better means).

Let us pause for a moment to see how to measure the "goodness" of an estimator-the word "estimator" is not formally defined yet. But for now,

p^n is our estimator:-) There are several ways to measure "goodness". The following are some natural measures:

  • Mean Absolute Error (MAE): Look at the difference between the estimator and the true value (i.e.,
    |p^np|
    ). Unfortunately, this error is random, and cannot be relied on. Having done a course in probability, it is natural to look at the average (true) of this quantity, which is defined below:
    MAEn:=E|p^np|.
  • Mean Square Error (MSE): Instead of looking at the absolute error, we can look at the squared error, which results in the following notion:
    MSEn:=E|p^np|2.

One might be wondering why not

3-rd power of the absolute error? For that matter, why not
k
-th power of the absolute error? The thought process is correct. However, these higher powers lead to a complicated expression for the Mean (whatever) error, and is not amicable for mathematical analysis. But
MSEn
is a nice quantity that can be analyzed! So, let us go a head with the
MSEn
. First, let us see if our estimator is unbiased. Towards this consider
Ep^n=E[1ni=1nXi]=(linearity)1ni=1nEXi=(i.i.d.)1nnEX1=Since EX1=pp.

Hence our estimator is unbiased! Now, let us look at the

MSEn performance
MSEn:=E|p^np|2=(why?)Var(p^n)=E|1ni=1n(EXip)|2=1n2i=1nE(Xip)2+1n2ijnE[(Xip)(Xjp)]=XiXjVar(X1)n+1n2ijnE(Xip)E(Xjp)=Var(X1)n.

This leads to the following result.


Theorem: The estimator

p^n satisfies
MSEn=O(1n)
. In the asymptotic
limnMSEn=0
.


The following plot shows the MSE verus

n for a coin with bias
1/2
. Clearly, the MSE is going to zero as
n
is increased. Further, the rate of fall is of the order
1/n
.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Note: Estimators with the above asymptotic property is called an efficient estimator. The notation

O() is called the Landua's Big O notation. It is often used to remove non-sense (sometimes not so non-sense) terms while analyzing something. Say, you are interested in knowing the behaviour of the MSE with respect to
n
not so much with other terms, then we can use this notation. Mathematically, it means the following. We say that
O(f(x))=g(x)
if
limxf(x)g(x)=constant
.

Is there any other way of measuring "goodness" of any estimator? To answer this, let us revisit the error often called empirical error

|p^np|. As noticed earlier, this error is random, and cannot bank on this. Note that our estimator should not result in a bad error, i.e.,
|p^np|>ϵ
, where
ϵ
is the tolerance error, and depends on the application. Whenever the error is bad, we refer it to as "Bad" event. Obviously, we cannot ensure that the error will never happen as
|p^np|
is random. Thus, a more sensible thing to ask for is to make the probability of the "Bad" event small, i.e.,
Pr{|p^np|>ϵ}<δ

for some

δ>0. Note that
X:=|p^np|
is non-negative. In general, it is very difficult to verify if
Pr{|p^np|>ϵ}
is small. Instead, if we can somehow show that something larger than
Pr{|p^np|>ϵ}
is small, then our job will be done. Thus, we need an upper bound on
Pr{X>ϵ}
; this is the essence of the next topic.

Applications of Large Deviation Bounds

[Ignore the fancy title for now:-)] We noticed above that we need an upper bound on

Pr{X>"something large"}-referred to as the tail probability (think why?). Here
X
is a non-negative rv. Consider

EX=0xfX(x)dxϵxfX(x)dxϵPr{X>ϵ}
Rearranging the above results in the following result.


Theorem: Let

X>0. Then,

Pr{X>ϵ}=Pr{X>ϵ}EXϵ.


Now, let us try to apply the above result to our estimation problem. Note that the non-negative rv

X is
|p^np|
. Applying the Markov inequality, we get
E|p^np|ϵ
. It is hard to compute the term
E|p^np|
. However, in the above, we could compute
E|p^np|2
(with a squared). However, Markov involves term without the squared. Idea is to see if can convert it to squared. Its an easy exercise once you note that
Pr{X>ϵ}=Pr{X2>ϵ2}
, and then apply Markov inequality to get
Pr{X>ϵ}EX2ϵ2.

The above when applied to

|XEX| gets a special name:


Theorem: (Chebyshev's Inequality) Let

X be any rv (not necessarily non-negative). Then,
Pr{|XEX|>ϵ}E|XEX|2ϵ2=Var(X)ϵ2.


Note that our estimator is unbiased. Using this fact and applying Chebyshev's inequality to our estimator result in

Pr{|p^np|>ϵ}Var(p^n)ϵ2=Again why?MSEnϵ2=Var(X1)nϵ2n0.

There are other way of stating this result. For example, if we want

Pr{|p^np|>ϵ}<δ for some
δ>0
, then, we can ensure this by choosing
n
such that
Var(X1)nϵ2<δ
. In other words, if
n>Var(X1)δϵ2
, then we can ensure
Pr{|p^np|>ϵ}<δ
. A fancy way of saying this is as follows:


With a probability of at least

1δ, the error
|p^np|<ϵ
provided the number of samples
n>Var(X1)δϵ2
. Here, the symbol
denotes ceiling operation. This style of results are often called Probably Approximately Correct (PAC) answers!


There is another name to the above. It is called an interval estimate. Instead of giving one estimate

p^n, we will give a range of values around
p^n
in which
p
should lien with reasonably high probability. Naturally, if one gives only
p^n
(one value), it is called the point estimate for obvious reasons. Essentially, the above result is saying
p^np
(approximately; approximation error of
ϵ
) with a probability of at least
1δ
(probably). Hence the name PAC. These style of results are very common in theoretical machine learning. In this course, we can give a more definative guarantees.

Side Remark: Often times, people may ask what guarantee can you give for your estimator/algorithm/hypothesis testing etc.? What they mean is can you say "Some sort of average "error" is small or probability of error is small or in general some performance measure is small (assuming small means good)". In this course, we will provide many guarantees. The above PAC or MSE expression is one such result on gaurantees.

Exercise

1: Suppose you are given
n
samples from an uniform distribution
X1,X2,,Xni.i.d.Unif(θ)
, where
θ
is unknown.

  • Come up with an estimator for
    θ
    ; denote it by
    θ^n
    .
  • Is your estimate unbiased? If not, can you come up with an unbiased estimate?
  • What is the
    MSEn
    performance of the proposed estimator? Does it go to zero as
    n
    goes to
    ? If so, at what rate?
  • Can you give a PAC style result, i.e., with a probability of at least
    1δ
    , the error
    |θ^nθ|<ϵ
    provided the number of samples
    n>?
    . Fill in the blank here.

Practice Problem: A symmetric die is thrown

600 times. Find a lower bound for probability of getting
80
to
120
sixes.

Practice Problem: (Prof. Stankova, UCB) Let

g(x)=5x6 for
x1
and
0
otherwise. What bound does Chebyshev’s inequality give for the probability
Pr(X2.5)
? For what value of
a
can we say
Pr(Xa)15%
?

Practice Problem: (Prof. Stankova, UCB): Let

g(x)=23x for
1x2
, and
0
everywhere else. Give a bound using Chebyshev’s for
Pr(10/9X2)
.

Practice Problem (Comparing bound with actual): Find an upper bound on

Pr{Xa} for some
a
, where
Xexp{λ}
. Compare your actual value with the upper bound.

Practice Problem (finite higher moments may help): Let

XN(0,1). Use Chebychev inequality to bound
Pr{|X|>a}
for some
a>0
. Compare this with the bound using
Pr{|X|>a}=Pr{|X|k>ak}
. Here, bound it using Markov inequality. This requires you to find the
k
-th moment of
X
. Compare your bound for different values of
k
, and comment. Comment on what happens if we increase
k
indefinitely.

Practice Problem (coding exercise): Simulate Exercise

1. Plot the MSE performance of your estimator versus the number of samples. You can use any coding language (Python or Matlab preferred).

Answer to Problem
2

As mentioned earlier, the second problem is a hypothesis testing or a decision problem; need to decide whether the coin is bias

1/2 or bias
3/4
. The first question that we encounter is on what basis we should decide. A natural metric here is the probability of making a mistake, i.e., saying bias
1/2
when the coin has bias
3/4
and vice-versa. Let us write this error event

E:={Say 3/4 and when 1/2}{Say 1/2 and when 3/4}.
To compute the probability of error, we first note that the two events in the union are disjoint. Thus,
Pr{E}=Pr{Say 3/4 and when 1/2}+Pr{Say 1/2 and when 3/4}.

Immiadiately, one may wander who will say

1/2? What is the "decision rule"? We have to clarify these before we delve into the solution.

Decision rule: Intuitively, we observe and then make decisions. What do we observe here? Easy, we toss

n times and observe the outcomes. Mathematically, we observe
n
Bernoulli random variables
Xi
,
i=1,2,,n
. Soon we realize that a rule should be a function of
X1,X2,,Xn
. Mathematically, a rule here is a function
Φ:{0,1}n{0,1}
-in simple terms, it is denoted by
Φ(X1,X2,,Xn)
, and it takes values in
{0,1}
. Here
0
means
3/4
and
1
means bias
1/2
. Seems very complicated isn't it? Not really! Lets move on. The event "Say
3/4
" means
Φ(X1,X2,,Xn)=0
for the decision rule
Φ()
; easy? Similarly for "Say
1/2
". There is another event to take care of, i.e., "when
1/2
":-(. This means that when the true coin had bias
1/2
. Let us denote this event by
H1
. You should excuse me for using too many notations. Here the fancy
H
refers to hypothesis and the subscript
1
is as per my madness. If you don't like, you can use
0
instead. Similarly,
H0
for "when
3/4
".

Are

H0 and
H1
random? Is it deterministic? Many questions remain to be answered. If I choose the coin randomly from the urn, then these events are random, and can associate probabilities to it. However, if these events are deterministic but unknown, then we cannot assign probabilities. We need to understand why these events can be deterministic or unknown. In fact, I can pick the coin randomly, and there is no way by which you can figure out how I have chosen the coin. For example, I can go into a secret room and pick the coin and show you. Note that I may choose to pick the coin deterministically as well. There are so many variants. Here let us keep things simple. Let us assume that the coins are picked randomly. In particular, I am going to pick the coin with bias
3/4
with probability
π0
. Let
π1:=1π0
. More precisely,
π0=Pr{H0}=1Pr{H1}
.

With the above setup, we can write the probability of error as

Pr{E}=Pr{Φ(X1,X2,,Xn)=0 and when 1/2}+Pr{Φ(X1,X2,,Xn)=1 and when 3/4}=Bayes rulePr{Φ(X1,X2,,Xn)=0|when 1/2}×Pr{ when 1/2}           +Pr{Φ(X1,X2,,Xn)=1| when 3/4}×Pr{ when 3/4}=Why?Pr{Φ(X1,X2,,Xn)=0|when 1/2}×π1           +Pr{Φ(X1,X2,,Xn)=1| when 3/4}×π0.

Now, the goal is to design a "decision rule"

Φ() that minimizes the above error, i.e.,
minΦ:Ωn{0,1}Pr{E}.

Seems complicated, and it is indeed not simple! This is because the minimization is over functions. To solve this, let us understand what a decision rule

Φ() possibly do. It observes
n
samples from
{0,1}
and decides in favor of zero or one. The input to
Φ()
is a set of
2n
possible binary sequences. It is deciding some in favor of one and some in favor of zeros. Thus,
Φ()
is partitioning the space of
2n
sequences into two regions. Let us collect all the sequences that are mapped in favor of
1
in one set, say
D1
, which is defined as
D1:={(x1,x2,,xn):Φ(x1,x2,,xn)=1 and xi{0,1}}.

The region corresponding to
0
is denoted by
D0=D1c
. To proceed further (also I am a bit lazy), we need the following notation:

Notation: We use

Xn:=(X1,X2,,Xn) and
xn:=(x1,x2,,xn)
. Imagine me writing (I meant latexing) all these arrays everytime!

Using the above, a part of the probability of error can be written as

Pr{Φ(X1,X2,,Xn)=1|when 3/4}=Why?xnD1pXn|when 3/4(xn).
The above follows from the fact that the probability of a rv (random vector in this case) falling in a region is the sum of PMF over that region (integral for the continuous case). In the above, the region corresponds to the region where you (
Φ()
) will make a decision in favor of
H1
. The above argument applies for conditional probability also (recall that conditional PMF also satisfies all the axioms of probability). Further, we note that
Pr{Φ(X1,X2,,Xn)=0|when 1/2}=1Pr{Φ(X1,X2,,Xn)=1|when 1/2}=1xnD1pXn|when 1/2(xn)

Let us look at the above more carefully. First, conditioning on "when
1/2
" means that it corresponds to an unbiased coin. Thus,
pXn|when 1/2(xn)
means that the probability of observing the
n
-tuple
xn
when you throw an unbiased coin
n
times. By noting that
xn{0,1}n
, we have
pXn|when 3/4(xn)=(34)i=1nxi×(14)ni=1nxi

since
i=1nxi
gives us the total number of ones in the sequence (or the total number of heads). Similarly,
pXn|when 1/2(xn)=Why?12n.

Recall that
Pr{E}=Pr{Φ(X1,X2,,Xn)=0|when 1/2}×π1           +Pr{Φ(X1,X2,,Xn)=1| when 3/4}×π0=Substitution(1xnD1pXn|when 1/2(xn))π1+xnD1pXn|when 3/4(xn)×π0=π1+xnD1[pXn|when 3/4(xn)×π0pXn|when 1/2(xn)×π1].

The problem was to minimize the probability of error (i.e., the above expression) with respect to
Φ()
. Note that the above expression depends on
Φ()
only through
D1
. Thus, we need to choose
Φ()
such that
D1
above makes the expression smaller. At first glance, it appears to be hard problem. However, let us look at it more closely. First, as explained earlier, any choice of
Φ()
will only split the space of
2n
sequences into two disjoint sets. In a way, we have to split the space into two pieces such that the resulting probability of error is small. If we split the space (i.e., choose
D1
) such that the summand, i.e.,
pXn|when 3/4(xn)×π0pXn|when 1/2(xn)×π1
is positive for some
xn
, then it will only increase the probability of error. Therefore, if
Φ()
is optimal, then it should only make the summand negative for any choice of
xn
. This leads to the following optimal rule:
pXn|when 3/4(xn)pXn|when 1/2(xn)xnD0xnD1π1π0=π11π1.

Note that the event
xnD1
corresponds to choosing bias
1/2
coin. Moreover, no matter what rule you pick, all the probabilities are fixed before hand. The only thing that you can ensure is that the summand is negative. That is exactly what the above rule is doing. The above leads to the following result.


Theorem: For Problem

2, the optimal way of deciding whether the coin is bias
1/2
or not is through the following
log
-likelihood ratio (LLR)
rule
logpXn|when 3/4(xn)pXn|when 1/2(xn)choose bias 3/4choose bias 1/2log(π11π1).


Now, let us see what it results in by substituting for the conditional probability


LLR(xn):=1ni=1nxichoose bias 3/4choose bias 1/21nlog(π11π1)+log2log3:=τ.


The left hand side is the relative frequency and the right hand side is a specific threshold! When

π1=1/2, the threshold becomes
0.63
; a number almost at the midpoint of
1/2
and
3/4
! Intuition works!

Side remark: Why should we do all these if we knew the answer before based on our intuition? I have two answers to it. First, its fun and the second, the above settles the issue. No body can question by saying that whats the guarantee that there are no scenarios where our intuition fails.

Real world examples

Example

1 (Effectiveness of a new drug): The theory and methodology developed above plays a significant role in clinical trails and medical research. Imagine that you are in a drug company attempting to find a drug for a disease. Naturally, one needs to figure out whether the treatment using the drug is effective or not. You can formulate this as a hypothesis testing problem, where null hypothesis corresponds to no effect of the drug and the alternate hypothesis corresponds to the drug being very effective.

Test case: Let us say it is believed that the drug discovered by the company reduces the blood pressure (BP) on obese patients. The drug company decides to collect BP data of

N obese patients for
D
days. After that, the drug is administered to all of them. Now, the goal is to test your hypothesis, i.e., effective versus ineffective given the BP measurements of all the patients before and after the medication. How do we solve this problem? (will fill in the details later).

Exercise

2: Consider two coins with biases
q1
and
q2
(
q2>q1
) in Problem
2
, and repeat the whole exercise to show that the following rule is optimal with respect to the probability of error
1ni=1nxichoose bias q2choose bias q1threshold.

Find the

threshold.

The above analysis does not tell us how much error do we incur by using the above LLR rule. We need to investigate this next. Towards this, let us start with the probability of error. The error occurs if LLR rule says the bias is

1/2 but the true bias is
3/4
and viceversa. This leads to
Pr{E}=Pr{LLR(xn)>τ when bias 1/2}+Pr{LLR(xn)<τ when bias 3/4}=Pr{LLR(xn)>τ| when bias 1/2}π1+Pr{LLR(xn)<τ| when bias 3/4}π0.

Let us use the symbol
Pr1/2(LLR(Xn)>τ)
to denote the first term above. Similar expression will be used for the second term with conditioning on bias
3/4
. Note that
Pr1/2(LLR(Xn)>τ)=Pr1/2{1ni=1nXi>τ}.

Similarly,
Pr3/4(LLR(Xn)<τ)=Pr3/4{1ni=1nXi<τ}=Pr3/4{1ni=1n(Xi)>τ}.

Observing the above two expressions, we realize that we need to evaluate a quantity of the form
Pr(1ni=1nZi>τ)
, where
Zi
's are i.i.d. rvs. As in the case of Markov, closed form expression for such things seldom exists, and hence we need to go for an upper bound. It turns out that this sort of things appear frequently in statistics, and hence let us try to give a general bound on such quantity.


Chernoff Bounding Technique

Let

Z1,Z2,,Zn be i.i.d. rvs with distribution
pZ1(z)
often written as
Z1,Z2,,Zni.i.d.pZ1(z)
. Note that each
Xi
need not be non-negative. Then,
Pr{1ni=1nZi>τ}=Pr{exp(sni=1nZi)>exp{sτ}}Eexp(sni=1nZi)exp{sτ}=esτEexp(sni=1nZi).

The above bound is true for any
s>0
. Hence, taking infimum over all
s>0
results in the following result.


Theorem (Chernoff bound): Let

Z1,Z2,,Zn be i.i.d. rvs with distribution
pZ1(z)
. Then,
Pr{1ni=1nZi>τ}infs>0[esτEexp(sni=1nZi)].


Practice Problem: Find the Chernoff bound on

Pr{1ni=1nZi>τ} when
Z1,Z2,,Zn
be i.i.d. rvs having exponential distribution with rate
λ
.

Let us apply the above to a Bernoulli random variables.

Z1,Z2,,Zni.i.d.Bern(p) for some
p[0,1]
. In order to apply the above bound, we need to compute
Eexp(sni=1nZi)
, which is simplified as follows
Eexp(sni=1nZi)=Ei=1nesZi/n=i.i.d.i=1nEesZi/n.

Now, it remains to compute
EesZi/n
. Let us have some fun by using some calculus. This is the time you should have realized that various subjects that you have learnt, especially mathematics will come in handy here. Note that
EesZi/n=pes/n+(1p)=1+p(es/n1).

Now, use your calculus knowledge to prove the following.

Exercise

3: Show that
1+yey
.

Using the result in the above exercise with

y=p(es/n1), we get
EesZi/nep(es/n1).

Thus,
Eexp(sni=1nZi)i=1nep(es/n1)=enp(es/n1).

Substituting this in the Chernoff bound, we get

Pr{1ni=1nZi>τ}infs>0[esτenp(es/n1)].
Noting that the above function is a convex function, and a simple differentiation leads to an optimal
s
of
s=nlogτp
. It is important to note that
τ>p
. Substituting this above leads to
Pr{1ni=1nZi>τ}en[τlogτp+τp].

Denote
αp:=τlogτp+τp
. Using this, the bound becomes
Pr{1ni=1nZi>τ}eαpn.

The above takes care of the first term in the probabiltiy of error expression. The second term looks something like
Pr{1ni=1nZi<τ}
. Note that this can be bounded away from zero if
τ
is sufficiently large. Intuitively, for a coin with bias
3/4
, the relative frequency being far away from
3/4
is small. Thus, if
τ
is away from
3/4
, then we get a reasonable bound; this is the essence of the following exercise.

Exercise

4: Find a bound on
Pr{1ni=1nZi<τ}
when
Zi
's are i.i.d. Bernoulli random variable with bias
p
and
τ<p
. Hint: Use Chernoff bounding technique.

The answer to Exercise

4 will be of the form
Pr{1ni=1nZi<τ}eβpn
, where
βp:=τlogpτ+pτ
. It is required that
β>0
. To declutter things, let me summarize the two bounds below:


  • For
    τ>p
    , we have
    Pr{1ni=1nZi>τ}eαpn
    for some appropriate
    αp>0
    .
  • For
    τ<p
    , we have
    Pr{1ni=1nZi<τ}eβpn
    for some appropriate
    βp>0
    .

Now we are ready to use the above result to prove a bound on the probability of error. Its long way up in this notes that you can find what I am writing next-the first term in the probability of error expression:

Pr1/2(LLR(Xn)>τ)=Pr1/2{1ni=1nXi>τ}think why?eα1/2n

Pr3/4(LLR(Xn)<τ)=Pr3/4{1ni=1nXi<τ}think why?eβ3/4n.
Thus the total probability of error is given by
Pr{E}π1eα1/2n+(1π1)eβ3/4n.

The above leads to the following result.


Theorem: For Problem

1, with the optimal LLR rule, the probability of error satisfies the following upper bound

(A)Pr{E}2max{π1,π0}emax{α1/2,β3/4}n.


Practice Problem: Find the number of samples so that the probability of error is at most

δ.

Exercise

4: Find explicit expressions for
β3/4
and
α1/2
.

Exercise

5: Find an upper bound on the probability of error for optimal rule that you have found for the problem in Exercise
2
.

Generalizing the above Settings

Detection theory

Its time for us to generalize Problem

2. This involved tossing one of the coin
n
times resulting in
n
i.i.d. rvs. Unlike estimation problem, the distribution of these rvs depepends on the coin chosen;
Bern(1/2)
or
Bern(3/4)
. This leads to the following generalization.

  • Problem
    2
    (Detection problem):
    Given
    n
    samples
    X1,X2,,Xni.i.d.{pXn(xn;H0) or fXn(xn;H0)if H0 is truepXn(xn;H1) or fXn(xn;H1)if H1 is true.

    The unknonw here is whether the hypothesis
    H0
    is true or the alternate hypothesis. The goal of decision rule is to figure out the "right" strategy for resolving the issue. We resolved this issue when
    H0
    was "bias
    3/4
    " and the alternate was "bias
    1/2
    ", and the distributions were Bernoulli with
    3/4
    and
    1/2
    , respectively. In this general scenario, we will do analogous to what we did for the coin toss problem. Let us consider the probability of error as a metric. As in the case of coin toss, let us start by looking at the decision rule
    Φ(X1,X2,,Xn):Rn{0,1}.

The error event is (exactly analogous to coin toss)

E:={Say H0 and when H1}{Say H1 and when H0}.
To compute the probability of error, we first note that the two events in the union are disjoint. Thus,
Pr{E}=Pr{Say H0 and when H1}+Pr{Say H1 and when H0}.

Now, using the definition of "Say

H1" and "Say
H0
" in terms of
Φ()
, we get the following

Pr{E}=Why?Pr{Φ(X1,X2,,Xn)=0|when H1 is true}×π1           +Pr{Φ(X1,X2,,Xn)=1| when H0 is true }×π0.
The rest of the proof is very similar to the coin tossing example, and hence left as an exercise. The final result is the following.


Theorem: For Problem

2, the optimal way of deciding whether the hypothesis
H1
is true or not is through the following
log
-likelihood ratio (LLR)
rule

  • Discrete case:

    logpXn(xn;H0)pXn(xn;H1)choose H0choose H1log(π11π1).

  • Continuous case:

    logfXn(xn;H0)fXn(xn;H1)choose H0choose H1log(π11π1).


Example: Say you get to observe

n i.i.d. samples either from
Exp{λ0}
or
Exp{λ1}
. Further, assume that
π1=1/2 and 1/4
.

Solution: Since the rvs involved are continuous, we need to consider the second equation in the theorem. Calling the samples correpsonding to

λ0 rate as
H0
, we have

fXn|when H0(xn)=λ0nexp{λ0i=1nxi}.

Further,

fXn|when H1(xn)=λ1nexp{λ1i=1nxi}.

Taking the ratio of the above two, and taking logarithm will result in the following rule

1ni=1nxichoose H1choose H0τn(λ0λ1)+1λ0λ1log(λ0λ1),
where
τ:=log(1π1π1)
.

Practice Question: Try using different values of

π1 (especially, the one given in the example above) and interpret the result.

Practice Question: Find the probability of error for the above decision rule.

Example: Say you get to observe

n i.i.d. samples either from
N(μ0,1)
or
N(μ1,1)
. Further, assume that
π1=1/2 and 1/4
. Find the decision rule. Also, find the probability of error.

The solution for the above problem will be of the form

1ni=1nXichoose H0choose H1τg,
where
τg
is the appropriate threshold for Gaussian.

Practice Problem: Find the threshold

τg for the problem.

Now, let us look at the probability of error. The error probability can be written as

Pe:=Pr{1ni=1nXi>τg|H1}π1+Pr{1ni=1nXi<τg|H0}π0.
From here on, it is down hill as we have already seen how to solve such problem in the context of Bernoulli rvs. The trick is the same. First let us consider the Chernoff for the first term above, i.e.,
Pr{1ni=1nXi>τg|H1}infs>0esnτgi=1nE[esXi].

The following is a way of making the document fancy.

Lemma (Moment Generating Function (MGF)): For

XiN(μ1,1),
E[esXi]=es2(s+2μ1)
.

Practice Problem: Let

XExp{λ}. Find the MGF of
X
.

The name given to

EesX is MGF. More on this later.

Practice problem: Prove the above Lemma.

Using this Lemma, the above can simplied to

Pr{1ni=1nXi>τg|H1}exp{n(μ1+τg)2}.

Now, an easy exercise (literally follow the same step as above) with

1 multiplied on both sides results in (requires a minor restriction that
n>μ1/τg
-always satisfied in the asymptotics)
Pr{1ni=1nXi<τg|H0}exp{n(μ0τg)2}.

Practice Problem: Show the above inequality.

Order wise, the probability of error becomes

Pe2max{π1,1π0}O(en).

So far, we have been looking at a scenario where the hypothesis are picked randomly with some distribution. It is possible that the hypothesis may be picked randomly but we do not know the probabilities. One fix is to assume the worst case, and design for the worst case. Intuitively, the worst case is

π1=1/2. An alternate to this is to solve a problem that balances various errors. Towards explaining this, we will consider two kinds of errors popularly known as probability of detection and probability of false alarm. More precisely
Pd(Φ):=Pr{decide H1|H1 is true}

and
PF(Φ):=Pr{decide H1|H0 is true}.

Ideally, the objective should be to make
Pd(Φ)
as high as possible and
PF(Φ)
as low as possible. Unfortunately, this is not possible; if you want to make
Pd(Φ)
high (close to
1
), then decide in favor of
H1
always! However, this results in a false alarm of
1
. On the other hand, if we want to make the false alarm as small as possible, a possibility is to decide in favor of
H0
all the time. In this case, the detection probability will be
0
! Thus, there exists a tradeoff between the two. One possible way to balance this tradeoff is to solve the following problem:
maxΦPD(Φ)(NeyPear-Problem)such that PF(Φ)α

for some

α>0. The solution to the above is called the Neyman Pearson's test. In this course, we will not look at how to solve the problem but will state the solution.


Lemma (Neyman-Pearson (NP) Lemma): The solution to "NeyPear-Problem" is the loglikelihood ratio given below

  • Discrete case:

    logpXn(xn;H0)pXn(xn;H1)choose H0choose H1τ.

  • Continuous case:

    logfXn(xn;H0)fXn(xn;H1)choose H0choose H1τ.
    In the above, the threshold is chosen to satisfy the constraint on the false alarm to be
    α
    .


The proof of the above can be found in the famous "Information theory" by Thomas and Cover and many other books on detection and estimation theory. Now, let us try to apply this solution to the following radar detection problem.

Example (Radar target detection): Consider that a radar sends a short pulse which gets reflected from a target (say a plane) and the radar observes the reflection. In fact, we are assuming that the short pulses are really short so that the speed of the aircraft is neglibile! That is the time difference between the pulses is so short that the airplane would have hardly moved. This is theory so we need not worry about the practicality:-) If you did not follow the above, don't worry, ignore and move on (then why did I write this?-hmm)!

Assume that it sends

n pulses one after the other. This can be modelled using the following

X1,X2,,Xni.i.d.{θ+wif the targe is presentwotherwise,
where
wN(0,1)
, and
θ
is a fixed number which represents the distance from the radar to the target. It is not so difficult to realize that we cannot assign probabilities for the two hypothesis, viz, target present and target not present. Naturally, NP is the way to go. It is easy to see that under the "the target present" hypothesis, the distribution is
N(θ,1)
while under "target not present", the distribution of
Xi
becomes
N(0,1)
-pure noise. Noting that
Xi
's are i.i.d., and applying NP-Lemma, we get
S¯n:=1ni=1nXichoose H1choose H0τ.

Note that the change in the inequality. Now, we need to find the threshold

τ such that the constraint
PF(Φ)
is satisfied. Let us first compute the false alarm
PF(Φ)=Pr{S¯n>τ|H0 is true}.

In genral, one can use the Chernoff technique to bound the above. In particalar,
PF(Φ)infs>0[ensτi=1nE[esXi|H0 is true]]=infs>0[ensτi=1nes2/2]=enτ2/2.

Now, we need
PF(Φ)α
, which is satisfied if
enτ2/2α
. This can be achieved by choosing
τ=2nlog1α
. There are several approaches to the above problem. An alternate approach is to use the fact that the sum of Gaussian is Gaussian. We will first show that it is indeed true.

Sum of Gaussian is Guassian: Let

X and
Y
be two independent Gaussian random variables with zero means and variance
σ2
. Let us first write the CDF of
Z=X+Y
using the total probability law for continuous rv
Pr{Zz}=why?Pr{X+Yz|Y=y}fY(y)dy.

A term in the integrand can be written as

Pr{X+Yz|Y=y}=Pr{Xzy|Y=y}=Pr{Xzy}=zyfX(x)dx,

where the second equality follows from the fact that

X and
Y
are independent. Substituting this, we get
Pr{Zz}=zyfX(x)fY(y)dxdy.

Now, taking the derivative with respect to
z
, we get
fZ(z)=fX(zy)fY(y)dy.

Substituting for the Gaussian PDF and after a long laborious algebra, we get the desired result, which is left as an exercise.

Practice Problem: Show that the above integral results in

ZN(0,2σ2),i.e.,
fZ(z)=12πσez2/4σ2
.

The second method is to use the moment method, i.e., using the MFG. This will be not dealt with here.

Exercise

6: Solve the following problem taking NP approach
X1,X2,,Xni.i.d.{Exp{λ0}under H0Exp{λ1} under H1.

Let us revisit detection problems a little later. Now, we will look at the estimation problem, and some "interesting" thing that we can do there.


Why theory?

You may have noted that the optimal decision so far turned out to be intuitive. Then, the question is why do we need theory!?

As an example consider the following hypothesis testing:

X1,X2,,Xni.i.d.{N(0,1)under H0pN(5,1)+(1p)N(5,1)under H1.

  • If the
    n
    samples are drawn under the null hypothesis, the histogram looks like:

  • If the
    n
    samples are drawn under the alternate hypothesis, the histogram looks like:

What is the mean and variance under both the hypothesis? Mean zero for both and variance

26p+26(1p)=26 for alternate and
1
under null hypothesis. Intuitively, the testing should depend on the variance. But, what is it that we have to use? May be variance! It works to some extent but may not work as good as the LLR based test as it is an optimal test.

Code to play around with (MATLAB/Python)

Hypothesis testing: Use the following Python code (thanks to chatgpt for the conversion from Matlab to Pyhton) to modify and use it as per your wish.
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)  # Set a random seed for reproducibility

num_trial = 5000
num_samp = 1
snr = np.arange(1, 101, 10)  # Equivalent to 1:10:100 in MATLAB

mse_mmse = []
mse_lmmse = []

for SNR in snr:
    data = np.random.randn(num_trial, num_samp)

    # Generate noise
    noise = np.random.randn(num_trial, num_samp) / np.sqrt(SNR)

    # Generate observations
    y = data + noise

    # MMSE Estimate
    data_mmse = y * (SNR / (1 + SNR))

    # LMMSE
    data_lmmse = y

    # Error calculation
    error_mmse = data - data_mmse
    error_lmmse = data - data_lmmse

    # MSE calculations
    mse_mmse.append(np.mean(np.square(error_mmse))
    mse_lmmse.append(np.mean(np.square(error_lmmse))

plt.figure(1)
plt.semilogy(10 * np.log10(snr), mse_mmse, '-b')
plt.semilogy(10 * np.log10(snr), mse_lmmse, '-r')
plt.legend(['mmse', 'lmmse'])
plt.grid(True)
plt.xlabel('SNR in dB')
plt.ylabel('MSE')
plt.show()

Hypothesis testing: Use the following Matlab code to modify and use it as per your wish.
% Hypothesis testing with mixture distribution
% Shows LLR test is optimal!
clear all;
clc;

num_samples = 100;
mixture = 1;
count = 0;
for temp = 1:10000
if mixture == 1
bern = rand(1,num_samples)>0.5;
mix_samples = (bern>0).*(-1+randn(1,num_samples)) + (bern<=0).*(1 + randn(1,num_samples));
else mix_samples = randn(1,num_samples);
end
% figure(1);
% hist(mix_samples,40);
% grid on;
% legend('Mixture Gaussian');


% figure(2)
% rand_samples = 5.*randn(1,num_samples);
% hist(rand_samples,40);
% grid on;
% legend('Standard Gaussian');

% Hypothesis test
likelihoodnum = 0;
likelihoodden = 0;
for temp = 1:length(mix_samples)
    likelihoodnum = likelihoodnum + log(0.5*exp(-(mix_samples(temp)-1)^2/2) + 0.5*exp(-(mix_samples(temp)+1)^2/2));
end

for temp = 1:length(mix_samples)
    likelihoodden = likelihoodden + log(exp(-mix_samples(temp)^2/2));
end
det_llr = 1;

if likelihoodnum>likelihoodden
    disp("mixture:");
else 
    det_llr = 0;
    disp("single");
end

est_var = mean((mix_samples-mean(mix_samples)).^2);
det = 1;
if est_var > 2 % Can use different threshold instead of 2
    disp("mixture: average metric")
else 
    det = 0;
    disp("single: average metric");
end

if det == 1
    count = count + 1;
end

end
count

Estimation theory

First, let us try to generalize Problem

1. Note that the solution involved tossing the coin
n
times resulting in
n
i.i.d. rvs with
Bern(p)
distribution. The task is to find the parameter
p
. Similarly, in Exercise
1
,
n
samples from an uniform distribution
X1,X2,,Xni.i.d.Unif(θ)
were given. The task was to find the unkown parameter
θ
. This leads to the following generalization for estimation problem.

  • Problem
    1
    (Estimation problem):
    Given
    n
    samples
    X1,X2,,Xni.i.d.pXn(xn;θ) or fXn(xn;θ)
    , find the unknown paramter
    θ
    . Here,
    θ
    is a parameter of the distribution. For example, it could be the mean, variance, maximum value etc.
    Examples for
    fXn(xn;θ)
    and
    pXn(xn;θ)
    include (i) Gauassian with mean
    θ
    and variance
    1
    , (ii) uniform distribution with domain
    [0,θ]
    , (iii) geometric distribution with parameter
    p
    , (iv) Poisson with rate
    λ
    and (v) exponential with rate
    λ
    and many more.
    Many questions need to be addressed such as what should be the metric with respect to which we need to find
    θ
    , how do we know that the given estimator (mostly derived from intuition) is good with respect to a given metric, etc. This will be the topic of study in this course.

How do we solve this general problem. As in the example above, first we need to find a metric with respect to which we should find our estimate. Of course, we will start with the

MSE metric. In particular, given an estimator
θ^nΘR
(in general, this can be
Rn
) which depends on the
n
observation, the MSE metric is
MSEn:=E|θθ^n|2,

where the expectation is with respect to the joint distribution on
X1,X2,,Xn
. One wishes to solve the following problem
minθ^n{MSEn:=E|θθ^n|2}.

It is important to note that
θ
has to be random for the above to make sense. In this case, the expectation above is also with respect to
θ
. It is possible that
θ
need not be random. For example, Problem
1
above, where
p
is a deterministic but unknown quantity. The approach that we are going to describe with
MSE
as a metric does not work in case when
θ
is not random. More on this later.

Simple things first: What if you don't get to observe anything but you have to make an estimate of

θ, where
θ
is a random drawn according to a known distribution. In this case, we would like to solve the following problem
minθ^n{MSE:=E|θθ^|2}.

Note that
θ^
does not depend on the number of samples
n
as the estimator do not have access to samples! First, let us expand
MSE=why?Eθ2+θ^22θθ^.

Minimizing the above with respect to
θ^
, we get
θ^=Eθ
. This is intuitively very appealing as one would bet on the mean when nothing is known!

What if we are given

n samples that can potentially depend on
θ
? How do we solve such problems? Before doing this, let us recall on what we want out of an estimator:

  • First, on an average (true average), the estimate should be true; mathematically,
    E[θ^n|θ]=θ
    .
    • The above is in general called an *unbiased estimator, i.e., an unbiased estimator
  • Second, as we increase the number of samples, the estimation should become "better" (in terms of the metric that we are using).

I urge the reader to compare this with the list that we had when we studied Problem

1. It looks very similar, isn't it? Now, let us make some of these more formal.

Definition (estimator): Given

n observations
X1,X2,,Xn
, an estimator is a map
θ^n(X1,X2,,Xn)
.

Definition (unbiased estimator): We say that an estimator

θ^n(X1,X2,,Xn) is an unbiased estimator if
E[θ^n(X1,X2,,Xn)|θ]=θ
-the true value.

Note that there is a conditioning with respect to

θ. This is because
θ
is sampled from a distribution. A straight forward way of coming up with an estimator is by making a linear combination of the observations, i.e.,
θ^n=i=1nαiXi.

The above estimator is called a linear estimator for obvious reasons. Now, the goal is to solve the following problem

minα1,α2,,αnEi=1nαiXiθ22.
Note that we have used the norm above which takes care of all the vector valued estimators. However, to begin with, let us assume that the estimator is scalar, and the above is an absolute value. First thing to note is that the above function (as a function of each
αi
) is convex. Thus, differentiating with respect to each
αj
and equating to zero results in
2E(i=1nαiXiθ)Xj=0 for all j.

Using linearity of expectation, the above can be written as
i=1nαiE(XiXj)=E(θXj) for all j.

Matrix algebra comes in handy while dealing with the above. In particular, defining (i) a matrix

Ψ whose
ji
-th entry is
E(XiXj)
, (b) a vector
μX
whose
i
-th entry is
E(θXi)
and (iii)
α
whose
i
-th entry is
αi
leads to the following
αTΨT=μXTΨα=μX.

The above leads to the following solution, also known a Linear Minimum Mean Squared Error (LMMSE) estimator
α=Ψ1μX.

If the inverse of the matrix does not exists, then one can use what is called the Pseudo inverse. For now, let us not worry about the technicalities but stick to the above solution.

It is possible that the above need not be the best solution. However, it is indeed the best among all the simplest solutions. First thing to note is that the above is not in general an unbiased estimator.

Geometric Interpretation: In order to appreciate and understand the LMMSE better, let us consider a problem where the randomness is stripped off, i.e.,

Xi's and
θ
are deterministic vectors lying in
Rd
.

The image shows the geometry of Least Squares (LS):

image.png

Suppose if we have

n observations
x1,x2,,xn
lying in
Rd
(for concreteness, assume that it lies in the
xy
plane) as shown in the figure above (labelled as span of
x
). We want an estimate in
Rd
(the
xy
plane) that is closest to
θ
. For simplicity, let the estimate be a linear combination of these
n
observations, i.e.,
i=1nαiXi
, and hence lie in
Rd
(the
xy
plane). See how linear algebra is useful here! Thus, we need to adjust the weights
αi
's such that the distance between the linear combination and
θ
is small, i.e., we need to solve the following problem:
minα1,α2,,αni=1nαixiθ22.

Intuitively, the optimal estimate would be a vector obtained by dropping a perpendicular from

θ, i.e.,

(θi=1nαixi)i=1nαixi.
In other words,
(θi=1nαixi),i=1nαixi=0.

Solving the above leads to the following
i=1nαiθ,xi=i,j=1nαiαjxi,xj.

Using
α:=(α1,α2,,αn)T
,
a:=(θ,x1,θ,x2,,θ,xn)T
and the
ij
-th entry of a matrix
Ψ
be
xi,xj
, the above can be written as
αTa=αTΨα.

Note that the matrix
Ψ
is a symmetric matrix. The above leads to
a=Ψα
, which leads to
α=Ψ1a
; a solution similar to the LMMSE solution. This solution is called the Least Squares (LS) solution, and the corresponding problem is called a LS problem.

In the LMMSE setting, the difficulty is that

θ and
Xi
's are random, and hence there is no meaning for perpendicular. An obvious approach is to bring in the structure of a vector space with inner product, in which case the meaning of perpendicular makes sense. The following is an aside, which is not a part of this course. However, interested readers can go through this.


We use the following approach:

  • Turning the space of rvs into a vector space: Consider the following
    L(R):={XX:EX2<}.

    Note that
    L(R)
    is a vector space with respect to usual sum and product.
  • Define an inner product: Consider the following
    X,Y:=E(XY).

    The above is clearly an inner product mimicking dot product on a strange space which mimics
    R2
    . The above induces the notion of distance:
    ||XY||L(R)2:=XY,XY.

    Note the similarity between this and the Euclidean distance, which is obtained as
    xTx
    .

Going by analogy with the LS problem, we can guess the solution to be

αTa=αTΨα,
where the
ij
-th entry of
Ψ
is
(Ψ)ij=.E(XiXj)
,
(a)i=E(Xiθ)
. Now, the final solution is similar to the geometric problem above.

Example (LMMSE): Let

X1,X2,,Xni.i.d.N(θ,1), where
θ
is sampled with some distribution
fΘ(θ)
independently of
Xi
's. In this case, the matrix
Ψ
has
(Eθ)2
in the non-diagonal entries and
1+Eθ2
on the diagonals (why?). This leads to an estimator that is biased.

How do we construct an unbiased estimator that is LMMSE? One way is to modify the above estimator to the following

θ^n=i=1nαiXi+b.
Now the optimization problem becomes
minα1,α2,,αn;bEi=1nαiXi+bθ2.

A similar approach as above can be followed to solve the problem.

Exercise

7: Solve the problem above, and find the optimal
αi
's and
b
.

Next we will look at the general soluton to the Minimum MSE (MMSE) problem above, i.e.,

ming(Xn)E|g(Xn)θ|2.
Assume that
g(Xn)
is any estimator and consider the following estimator which depends on the conditional mean
f(Xn):=argming(Xn)E[|g(Xn)θ|2|Xn=xn].

Now, we show that the above estimator solves the MMSE problem.


Theorem: The conditional mean, i.e.,

f(Xn):=argminf(Xn)E[θ|Xn] is the optimal solution to the MMSE problem.


Proof: Consider the following for any estimator

g(Xn)
E|g(Xn)θ|2=EXn[E[|g(Xn)θ|2|Xn=xn]](a)EXn[E[|f(Xn)θ|2|Xn=xn]]=E[|f(Xn)θ|2].

Hence
f(Xn)
is indeed the optimal estimator.

Now, it remains to solve the following problem

f(Xn):=argming(Xn)E[|g(Xn)θ|2|Xn=xn].


Theorem: The solution to the above problem is given by

f(Xn)=E[θ|Xn=xn].


Proof: First, let us expand the conditional expectation

E[|g(Xn)θ|2|Xn=xn]=g2(xn)+E[θ2|Xn=xn]2g(xn)E[θ|Xn=xn].

Now, differentiating with respect to

g() what do you mean by this? For now, let us be a little sloppy and do the usual differentiation, and equating it to zero proves the theorem.

Based on our intuition from the geometry and the LS, we know that the error is orthogonal to the estimate. Note that in our new language of inner-product and norms, we expect the following, which is indeed true!

Lemma: The error is uncorrelated with the estimate, i.e.,

(θE[θ|Xn=xn]),E[θ|Xn=xn]=E[(θE[θ|Xn=xn])E[θ|Xn=xn]]=0.
Proof: Follows by using conditional expectation. Easy exercise!

Example: Consider the following density

f(x,y)={1211(x+1) if 0xy,0y1,0 Otherwise.
Find the MMSE estimator of
X
given the observation
Y
.

Example: Consider

Y=X+W, where
WN(0,1/SNR)
and
XN(0,1)
are independent Gaussians. We get to observer
Y
. Compute an MMSE estimate of
X
given
Y
.
Solution: The MMSE estimate is
X^:=E[X|Y=y]
. This requires us to compute the conditional density
fX|Y(x|y)
, which is given by
fX|Y(x|y)=fX,Y(x,y)fY(y)=fY|X(y|x)fX(x)fY(y).

In the above,
fY|X(y|x)
is
N(x,1/SNR)
. Using this, we get
fY|X(y|x)fX(x)fY(y)=SNR2πexp{SNR(yx)22}×12πexp{x22}SNR2π(1+SNR)exp{SNRy22(1+SNR)}=1+σeff22πσeff2exp{(xy1+σeff2)22σeff21+σeff2},

where we have used the fact that the density of
Y
is again Gaussian (why?), and
σeff2:=1SNR
. Thus, the distribution of
X
conditioned on
Y=y
is again Gaussian with mean
y1+σeff2=SNR1+SNRy
and variance
σeff21+σeff2
. Now, the conditional expectation is easy to compute; it is simply the average of
X
conditioned on
Y=y
, which is
SNR1+SNRy
. The MMSE estimate is simply a scaling of the observation
X^:=SNR1+SNRy.

A few observations are in order:

  • The estimate is biased, i.e.,

    E(X^|X=x)=x1+σeff2x!

  • The MSE of the estimator is

    MSE=E(Y1+σeff2X)2=E(X+W1+σeff2X)2=a2EX2+b2EW22abE(XW)=EXEW=0=O(1SNR).

  • Error is uncorrelated with the estimate! Check this yourself!

  • A natural question to ask is how does the above campare with LS? Since there is only one observation, finding the LS estimate is easy. In particular,

    X^LS=E(XY)Y=EX2×Y=Y.
    It is clear that the LS is an unbiased estimate while MMSE is not. Further, the MSE of the LS is
    1/SNR
    ! Order wise both MMSE and LS are same. See the figure below for comparision. In fact, observer that at very high
    SNR
    , i.e., low noise condition, the MMSE and LS both approach the same performance, as expected.

  • Recall that when we started this course, we listed down the requirements for an estimator to be unbiased, and also have good MMSE performance. If the estimator is unbiased, then MSE

    Eθ^θ2 is the variance. Hence we need a Minimum Variance Unbiased Estimator (MVUE). This is beyond the scope of this course, and will be dealt in more detail in other courses.

mmse_vs_lmmse.jpg

Code to plot the MSE of MMSE and LMMSE (MATLAB).
%% MMSE versus LMMSE clear all; clc; num_trial = 5000; num_samp = 1; snr = 1:10:100; % Generate $n$ i.i.d N(0,1): X^n temp = 0; mse_mmse = []; mse_lmmse = []; for i = 1:length(snr) SNR = snr(i); data = randn(num_trial,num_samp); %%% Generate $n=1$ i.i.d noise: W^n noise = randn(num_trial,num_samp)./sqrt(SNR); %%% Generate observations: Y^n; n=1 here y = data + noise; %% MMSE Estimate data_mmse = y.*(SNR/(1 + SNR)); %% LMMSE data_lmmse = y; %%% error calculation error_mmse = data - data_mmse; error_lmmse = data - data_lmmse; %% MSE calculations temp = temp + 1; mse_mmse(temp) = trace(error_mmse*error_mmse')/num_trial; mse_lmmse(temp) = trace(error_lmmse*error_lmmse')/num_trial; end figure(1); semilogy(10*log10(snr),mse_mmse,'-b'); hold on; semilogy(10*log10(snr),mse_lmmse,'-r'); legend('mmse','lmmse'); grid on; xlabel('SNR in dB'); ylabel('MSE')

Exercise: Show that LS and MMSE results in the same estimator for the above model (Gaussian). Also, convince yourself why the MMSE curve is not linear while LMMSE is linear.

Exercise: Consider the following joint distribution of

X and
Y
.

pX,Y(x,y)
Y=1
Y=0
Y=1
X=0
1/6
1/6
0
X=1
0
0
1/3
X=3
1/6
1/6
0
  • Assume that you observe
    X
    and find an MMSE estimate of
    Y
    .
  • Assume that you observe
    Y
    and find an MMSE estimate of
    X
    .

Exercise: Let

X and
Y
have joint PDF
fX,Y(x,y)=x+y for 0x1,0y1.

Find the MMSE estimate of

X given the observation
Y
.

Maximum Likelihood (ML) Estimate

It is evident from the above discussion that if

θ is deterministic and unknown, MSE cannot be used as a metric to find an estimate of
θ
. We use the following example to introduce you to a method of estimation called Maximum likelihood method-a method similar to what is done in the hypothesis testing case, but with a small twist.

Example: Consider Problem

1 where we are supposed to find an estimate of
p
given
n
tosses
X1,X2,,Xn
. But,
p
is deterministic but unknown! Let look at the likelihood of observing the outcomes
X1,X2,,Xn
when a coin
q
is tossed:
Pr{x1,x2,,xn|q is the coin}=qi=1nxi×(1q)ni=1nxi.

A natural choice of
q
would be something close to
p
as
p
has generated these samples. However, I do not know
p
. But, intuitively, I should chose a
q
that maximizes the above, i.e.,
maxqPr{x1,x2,,xn|q is the coin}.

Solving the above leads to the following solution; this is called a maximum likelihood estimate or ML for short
q^n=1ni=1nxi.

One can generalize the above and say if

X1,X2,,XnfXn(x1,x2,,xn;θ) or
pXn(x1,x2,,xn;θ)
, then the ML estimate is given by
maxq{fXn(x1,x2,,xn;θ) or pXn(x1,x2,,xn;θ)}.

Now, we have so many questions. Is the ML estimate unbiased? Does it lead to very low MSE when

n is large? Before answering these questions, let us get used to ML by considering more examples.


Example: Consider

X1,X2,,Xn be i.i.d. from
Exp(θ)
. Find an estimate of
θ
.

First, we need to find the joint density of the

n rvs:
fX1,X2,,Xn(x1,x2,,xn;θ)=i=1nfXi(xi;θ)=θnexp(θi=1nxi).

ML estimate of
θ
is given by
maxθ[θnExp(θi=1nxi)].

By taking log and then differentiating and equating to zero results in
θ^nML=[1ni=1nxi]1
-an intuitively appealing solution.


Example: Consider

X1,X2,,Xn be i.i.d. from
N(0,θ2)
. Find an estimate of
θ
.

First, we need to find the joint density of the

n rvs:
fX1,X2,,Xn(x1,x2,,xn;θ)=i=1nfXi(xi;θ)=1(2π)n/2θnExp(1θ2i=1nxi).

ML estimate of
θ
is given by
maxθ1(2π)n/2θnExp(1θ2i=1nxi).

By taking log and then differentiating and equating to zero results in
θ^nML=1ni=1nxi2
-again an intuitively appealing solution.

Example: Consider

X1,X2,,Xn be i.i.d. from
Unif(θ)
. Find an ML estimate of
θ
.

First, we need to find the joint density of the

n rvs:
fX1,X2,,Xn(x1,x2,,xn;θ)=i=1nfXi(xi;θ)=1θn for all xiθ,i=1,2,,n.

Thus, maximizing the above to get an ML estimate amounts to solving the problem below
maxθ>01θnsuch that θmax{x1,x2,,xn}.

The solution is trivial, i.e.,
θ^n:=max{x1,x2,,xn}
-again an intuitively appealing solution.

Practice Problem: Use the MATLAB code provided above to generate

n=1000 samples from Gausian distribution with mean
μ=3
and variance
1.4
, and use the samples to fit a Gaussian distribution to the samples. Think of how to use the ML estimate to do the same. Also, understand the code.

Practice Problem: Let

fX|Θ(x|θ)=θeθx for
x0
and
0
otherwise. Let
fΘ(θ)=αeαθ
for
θ0
. Find MMSE and ML estimates of
θ
given a single observation
X=x
.
Practice Problem: Suppose that the random variable
X
is uniformly distributed between
3
and
3
, denoted
Unif[1,1]
. The data
Y
is a noisy measurement of
X
given by
Y=X+N,

where
N
is independent of
X
, and is Laplacian distributed with PDF
pN(n)=12e|n|.

Find the minimum mean-square error estimate of
X
given
Y
. Also, solve
maxx[3.3]fY|X(y|x)
. This is called a Maximum a posteriori Estimation (MAP).

Next, we will take a slight detour. So far, we have assumed that the observations

X1,X2,,Xn are i.i.d. and hence descibed it through joint pdf, which is the product. However, in general, this should be described through joint pfd. This is difficult! So, let us look at the observations (correlated) that are Gaussian. We can stack
n
rvs, and we get a vector (correlated with each other). How do you describe the joint distribution? The answer is as follows (turn this problem into something that we know how to describe, i.e., turn this into one variable):

Definition: A random vector

X1,X2,,Xn is said to be jointly Gaussian if any linear combination
X=i=1naiXi

results in a Gaussian random variable for any

ai,
i=1,2,,n
.

The above implies that each

Xi is a Gaussian rv.

Definition: We say that

XRn is jointly Gaussian with the mean vector
μ:=(μ1,μ2,,μn)
, where
μi=EXi
and covariance matrix
Σ:=E(XTX)
if the joint density is
fX(x)=1(2π)n/2|Σ|1/2exp{(xμ)TΣ1(xμ)2}.

Generally, we use the short hand notation

XN(μ,Σ). In the above,
|Σ|
denotes the determinant of the matrix. Note that
ΣRn×n
is a symmetric positive semidefinite matrix (why?).

Now, let us consider the following problems, one on detection and the other on estimation.

  • Detection:

    X:=(X1,X2,,Xn)N(0,Σ) under
    H0
    and
    N(μ,Σ)
    under
    H1
    . Find the test to distinguish between the two.
    Solution: The test is again the LLR assuming
    π1=1/2
    :
    xTΣ1μchoose H0choose H10

  • Estimation: Let

    X:=(X1,X2,,Xn)N(Θ,Σ). Find an estimate of
    Θ
    .
    Solution: The ML estimate is (MMSE is not possible. Why?):
    minv  (xv)TΣ1(xv).

IMPORTANT NOTE: So far, we have looked at both estimation and detection problems. In these problems, we assumed some kind of distribution. However, in partice you do not know the distribution. One can come up with the test without knowing the distribution. Based on this, two terms are used in statistics, namely (a) parametric test or estimation; this refers to the case where you know the distribution and (b) non-parameteric test or estimation; this is where you do not assume any distribution. Hence the tests/estimation that we considered in this notes so far are called parameteric test/estimation.

Page Title

Finally some assignment problems on the above topics
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  1. Given

    n samples
    Y1,Y2,,Yn
    . Suppose we use the estimator
    Y^=1ni=1nYi
    . Show that
    n=O(r2ϵ2δ)
    , where
    r=Var[Y]E[Y]
    is sufficient to ensure:
    P[|Y^E[Y]|ϵE[Y]]δ.

  2. Suppose a sample

    X1,,Xn is modelled by a Poisson distribution with a parameter denoted by
    λ
    , so that
    fX(x;λ)=λxx!eλx=0,1,2,...
    for some
    λ>0
    . Estimate the unknown parameter
    λ
    by using MLE. Show that asymptotically (in
    n
    ), the MSE in the estimate is zero.

  3. You have a coin that you think is biased. you flip it

    7 times and get the sequence
    HHHHHHT
    . What is the maximum likelihood estimate for the probability of getting heads?

  4. You think a die is rigged so that it will roll

    4 less often. You continually roll the die until you get
    4
    and you have to roll
    10
    times before this happens. Is your suspicion correct with a confidence level of
    α=0.05
    ?

  5. Two cars A and B are travelling independently in the same direction. The speed of car A is normally distributed with mean

    100 km/hr and standard deviation
    5
    km/hr while the speed of car B is distributed normally with mean
    100
    km/hr and standard deviation
    2
    km/hr. Car A was initially 3 km ahead of car B. Find out the probability that after
    3
    hours, they are within a distance of
    3
    km from each other.

  6. Let

    X and
    Y
    be two continuous random variables with joint pdf
    fX,Y(x,y)=x+y,0x1 and 0y1.
    Find MMSE estimator of
    Y
    given
    X
    . Also, find the MLE of
    X
    . Comment which is better in terms of computation and MSE.

  7. Consider the linear regression model

    yi=β0+β1xi+ϵi for
    i=1,2,...,6
    , where
    β0
    and
    β1
    are unknown parameters and
    ϵis
    are independent and identically distributed random variables having
    N(0,1)
    distribution. The data on
    (xi,yi)
    are given in Table
    1
    . Find the constant that minimizes the squred error.

Table related to

7.-
image.png
-

  1. Given that the number of products produced in a factory during a week is a random variable with a mean of

    50 and a variance of
    25
    , what can be inferred about the likelihood that the weekly production falls within the range of
    40
    to
    60
    products?

  2. Let us consider we have a random sample

    X1,X2,,Xn where
    Xi=0
    if a randomly selected person does not hold a racing bike, and
    Xi=1
    if a randomly selected person does hold a racing bike. Assuming that the
    Xi
    are independent Bernoulli random variables with unknown parameter
    p
    , find the maximum likelihood estimator of
    p
    , the proportion of person who hold a racing bike.

  3. Consider the following hypothesis testing problem:

    (X1,X2,,Xn){N(0,σ2I) under H0N(0,S) under H1.

Find the ML rule for the above.

  1. Not included in the exam (Number of collisions): If we throw

    m balls in
    n
    bins, the average number of collisions
    Ym,n
    to be
    λm,n=(m2)1n
    . Use Chebyshev’s inequality to show that:
    P[|Ym,nλm,n|cλm,n]1c2.

    Next, suppose we choose
    m=2n
    , then
    λm,n1
    . Use Chernoff bounds plus the union bound to bound the probability that no bin has more than
    1
    ball.

  2. Let

    X1,
    X2
    ,,
    Xn
    (n2)
    be independent and identically distributed random variables with probability density function
    fX(x)={1x2ifx10otherwise.

    Then which of the following have/has finite expectation?
    (i)
    X
    , (ii)
    1X2
    , (iii)
    X1
    , and (iv)
    min{X1,X2,X3,...,Xn}
    .

  3. Assume that you observe one sample whose distribution is

    fX(x)={3x22 for 1x1 under H1Unif[1,1] under H0.
    Find the Neyman Pearson rule to achieve a false alarm rate of
    0.1
    . Find the probability of detection (
    Pd
    ). Plot a graph showing
    Pd
    versus probability of false alarm
    Pf
    .

  4. Let the observation be modelled as

    Xn=αXn1+Wn,
    where
    Wni.i.d.N(0,σ2)
    and
    α>0
    is a known constant. Find an MMSE estimate of
    Xn
    given all the past observation
    X1,X2,,Xn1
    . Find the MSE of your estimator.