B. N. Bharath
Assistant professor,
Department of Electrical, Electronics and Communication Engineering (EECE),
IIT Dharwad
There will be two evaluations-one quiz () and one final exam (). The quiz may contain some assignments (given during the regular classes) as well like writing a code, solving problems etc.
Quiz: from AM to AM (regular class hours). The syllabus will include everything that is covered until .
Final exam will be as per the academic calendar shared by the academic office.
The repeat exam or evaluation will not be encouraged. If someone misses the quiz for a genuine reason, then I may may use the final exam marks to grade with a possible penalty. The penalty will be decided based on the reason for absence.
The answer scripts will be distributed on th November in the PC. More details will be conveyed by the TAs.
Let us start with the question of what is data analysis? This refers to providing inference and insights using data. The classical approach involves modelling the data using some sort of a function or a probability distribution and use tools from probability to make inferences. One might be wondering why do we need data analysis? Imagine that you are an investment banker and you have to decide whether to invest in some stocks. One can use the past experience, current political situation etc. to infer. A more systematic approach is to use the past data and provide inference using probabilitstic tools in addition to one's rich experience. This course covers classical approaches to these problems, and provide some insights and tools necessary to solve certain class of problems. Let us begin by looking a stylized example involving a coin toss.
–––––––––-–––––––––-
In particular, let us consider the following four example problems
Problem : Given a coin, find its bias.
Problem : We have two coins, one with bias and the other with bias . Imagine that I am going to randomly pick one of them, and give it to you. You are supposed to find whether I have given a biased coin or an unbiased coin.
Problem (more challenging): Same as Problem with bias replaced by "not ", i.e., biased coin versus unbiased coin.
Problem (more challenging): Same as Problem with bias replaced by "bias is a rational number" and bias replaced by "bias is an irrational number", i.e., rational bias versus irrational bias.
A typical sample path when you toss a coin times (or sample Bernoulli with bias ) is given below
In the first problem, our end result will be the bias, which is any number between and . On the other hand, Problem asks us to find the right answer among two possibities, i.e., biased and unbiased. Problem is called an estimation problem while the second (third and fourth) problem is called a hypothesis testing or classification or decision problem. First, let us begin by modeling the experiment involving Problem above. The first step is to identify the experiment involved, i.e., tossing a coin whose sample space is . We start by modelling each outcome of the coin by a random variable (rv), i.e., let the outcome of the -th coin toss be , , which is defined as
Note that each coin toss is independent of each other, i.e., the outcome of any coin toss will not impact other outcomes. A mathematical way of saying this is that are independent rvs, whose definition is as follows:
Definition: (independent rvs): We say that are independent rvs if the joint distribution for any subset . If the distributions of all 's are same, then we say that the rvs are identically distributed. If are both independent and identically distributed (i.i.d. for short), we use the notation . For continuous rvs, it suffices to replace the PMF by PDF.
Note: Sometimes we use to indicate independence. For example, means and are independent rvs.
Thus, by our intuition, it makes sense to assume that , where is unknown! Now, our goal is to find this . Again, guided by our intuition, a natural estimate is to count the number of heads, and divide it by . Mathematically,
Let us closely look at what we want our estimate to be.
Let us pause for a moment to see how to measure the "goodness" of an estimator-the word "estimator" is not formally defined yet. But for now, is our estimator:-) There are several ways to measure "goodness". The following are some natural measures:
One might be wondering why not -rd power of the absolute error? For that matter, why not -th power of the absolute error? The thought process is correct. However, these higher powers lead to a complicated expression for the Mean (whatever) error, and is not amicable for mathematical analysis. But is a nice quantity that can be analyzed! So, let us go a head with the . First, let us see if our estimator is unbiased. Towards this consider
Hence our estimator is unbiased! Now, let us look at the performance
This leads to the following result.
Theorem: The estimator satisfies . In the asymptotic .
The following plot shows the MSE verus for a coin with bias . Clearly, the MSE is going to zero as is increased. Further, the rate of fall is of the order .
Note: Estimators with the above asymptotic property is called an efficient estimator. The notation is called the Landua's Big O notation. It is often used to remove non-sense (sometimes not so non-sense) terms while analyzing something. Say, you are interested in knowing the behaviour of the MSE with respect to not so much with other terms, then we can use this notation. Mathematically, it means the following. We say that if .
Is there any other way of measuring "goodness" of any estimator? To answer this, let us revisit the error often called empirical error . As noticed earlier, this error is random, and cannot bank on this. Note that our estimator should not result in a bad error, i.e., , where is the tolerance error, and depends on the application. Whenever the error is bad, we refer it to as "Bad" event. Obviously, we cannot ensure that the error will never happen as is random. Thus, a more sensible thing to ask for is to make the probability of the "Bad" event small, i.e.,
for some . Note that is non-negative. In general, it is very difficult to verify if is small. Instead, if we can somehow show that something larger than is small, then our job will be done. Thus, we need an upper bound on ; this is the essence of the next topic.
[Ignore the fancy title for now:-)] We noticed above that we need an upper bound on -referred to as the tail probability (think why?). Here is a non-negative rv. Consider
Rearranging the above results in the following result.
Theorem: Let . Then,
Now, let us try to apply the above result to our estimation problem. Note that the non-negative rv is . Applying the Markov inequality, we get . It is hard to compute the term . However, in the above, we could compute (with a squared). However, Markov involves term without the squared. Idea is to see if can convert it to squared. Its an easy exercise once you note that , and then apply Markov inequality to get
The above when applied to gets a special name:
Theorem: (Chebyshev's Inequality) Let be any rv (not necessarily non-negative). Then,
Note that our estimator is unbiased. Using this fact and applying Chebyshev's inequality to our estimator result in
There are other way of stating this result. For example, if we want for some , then, we can ensure this by choosing such that . In other words, if , then we can ensure . A fancy way of saying this is as follows:
With a probability of at least , the error provided the number of samples . Here, the symbol denotes ceiling operation. This style of results are often called Probably Approximately Correct (PAC) answers!
There is another name to the above. It is called an interval estimate. Instead of giving one estimate , we will give a range of values around in which should lien with reasonably high probability. Naturally, if one gives only (one value), it is called the point estimate for obvious reasons. Essentially, the above result is saying (approximately; approximation error of ) with a probability of at least (probably). Hence the name PAC. These style of results are very common in theoretical machine learning. In this course, we can give a more definative guarantees.
Side Remark: Often times, people may ask what guarantee can you give for your estimator/algorithm/hypothesis testing etc.? What they mean is can you say "Some sort of average "error" is small or probability of error is small or in general some performance measure is small (assuming small means good)". In this course, we will provide many guarantees. The above PAC or MSE expression is one such result on gaurantees.
Exercise : Suppose you are given samples from an uniform distribution , where is unknown.
Practice Problem: A symmetric die is thrown times. Find a lower bound for probability of getting to sixes.
Practice Problem: (Prof. Stankova, UCB) Let for and otherwise. What bound does Chebyshev’s inequality give for the probability ? For what value of can we say ?
Practice Problem: (Prof. Stankova, UCB): Let for , and everywhere else. Give a bound using Chebyshev’s for .
Practice Problem (Comparing bound with actual): Find an upper bound on for some , where . Compare your actual value with the upper bound.
Practice Problem (finite higher moments may help): Let . Use Chebychev inequality to bound for some . Compare this with the bound using . Here, bound it using Markov inequality. This requires you to find the -th moment of . Compare your bound for different values of , and comment. Comment on what happens if we increase indefinitely.
Practice Problem (coding exercise): Simulate Exercise . Plot the MSE performance of your estimator versus the number of samples. You can use any coding language (Python or Matlab preferred).
As mentioned earlier, the second problem is a hypothesis testing or a decision problem; need to decide whether the coin is bias or bias . The first question that we encounter is on what basis we should decide. A natural metric here is the probability of making a mistake, i.e., saying bias when the coin has bias and vice-versa. Let us write this error event
To compute the probability of error, we first note that the two events in the union are disjoint. Thus,
Immiadiately, one may wander who will say ? What is the "decision rule"? We have to clarify these before we delve into the solution.
Decision rule: Intuitively, we observe and then make decisions. What do we observe here? Easy, we toss times and observe the outcomes. Mathematically, we observe Bernoulli random variables , . Soon we realize that a rule should be a function of . Mathematically, a rule here is a function -in simple terms, it is denoted by , and it takes values in . Here means and means bias . Seems very complicated isn't it? Not really! Lets move on. The event "Say " means for the decision rule ; easy? Similarly for "Say ". There is another event to take care of, i.e., "when ":-(. This means that when the true coin had bias . Let us denote this event by . You should excuse me for using too many notations. Here the fancy refers to hypothesis and the subscript is as per my madness. If you don't like, you can use instead. Similarly, for "when ".
Are and random? Is it deterministic? Many questions remain to be answered. If I choose the coin randomly from the urn, then these events are random, and can associate probabilities to it. However, if these events are deterministic but unknown, then we cannot assign probabilities. We need to understand why these events can be deterministic or unknown. In fact, I can pick the coin randomly, and there is no way by which you can figure out how I have chosen the coin. For example, I can go into a secret room and pick the coin and show you. Note that I may choose to pick the coin deterministically as well. There are so many variants. Here let us keep things simple. Let us assume that the coins are picked randomly. In particular, I am going to pick the coin with bias with probability . Let . More precisely, .
With the above setup, we can write the probability of error as
Now, the goal is to design a "decision rule" that minimizes the above error, i.e.,
Seems complicated, and it is indeed not simple! This is because the minimization is over functions. To solve this, let us understand what a decision rule possibly do. It observes samples from and decides in favor of zero or one. The input to is a set of possible binary sequences. It is deciding some in favor of one and some in favor of zeros. Thus, is partitioning the space of sequences into two regions. Let us collect all the sequences that are mapped in favor of in one set, say , which is defined as
The region corresponding to is denoted by . To proceed further (also I am a bit lazy), we need the following notation:
Notation: We use and . Imagine me writing (I meant latexing) all these arrays everytime!
Using the above, a part of the probability of error can be written as
The above follows from the fact that the probability of a rv (random vector in this case) falling in a region is the sum of PMF over that region (integral for the continuous case). In the above, the region corresponds to the region where you () will make a decision in favor of . The above argument applies for conditional probability also (recall that conditional PMF also satisfies all the axioms of probability). Further, we note that
Let us look at the above more carefully. First, conditioning on "when " means that it corresponds to an unbiased coin. Thus, means that the probability of observing the -tuple when you throw an unbiased coin times. By noting that , we have
since gives us the total number of ones in the sequence (or the total number of heads). Similarly,
Recall that
The problem was to minimize the probability of error (i.e., the above expression) with respect to . Note that the above expression depends on only through . Thus, we need to choose such that above makes the expression smaller. At first glance, it appears to be hard problem. However, let us look at it more closely. First, as explained earlier, any choice of will only split the space of sequences into two disjoint sets. In a way, we have to split the space into two pieces such that the resulting probability of error is small. If we split the space (i.e., choose ) such that the summand, i.e., is positive for some , then it will only increase the probability of error. Therefore, if is optimal, then it should only make the summand negative for any choice of . This leads to the following optimal rule:
Note that the event corresponds to choosing bias coin. Moreover, no matter what rule you pick, all the probabilities are fixed before hand. The only thing that you can ensure is that the summand is negative. That is exactly what the above rule is doing. The above leads to the following result.
Theorem: For Problem , the optimal way of deciding whether the coin is bias or not is through the following -likelihood ratio (LLR) rule
Now, let us see what it results in by substituting for the conditional probability
The left hand side is the relative frequency and the right hand side is a specific threshold! When , the threshold becomes ; a number almost at the midpoint of and ! Intuition works!
Side remark: Why should we do all these if we knew the answer before based on our intuition? I have two answers to it. First, its fun and the second, the above settles the issue. No body can question by saying that whats the guarantee that there are no scenarios where our intuition fails.
Example (Effectiveness of a new drug): The theory and methodology developed above plays a significant role in clinical trails and medical research. Imagine that you are in a drug company attempting to find a drug for a disease. Naturally, one needs to figure out whether the treatment using the drug is effective or not. You can formulate this as a hypothesis testing problem, where null hypothesis corresponds to no effect of the drug and the alternate hypothesis corresponds to the drug being very effective.
Test case: Let us say it is believed that the drug discovered by the company reduces the blood pressure (BP) on obese patients. The drug company decides to collect BP data of obese patients for days. After that, the drug is administered to all of them. Now, the goal is to test your hypothesis, i.e., effective versus ineffective given the BP measurements of all the patients before and after the medication. How do we solve this problem? (will fill in the details later).
Exercise : Consider two coins with biases and () in Problem , and repeat the whole exercise to show that the following rule is optimal with respect to the probability of error
Find the .
The above analysis does not tell us how much error do we incur by using the above LLR rule. We need to investigate this next. Towards this, let us start with the probability of error. The error occurs if LLR rule says the bias is but the true bias is and viceversa. This leads to
Let us use the symbol to denote the first term above. Similar expression will be used for the second term with conditioning on bias . Note that
Similarly,
Observing the above two expressions, we realize that we need to evaluate a quantity of the form , where 's are i.i.d. rvs. As in the case of Markov, closed form expression for such things seldom exists, and hence we need to go for an upper bound. It turns out that this sort of things appear frequently in statistics, and hence let us try to give a general bound on such quantity.
Let be i.i.d. rvs with distribution often written as . Note that each need not be non-negative. Then,
The above bound is true for any . Hence, taking infimum over all results in the following result.
Theorem (Chernoff bound): Let be i.i.d. rvs with distribution . Then,
Practice Problem: Find the Chernoff bound on when be i.i.d. rvs having exponential distribution with rate .
Let us apply the above to a Bernoulli random variables. for some . In order to apply the above bound, we need to compute , which is simplified as follows
Now, it remains to compute . Let us have some fun by using some calculus. This is the time you should have realized that various subjects that you have learnt, especially mathematics will come in handy here. Note that
Now, use your calculus knowledge to prove the following.
Exercise : Show that .
Using the result in the above exercise with , we get
Thus,
Substituting this in the Chernoff bound, we get
Noting that the above function is a convex function, and a simple differentiation leads to an optimal of . It is important to note that . Substituting this above leads to
Denote . Using this, the bound becomes
The above takes care of the first term in the probabiltiy of error expression. The second term looks something like . Note that this can be bounded away from zero if is sufficiently large. Intuitively, for a coin with bias , the relative frequency being far away from is small. Thus, if is away from , then we get a reasonable bound; this is the essence of the following exercise.
Exercise : Find a bound on when 's are i.i.d. Bernoulli random variable with bias and . Hint: Use Chernoff bounding technique.
The answer to Exercise will be of the form , where . It is required that . To declutter things, let me summarize the two bounds below:
Now we are ready to use the above result to prove a bound on the probability of error. Its long way up in this notes that you can find what I am writing next-the first term in the probability of error expression:
Thus the total probability of error is given by
The above leads to the following result.
Theorem: For Problem , with the optimal LLR rule, the probability of error satisfies the following upper bound
Practice Problem: Find the number of samples so that the probability of error is at most .
Exercise : Find explicit expressions for and .
Exercise : Find an upper bound on the probability of error for optimal rule that you have found for the problem in Exercise .
Its time for us to generalize Problem . This involved tossing one of the coin times resulting in i.i.d. rvs. Unlike estimation problem, the distribution of these rvs depepends on the coin chosen; or . This leads to the following generalization.
The error event is (exactly analogous to coin toss)
To compute the probability of error, we first note that the two events in the union are disjoint. Thus,
Now, using the definition of "Say " and "Say " in terms of , we get the following
The rest of the proof is very similar to the coin tossing example, and hence left as an exercise. The final result is the following.
Theorem: For Problem , the optimal way of deciding whether the hypothesis is true or not is through the following -likelihood ratio (LLR) rule
Discrete case:
Continuous case:
Example: Say you get to observe i.i.d. samples either from or . Further, assume that .
Solution: Since the rvs involved are continuous, we need to consider the second equation in the theorem. Calling the samples correpsonding to rate as , we have
Further,
Taking the ratio of the above two, and taking logarithm will result in the following rule
where .
Practice Question: Try using different values of (especially, the one given in the example above) and interpret the result.
Practice Question: Find the probability of error for the above decision rule.
Example: Say you get to observe i.i.d. samples either from or . Further, assume that . Find the decision rule. Also, find the probability of error.
The solution for the above problem will be of the form
where is the appropriate threshold for Gaussian.
Practice Problem: Find the threshold for the problem.
Now, let us look at the probability of error. The error probability can be written as
From here on, it is down hill as we have already seen how to solve such problem in the context of Bernoulli rvs. The trick is the same. First let us consider the Chernoff for the first term above, i.e.,
The following is a way of making the document fancy.
Lemma (Moment Generating Function (MGF)): For , .
Practice Problem: Let . Find the MGF of .
The name given to is MGF. More on this later.
Practice problem: Prove the above Lemma.
Using this Lemma, the above can simplied to
Now, an easy exercise (literally follow the same step as above) with multiplied on both sides results in (requires a minor restriction that -always satisfied in the asymptotics)
Practice Problem: Show the above inequality.
Order wise, the probability of error becomes
So far, we have been looking at a scenario where the hypothesis are picked randomly with some distribution. It is possible that the hypothesis may be picked randomly but we do not know the probabilities. One fix is to assume the worst case, and design for the worst case. Intuitively, the worst case is . An alternate to this is to solve a problem that balances various errors. Towards explaining this, we will consider two kinds of errors popularly known as probability of detection and probability of false alarm. More precisely
and
Ideally, the objective should be to make as high as possible and as low as possible. Unfortunately, this is not possible; if you want to make high (close to ), then decide in favor of always! However, this results in a false alarm of . On the other hand, if we want to make the false alarm as small as possible, a possibility is to decide in favor of all the time. In this case, the detection probability will be ! Thus, there exists a tradeoff between the two. One possible way to balance this tradeoff is to solve the following problem:
for some . The solution to the above is called the Neyman Pearson's test. In this course, we will not look at how to solve the problem but will state the solution.
Lemma (Neyman-Pearson (NP) Lemma): The solution to "NeyPear-Problem" is the loglikelihood ratio given below
Discrete case:
Continuous case:
In the above, the threshold is chosen to satisfy the constraint on the false alarm to be .
The proof of the above can be found in the famous "Information theory" by Thomas and Cover and many other books on detection and estimation theory. Now, let us try to apply this solution to the following radar detection problem.
Example (Radar target detection): Consider that a radar sends a short pulse which gets reflected from a target (say a plane) and the radar observes the reflection. In fact, we are assuming that the short pulses are really short so that the speed of the aircraft is neglibile! That is the time difference between the pulses is so short that the airplane would have hardly moved. This is theory so we need not worry about the practicality:-) If you did not follow the above, don't worry, ignore and move on (then why did I write this?-hmm…)!
Assume that it sends pulses one after the other. This can be modelled using the following
where , and is a fixed number which represents the distance from the radar to the target. It is not so difficult to realize that we cannot assign probabilities for the two hypothesis, viz, target present and target not present. Naturally, NP is the way to go. It is easy to see that under the "the target present" hypothesis, the distribution is while under "target not present", the distribution of becomes -pure noise. Noting that 's are i.i.d., and applying NP-Lemma, we get
Note that the change in the inequality. Now, we need to find the threshold such that the constraint is satisfied. Let us first compute the false alarm
In genral, one can use the Chernoff technique to bound the above. In particalar,
Now, we need , which is satisfied if . This can be achieved by choosing . There are several approaches to the above problem. An alternate approach is to use the fact that the sum of Gaussian is Gaussian. We will first show that it is indeed true.
Sum of Gaussian is Guassian: Let and be two independent Gaussian random variables with zero means and variance . Let us first write the CDF of using the total probability law for continuous rv
A term in the integrand can be written as
where the second equality follows from the fact that and are independent. Substituting this, we get
Now, taking the derivative with respect to , we get
Substituting for the Gaussian PDF and after a long laborious algebra, we get the desired result, which is left as an exercise.
Practice Problem: Show that the above integral results in ,i.e., .
The second method is to use the moment method, i.e., using the MFG. This will be not dealt with here.
Exercise : Solve the following problem taking NP approach
Let us revisit detection problems a little later. Now, we will look at the estimation problem, and some "interesting" thing that we can do there.
You may have noted that the optimal decision so far turned out to be intuitive. Then, the question is why do we need theory!?
As an example consider the following hypothesis testing:
What is the mean and variance under both the hypothesis? Mean zero for both and variance for alternate and under null hypothesis. Intuitively, the testing should depend on the variance. But, what is it that we have to use? May be variance! It works to some extent but may not work as good as the LLR based test as it is an optimal test.
–
–
First, let us try to generalize Problem . Note that the solution involved tossing the coin times resulting in i.i.d. rvs with distribution. The task is to find the parameter . Similarly, in Exercise , samples from an uniform distribution were given. The task was to find the unkown parameter . This leads to the following generalization for estimation problem.
How do we solve this general problem. As in the example above, first we need to find a metric with respect to which we should find our estimate. Of course, we will start with the metric. In particular, given an estimator (in general, this can be ) which depends on the observation, the MSE metric is
where the expectation is with respect to the joint distribution on . One wishes to solve the following problem
It is important to note that has to be random for the above to make sense. In this case, the expectation above is also with respect to . It is possible that need not be random. For example, Problem above, where is a deterministic but unknown quantity. The approach that we are going to describe with as a metric does not work in case when is not random. More on this later.
Simple things first: What if you don't get to observe anything but you have to make an estimate of , where is a random drawn according to a known distribution. In this case, we would like to solve the following problem
Note that does not depend on the number of samples as the estimator do not have access to samples! First, let us expand
Minimizing the above with respect to , we get . This is intuitively very appealing as one would bet on the mean when nothing is known!
What if we are given samples that can potentially depend on ? How do we solve such problems? Before doing this, let us recall on what we want out of an estimator:
I urge the reader to compare this with the list that we had when we studied Problem . It looks very similar, isn't it? Now, let us make some of these more formal.
Definition (estimator): Given observations , an estimator is a map .
Definition (unbiased estimator): We say that an estimator is an unbiased estimator if -the true value.
Note that there is a conditioning with respect to . This is because is sampled from a distribution. A straight forward way of coming up with an estimator is by making a linear combination of the observations, i.e.,
The above estimator is called a linear estimator for obvious reasons. Now, the goal is to solve the following problem
Note that we have used the norm above which takes care of all the vector valued estimators. However, to begin with, let us assume that the estimator is scalar, and the above is an absolute value. First thing to note is that the above function (as a function of each ) is convex. Thus, differentiating with respect to each and equating to zero results in
Using linearity of expectation, the above can be written as
Matrix algebra comes in handy while dealing with the above. In particular, defining (i) a matrix whose -th entry is , (b) a vector whose -th entry is and (iii) whose -th entry is leads to the following
The above leads to the following solution, also known a Linear Minimum Mean Squared Error (LMMSE) estimator
If the inverse of the matrix does not exists, then one can use what is called the Pseudo inverse. For now, let us not worry about the technicalities but stick to the above solution.
It is possible that the above need not be the best solution. However, it is indeed the best among all the simplest solutions. First thing to note is that the above is not in general an unbiased estimator.
Geometric Interpretation: In order to appreciate and understand the LMMSE better, let us consider a problem where the randomness is stripped off, i.e., 's and are deterministic vectors lying in .
The image shows the geometry of Least Squares (LS):
Suppose if we have observations lying in (for concreteness, assume that it lies in the plane) as shown in the figure above (labelled as span of ). We want an estimate in (the plane) that is closest to . For simplicity, let the estimate be a linear combination of these observations, i.e., , and hence lie in (the plane). See how linear algebra is useful here! Thus, we need to adjust the weights 's such that the distance between the linear combination and is small, i.e., we need to solve the following problem:
Intuitively, the optimal estimate would be a vector obtained by dropping a perpendicular from , i.e.,
In other words,
Solving the above leads to the following
Using , and the -th entry of a matrix be , the above can be written as
Note that the matrix is a symmetric matrix. The above leads to , which leads to ; a solution similar to the LMMSE solution. This solution is called the Least Squares (LS) solution, and the corresponding problem is called a LS problem.
In the LMMSE setting, the difficulty is that and 's are random, and hence there is no meaning for perpendicular. An obvious approach is to bring in the structure of a vector space with inner product, in which case the meaning of perpendicular makes sense. The following is an aside, which is not a part of this course. However, interested readers can go through this.
We use the following approach:
Going by analogy with the LS problem, we can guess the solution to be
where the -th entry of is , . Now, the final solution is similar to the geometric problem above.
Example (LMMSE): Let , where is sampled with some distribution independently of 's. In this case, the matrix has in the non-diagonal entries and on the diagonals (why?). This leads to an estimator that is biased.
How do we construct an unbiased estimator that is LMMSE? One way is to modify the above estimator to the following
Now the optimization problem becomes
A similar approach as above can be followed to solve the problem.
Exercise : Solve the problem above, and find the optimal 's and .
Next we will look at the general soluton to the Minimum MSE (MMSE) problem above, i.e.,
Assume that is any estimator and consider the following estimator which depends on the conditional mean
Now, we show that the above estimator solves the MMSE problem.
Theorem: The conditional mean, i.e., is the optimal solution to the MMSE problem.
Proof: Consider the following for any estimator
Hence is indeed the optimal estimator.
Now, it remains to solve the following problem
Theorem: The solution to the above problem is given by .
Proof: First, let us expand the conditional expectation
Now, differentiating with respect to what do you mean by this? For now, let us be a little sloppy and do the usual differentiation, and equating it to zero proves the theorem.
Based on our intuition from the geometry and the LS, we know that the error is orthogonal to the estimate. Note that in our new language of inner-product and norms, we expect the following, which is indeed true!
Lemma: The error is uncorrelated with the estimate, i.e.,
Proof: Follows by using conditional expectation. Easy exercise!
Example: Consider the following density
Find the MMSE estimator of given the observation .
Example: Consider , where and are independent Gaussians. We get to observer . Compute an MMSE estimate of given .
Solution: The MMSE estimate is . This requires us to compute the conditional density , which is given by
In the above, is . Using this, we get
where we have used the fact that the density of is again Gaussian (why?), and . Thus, the distribution of conditioned on is again Gaussian with mean and variance . Now, the conditional expectation is easy to compute; it is simply the average of conditioned on , which is . The MMSE estimate is simply a scaling of the observation
A few observations are in order:
The estimate is biased, i.e., !
The MSE of the estimator is
Error is uncorrelated with the estimate! Check this yourself!
A natural question to ask is how does the above campare with LS? Since there is only one observation, finding the LS estimate is easy. In particular,
It is clear that the LS is an unbiased estimate while MMSE is not. Further, the MSE of the LS is ! Order wise both MMSE and LS are same. See the figure below for comparision. In fact, observer that at very high , i.e., low noise condition, the MMSE and LS both approach the same performance, as expected.
Recall that when we started this course, we listed down the requirements for an estimator to be unbiased, and also have good MMSE performance. If the estimator is unbiased, then MSE is the variance. Hence we need a Minimum Variance Unbiased Estimator (MVUE). This is beyond the scope of this course, and will be dealt in more detail in other courses.
Exercise: Show that LS and MMSE results in the same estimator for the above model (Gaussian). Also, convince yourself why the MMSE curve is not linear while LMMSE is linear.
Exercise: Consider the following joint distribution of and .
Exercise: Let and have joint PDF
Find the MMSE estimate of given the observation .
It is evident from the above discussion that if is deterministic and unknown, MSE cannot be used as a metric to find an estimate of . We use the following example to introduce you to a method of estimation called Maximum likelihood method-a method similar to what is done in the hypothesis testing case, but with a small twist.
Example: Consider Problem where we are supposed to find an estimate of given tosses . But, is deterministic but unknown! Let look at the likelihood of observing the outcomes when a coin is tossed:
A natural choice of would be something close to as has generated these samples. However, I do not know . But, intuitively, I should chose a that maximizes the above, i.e.,
Solving the above leads to the following solution; this is called a maximum likelihood estimate or ML for short
One can generalize the above and say if or , then the ML estimate is given by
Now, we have so many questions. Is the ML estimate unbiased? Does it lead to very low MSE when is large? Before answering these questions, let us get used to ML by considering more examples.
Example: Consider be i.i.d. from . Find an estimate of .
First, we need to find the joint density of the rvs:
ML estimate of is given by
By taking log and then differentiating and equating to zero results in -an intuitively appealing solution.
Example: Consider be i.i.d. from . Find an estimate of .
First, we need to find the joint density of the rvs:
ML estimate of is given by
By taking log and then differentiating and equating to zero results in -again an intuitively appealing solution.
Example: Consider be i.i.d. from . Find an ML estimate of .
First, we need to find the joint density of the rvs:
Thus, maximizing the above to get an ML estimate amounts to solving the problem below
The solution is trivial, i.e., -again an intuitively appealing solution.
Practice Problem: Use the MATLAB code provided above to generate samples from Gausian distribution with mean and variance , and use the samples to fit a Gaussian distribution to the samples. Think of how to use the ML estimate to do the same. Also, understand the code.
Practice Problem: Let for and otherwise. Let for . Find MMSE and ML estimates of given a single observation .
Practice Problem: Suppose that the random variable is uniformly distributed between and , denoted . The data is a noisy measurement of given by
where is independent of , and is Laplacian distributed with PDF
Find the minimum mean-square error estimate of given . Also, solve . This is called a Maximum a posteriori Estimation (MAP).
Next, we will take a slight detour. So far, we have assumed that the observations are i.i.d. and hence descibed it through joint pdf, which is the product. However, in general, this should be described through joint pfd. This is difficult! So, let us look at the observations (correlated) that are Gaussian. We can stack rvs, and we get a vector (correlated with each other). How do you describe the joint distribution? The answer is as follows (turn this problem into something that we know how to describe, i.e., turn this into one variable):
Definition: A random vector is said to be jointly Gaussian if any linear combination
results in a Gaussian random variable for any , .
The above implies that each is a Gaussian rv.
Definition: We say that is jointly Gaussian with the mean vector , where and covariance matrix if the joint density is
Generally, we use the short hand notation . In the above, denotes the determinant of the matrix. Note that is a symmetric positive semidefinite matrix (why?).
Now, let us consider the following problems, one on detection and the other on estimation.
Detection: under and under . Find the test to distinguish between the two.
Solution: The test is again the LLR assuming :
Estimation: Let . Find an estimate of .
Solution: The ML estimate is (MMSE is not possible. Why?):
IMPORTANT NOTE: So far, we have looked at both estimation and detection problems. In these problems, we assumed some kind of distribution. However, in partice you do not know the distribution. One can come up with the test without knowing the distribution. Based on this, two terms are used in statistics, namely (a) parametric test or estimation; this refers to the case where you know the distribution and (b) non-parameteric test or estimation; this is where you do not assume any distribution. Hence the tests/estimation that we considered in this notes so far are called parameteric test/estimation.
Given samples . Suppose we use the estimator . Show that , where is sufficient to ensure:
Suppose a sample is modelled by a Poisson distribution with a parameter denoted by , so that for some . Estimate the unknown parameter by using MLE. Show that asymptotically (in ), the MSE in the estimate is zero.
You have a coin that you think is biased. you flip it times and get the sequence . What is the maximum likelihood estimate for the probability of getting heads?
You think a die is rigged so that it will roll less often. You continually roll the die until you get and you have to roll times before this happens. Is your suspicion correct with a confidence level of ?
Two cars A and B are travelling independently in the same direction. The speed of car A is normally distributed with mean km/hr and standard deviation km/hr while the speed of car B is distributed normally with mean km/hr and standard deviation km/hr. Car A was initially 3 km ahead of car B. Find out the probability that after hours, they are within a distance of km from each other.
Let and be two continuous random variables with joint pdf
Find MMSE estimator of given . Also, find the MLE of . Comment which is better in terms of computation and MSE.
Consider the linear regression model for , where and are unknown parameters and are independent and identically distributed random variables having distribution. The data on are given in Table . Find the constant that minimizes the squred error.
Table related to .––––––––––––––––- ––––––––––––––––––––––––––––-
Given that the number of products produced in a factory during a week is a random variable with a mean of and a variance of , what can be inferred about the likelihood that the weekly production falls within the range of to products?
Let us consider we have a random sample where if a randomly selected person does not hold a racing bike, and if a randomly selected person does hold a racing bike. Assuming that the are independent Bernoulli random variables with unknown parameter , find the maximum likelihood estimator of , the proportion of person who hold a racing bike.
Consider the following hypothesis testing problem:
Find the ML rule for the above.
Not included in the exam (Number of collisions): If we throw balls in bins, the average number of collisions to be . Use Chebyshev’s inequality to show that:
Next, suppose we choose , then . Use Chernoff bounds plus the union bound to bound the probability that no bin has more than ball.
Let ,,…, be independent and identically distributed random variables with probability density function
Then which of the following have/has finite expectation?
(i) , (ii) , (iii) , and (iv) .
Assume that you observe one sample whose distribution is
Find the Neyman Pearson rule to achieve a false alarm rate of . Find the probability of detection (). Plot a graph showing versus probability of false alarm .
Let the observation be modelled as
where and is a known constant. Find an MMSE estimate of given all the past observation . Find the MSE of your estimator.