Data Analysis Course @ IIT Dha

# [Crude Draft] Data Analysis (Half Semester) Course @ IIT Dharwad B. N. Bharath Assistant professor, Department of Electrical, Electronics and Communication Engineering (EECE), IIT Dharwad --- ## Commenting - Please write your name and roll number followed by your comments. This helps me know who has commented. Also, please do not hesitate to comment. All the comments will be taken seriously, and no comment is a stupid comment :grinning: --- ## Logistics There will be two evaluations-one quiz ($40\%$) and one final exam ($60\%$). The quiz ***may*** contain some assignments (given during the regular classes) as well like writing a code, solving problems etc. - <span style="color:red">Quiz:</span> $31/10/2023$ from $8:30$ AM to $9:30$ AM (regular class hours). The syllabus will include everything that is covered until $20/10/2023$. - Final exam will be as per the academic calendar shared by the academic office. - The repeat exam or evaluation will not be encouraged. If someone misses the quiz for a genuine reason, then I may may use the final exam marks to grade with a possible penalty. The penalty will be decided based on the reason for absence. - The answer scripts will be distributed on $24$th November in the PC. More details will be conveyed by the TAs. --- ## References - "Introduction to statistics," MIT Opencourseware, Link: [click here](https://ocw.mit.edu/courses/18-650-statistics-for-applications-fall-2016/pages/lecture-slides/) - "All of statistics (A concise course in statistical inference)" by Larry Wasserman, Springer. - "Introduction to Probability, statistics and Random processes," [click here](https://www.probabilitycourse.com/chapter9/9_1_0_bayesian_inference.php/) - Nice problems can be found here [click here](https://classes.engineering.wustl.edu/2009/spring/ese524/problems.pdf) ## Announcements - The classes will be held in UG4 in the Permanent Campus on Thursday's and Friday's and PC300 on Tuesday's. --- ## Introduction Let us start with the question of what is data analysis? This refers to providing inference and insights using data. The classical approach involves modelling the data using some sort of a function or a probability distribution and use tools from probability to make inferences. One might be wondering why do we need data analysis? Imagine that you are an investment banker and you have to decide whether to invest in some stocks. One can use the past experience, current political situation etc. to infer. A more systematic approach is to use the past data and provide inference using probabilitstic tools in addition to one's rich experience. This course covers classical approaches to these problems, and provide some insights and tools necessary to solve certain class of problems. Let us begin by looking a stylized example involving a coin toss. -------------------![](https://hackmd.io/_uploads/S16dADf-p.png)------------------- In particular, let us consider the following four example problems - **Problem $1$:** Given a coin, find its bias. - **Problem $2$:** We have two coins, one with bias $1/2$ and the other with bias $3/4$. Imagine that I am going to randomly pick one of them, and give it to you. You are supposed to find whether I have given a biased coin or an unbiased coin. - **Problem $3$ (more challenging):** Same as **Problem $2$** with bias $3/4$ replaced by "not $1/2$", i.e., biased coin versus unbiased coin. - **Problem $4$ (more challenging):** Same as **Problem $2$** with bias $3/4$ replaced by "bias is a rational number" and bias $1/2$ replaced by "bias is an irrational number", i.e., rational bias versus irrational bias. A typical sample path when you toss a coin $n=15$ times (or sample Bernoulli with bias $1/2$) is given below $[1~~ 0~~ 0~~ 1~~ 1~~ 0~~ 1~~ 1~~ 1~~ 0~~ 0~~ 0~~ 1~~ 0~~ 1].$ ### Answer to Problem $1$ In the first problem, our end result will be the bias, which is any number between $0$ and $1$. On the other hand, **Problem $2$** asks us to find the right answer among two possibities, i.e., biased and unbiased. **Problem $1$** is called an estimation problem while the second (third and fourth) problem is called a hypothesis testing or classification or decision problem. First, let us begin by modeling the experiment involving **Problem $1$** above. The first step is to identify the experiment involved, i.e., tossing a coin whose sample space is $\Omega:=\{H,T\}$. We start by modelling each outcome of the coin by a random variable (rv), i.e., let the outcome of the $i$-th coin toss be $X_i:\Omega \rightarrow \{0,1\}$, $i=1,2,\ldots,n$, which is defined as \begin{equation} X_i = \left\{ \begin{array}{cc} 1 & \text{if outcome is $H$} \\ 0 & \text{otherwise.} \end{array} \right. \end{equation} Note that each coin toss is independent of each other, i.e., the outcome of any coin toss will not impact other outcomes. A mathematical way of saying this is that $X_1,X_2,\ldots,X_n$ are *independent* rvs, whose definition is as follows: --- **Definition: (independent rvs):** We say that $X_1,X_2,\ldots,X_n$ are *independent* rvs if the joint distribution $p_{X_i: i \in \mathcal{I}}(x_i:i\in \mathcal{I}) = \prod_{i\in \mathcal{I}} p_{X_i}(x_i)$ for any subset $\mathcal{I} \subseteq \{1,2,\ldots,n\}$. If the distributions of all $X_i$'s are same, then we say that the rvs are identically distributed. If $X_1,X_2,\ldots, X_n$ are both *independent and identically distributed (i.i.d. for short)*, we use the notation $X_1,X_2,\ldots,X_n \stackrel{i.i.d.}{\sim}p_{X_1}(x)$. For continuous rvs, it suffices to replace the PMF by PDF. **Note:** Sometimes we use $\perp$ to indicate independence. For example, $X\perp Y$ means $X$ and $Y$ are independent rvs. Thus, by our intuition, it makes sense to assume that $X_1,X_2,\ldots,X_n \stackrel{i.i.d.}{\sim} \texttt{Bern}(p)$, where $p$ is unknown! Now, our goal is to find this $p$. Again, guided by our intuition, a natural estimate is to count the number of heads, and divide it by $n$. Mathematically, \begin{equation} \hat{p}_n := \frac{1}{n} \sum_{i=1}^n X_i. \end{equation} Let us closely look at what we want our estimate to be. - First, on an average (true average), the estimate should be true; mathematically, $\mathbb{E} \hat{p}_n =p$. - The above is in general called an *unbiased estimator. We will see more about this later.* - Second, as we increase the number of tosses (more samples), the estimation should become "better" (right now we do not know what better means). Let us pause for a moment to see how to measure the "goodness" of an estimator-the word "estimator" is not formally defined yet. But for now, $\hat{p}_n$ is our estimator:-) There are several ways to measure "goodness". The following are some natural measures: - **Mean Absolute Error (MAE):** Look at the difference between the estimator and the true value (i.e., $|\hat{p}_n - p|$). Unfortunately, this error is random, and cannot be relied on. Having done a course in probability, it is natural to look at the average (true) of this quantity, which is defined below: \begin{equation} \texttt{MAE}_n := \mathbb{E}|\hat{p}_n - p|. \end{equation} - **Mean Square Error (MSE):** Instead of looking at the absolute error, we can look at the squared error, which results in the following notion: \begin{equation} \texttt{MSE}_n := \mathbb{E}|\hat{p}_n - p|^2. \end{equation} One might be wondering why not $3$-rd power of the absolute error? For that matter, why not $k$-th power of the absolute error? The thought process is correct. However, these higher powers lead to a complicated expression for the Mean (whatever) error, and is not amicable for mathematical analysis. But $\texttt{MSE}_n$ is a nice quantity that can be analyzed! So, let us go a head with the $\texttt{MSE}_n$. First, let us see if our estimator is unbiased. Towards this consider \begin{eqnarray} \mathbb{E}\hat{p}_n &=& \mathbb{E} \left[\frac{1}{n} \sum_{i=1}^n X_i\right] \nonumber \\ &\stackrel{\text{(linearity)}}{=}& \frac{1}{n} \sum_{i=1}^n \mathbb{E} X_i\nonumber\\ &\stackrel{\text{(i.i.d.)}}{=}& \frac{1}{n}n \mathbb{E}X_1 \nonumber \\ &\stackrel{\text{Since }\mathbb{E}X_1 = p}{=}& p. \end{eqnarray} Hence our estimator is unbiased! Now, let us look at the $\texttt{MSE}_n$ performance \begin{eqnarray} \texttt{MSE}_n &:=& \mathbb{E}|\hat{p}_n - p|^2 \stackrel{\text{(why?)}}{=}\texttt{Var}(\hat{p}_n)\nonumber \\ &=& \mathbb{E}\left|\frac{1}{n} \sum_{i=1}^n (\mathbb{E} X_i - p)\right|^2 \nonumber\\ &=& \frac{1}{n^2}\sum_{i=1}^n \mathbb{E}(X_i-p)^2 + \frac{1}{n^2} \sum_{i \neq j}^n \mathbb{E}\left[(X_i-p) (X_j-p)\right]\nonumber \\ &\stackrel{X_i \perp X_j}{=}& \frac{\texttt{Var}(X_1)}{n} + \frac{1}{n^2} \sum_{i \neq j}^n \mathbb{E}\left(X_i-p\right) \mathbb{E}\left(X_j -p\right) \nonumber \\ &=& \frac{\texttt{Var}(X_1)}{n}. \end{eqnarray} This leads to the following result. --- ***Theorem:*** The estimator $\hat{p}_n$ satisfies $\texttt{MSE}_n = \mathcal{O}\left(\frac{1}{n}\right)$. In the asymptotic $\lim_{n \rightarrow \infty} \texttt{MSE}_n = 0$. --- The following plot shows the MSE verus $n$ for a coin with bias $1/2$. Clearly, the MSE is going to zero as $n$ is increased. Further, the rate of fall is of the order $1/n$. ![](https://hackmd.io/_uploads/rkmisAqep.jpg) ***Note:*** Estimators with the above asymptotic property is called an efficient estimator. The notation $\mathcal{O}(*)$ is called the Landua's Big O notation. It is often used to remove non-sense (sometimes not so non-sense) terms while analyzing something. Say, you are interested in knowing the behaviour of the MSE with respect to $n$ not so much with other terms, then we can use this notation. Mathematically, it means the following. We say that $\mathcal{O}(f(x)) = g(x)$ if $\lim_{x \rightarrow \infty} \frac{f(x)}{g(x)} = \text{constant}$. Is there any other way of measuring "goodness" of any estimator? To answer this, let us revisit the error often called empirical error $|\hat{p}_n - p|$. As noticed earlier, this error is random, and cannot bank on this. Note that our estimator should not result in a bad error, i.e., $|\hat{p}_n - p| > \epsilon$, where $\epsilon$ is the tolerance error, and depends on the application. Whenever the error is bad, we refer it to as "Bad" event. Obviously, we cannot ensure that the error will never happen as $|\hat{p}_n - p|$ is random. Thus, a more sensible thing to ask for is to make the probability of the "Bad" event small, i.e., \begin{equation} \Pr\{|\hat{p}_n - p| > \epsilon\} < \delta \end{equation} for some $\delta > 0$. Note that $X:=|\hat{p}_n - p|$ is non-negative. In general, it is very difficult to verify if $\Pr\{|\hat{p}_n - p| > \epsilon\}$ is small. Instead, if we can somehow show that something larger than $\Pr\{|\hat{p}_n - p| > \epsilon\}$ is small, then our job will be done. Thus, we need an upper bound on $\Pr\{X > \epsilon\}$; this is the essence of the next topic. ### Applications of Large Deviation Bounds ***[Ignore the fancy title for now:-)]*** We noticed above that we need an upper bound on $\Pr\{X > \text{"something large"}\}$-referred to as the tail probability (think why?). Here $X$ is a non-negative rv. Consider $$\mathbb{E} X = \int_{0}^\infty x f_X(x) dx \geq \int_{\epsilon}^\infty x f_X(x) dx \geq \epsilon \Pr\{X > \epsilon\} $$ Rearranging the above results in the following result. --- **Theorem:** Let $X >0$. Then, \begin{equation} \Pr\{X > \epsilon\} = \Pr\{X > \epsilon\} \leq \frac{\mathbb{E} X}{\epsilon}. \end{equation} --- Now, let us try to apply the above result to our estimation problem. Note that the non-negative rv $X$ is $|\hat{p}_n - p|$. Applying the Markov inequality, we get $\frac{\mathbb{E} |\hat{p}_n - p|}{\epsilon}$. It is hard to compute the term $\mathbb{E} |\hat{p}_n - p|$. However, in the above, we could compute $\mathbb{E} |\hat{p}_n - p|^2$ (with a squared). However, Markov involves term without the squared. Idea is to see if can convert it to squared. Its an easy exercise once you note that $\Pr\{X > \epsilon\} = \Pr\{X^2 > \epsilon^2\}$, and then apply Markov inequality to get $$\Pr\{X > \epsilon\} \leq \frac{\mathbb{E} X^2}{\epsilon^2}.$$ The above when applied to $|X-\mathbb{E} X|$ gets a special name: --- **Theorem: (Chebyshev's Inequality)** Let $X$ be any rv <span style="color:red">(not necessarily non-negative)</span>. Then, \begin{equation} \Pr\{|X-\mathbb{E} X| > \epsilon\} \leq \frac{\mathbb{E} |X - \mathbb{E}X|^2}{\epsilon^2} = \frac{\texttt{Var}(X)}{\epsilon^2}. \end{equation} --- Note that our estimator is unbiased. Using this fact and applying Chebyshev's inequality to our estimator result in \begin{equation} \Pr\{|\hat{p}_n - p| > \epsilon\} \leq \frac{\texttt{Var}(\hat{p}_n)}{\epsilon^2} \stackrel{\text{Again why?}}{=} \frac{\texttt{MSE}_n}{\epsilon^2} = \frac{\texttt{Var}(X_1)}{n\epsilon^2} \stackrel{n \rightarrow \infty}{\rightarrow} 0. \end{equation} There are other way of stating this result. For example, if we want $\Pr\{|\hat{p}_n - p| > \epsilon\} < \delta$ for some $\delta >0$, then, we can ensure this by choosing $n$ such that $\frac{\texttt{Var}(X_1)}{n\epsilon^2} < \delta$. In other words, if $n > \frac{\texttt{Var}(X_1)}{\delta \epsilon^2}$, then we can ensure $\Pr\{|\hat{p}_n - p| > \epsilon\} < \delta$. A fancy way of saying this is as follows: --- *With a probability of at least $1-\delta$, the error $|\hat{p}_n - p| < \epsilon$ provided the number of samples $n > \left \lceil \frac{\texttt{Var}(X_1)}{\delta \epsilon^2} \right \rceil$. Here, the symbol $\lceil * \rceil$ denotes ceiling operation. This style of results are often called Probably Approximately Correct (PAC) answers!* --- There is another name to the above. It is called an interval estimate. Instead of giving one estimate $\hat{p}_n$, we will give a range of values around $\hat{p}_n$ in which $p$ should lien with reasonably high probability. Naturally, if one gives only $\hat{p}_n$ (one value), it is called the point estimate for obvious reasons. Essentially, the above result is saying $\hat{p}_n \approx p$ (approximately; approximation error of $\epsilon$) with a probability of at least $1-\delta$ (probably). Hence the name PAC. These style of results are very common in theoretical machine learning. In this course, we can give a more definative guarantees. **Side Remark:** *Often times, people may ask what guarantee can you give for your estimator/algorithm/hypothesis testing etc.? What they mean is can you say "Some sort of average "error" is small or probability of error is small or in general some performance measure is small (assuming small means good)". In this course, we will provide many guarantees. The above PAC or MSE expression is one such result on gaurantees.* **Exercise $1$:** Suppose you are given $n$ samples from an uniform distribution $X_1, X_2, \ldots, X_n \stackrel{i.i.d.}{\sim} \texttt{Unif}(\theta)$, where $\theta$ is **unknown**. - Come up with an estimator for $\theta$; denote it by $\hat{\theta}_n$. - Is your estimate unbiased? If not, can you come up with an unbiased estimate? - What is the $\texttt{MSE}_n$ performance of the proposed estimator? Does it go to zero as $n$ goes to $\infty$? If so, at what rate? - Can you give a PAC style result, i.e., *with a probability of at least $1-\delta$, the error $|\hat{\theta}_n - \theta| < \epsilon$ provided the number of samples $n >?$. Fill in the blank here.* **Practice Problem:** A symmetric die is thrown $600$ times. Find a lower bound for probability of getting $80$ to $120$ sixes. **Practice Problem: (Prof. Stankova, UCB)** Let $g(x) = \frac{5}{x^6}$ for $x \geq 1$ and $0$ otherwise. What bound does Chebyshev’s inequality give for the probability $\Pr(X \geq 2.5)$? For what value of $a$ can we say $\Pr(X \geq a) ≤ 15\%$? **Practice Problem: (Prof. Stankova, UCB):** Let $g(x) = \frac{2}{3} x$ for $1 \leq x \leq 2$, and $0$ everywhere else. Give a bound using Chebyshev’s for $\Pr(10/9 \leq X \leq 2)$.  **Practice Problem (Comparing bound with actual):** Find an upper bound on $\Pr\{X \geq a\}$ for some $a$, where $X \sim \exp\{\lambda\}$. Compare your actual value with the upper bound. **Practice Problem (finite higher moments may help):** Let $X \sim \mathcal{N}(0,1)$. Use Chebychev inequality to bound $\Pr\{|X| > a\}$ for some $a > 0$. Compare this with the bound using $\Pr\{|X| > a\} = \Pr\{|X|^k > a^k\}$. Here, bound it using Markov inequality. This requires you to find the $k$-th moment of $X$. Compare your bound for different values of $k$, and comment. Comment on what happens if we increase $k$ indefinitely. **Practice Problem (coding exercise):** Simulate **Exercise $1$**. Plot the MSE performance of your estimator versus the number of samples. You can use any coding language (Python or Matlab preferred). ### Answer to Problem $2$ As mentioned earlier, the second problem is a hypothesis testing or a decision problem; need to decide whether the coin is bias $1/2$ or bias $3/4$. The first question that we encounter is on what basis we should decide. A natural metric here is the probability of making a mistake, i.e., saying bias $1/2$ when the coin has bias $3/4$ and vice-versa. Let us write this error event \begin{equation} \mathcal{E} := \{\text{Say } 3/4 \text{ and when }1/2\} \bigcup \{\text{Say } 1/2 \text{ and when }3/4\}. \end{equation} To compute the probability of error, we first note that the two events in the union are disjoint. Thus, \begin{equation} \Pr\{\mathcal{E}\} = \Pr\{\text{Say } 3/4 \text{ and when }1/2\} + \Pr\{\text{Say } 1/2 \text{ and when }3/4\}. \end{equation} Immiadiately, one may wander who will say $1/2$? What is the "decision rule"? We have to clarify these before we delve into the solution. **Decision rule:** Intuitively, we observe and then make decisions. What do we observe here? Easy, we toss $n$ times and observe the outcomes. Mathematically, we observe $n$ Bernoulli random variables $X_i$, $i=1,2,\ldots,n$. Soon we realize that a rule should be a function of $X_1, X_2, \ldots,X_n$. Mathematically, a rule here is a function $\Phi: \{0,1\}^n\rightarrow \{0,1\}$-in simple terms, it is denoted by $\Phi(X_1,X_2,\ldots,X_n)$, and it takes values in $\{0,1\}$. Here $0$ means $3/4$ and $1$ means bias $1/2$. Seems very complicated isn't it? Not really! Lets move on. The event "Say $3/4$" means $\Phi(X_1,X_2,\ldots,X_n) = 0$ for the decision rule $\Phi()$; easy? Similarly for "Say $1/2$". There is another event to take care of, i.e., "when $1/2$":-(. This means that when the true coin had bias $1/2$. Let us denote this event by $\mathcal{H}_1$. You should excuse me for using too many notations. Here the fancy $H$ refers to hypothesis and the subscript $1$ is as per my madness. If you don't like, you can use $0$ instead. Similarly, $\mathcal{H}_0$ for "when $3/4$". Are $\mathcal{H}_0$ and $\mathcal{H}_1$ random? Is it deterministic? Many questions remain to be answered. If I choose the coin randomly from the urn, then these events are random, and can associate probabilities to it. However, if these events are deterministic but unknown, then we cannot assign probabilities. We need to understand why these events can be deterministic or unknown. In fact, I can pick the coin randomly, and there is no way by which you can figure out how I have chosen the coin. For example, I can go into a secret room and pick the coin and show you. Note that I may choose to pick the coin deterministically as well. There are so many variants. Here let us keep things simple. Let us assume that the coins are picked randomly. In particular, I am going to pick the coin with bias $3/4$ with probability $\pi_0$. Let $\pi_1 := 1-\pi_0$. More precisely, $\pi_0 = \Pr\{\mathcal{H}_0\} = 1-\Pr\{\mathcal{H}_1\}$. With the above setup, we can write the probability of error as \begin{eqnarray} \Pr\{\mathcal{E}\} &=& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 0 \text{ and when }1/2\} +\\&& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 1 \text{ and when }3/4\}\nonumber \\ &\stackrel{\text{Bayes rule}}{=}& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 0 |\text{when }1/2\} \times \Pr\{ \text{ when }1/2\} \nonumber \\ &~~~~~~~~~~~+& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 1 | \text{ when }3/4\} \times \Pr\{ \text{ when }3/4\} \nonumber \\ &\stackrel{\text{Why?}}{=}& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 0 |\text{when }1/2\} \times \pi_1 \nonumber \\ &~~~~~~~~~~~+& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 1 | \text{ when }3/4\} \times \pi_0. \nonumber \end{eqnarray} Now, the goal is to design a "decision rule" $\Phi()$ that minimizes the above error, i.e., \begin{equation} \min_{\Phi:\Omega^n \rightarrow \{0,1\}} \Pr\{\mathcal{E}\}. \end{equation} Seems complicated, and it is indeed not simple! This is because the minimization is over functions. To solve this, let us understand what a decision rule $\Phi()$ possibly do. It observes $n$ samples from $\{0,1\}$ and decides in favor of zero or one. The input to $\Phi()$ is a set of $2^n$ possible binary sequences. It is deciding some in favor of one and some in favor of zeros. Thus, $\Phi()$ is partitioning the space of $2^n$ sequences into two regions. Let us collect all the sequences that are mapped in favor of $1$ in one set, say $\mathcal{D}_1$, which is defined as \begin{equation} \mathcal{D}_1 := \{(x_1,x_2,\ldots, x_n): \Phi(x_1,x_2,\ldots, x_n) = 1 \text{ and } x_i \in \{0,1\}\}. \end{equation} The region corresponding to $0$ is denoted by $\mathcal{D}_0 = \mathcal{D}_1^c$. To proceed further (also I am a bit lazy), we need the following notation: **Notation:** We use $X^n:=(X_1,X_2,\ldots,X_n)$ and $\mathbf{x}^n := (x_1,x_2,\ldots,x_n)$. Imagine me writing (I meant latexing) all these arrays everytime! Using the above, a part of the probability of error can be written as \begin{equation} \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 1 |\text{when }3/4\} \stackrel{\text{Why?}}{=} \sum_{\mathbf{x}^n \in \mathcal{D}_1} p_{X^n |\text{when }3/4}(\mathbf{x}^n). \end{equation} The above follows from the fact that the probability of a rv (random vector in this case) falling in a region is the sum of PMF over that region (integral for the continuous case). In the above, the region corresponds to the region where you ($\Phi()$) will make a decision in favor of $\mathcal{H}_1$. The above argument applies for conditional probability also (recall that conditional PMF also satisfies all the axioms of probability). Further, we note that \begin{eqnarray} \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 0 |\text{when }1/2\} &=& 1 - \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 1 |\text{when }1/2\} \nonumber \\ &=& 1 - \sum_{\mathbf{x}^n \in \mathcal{D}_1} p_{X^n |\text{when }1/2}(\mathbf{x}^n) \end{eqnarray} Let us look at the above more carefully. First, conditioning on "when $1/2$" means that it corresponds to an unbiased coin. Thus, $p_{X^n |\text{when }1/2}(\mathbf{x}^n)$ means that the probability of observing the $n$-tuple $\mathbf{x}^n$ when you throw an unbiased coin $n$ times. By noting that $\mathbf{x}^n \in \{0,1\}^n$, we have \begin{equation} p_{X^n |\text{when }3/4}(\mathbf{x}^n) = \left(\frac{3}{4}\right)^{\sum_{i=1}^n x_i} \times \left(\frac{1}{4}\right)^{n- \sum_{i=1}^n x_i} \end{equation} since $\sum_{i=1}^n x_i$ gives us the total number of ones in the sequence (or the total number of heads). Similarly, \begin{equation} p_{X^n |\text{when }1/2}(\mathbf{x}^n) \stackrel{\text{Why?}}{=} \frac{1}{2^n}. \end{equation} Recall that \begin{eqnarray} \Pr\{\mathcal{E}\} &=& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 0 |\text{when }1/2\} \times \pi_1 \nonumber \\ &~~~~~~~~~~~+& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 1 | \text{ when }3/4\} \times \pi_0 \nonumber \\ &\stackrel{\text{Substitution}}{=}& \left(1 - \sum_{\mathbf{x}^n \in \mathcal{D}_1} p_{X^n |\text{when }1/2}(\mathbf{x}^n)\right) \pi_1 + \sum_{\mathbf{x}^n \in \mathcal{D}_1} p_{X^n |\text{when }3/4}(\mathbf{x}^n) \times \pi_0 \\ &=& \pi_1 + \sum_{\mathbf{x}^n \in \mathcal{D}_1} \left[p_{X^n |\text{when }3/4}(\mathbf{x}^n) \times \pi_0 - p_{X^n |\text{when }1/2}(\mathbf{x}^n) \times \pi_1\right]. \end{eqnarray} The problem was to minimize the probability of error (i.e., the above expression) with respect to $\Phi()$. Note that the above expression depends on $\Phi()$ only through $\mathcal{D}_1$. Thus, we need to choose $\Phi()$ such that $\mathcal{D}_1$ above makes the expression smaller. At first glance, it appears to be hard problem. However, let us look at it more closely. First, as explained earlier, any choice of $\Phi()$ will only split the space of $2^n$ sequences into two disjoint sets. In a way, we have to split the space into two pieces such that the resulting probability of error is small. If we split the space (i.e., choose $\mathcal{D}_1$) such that the summand, i.e., $p_{X^n |\text{when }3/4}(\mathbf{x}^n) \times \pi_0 - p_{X^n |\text{when }1/2}(\mathbf{x}^n) \times \pi_1$ is positive for some $\mathbf{x}^n$, then it will only increase the probability of error. Therefore, if $\Phi()$ is optimal, then it should only make the summand negative for any choice of $\mathbf{x}^n$. This leads to the following optimal rule: \begin{equation} \frac{p_{X^n |\text{when }3/4}(\mathbf{x}^n)}{p_{X^n |\text{when }1/2}(\mathbf{x}^n)} \underset{\mathbf{x}^n \in \mathcal{D}_1}{\overset{\mathbf{x}^n \in \mathcal{D}_0}{\gtrless}} \frac{\pi_1}{\pi_0} = \frac{\pi_1}{1-\pi_1}. \end{equation} Note that the event $\mathbf{x}^n \in \mathcal{D}_1$ corresponds to choosing bias $1/2$ coin. Moreover, no matter what rule you pick, all the probabilities are fixed before hand. The only thing that you can ensure is that the summand is negative. That is exactly what the above rule is doing. The above leads to the following result. --- **Theorem:** For **Problem $2$**, the optimal way of deciding whether the coin is bias $1/2$ or not is through the following ***$\log$-likelihood ratio (LLR)*** rule \begin{equation} \log \frac{p_{X^n |\text{when }3/4}(\mathbf{x}^n)}{p_{X^n |\text{when }1/2}(\mathbf{x}^n)} \underset{\text{choose bias }1/2}{\overset{\text{choose bias }3/4}{\gtrless}} \log\left(\frac{\pi_1}{1-\pi_1}\right). \end{equation} --- Now, let us see what it results in by substituting for the conditional probability --- \begin{equation} LLR(\mathbf{x}^n) :=\frac{1}{n} \sum_{i=1}^n x_i \underset{\text{choose bias }1/2}{\overset{\text{choose bias }3/4}{\gtrless}} \underbrace{\frac{\frac{1}{n}\log\left(\frac{\pi_1}{1-\pi_1}\right) + \log 2}{\log 3}}_{:=\tau}. \end{equation} --- The left hand side is the relative frequency and the right hand side is a specific threshold! When $\pi_1 = 1/2$, the threshold becomes $0.63$; a number almost at the midpoint of $1/2$ and $3/4$! Intuition works! **Side remark:** Why should we do all these if we knew the answer before based on our intuition? I have two answers to it. First, its fun and the second, the above settles the issue. No body can question by saying that whats the guarantee that there are no scenarios where our intuition fails. ## <span style="color:red">Real world examples</span> **Example $1$ (Effectiveness of a new drug):** The theory and methodology developed above plays a significant role in clinical trails and medical research. Imagine that you are in a drug company attempting to find a drug for a disease. Naturally, one needs to figure out whether the treatment using the drug is effective or not. You can formulate this as a hypothesis testing problem, where null hypothesis corresponds to no effect of the drug and the alternate hypothesis corresponds to the drug being very effective. *Test case:* Let us say it is believed that the drug discovered by the company reduces the blood pressure (BP) on obese patients. The drug company decides to collect BP data of $N$ obese patients for $D$ days. After that, the drug is administered to all of them. Now, the goal is to test your hypothesis, i.e., effective versus ineffective given the BP measurements of all the patients before and after the medication. How do we solve this problem? (will fill in the details later). **Exercise $2$:** Consider two coins with biases $q_1$ and $q_2$ ($q_2 > q_1$) in **Problem $2$**, and repeat the whole exercise to show that the following rule is optimal with respect to the probability of error \begin{equation} \frac{1}{n} \sum_{i=1}^n x_i \underset{\text{choose bias }q_1}{\overset{\text{choose bias }q_2}{\gtrless}} \texttt{threshold}. \end{equation} Find the $\texttt{threshold}$. The above analysis does not tell us how much error do we incur by using the above LLR rule. We need to investigate this next. Towards this, let us start with the probability of error. The error occurs if LLR rule says the bias is $1/2$ but the true bias is $3/4$ and viceversa. This leads to \begin{eqnarray} \Pr\{\mathcal{E}\} &=& \Pr\{LLR(\mathbf{x}^n)>\tau \bigcap \text{ when bias } 1/2\} + \Pr\{LLR(\mathbf{x}^n)< \tau \bigcap \text{ when bias } 3/4\} \\ &=&\Pr\{LLR(\mathbf{x}^n)>\tau | \text{ when bias } 1/2\} \pi_1 + \Pr\{LLR(\mathbf{x}^n) < \tau | \text{ when bias } 3/4\} \pi_0. \end{eqnarray} Let us use the symbol $\Pr_{1/2}(LLR(X^n) > \tau)$ to denote the first term above. Similar expression will be used for the second term with conditioning on bias $3/4$. Note that \begin{equation} \Pr_{1/2}(LLR(X^n) > \tau) = \Pr_{1/2}\left\{\frac{1}{n} \sum_{i=1}^n X_i > \tau \right\}. \end{equation} Similarly, \begin{eqnarray} \Pr_{3/4}(LLR(X^n) < \tau) &=& \Pr_{3/4}\left\{\frac{1}{n} \sum_{i=1}^n X_i < \tau \right\}\\ &=& \Pr_{3/4}\left\{\frac{1}{n} \sum_{i=1}^n (-X_i) > -\tau \right\}. \end{eqnarray} Observing the above two expressions, we realize that we need to evaluate a quantity of the form $\Pr(\frac{1}{n} \sum_{i=1}^n Z_i > \tau)$, where $Z_i$'s are i.i.d. rvs. As in the case of Markov, closed form expression for such things seldom exists, and hence we need to go for an upper bound. It turns out that this sort of things appear frequently in statistics, and hence let us try to give a general bound on such quantity. --- ## Chernoff Bounding Technique Let $Z_1,Z_2,\ldots, Z_n$ be i.i.d. rvs with distribution $p_{Z_1}(z)$ often written as $Z_1,Z_2,\ldots, Z_n \stackrel{\text{i.i.d.}}{\sim}p_{Z_1}(z)$. Note that each $X_i$ need not be non-negative. Then, \begin{eqnarray} \Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i > \tau\right\} &=& \Pr\left\{\exp\left(\frac{s}{n} \sum_{i=1}^n Z_i\right) > \exp\{s\tau\}\right\}\\ &\leq& \frac{\mathbb{E}\exp\left(\frac{s}{n} \sum_{i=1}^n Z_i\right)}{\exp\{s\tau\}}\\ &=& e^{-s \tau} \mathbb{E}\exp\left( \frac{s}{n} \sum_{i=1}^n Z_i\right). \end{eqnarray} The above bound is true for any $s > 0$. Hence, taking infimum over all $s>0$ results in the following result. --- **Theorem (Chernoff bound):** Let $Z_1,Z_2,\ldots, Z_n$ be i.i.d. rvs with distribution $p_{Z_1}(z)$. Then, \begin{eqnarray} \Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i > \tau\right\} &\leq& \inf_{s>0} \left[e^{-s \tau} \mathbb{E}\exp\left( \frac{s}{n} \sum_{i=1}^n Z_i\right)\right]. \end{eqnarray} --- **Practice Problem:** Find the Chernoff bound on $\Pr\{\frac{1}{n} \sum_{i=1}^n Z_i > \tau\}$ when $Z_1,Z_2,\ldots, Z_n$ be i.i.d. rvs having exponential distribution with rate $\lambda$. Let us apply the above to a Bernoulli random variables. $Z_1,Z_2,\ldots, Z_n \stackrel{\text{i.i.d.}}{\sim}\texttt{Bern}(p)$ for some $p \in [0,1]$. In order to apply the above bound, we need to compute $\mathbb{E}\exp\left( \frac{s}{n} \sum_{i=1}^n Z_i\right)$, which is simplified as follows \begin{eqnarray} \mathbb{E}\exp\left( \frac{s}{n} \sum_{i=1}^n Z_i\right) &=& \mathbb{E} \prod_{i=1}^n e^{sZ_i/n} \\ &\stackrel{\text{i.i.d.}}{=}& \prod_{i=1}^n \mathbb{E} e^{sZ_i/n}. \end{eqnarray} Now, it remains to compute $\mathbb{E} e^{sZ_i/n}$. Let us have some fun by using some calculus. This is the time you should have realized that various subjects that you have learnt, especially mathematics will come in handy here. Note that \begin{eqnarray} \mathbb{E} e^{sZ_i/n} &=& p e^{s/n} + (1-p) \\ &=& 1 + p(e^{s/n} -1). \end{eqnarray} Now, use your calculus knowledge to prove the following. **Exercise $3$:** Show that $1 + y \leq e^y$. Using the result in the above exercise with $y = p(e^{s/n} - 1)$, we get \begin{eqnarray} \mathbb{E} e^{sZ_i/n} &\leq& e^{p(e^{s/n} -1)}. \end{eqnarray} Thus, \begin{eqnarray} \mathbb{E}\exp\left( \frac{s}{n} \sum_{i=1}^n Z_i\right) &\leq& \prod_{i=1}^n e^{p(e^{s/n} -1)}\\ &=& e^{np(e^{s/n} -1)}. \end{eqnarray} Substituting this in the Chernoff bound, we get \begin{eqnarray} \Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i > \tau\right\} &\leq& \inf_{s>0} \left[e^{-s \tau} e^{np(e^{s/n} -1)}\right]. \end{eqnarray} Noting that the above function is a convex function, and a simple differentiation leads to an optimal $s$ of $s^{*} = n \log \frac{\tau}{p}$. It is important to note that $\tau > p$. Substituting this above leads to \begin{eqnarray} \Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i > \tau\right\} &\leq& e^{-n \left[\tau \log\frac{\tau}{p} + \tau - p\right]}. \end{eqnarray} Denote $\alpha_p := \tau \log\frac{\tau}{p} + \tau - p$. Using this, the bound becomes \begin{eqnarray} \Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i > \tau\right\} &\leq& e^{-\alpha_p n}. \end{eqnarray} The above takes care of the first term in the probabiltiy of error expression. The second term looks something like $\Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i < \tau\right\}$. Note that this can be bounded away from zero if $\tau$ is sufficiently large. Intuitively, for a coin with bias $3/4$, the relative frequency being far away from $3/4$ is small. Thus, if $\tau$ is away from $3/4$, then we get a reasonable bound; this is the essence of the following exercise. **Exercise $4$:** Find a bound on $\Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i < \tau\right\}$ when $Z_i$'s are i.i.d. Bernoulli random variable with bias $p$ and $\tau < p$. *Hint: Use Chernoff bounding technique.* The answer to **Exercise $4$** will be of the form $\Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i < \tau\right\} \leq e^{-\beta_p n}$, where $\beta_p:= -\tau \log\frac{p}{\tau} + p- \tau$. It is required that $\beta > 0$. To declutter things, let me summarize the two bounds below: --- - For $\tau > p$, we have $\Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i > \tau\right\} \leq e^{-\alpha_p n}$ for some appropriate $\alpha_p >0$. - For $\tau < p$, we have $\Pr\left\{\frac{1}{n} \sum_{i=1}^n Z_i < \tau\right\} \leq e^{-\beta_p n}$ for some appropriate $\beta_p >0$. --- Now we are ready to use the above result to prove a bound on the probability of error. Its long way up in this notes that you can find what I am writing next-the first term in the probability of error expression: \begin{eqnarray} \Pr_{1/2}(LLR(X^n) > \tau) &=& \Pr_{1/2}\left\{\frac{1}{n} \sum_{i=1}^n X_i > \tau \right\}\\ &\stackrel{\text{think why?}}{\leq}& e^{-\alpha_{1/2} n} \end{eqnarray} \begin{eqnarray} \Pr_{3/4}(LLR(X^n) < \tau) &=& \Pr_{3/4}\left\{\frac{1}{n} \sum_{i=1}^n X_i < \tau \right\}\\ &\stackrel{\text{think why?}}{\leq}& e^{-\beta_{3/4} n}. \end{eqnarray} Thus the total probability of error is given by \begin{equation} \Pr\{\mathcal{E}\} \leq \pi_1 e^{-\alpha_{1/2}n} + (1-\pi_1) e^{-\beta_{3/4}n}. \end{equation} The above leads to the following result. --- **Theorem:** For **Problem $1$**, with the optimal LLR rule, the probability of error satisfies the following upper bound \begin{equation} \Pr\{\mathcal{E}\} \leq 2 \max\{\pi_1,\pi_0\} e^{-\max\{\alpha_{1/2},\beta_{3/4}\}n}. \tag{A} \end{equation} --- **Practice Problem:** Find the number of samples so that the probability of error is at most $\delta$. **Exercise $4$:** Find explicit expressions for $\beta_{3/4}$ and $\alpha_{1/2}$. **Exercise $5$:** Find an upper bound on the probability of error for optimal rule that you have found for the problem in **Exercise $2$**. ## Generalizing the above Settings ### Detection theory Its time for us to generalize **Problem $2$**. This involved tossing one of the coin $n$ times resulting in $n$ i.i.d. rvs. Unlike estimation problem, the distribution of these rvs depepends on the coin chosen; $\texttt{Bern}(1/2)$ or $\texttt{Bern}(3/4)$. This leads to the following generalization. - **Problem $2^{'}$ (Detection problem):** Given $n$ samples \begin{equation} X_1, X_2, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} \left\{ \begin{array}{cc} p_{X^n}(\mathbf{x}^n; \mathcal{H}_0) \text{ or } f_{X^n}(\mathbf{x}^n; \mathcal{H}_0) & \text{if $\mathcal{H}_0$ is true} \\ p_{X^n}(\mathbf{x}^n; \mathcal{H}_1) \text{ or } f_{X^n}(\mathbf{x}^n; \mathcal{H}_1) & \text{if $\mathcal{H}_1$ is true.} \end{array} \right. \end{equation} The unknonw here is whether the hypothesis $\mathcal{H}_0$ is true or the alternate hypothesis. The goal of decision rule is to figure out the "right" strategy for resolving the issue. We resolved this issue when $\mathcal{H}_0$ was "bias $3/4$" and the alternate was "bias $1/2$", and the distributions were Bernoulli with $3/4$ and $1/2$, respectively. In this general scenario, we will do analogous to what we did for the coin toss problem. Let us consider the probability of error as a metric. As in the case of coin toss, let us start by looking at the decision rule $$\Phi(X_1,X_2,\ldots,X_n):\mathbb{R}^n \rightarrow \{0,1\}.$$ The error event is (exactly analogous to coin toss) \begin{equation} \mathcal{E} := \{\text{Say } \mathcal{H}_0 \text{ and when }\mathcal{H}_1\} \bigcup \{\text{Say } \mathcal{H}_1 \text{ and when }\mathcal{H}_0\}. \end{equation} To compute the probability of error, we first note that the two events in the union are disjoint. Thus, \begin{equation} \Pr\{\mathcal{E}\} = \Pr\{\text{Say } \mathcal{H}_0 \text{ and when }\mathcal{H}_1\} + \Pr\{\text{Say } \mathcal{H}_1 \text{ and when }\mathcal{H}_0\}. \end{equation} Now, using the definition of "Say $\mathcal{H}_1$" and "Say $\mathcal{H}_0$" in terms of $\Phi()$, we get the following \begin{eqnarray} \Pr\{\mathcal{E}\} &\stackrel{\text{Why?}}{=}& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 0 |\text{when }\mathcal{H}_1 \text{ is true}\} \times \pi_1 \nonumber \\ &~~~~~~~~~~~+& \Pr\{\Phi(X_1,X_2,\ldots,X_n) = 1 | \text{ when $\mathcal{H}_0$ is true }\} \times \pi_0. \nonumber \end{eqnarray} The rest of the proof is very similar to the coin tossing example, and hence left as an exercise. The final result is the following. --- **Theorem:** For **Problem $2^{'}$**, the optimal way of deciding whether the hypothesis $\mathcal{H}_1$ is true or not is through the following ***$\log$-likelihood ratio (LLR)*** rule - Discrete case: \begin{equation} \log \frac{ p_{X^n}(\mathbf{x}^n; \mathcal{H}_0)}{p_{X^n}(\mathbf{x}^n; \mathcal{H}_1)} \underset{\text{choose }\mathcal{H}_1}{\overset{\text{choose }\mathcal{H}_0}{\gtrless}} \log\left(\frac{\pi_1}{1-\pi_1}\right). \end{equation} - Continuous case: \begin{equation} \log \frac{ f_{X^n}(\mathbf{x}^n; \mathcal{H}_0)}{f_{X^n}(\mathbf{x}^n; \mathcal{H}_1)} \underset{\text{choose }\mathcal{H}_1}{\overset{\text{choose }\mathcal{H}_0}{\gtrless}} \log\left(\frac{\pi_1}{1-\pi_1}\right). \end{equation} --- **Example:** Say you get to observe $n$ i.i.d. samples either from $\texttt{Exp}\{\lambda_0\}$ or $\texttt{Exp}\{\lambda_1\}$. Further, assume that $\pi_1 = 1/2 \text{ and } 1/4$. *Solution:* Since the rvs involved are continuous, we need to consider the second equation in the theorem. Calling the samples correpsonding to $\lambda_0$ rate as $\mathcal{H}_0$, we have $$f_{X^n |\text{when }\mathcal{H}_0}(\mathbf{x}^n) = \lambda_0^n \exp\left\{-\lambda_0 \sum_{i=1}^n x_i\right\}.$$ Further, $$f_{X^n |\text{when }\mathcal{H}_1}(\mathbf{x}^n) = \lambda_1^n \exp\left\{-\lambda_1 \sum_{i=1}^n x_i\right\}.$$ Taking the ratio of the above two, and taking logarithm will result in the following rule \begin{equation} \frac{1}{n} \sum_{i=1}^n x_i \underset{\text{choose }\mathcal{H}_0}{\overset{\text{choose }\mathcal{H}_1}{\gtrless}} \frac{\tau}{n(\lambda_0 - \lambda_1)} + \frac{1}{\lambda_0- \lambda_1} \log\left(\frac{\lambda_0}{\lambda_1}\right), \end{equation} where $\tau := \log\left(\frac{1-\pi_1}{\pi_1}\right)$. **Practice Question:** Try using different values of $\pi_1$ (especially, the one given in the example above) and interpret the result. **Practice Question:** Find the probability of error for the above decision rule. **Example:** Say you get to observe $n$ i.i.d. samples either from $\mathcal{N}(\mu_0,1)$ or $\mathcal{N}(\mu_1, 1)$. Further, assume that $\pi_1 = 1/2 \text{ and } 1/4$. Find the decision rule. Also, find the probability of error. The solution for the above problem will be of the form \begin{equation} \frac{1}{n} \sum_{i=1}^n X_i \underset{\text{choose }\mathcal{H}_1}{\overset{\text{choose }\mathcal{H}_0}{\gtrless}} \tau_g, \end{equation} where $\tau_g$ is the appropriate threshold for Gaussian. **Practice Problem:** Find the threshold $\tau_g$ for the problem. Now, let us look at the probability of error. The error probability can be written as \begin{equation} P_e := \Pr\left\{\frac{1}{n} \sum_{i=1}^n X_i > \tau_g \left| \right.\mathcal{H}_1\right\} \pi_1 + \Pr\left\{\frac{1}{n} \sum_{i=1}^n X_i < \tau_g|\mathcal{H}_0\right\} \pi_0. \end{equation} From here on, it is down hill as we have already seen how to solve such problem in the context of Bernoulli rvs. The trick is the same. First let us consider the Chernoff for the first term above, i.e., \begin{equation} \Pr\left\{\frac{1}{n} \sum_{i=1}^n X_i > \tau_g \left| \right.\mathcal{H}_1\right\} \leq \inf_{s > 0} e^{-sn \tau_g} \prod_{i=1}^n \mathbb{E}\left[e^{sX_i}\right]. \end{equation} The following is a way of making the document fancy. **Lemma (Moment Generating Function (MGF)):** For $X_i \sim \mathcal{N}(\mu_1,1)$, $\mathbb{E}\left[e^{sX_i}\right] = e^{\frac{s}{2}(s+2 \mu_1)}$. **Practice Problem:** Let $X \sim \texttt{Exp}\{\lambda\}$. Find the MGF of $X$. The name given to $\mathbb{E}e^{sX}$ is MGF. More on this later. **Practice problem:** Prove the above Lemma. Using this Lemma, the above can simplied to \begin{equation} \Pr\left\{\frac{1}{n} \sum_{i=1}^n X_i > \tau_g \left| \right.\mathcal{H}_1\right\} \leq \exp\left\{-n\left(\mu_1 + \tau_g\right)^2\right\}. \end{equation} Now, an easy exercise (literally follow the same step as above) with $-1$ multiplied on both sides results in (requires a minor restriction that $n > \mu_1/\tau_g$-always satisfied in the asymptotics) \begin{equation} \Pr\left\{\frac{1}{n} \sum_{i=1}^n X_i < \tau_g \left| \right.\mathcal{H}_0\right\} \leq \exp\left\{-n\left(\mu_0 - \tau_g\right)^2\right\}. \end{equation} **Practice Problem:** Show the above inequality. Order wise, the probability of error becomes \begin{equation} P_e \leq 2\max\{\pi_1, 1-\pi_0\} \mathcal{O}(e^{-n}). \end{equation} So far, we have been looking at a scenario where the hypothesis are picked randomly with some distribution. It is possible that the hypothesis may be picked randomly but we do not know the probabilities. One fix is to assume the worst case, and design for the worst case. Intuitively, the worst case is $\pi_1 = 1/2$. An alternate to this is to solve a problem that balances various errors. Towards explaining this, we will consider two kinds of errors popularly known as probability of detection and probability of false alarm. More precisely \begin{equation} P_d(\Phi):= \Pr\{\text{decide $\mathcal{H}_1$}|\mathcal{H}_1 \text{ is true} \} \end{equation} and \begin{equation} P_F(\Phi):= \Pr\{\text{decide $\mathcal{H}_1$}|\mathcal{H}_0 \text{ is true} \}. \end{equation} Ideally, the objective should be to make $P_d(\Phi)$ as high as possible and $P_F(\Phi)$ as low as possible. Unfortunately, this is not possible; if you want to make $P_d(\Phi)$ high (close to $1$), then decide in favor of $\mathcal{H}_1$ always! However, this results in a false alarm of $1$. On the other hand, if we want to make the false alarm as small as possible, a possibility is to decide in favor of $\mathcal{H}_0$ all the time. In this case, the detection probability will be $0$! Thus, there exists a tradeoff between the two. One possible way to balance this tradeoff is to solve the following problem: \begin{eqnarray} \max_{\Phi} && P_D(\Phi) \\ \text{such that } && P_F(\Phi) \leq \alpha \tag{NeyPear-Problem} \end{eqnarray} for some $\alpha >0$. The solution to the above is called the Neyman Pearson's test. In this course, we will not look at how to solve the problem but will state the solution. --- **Lemma (Neyman-Pearson (NP) Lemma):** The solution to "NeyPear-Problem" is the loglikelihood ratio given below - Discrete case: \begin{equation} \log \frac{ p_{X^n}(\mathbf{x}^n; \mathcal{H}_0)}{p_{X^n}(\mathbf{x}^n; \mathcal{H}_1)} \underset{\text{choose }\mathcal{H}_1}{\overset{\text{choose }\mathcal{H}_0}{\gtrless}} \tau. \end{equation} - Continuous case: \begin{equation} \log \frac{ f_{X^n}(\mathbf{x}^n; \mathcal{H}_0)}{f_{X^n}(\mathbf{x}^n; \mathcal{H}_1)} \underset{\text{choose }\mathcal{H}_1}{\overset{\text{choose }\mathcal{H}_0}{\gtrless}} \tau. \end{equation} In the above, the threshold is chosen to satisfy the constraint on the false alarm to be $\alpha$. --- The proof of the above can be found in the famous "Information theory" by Thomas and Cover and many other books on detection and estimation theory. Now, let us try to apply this solution to the following radar detection problem. **Example (Radar target detection):** Consider that a radar sends a short pulse which gets reflected from a target (say a plane) and the radar observes the reflection. In fact, we are assuming that the short pulses are really short so that the speed of the aircraft is neglibile! That is the time difference between the pulses is so short that the airplane would have hardly moved. This is theory so we need not worry about the practicality:-) If you did not follow the above, don't worry, ignore and move on (then why did I write this?-hmm...)! ![](https://hackmd.io/_uploads/ry7yCPMZp.png) Assume that it sends $n$ pulses one after the other. This can be modelled using the following \begin{equation} X_1, X_2,\ldots,X_n \stackrel{\text{i.i.d.}}{\sim} \left\{ \begin{array}{cc} \theta + w & \text{if the targe is present} \\ w & \text{otherwise,} \end{array} \right. \end{equation} where $w \sim \mathcal{N}(0,1)$, and $\theta$ is a fixed number which represents the distance from the radar to the target. It is not so difficult to realize that we cannot assign probabilities for the two hypothesis, viz, target present and target not present. Naturally, NP is the way to go. It is easy to see that under the "the target present" hypothesis, the distribution is $\mathcal{N}(\theta,1)$ while under "target not present", the distribution of $X_i$ becomes $\mathcal{N}(0,1)$-pure noise. Noting that $X_i$'s are i.i.d., and applying NP-Lemma, we get \begin{equation} \bar{S}_n := \frac{1}{n} \sum_{i=1}^n X_i \underset{\text{choose }\mathcal{H}_0}{\overset{\text{choose }\mathcal{H}_1}{\gtrless}} \tau. \end{equation} Note that the change in the inequality. Now, we need to find the threshold $\tau$ such that the constraint $P_F(\Phi)$ is satisfied. Let us first compute the false alarm \begin{eqnarray} P_F(\Phi) &=& \Pr\{\bar{S}_n > \tau | \mathcal{H}_0 \text{ is true}\}. \end{eqnarray} In genral, one can use the Chernoff technique to bound the above. In particalar, \begin{eqnarray} P_F(\Phi) &\leq& \inf_{s>0} \left[e^{-ns\tau} \prod_{i=1}^n \mathbb{E}\left[e^{sX_i}|\mathcal{H}_0 \text{ is true}\right]\right] \\ &=& \inf_{s>0} \left[e^{-ns\tau} \prod_{i=1}^n e^{s^2/2}\right] \\ &=& e^{-n\tau^2/2}. \end{eqnarray} Now, we need $P_F(\Phi) \leq \alpha$, which is satisfied if $e^{-n\tau^2/2} \leq \alpha$. This can be achieved by choosing $\tau = \sqrt{\frac{2}{n}\log\frac{1}{\alpha}}$. There are several approaches to the above problem. An alternate approach is to use the fact that the sum of Gaussian is Gaussian. We will first show that it is indeed true. *Sum of Gaussian is Guassian:* Let $X$ and $Y$ be two independent Gaussian random variables with zero means and variance $\sigma^2$. Let us first write the CDF of $Z=X+Y$ using the total <span style="color:red">probability law for continuous rv</span> \begin{equation} \Pr\{Z \leq z\} \stackrel{\text{why?}}{= }\int_{-\infty}^\infty \Pr\{X + Y \leq z | Y=y\} f_Y(y) dy. \end{equation} A term in the integrand can be written as \begin{eqnarray} \Pr\{X + Y \leq z | Y=y\} &=& \Pr\{X \leq z - y | Y=y\}\\ &=& \Pr\{X \leq z - y\}\\ &=& \int_{-\infty}^{z-y} f_X(x) dx, \end{eqnarray} where the second equality follows from the fact that $X$ and $Y$ are independent. Substituting this, we get \begin{eqnarray} \Pr\{Z \leq z\} &=& \int_{-\infty}^\infty \int_{-\infty}^{z-y} f_X(x) f_Y(y) dx dy. \end{eqnarray} Now, taking the derivative with respect to $z$, we get \begin{eqnarray} f_Z(z) &=& \int_{-\infty}^\infty f_X(z-y) f_Y(y) dy. \end{eqnarray} Substituting for the Gaussian PDF and after a long laborious algebra, we get the desired result, which is left as an exercise. **Practice Problem:** Show that the above integral results in $Z \sim \mathcal{N}(0,2 \sigma^2)$,i.e., $f_Z(z) = \frac{1}{2\sqrt{ \pi}\sigma} e^{-z^2/4\sigma^2}$. The second method is to use the moment method, i.e., using the MFG. This will be not dealt with here. **Exercise $6$:** Solve the following problem taking NP approach \begin{equation} X_1, X_2,\ldots,X_n \stackrel{\text{i.i.d.}}{\sim} \left\{ \begin{array}{cc} \texttt{Exp}\{\lambda_0\} & \text{under }\mathcal{H}_0 \\ \texttt{Exp}\{\lambda_1\} & \text{ under }\mathcal{H}_1. \end{array} \right. \end{equation} Let us revisit detection problems a little later. Now, we will look at the estimation problem, and some "interesting" thing that we can do there. --- ### Why theory? You may have noted that the optimal decision so far turned out to be intuitive. Then, the question is why do we need theory!? As an example consider the following hypothesis testing: \begin{equation} X_1, X_2,\ldots,X_n \stackrel{\text{i.i.d.}}{\sim} \left\{ \begin{array}{cc} \mathcal{N}(0,1) & \text{under }\mathcal{H}_0 \\ p \mathcal{N}(-5,1) + (1-p) \mathcal{N}(5,1)& \text{under }\mathcal{H}_1. \end{array} \right. \end{equation} - If the $n$ samples are drawn under the null hypothesis, the histogram looks like: ![](https://hackmd.io/_uploads/B1O5UsUM6.png) - If the $n$ samples are drawn under the alternate hypothesis, the histogram looks like: ![](https://hackmd.io/_uploads/rkrUvoLfa.png) What is the mean and variance under both the hypothesis? Mean zero for both and variance $26 p + 26 (1-p) = 26$ for alternate and $1$ under null hypothesis. Intuitively, the testing should depend on the variance. But, what is it that we have to use? May be variance! It works to some extent but may not work as good as the LLR based test as it is an optimal test. ### Code to play around with (MATLAB/Python) <details><summary>Hypothesis testing: Use the following Python code (thanks to chatgpt for the conversion from Matlab to Pyhton) to modify and use it as per your wish. </summary> ```javascript import numpy as np import matplotlib.pyplot as plt np.random.seed(0) # Set a random seed for reproducibility num_trial = 5000 num_samp = 1 snr = np.arange(1, 101, 10) # Equivalent to 1:10:100 in MATLAB mse_mmse = [] mse_lmmse = [] for SNR in snr: data = np.random.randn(num_trial, num_samp) # Generate noise noise = np.random.randn(num_trial, num_samp) / np.sqrt(SNR) # Generate observations y = data + noise # MMSE Estimate data_mmse = y * (SNR / (1 + SNR)) # LMMSE data_lmmse = y # Error calculation error_mmse = data - data_mmse error_lmmse = data - data_lmmse # MSE calculations mse_mmse.append(np.mean(np.square(error_mmse)) mse_lmmse.append(np.mean(np.square(error_lmmse)) plt.figure(1) plt.semilogy(10 * np.log10(snr), mse_mmse, '-b') plt.semilogy(10 * np.log10(snr), mse_lmmse, '-r') plt.legend(['mmse', 'lmmse']) plt.grid(True) plt.xlabel('SNR in dB') plt.ylabel('MSE') plt.show() ``` </details> -- <details><summary>Hypothesis testing: Use the following Matlab code to modify and use it as per your wish.</summary> ```javascript % Hypothesis testing with mixture distribution % Shows LLR test is optimal! clear all; clc; num_samples = 100; mixture = 1; count = 0; for temp = 1:10000 if mixture == 1 bern = rand(1,num_samples)>0.5; mix_samples = (bern>0).*(-1+randn(1,num_samples)) + (bern<=0).*(1 + randn(1,num_samples)); else mix_samples = randn(1,num_samples); end % figure(1); % hist(mix_samples,40); % grid on; % legend('Mixture Gaussian'); % figure(2) % rand_samples = 5.*randn(1,num_samples); % hist(rand_samples,40); % grid on; % legend('Standard Gaussian'); % Hypothesis test likelihoodnum = 0; likelihoodden = 0; for temp = 1:length(mix_samples) likelihoodnum = likelihoodnum + log(0.5*exp(-(mix_samples(temp)-1)^2/2) + 0.5*exp(-(mix_samples(temp)+1)^2/2)); end for temp = 1:length(mix_samples) likelihoodden = likelihoodden + log(exp(-mix_samples(temp)^2/2)); end det_llr = 1; if likelihoodnum>likelihoodden disp("mixture:"); else det_llr = 0; disp("single"); end est_var = mean((mix_samples-mean(mix_samples)).^2); det = 1; if est_var > 2 % Can use different threshold instead of 2 disp("mixture: average metric") else det = 0; disp("single: average metric"); end if det == 1 count = count + 1; end end count ``` </details> -- ### Estimation theory First, let us try to generalize **Problem $1$**. Note that the solution involved tossing the coin $n$ times resulting in $n$ i.i.d. rvs with $\texttt{Bern}(p)$ distribution. The task is to find the parameter $p$. Similarly, in **Exercise $1$**, $n$ samples from an uniform distribution $X_1, X_2, \ldots, X_n \stackrel{i.i.d.}{\sim} \texttt{Unif}(\theta)$ were given. The task was to find the unkown parameter $\theta$. This leads to the following generalization for estimation problem. - **Problem $1^{'}$ (Estimation problem):** Given $n$ samples $X_1, X_2, \ldots, X_n \stackrel{i.i.d.}{\sim} p_{X^n}(\mathbf{x}^n; \theta) \text{ or } f_{X^n}(\mathbf{x}^n;\theta)$, find the unknown paramter $\theta$. Here, $\theta$ is a parameter of the distribution. For example, it could be the mean, variance, maximum value etc. -- Examples for $f_{X^n}(\mathbf{x}^n;\theta)$ and $p_{X^n}(\mathbf{x}^n; \theta)$ include (i) Gauassian with mean $\theta$ and variance $1$, (ii) uniform distribution with domain $[0,\theta]$, (iii) geometric distribution with parameter $p$, (iv) Poisson with rate $\lambda$ and (v) exponential with rate $\lambda$ and many more. Many questions need to be addressed such as what should be the metric with respect to which we need to find $\theta$, how do we know that the given estimator (mostly derived from intuition) is good with respect to a given metric, etc. This will be the topic of study in this course. How do we solve this general problem. As in the example above, first we need to find a metric with respect to which we should find our estimate. Of course, we will start with the $\texttt{MSE}$ metric. In particular, given an estimator $\hat{\theta}_n \in \Theta \subseteq \mathbb{R}$ (in general, this can be $\mathbb{R}^n$) which depends on the $n$ observation, the MSE metric is \begin{equation} \texttt{MSE}_n:= \mathbb{E} \left|\theta - \hat{\theta}_n\right|^2, \end{equation} where the expectation is with respect to the joint distribution on $X_1,X_2,\dots,X_n$. One wishes to solve the following problem \begin{equation} \min_{\hat{\theta}_n} \left\{\texttt{MSE}_n:= \mathbb{E} \left|\theta - \hat{\theta}_n\right|^2\right\}. \end{equation} It is important to note that $\theta$ has to be random for the above to make sense. In this case, the expectation above is also with respect to $\theta$. It is possible that $\theta$ need not be random. For example, **Problem $1$** above, where $p$ is a deterministic but unknown quantity. The approach that we are going to describe with $\texttt{MSE}$ as a metric does not work in case when $\theta$ is not random. More on this later. *Simple things first:* What if you don't get to observe anything but you have to make an estimate of $\theta$, where $\theta$ is a random drawn according to a known distribution. In this case, we would like to solve the following problem \begin{equation} \min_{\hat{\theta}_n} \left\{\texttt{MSE}:= \mathbb{E} \left|\theta - \hat{\theta}\right|^2\right\}. \end{equation} Note that $\hat{\theta}$ does not depend on the number of samples $n$ as the estimator do not have access to samples! First, let us expand \begin{equation} \texttt{MSE} \stackrel{\text{why?}}{=} \mathbb{E}\theta^2 + \hat{\theta}^2 - 2 \theta \hat{\theta}. \end{equation} Minimizing the above with respect to $\hat{\theta}$, we get $\hat{\theta}^{*} = \mathbb{E}\theta$. This is intuitively very appealing as one would bet on the mean when nothing is known! What if we are given $n$ samples that can potentially depend on $\theta$? How do we solve such problems? Before doing this, let us recall on what we want out of an estimator: - First, on an average (true average), the estimate should be true; mathematically, $\mathbb{E} [\hat{\theta}_n|\theta] =\theta$. - The above is in general called an *unbiased estimator, i.e., an unbiased estimator - Second, as we increase the number of samples, the estimation should become "better" (in terms of the metric that we are using). I urge the reader to compare this with the list that we had when we studied **Problem $1$**. It looks very similar, isn't it? Now, let us make some of these more formal. **Definition (estimator):** Given $n$ observations $X_1,X_2,\ldots,X_n$, an estimator is a map $\hat{\theta}_n(X_1,X_2,\ldots,X_n)$. **Definition (unbiased estimator):** We say that an estimator $\hat{\theta}_n(X_1,X_2,\ldots,X_n)$ is an *unbiased* estimator if $\mathbb{E} \left[\hat{\theta}_n(X_1,X_2,\ldots,X_n)|\theta\right] =\theta$-the true value. Note that there is a conditioning with respect to $\theta$. This is because $\theta$ is sampled from a distribution. A straight forward way of coming up with an estimator is by making a linear combination of the observations, i.e., \begin{equation} \hat{\theta}_n = \sum_{i=1}^n \alpha_i X_i. \end{equation} The above estimator is called a *linear estimator* for obvious reasons. Now, the goal is to solve the following problem \begin{equation} \min_{\alpha_1,\alpha_2,\ldots,\alpha_n} \mathbb{E} \left\| \sum_{i=1}^n \alpha_i X_i - \theta\right\|_2^2. \end{equation} Note that we have used the norm above which takes care of all the vector valued estimators. However, to begin with, let us assume that the estimator is scalar, and the above is an absolute value. First thing to note is that the above function (as a function of each $\alpha_i$) is convex. Thus, differentiating with respect to each $\alpha_j$ and equating to zero results in \begin{equation} 2 \mathbb{E}\left(\sum_{i=1}^n \alpha_i X_i -\theta \right)X_j = 0 \text{ for all $j$}. \end{equation} Using linearity of expectation, the above can be written as \begin{equation} \sum_{i=1}^n \alpha_i \mathbb{E}(X_i X_j) = \mathbb{E} (\theta X_j) \text{ for all $j$}. \end{equation} Matrix algebra comes in handy while dealing with the above. In particular, defining (i) a matrix $\Psi$ whose $ji$-th entry is $\mathbb{E}(X_i X_j)$, (b) a vector $\mu_X$ whose $i$-th entry is $\mathbb{E}(\theta X_i)$ and (iii) $\mathbf{\alpha}$ whose $i$-th entry is $\alpha_i$ leads to the following \begin{equation} \mathbf{\alpha}^T \Psi^T = \mu_X^T \implies \Psi \mathbf{\alpha} = \mu_X. \end{equation} The above leads to the following solution, also known a Linear Minimum Mean Squared Error (LMMSE) estimator \begin{equation} \mathbf{\alpha} = \Psi^{-1} \mu_X. \end{equation} If the inverse of the matrix does not exists, then one can use what is called the Pseudo inverse. For now, let us not worry about the technicalities but stick to the above solution. It is possible that the above need not be the best solution. However, it is indeed the best among all the simplest solutions. First thing to note is that the above is not in general an unbiased estimator. **Geometric Interpretation:** In order to appreciate and understand the LMMSE better, let us consider a problem where the randomness is stripped off, i.e., $X_i$'s and $\theta$ are deterministic vectors lying in $\mathbb{R}^d$. **The image shows the geometry of Least Squares (LS):** ![image.png](https://hackmd.io/_uploads/H1y_rjeQp.png) Suppose if we have $n$ observations $\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_n$ lying in $\mathbb{R}^d$ (for concreteness, assume that it lies in the $x-y$ plane) as shown in the figure above (labelled as span of $x$). We want an estimate in $\mathbb{R}^d$ (the $x-y$ plane) that is closest to $\theta$. For simplicity, let the estimate be a linear combination of these $n$ observations, i.e., $\sum_{i=1}^n \alpha_i X_i$, and hence lie in $\mathbb{R}^d$ (the $x-y$ plane). See how linear algebra is useful here! Thus, we need to adjust the weights $\alpha_i$'s such that the distance between the linear combination and $\theta$ is small, i.e., we need to solve the following problem: \begin{equation} \min_{\alpha_1,\alpha_2,\ldots,\alpha_n} \left\| \sum_{i=1}^n \alpha_i \mathbf{x}_i - \theta\right\|_2^2. \end{equation} Intuitively, the optimal estimate would be a vector obtained by dropping a perpendicular from $\theta$, i.e., \begin{equation} \left(\theta - \sum_{i=1}^n \alpha_i \mathbf{x}_i\right) \perp \sum_{i=1}^n \alpha_i \mathbf{x}_i. \end{equation} In other words, \begin{equation} \left<\left(\theta - \sum_{i=1}^n \alpha_i \mathbf{x}_i\right), \sum_{i=1}^n \alpha_i \mathbf{x}_i\right> = 0. \end{equation} Solving the above leads to the following \begin{equation} \sum_{i=1}^n \alpha_i \left<\theta, \mathbf{x}_i\right> = \sum_{i,j=1}^n \alpha_i \alpha_j \left<\mathbf{x}_i,\mathbf{x}_j\right>. \end{equation} Using $\mathbf{\alpha}:=(\alpha_1,\alpha_2,\ldots,\alpha_n)^T$, $\mathbf{a}:= (\left<\theta, \mathbf{x}_1,\left<\theta, \mathbf{x}_2\right>,\ldots,\left<\theta, \mathbf{x}_n\right>\right>)^T$ and the $ij$-th entry of a matrix $\Psi$ be $\left<\mathbf{x}_i,\mathbf{x}_j\right>$, the above can be written as \begin{equation} \mathbf{\alpha}^T \mathbf{a} = \mathbf{\alpha}^T \Psi \mathbf{\alpha}. \end{equation} Note that the matrix $\Psi$ is a symmetric matrix. The above leads to $\mathbf{a} = \Psi \mathbf{\alpha}$, which leads to $\mathbf{\alpha} = \Psi^{-1} \mathbf{a}$; a solution similar to the LMMSE solution. This solution is called the Least Squares (LS) solution, and the corresponding problem is called a LS problem. In the LMMSE setting, the difficulty is that $\theta$ and $X_i$'s are random, and hence there is no meaning for perpendicular. An obvious approach is to bring in the structure of a vector space with inner product, in which case the meaning of perpendicular makes sense. <span style="color:red">The following is an aside, which is not a part of this course. However, interested readers can go through this.</span> --- We use the following approach: - **Turning the space of rvs into a vector space:** Consider the following \begin{equation} \mathcal{L}(\mathbb{R}) := \{X \in \mathcal{X}: \mathbb{E}X^2 < \infty\}. \end{equation} Note that $\mathcal{L}(\mathbb{R})$ is a vector space with respect to usual sum and product. - **Define an inner product:** Consider the following \begin{equation} \left<X,Y\right> := \mathbb{E}(XY). \end{equation} The above is clearly an inner product mimicking dot product on a strange space which mimics $\mathbb{R}^2$. The above induces the notion of distance: \begin{equation} \left||X-Y\right||^2_{\mathcal{L}(\mathbb{R})} := \sqrt{\left<X-Y,X-Y\right>}. \end{equation} Note the similarity between this and the Euclidean distance, which is obtained as $\sqrt{\mathbf{x}^T\mathbf{x}}$. --- Going by analogy with the LS problem, we can guess the solution to be \begin{equation} \mathbf{\alpha}^T \mathbf{a} = \mathbf{\alpha}^T \Psi \mathbf{\alpha}, \end{equation} where the $ij$-th entry of $\Psi$ is $(\Psi)_{ij} =. \mathbb{E}(X_i X_j)$, $(\mathbf{a})_i = \mathbb{E}(X_i \theta)$. Now, the final solution is similar to the geometric problem above. **Example (LMMSE):** Let $X_1,X_2,\ldots,X_n \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(\theta,1)$, where $\theta$ is sampled with some distribution $f_{\Theta}(\theta)$ independently of $X_i$'s. In this case, the matrix $\Psi$ has $(\mathbb{E}\theta)^2$ in the non-diagonal entries and $1 + \mathbb{E}\theta^2$ on the diagonals (why?). This leads to an estimator that is biased. How do we construct an unbiased estimator that is LMMSE? One way is to modify the above estimator to the following \begin{equation} \hat{\theta}_n = \sum_{i=1}^n \alpha_i X_i + b. \end{equation} Now the optimization problem becomes \begin{equation} \min_{\alpha_1,\alpha_2,\ldots,\alpha_n; b} \mathbb{E} \left\| \sum_{i=1}^n \alpha_i X_i + b - \theta\right\|^2. \end{equation} A similar approach as above can be followed to solve the problem. **Exercise $7$:** Solve the problem above, and find the optimal $\alpha_i$'s and $b$. Next we will look at the general soluton to the Minimum MSE (MMSE) problem above, i.e., \begin{equation} \min_{g(X^n)} \mathbb{E} |g(X^n) - \theta |^2. \end{equation} Assume that $g(X^n)$ is any estimator and consider the following estimator which depends on the conditional mean \begin{equation} f^{*}(X^n) := \arg \min_{g(X^n)} \mathbb{E} \left[|g(X^n) - \theta |^2 | X^n = \mathbf{x}^n\right]. \end{equation} Now, we show that the above estimator solves the MMSE problem. --- **Theorem:** The conditional mean, i.e., $f^{*}(X^n) := \arg \min_{f(X^n)} \mathbb{E} \left[\theta | X^n\right]$ is the optimal solution to the MMSE problem. --- *Proof:* Consider the following for any estimator $g(X^n)$ \begin{eqnarray} \mathbb{E} |g(X^n) - \theta |^2 &=& \mathbb{E}_{X^n}\left[\mathbb{E} \left[|g(X^n) - \theta |^2 | X^n = \mathbf{x}^n\right]\right] \\ &\stackrel{(a)}{\geq}& \mathbb{E}_{X^n}\left[\mathbb{E} \left[|f^{*}(X^n) - \theta |^2 | X^n = \mathbf{x}^n\right]\right] \\ &=& \mathbb{E} \left[|f^{*}(X^n) - \theta |^2 \right]. \end{eqnarray} Hence $f^{*}(X^n)$ is indeed the optimal estimator. Now, it remains to solve the following problem \begin{equation} f^{*}(X^n) := \arg \min_{g(X^n)} \mathbb{E} \left[|g(X^n) - \theta |^2 | X^n = \mathbf{x}^n\right]. \end{equation} --- ***Theorem:*** The solution to the above problem is given by $f^{*}(X^n) = \mathbb{E}\left[\theta|X^n = \mathbf{x}^n\right]$. --- *Proof:* First, let us expand the conditional expectation \begin{eqnarray} \mathbb{E} \left[|g(X^n) - \theta |^2 | X^n = \mathbf{x}^n\right] &=& g^2(\mathbf{x}^n) + \mathbb{E}[\theta^2 | X^n = \mathbf{x}^n] - 2 g(\mathbf{x}^n) \mathbb{E}[\theta | X^n = \mathbf{x}^n]. \end{eqnarray} Now, differentiating with respect to $g(*)$ what do you mean by this? For now, let us be a little sloppy and do the usual differentiation, and equating it to zero proves the theorem. Based on our intuition from the geometry and the LS, we know that the error is orthogonal to the estimate. Note that in our new language of inner-product and norms, we expect the following, which is indeed true! **Lemma:** The error is uncorrelated with the estimate, i.e., \begin{equation} \left<\left(\theta - \mathbb{E}\left[\theta|X^n = \mathbf{x}^n\right]\right), \mathbb{E}\left[\theta|X^n = \mathbf{x}^n\right]\right> = \mathbb{E}\left[\left(\theta - \mathbb{E}\left[\theta|X^n = \mathbf{x}^n\right]\right)\mathbb{E}\left[\theta|X^n = \mathbf{x}^n\right]\right] = 0. \end{equation} *Proof:* Follows by using conditional expectation. Easy exercise! **Example:** Consider the following density \begin{equation} f(x,y) = \left\{ \begin{array}{cc} \frac{12}{11} (x+1) & \text{ if } 0 \leq x \leq \sqrt{y}, 0 \leq y \leq 1,\\ 0 & \text{ Otherwise.} \end{array} \right. \end{equation} Find the MMSE estimator of $X$ given the observation $Y$. **Example:** Consider $Y=X + W$, where $W \sim \mathcal{N}(0,1/\texttt{SNR})$ and $X \sim \mathcal{N}(0,1)$ are independent Gaussians. We get to observer $Y$. Compute an MMSE estimate of $X$ given $Y$. ***Solution:*** The MMSE estimate is $\hat{X} := \mathbb{E}[X|Y=y]$. This requires us to compute the conditional density $f_{X|Y}(x|y)$, which is given by \begin{equation} f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} = \frac{f_{Y|X}(y|x) f_X(x)}{f_Y(y)}. \end{equation} In the above, $f_{Y|X}(y|x)$ is $\mathcal{N}(x,1/\texttt{SNR})$. Using this, we get \begin{eqnarray} \frac{f_{Y|X}(y|x) f_X(x)}{f_Y(y)} &=& \frac{\frac{\sqrt{\texttt{SNR}}}{\sqrt{2 \pi}} \exp\left\{-\frac{\texttt{SNR}(y-x)^2}{2}\right\} \times \frac{1}{\sqrt{2 \pi}} \exp\left\{-\frac{x^2}{2}\right\}}{ \frac{\sqrt{\texttt{SNR}}}{\sqrt{2 \pi (1+\texttt{SNR})}} \exp\left\{-\frac{\texttt{SNR} y^2}{2(1+ \texttt{SNR})}\right\}}\\ &=& \frac{\sqrt{1+\sigma_{eff}^2}}{\sqrt{2 \pi \sigma_{eff}^2}} \exp\left\{-\frac{(x - \frac{y}{1 + \sigma_{eff}^2})^2}{2 \frac{\sigma_{eff}^2}{1 + \sigma_{eff}^2}}\right\}, \end{eqnarray} where we have used the fact that the density of $Y$ is again Gaussian (why?), and $\sigma_{eff}^2:= \frac{1}{ \texttt{SNR}}$. Thus, the distribution of $X$ conditioned on $Y=y$ is again Gaussian with mean $\frac{y}{1 + \sigma_{eff}^2} = \frac{\texttt{SNR}}{1 + \texttt{SNR}} y$ and variance $\frac{\sigma_{eff}^2}{1 + \sigma_{eff}^2}$. Now, the conditional expectation is easy to compute; it is simply the average of $X$ conditioned on $Y=y$, which is $\frac{\texttt{SNR}}{1 + \texttt{SNR}} y$. The MMSE estimate is simply a scaling of the observation \begin{equation} \hat{X} := \frac{\texttt{SNR}}{1 + \texttt{SNR}} y. \end{equation} A few observations are in order: - The estimate is *biased*, i.e., $\mathbb{E}(\hat{X} | X = x) = \frac{x}{1+\sigma_{eff}^2} \neq x$! - The MSE of the estimator is \begin{eqnarray} \texttt{MSE} &=& \mathbb{E}\left(\frac{Y}{1 + \sigma_{eff}^2} - X\right)^2\\ &=& \mathbb{E}\left(\frac{X + W}{1 + \sigma_{eff}^2} - X\right)^2\\\\ &=& a^2 \mathbb{E}X^2 + b^2 \mathbb{E}W^2 - 2 a b \underbrace{\mathbb{E}(XW)}_{= \mathbb{E}X \mathbb{E}W = 0}\\ &=& \mathcal{O}\left(\frac{1}{\texttt{SNR}}\right). \end{eqnarray} - Error is uncorrelated with the estimate! Check this yourself! - A natural question to ask is how does the above campare with LS? Since there is only one observation, finding the LS estimate is easy. In particular, \begin{equation} \hat{X}_{LS} = \mathbb{E}(X Y) Y = \mathbb{E}X^2 \times Y = Y. \end{equation} It is clear that the LS is an unbiased estimate while MMSE is not. Further, the MSE of the LS is $1/\texttt{SNR}$! Order wise both MMSE and LS are same. See the figure below for comparision. In fact, observer that at very high $\texttt{SNR}$, i.e., low noise condition, the MMSE and LS both approach the same performance, as expected. - Recall that when we started this course, we listed down the requirements for an estimator to be unbiased, and also have good MMSE performance. If the estimator is unbiased, then MSE $\mathbb{E} \left\| \hat{\theta}-\theta\right\|^2$ is the variance. Hence we need a Minimum Variance Unbiased Estimator (MVUE). This is beyond the scope of this course, and will be dealt in more detail in other courses. ![mmse_vs_lmmse.jpg](https://hackmd.io/_uploads/S1IPHBLXp.jpg) <details><summary>Code to plot the MSE of MMSE and LMMSE (MATLAB).</summary> ```javascript= %% MMSE versus LMMSE clear all; clc; num_trial = 5000; num_samp = 1; snr = 1:10:100; % Generate $n$ i.i.d N(0,1): X^n temp = 0; mse_mmse = []; mse_lmmse = []; for i = 1:length(snr) SNR = snr(i); data = randn(num_trial,num_samp); %%% Generate $n=1$ i.i.d noise: W^n noise = randn(num_trial,num_samp)./sqrt(SNR); %%% Generate observations: Y^n; n=1 here y = data + noise; %% MMSE Estimate data_mmse = y.*(SNR/(1 + SNR)); %% LMMSE data_lmmse = y; %%% error calculation error_mmse = data - data_mmse; error_lmmse = data - data_lmmse; %% MSE calculations temp = temp + 1; mse_mmse(temp) = trace(error_mmse*error_mmse')/num_trial; mse_lmmse(temp) = trace(error_lmmse*error_lmmse')/num_trial; end figure(1); semilogy(10*log10(snr),mse_mmse,'-b'); hold on; semilogy(10*log10(snr),mse_lmmse,'-r'); legend('mmse','lmmse'); grid on; xlabel('SNR in dB'); ylabel('MSE') ``` </details> **Exercise:** Show that LS and MMSE results in the same estimator for the above model (Gaussian). Also, convince yourself why the MMSE curve is not linear while LMMSE is linear. **Exercise:** Consider the following joint distribution of $X$ and $Y$. | $p_{X,Y}(x,y)$ | $Y=1$ | $Y=0$ | $Y=-1$ | | -------- | -------- | -------- |-------- | | $X= 0$ | $1/6$ | $1/6$ | $0$ | $X= 1$ | $0$ | $0$ | $1/3$ | $X = 3$ | $1/6$ | $1/6$ | $0$ - Assume that you observe $X$ and find an MMSE estimate of $Y$. - Assume that you observe $Y$ and find an MMSE estimate of $X$. **Exercise:** Let $X$ and $Y$ have joint PDF \begin{equation} f_{X,Y}(x,y) = x + y \text{ for } 0 \leq x \leq 1, 0 \leq y \leq 1. \end{equation} Find the MMSE estimate of $X$ given the observation $Y$. ## Maximum Likelihood (ML) Estimate It is evident from the above discussion that if $\theta$ is deterministic and unknown, MSE cannot be used as a metric to find an estimate of $\theta$. We use the following example to introduce you to a method of estimation called Maximum likelihood method-a method similar to what is done in the hypothesis testing case, but with a small twist. **Example:** Consider **Problem $1$** where we are supposed to find an estimate of $p$ given $n$ tosses $X_1,X_2,\ldots,X_n$. But, $p$ is deterministic but unknown! Let look at the likelihood of observing the outcomes $X_1,X_2,\ldots,X_n$ when a coin $q$ is tossed: \begin{equation} \Pr\{x_1,x_2,\ldots,x_n | q \text{ is the coin}\} = q^{\sum_{i=1}^n x_i} \times (1-q)^{n - \sum_{i=1}^n x_i}. \end{equation} A natural choice of $q$ would be something close to $p$ as $p$ has generated these samples. However, I do not know $p$. But, intuitively, I should chose a $q$ that maximizes the above, i.e., \begin{equation} \max_{q} \Pr\{x_1,x_2,\ldots,x_n | q \text{ is the coin}\}. \end{equation} Solving the above leads to the following solution; this is called a maximum likelihood estimate or ML for short \begin{equation} \hat{q}_n = \frac{1}{n}\sum_{i=1}^n x_i. \end{equation} One can generalize the above and say if $X_1,X_2,\ldots,X_n \sim f_{X^n}(x_1,x_2,\ldots,x_n; \theta)$ or $p_{X^n}(x_1,x_2,\ldots,x_n; \theta)$, then the ML estimate is given by \begin{equation} \max_{q} \left\{f_{X^n}(x_1,x_2,\ldots,x_n; \theta) \text{ or } p_{X^n}(x_1,x_2,\ldots,x_n; \theta)\right\}. \end{equation} Now, we have so many questions. Is the ML estimate unbiased? Does it lead to very low MSE when $n$ is large? Before answering these questions, let us get used to ML by considering more examples. --- **Example:** Consider $X_1,X_2,\ldots,X_n$ be i.i.d. from $\texttt{Exp}(\theta)$. Find an estimate of $\theta$. First, we need to find the joint density of the $n$ rvs: \begin{equation} f_{X_1,X_2,\ldots,X_n}(x_1,x_2,\ldots,x_n; \theta) = \prod_{i=1}^n f_{X_i}(x_i; \theta) = \theta^n \text{exp}(-\theta \sum_{i=1}^n x_i). \end{equation} ML estimate of $\theta$ is given by \begin{equation} \max_{\theta} \left[\theta^n \text{Exp}\left(-\theta \sum_{i=1}^n x_i\right)\right]. \end{equation} By taking log and then differentiating and equating to zero results in $\hat{\theta}_{n}^{ML} = \left[\frac{1}{n}\sum_{i=1}^n x_i \right]^{-1}$-an intuitively appealing solution. --- **Example:** Consider $X_1,X_2,\ldots,X_n$ be i.i.d. from $\mathcal{N}(0,\theta^2)$. Find an estimate of $\theta$. First, we need to find the joint density of the $n$ rvs: \begin{equation} f_{X_1,X_2,\ldots,X_n}(x_1,x_2,\ldots,x_n; \theta) = \prod_{i=1}^n f_{X_i}(x_i; \theta) = \frac{1}{(2 \pi)^{n/2} \theta^n} \text{Exp}\left(-\frac{1}{\theta^2} \sum_{i=1}^n x_i\right). \end{equation} ML estimate of $\theta$ is given by \begin{equation} \max_{\theta} \frac{1}{(2 \pi)^{n/2} \theta^n} \text{Exp}\left(-\frac{1}{\theta^2} \sum_{i=1}^n x_i\right). \end{equation} By taking log and then differentiating and equating to zero results in $\hat{\theta}_{n}^{ML} = \sqrt{\frac{1}{n}\sum_{i=1}^n x^2_i}$-again an intuitively appealing solution. **Example:** Consider $X_1,X_2,\ldots,X_n$ be i.i.d. from $\texttt{Unif}(\theta)$. Find an ML estimate of $\theta$. First, we need to find the joint density of the $n$ rvs: \begin{equation} f_{X_1,X_2,\ldots,X_n}(x_1,x_2,\ldots,x_n; \theta) = \prod_{i=1}^n f_{X_i}(x_i; \theta) = \frac{1}{\theta^n} \text{ for all } x_i \leq \theta, i =1,2,\ldots,n. \end{equation} Thus, maximizing the above to get an ML estimate amounts to solving the problem below \begin{eqnarray} &&\max_{\theta > 0} \frac{1}{\theta^n} \\ && \text{such that } \theta \geq \max\{x_1,x_2,\ldots,x_n\}. \end{eqnarray} The solution is trivial, i.e., $\hat{\theta}_n := \max\{x_1,x_2,\ldots,x_n\}$-again an intuitively appealing solution. **Practice Problem:** Use the MATLAB code provided above to generate $n = 1000$ samples from Gausian distribution with mean $\mu = 3$ and variance $1.4$, and use the samples to fit a Gaussian distribution to the samples. Think of how to use the ML estimate to do the same. Also, understand the code. **Practice Problem:** Let $f_{X|\Theta}(x|\theta) = \theta e^{-\theta x}$ for $x \geq 0$ and $0$ otherwise. Let $f_{\Theta}(\theta) = \alpha e^{-\alpha \theta}$ for $\theta \geq 0$. Find MMSE and ML estimates of $\theta$ given a single observation $X=x$. **Practice Problem:** Suppose that the random variable $X$ is uniformly distributed between $-3$ and $3$, denoted $\texttt{Unif}[-1,1]$. The data $Y$ is a noisy measurement of $X$ given by \begin{equation} Y = X + N, \end{equation} where $N$ is independent of $X$, and is Laplacian distributed with PDF \begin{equation} p_N (n) = \frac{1}{2} e^{-|n|}. \end{equation} Find the minimum mean-square error estimate of $X$ given $Y$. Also, solve $\max_{x \in [-3.3]} f_{Y|X}(y|x)$. This is called a Maximum a posteriori Estimation (MAP). Next, we will take a slight detour. So far, we have assumed that the observations $X_1,X_2,\ldots,X_n$ are i.i.d. and hence descibed it through joint pdf, which is the product. However, in general, this should be described through joint pfd. This is difficult! So, let us look at the observations (correlated) that are Gaussian. We can stack $n$ rvs, and we get a vector (correlated with each other). How do you describe the joint distribution? The answer is as follows (turn this problem into something that we know how to describe, i.e., turn this into one variable): **Definition:** A random vector $X_1,X_2,\ldots,X_n$ is said to be jointly Gaussian if any linear combination \begin{equation} X = \sum_{i=1}^n a_i X_i \end{equation} results in a Gaussian random variable for any $a_i$, $i = 1,2,\ldots,n$. The above implies that each $X_i$ is a Gaussian rv. **Definition:** We say that $\mathbf{X} \in \mathbb{R}^n$ is jointly Gaussian with the mean vector $\mathbf{\mu} := (\mu_1, \mu_2, \ldots,\mu_n)$, where $\mu_i = \mathbb{E}X_i$ and covariance matrix $\Sigma := \mathbb{E}(\mathbf{X}^T \mathbf{X})$ if the joint density is \begin{equation} f_{\mathbf{X}}(\mathbf{x}) = \frac{1}{(2 \pi)^{n/2} |\Sigma|^{1/2}} \exp \left\{-\frac{(\mathbf{x} - \mathbf{\mu})^T \Sigma^{-1} (\mathbf{x} - \mathbf{\mu})} {2}\right\}. \end{equation} Generally, we use the short hand notation $X \sim \mathcal{N}(\mathbf{\mu},\Sigma)$. In the above, $|\Sigma|$ denotes the determinant of the matrix. Note that $\Sigma \in \mathbb{R}^{n \times n}$ is a symmetric positive semidefinite matrix (why?). Now, let us consider the following problems, one on detection and the other on estimation. - **Detection:** $X:= (X_1,X_2,\ldots,X_n) \sim \mathcal{N}(0,\Sigma)$ under $\mathcal{H}_0$ and $\mathcal{N}(\mathbf{\mu},\Sigma)$ under $\mathcal{H}_1$. Find the test to distinguish between the two. *Solution:* The test is again the LLR assuming $\pi_1=1/2$: \begin{equation} \mathbf{x}^T \Sigma^{-1} \mathbf{\mu} \underset{\text{choose }\mathcal{H}_1}{\overset{\text{choose }\mathcal{H}_0}{\gtrless}} 0 \end{equation} - **Estimation:** Let $X:= (X_1,X_2,\ldots,X_n) \sim \mathcal{N}({\Theta},\Sigma)$. Find an estimate of $\Theta$. *Solution:* The ML estimate is (MMSE is not possible. Why?): \begin{equation} \min_{\mathbf{v}} ~~(\mathbf{x} - \mathbf{v})^T\Sigma^{-1} (\mathbf{x} - \mathbf{v}). \end{equation} **IMPORTANT NOTE:** So far, we have looked at both estimation and detection problems. In these problems, we assumed some kind of distribution. However, in partice you do not know the distribution. One can come up with the test without knowing the distribution. Based on this, two terms are used in statistics, namely (a) parametric test or estimation; this refers to the case where you know the distribution and (b) non-parameteric test or estimation; this is where you do not assume any distribution. Hence the tests/estimation that we considered in this notes so far are called parameteric test/estimation. <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>Page Title</title> <style> /* Whatever that is inside this <style> tag is all styling for your markup / content structure. /* The . with the boxed represents that it is a class */ .boxed { background: #F0F9F2; color: black; border: 3px solid #535353; margin: 0px auto; width: 856px; padding: 10px; border-radius: 10px; } </style> </head> <body> <div class="boxed"> ## Finally some assignment problems on the above topics :grinning: 1. Given $n$ samples $Y_1,Y_2,\dots,Y_n$. Suppose we use the estimator $\widehat{Y}=\frac{1}{n}\sum_{i=1}^nY_i$. Show that $n=\mathcal{O}(\frac{r^2}{\epsilon^2\delta})$, where $r=\frac{\sqrt{Var[Y]}}{\mathbb{E}[Y]}$ is sufficient to ensure: $$\mathbb{P}\left[\left|\widehat{Y}-\mathbb{E}[Y]\right|\geq\epsilon\mathbb{E}[Y]\right]\leq\delta.$$ 2. Suppose a sample $X_1,\ldots, X_n$ is modelled by a Poisson distribution with a parameter denoted by $\lambda$, so that $$f_X(x;\lambda)=\frac{\lambda^x}{x!}e^{-\lambda}\hspace{1cm} x=0,1,2,...$$ for some $\lambda>0$. Estimate the unknown parameter $\lambda$ by using MLE. Show that asymptotically (in $n$), the MSE in the estimate is zero. 3. You have a coin that you think is biased. you flip it $7$ times and get the sequence $HHHHHHT$. What is the maximum likelihood estimate for the probability of getting heads? 4. You think a die is rigged so that it will roll $4$ less often. You continually roll the die until you get $4$ and you have to roll $10$ times before this happens. Is your suspicion correct with a confidence level of $\alpha=0.05$? 5. Two cars A and B are travelling independently in the same direction. The speed of car A is normally distributed with mean $100$ km/hr and standard deviation $\sqrt{5}$ km/hr while the speed of car B is distributed normally with mean $100$ km/hr and standard deviation $2$ km/hr. Car A was initially 3 km ahead of car B. Find out the probability that after $3$ hours, they are within a distance of $3$ km from each other. 6. Let $X$ and $Y$ be two continuous random variables with joint pdf $$ f_{X,Y}(x,y)= x+y, 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 .$$Find MMSE estimator of $Y$ given $X$. Also, find the MLE of $X$. Comment which is better in terms of computation and MSE. 7. Consider the linear regression model $y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}$ for $i=1,2,...,6$, where $\beta_{0}$ and $\beta_{1}$ are unknown parameters and $\epsilon_{i}'s$ are independent and identically distributed random variables having $\mathcal{N}(0,1)$ distribution. The data on $(x_{i},y_{i})$ are given in Table $1$. Find the constant that minimizes the squred error. Table related to $7$.--------------------------------- ![image.png](https://hackmd.io/_uploads/r1Dzs62Xa.png)--------------------------------------------------------- 8. Given that the number of products produced in a factory during a week is a random variable with a mean of $50$ and a variance of $25$, what can be inferred about the likelihood that the weekly production falls within the range of $40$ to $60$ products? 9. Let us consider we have a random sample $X_1, X_2, \cdots, X_n$ where $X_i=0$ if a randomly selected person does not hold a racing bike, and $X_i=1$ if a randomly selected person does hold a racing bike. Assuming that the $X_i$ are independent Bernoulli random variables with unknown parameter $p$, find the maximum likelihood estimator of $p$, the proportion of person who hold a racing bike. 10. Consider the following hypothesis testing problem: \begin{equation} (X_1,X_2,\ldots,X_n) \sim \left\{ \begin{array}[cc] \mathcal{N}(0,\sigma^2 I) & \text{ under } \mathcal{H}_0\\ \mathcal{N}(0,\mathbf{S}) & \text{ under } \mathcal{H}_1. \end{array} \right. \end{equation} Find the ML rule for the above. 11. **Not included in the exam** (Number of collisions): If we throw $m$ balls in $n$ bins, the average number of collisions $Y_{m,n}$ to be $\lambda_{m,n}=\left(\frac{m}{2}\right)\frac{1}{n}$. Use Chebyshev’s inequality to show that: $$\mathbb{P}\left[\left|Y_{m,n}-\lambda_{m,n}\right|\geq c\sqrt{\lambda_{m,n}}\right]\leq\frac{1}{c^2}.$$ Next, suppose we choose $m=2\sqrt{n}$, then $\lambda_{m,n}\leq 1$. Use Chernoff bounds plus the union bound to bound the probability that no bin has more than $1$ ball. 12. Let $X_{1}$,$X_{2}$,...,$X_{n}$ $(n\geq 2)$ be independent and identically distributed random variables with probability density function \begin{align*} f_{X}(x)=\begin{cases} \frac{1}{x^{2}} & \text{if} \hspace{0.5 cm} x \geq 1\\ 0 & \text{otherwise.} \end{cases} \end{align*} Then which of the following have/has finite expectation? (i) $X$, (ii) $\frac{1}{X_{2}}$, (iii) $\sqrt{X_{1}}$, and (iv) $\min\{X_{1},X_{2},X_{3},...,X_{n}\}$. 13. Assume that you observe one sample whose distribution is \begin{align*} f_{X}(x)=\begin{cases} \frac{3 x^{2}}{2} \text{ for } -1 \leq x \leq 1 & \text{ under } \mathcal{H}_1\\ \texttt{Unif}[-1,1] & \text{ under } \mathcal{H}_0. \end{cases} \end{align*} Find the Neyman Pearson rule to achieve a false alarm rate of $0.1$. Find the probability of detection ($P_d$). Plot a graph showing $P_d$ versus probability of false alarm $P_f$. 14. Let the observation be modelled as \begin{equation} X_n = \alpha X_{n-1} + W_n, \end{equation} where $W_n \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0,\sigma^2)$ and $\alpha > 0$ is a known constant. Find an MMSE estimate of $X_n$ given all the past observation $X_1,X_2,\ldots,X_{n-1}$. Find the MSE of your estimator. </div> </body> </html>