Sampling and Estimation Problems in Probability

# Sampling and Estimation Problems in Probability In this article, I address a ***important*** topic because it contains foundational knowledge that helps you better understand two methods: ***MLE*** and ***MAP*** (two extremely important methods for ***probabilistic*** models in machine learning). ## 1. What is a sample? + Data collection, organization, aggregation, and processing are called ***descriptive statistics***. + Aggregating the above statistical data leads to the concept of a ***data sample***. + The parent set of all ***data samples*** is the ***population***. + Simply put, a ***data sample*** reflects a part of the information of the ***population***. Therefore, to obtain complete information about a data-related problem, we would have to operate on the ***population***. So why do we need ***the definition of a data sample***? + Do you have enough resources to ***handle*** a large dataset? + Can you keep up with data changes in such a fast-paced technological era? You probably already know the answer. ***The answer is no***. Working on the ***population*** is impossible. Therefore, solutions such as ***sampling*** and ***studying samples*** were developed. ### 1.1. Sampling methods There are many sampling methods, and each method is suitable for different types of data. #### 1.1.1 Simple Random Sampling As the name suggests, we randomly select a ***subset*** from the ***population***: + With replacement: When a data point is taken out of the population, an identical data point is generated back into the population. In other words, ***sampling with replacement*** allows data points to be repeated. + Without replacement: Each data point is selected only once. #### 1.1.2 Cluster Sampling Divide the population into ***k*** groups. Perform random sampling on these ***k*** groups. Collecting these samples gives us the ***cluster sample***. When the number of samples in each group is too ***imbalanced***, this method should be used. In machine learning, when splitting ***training and test sets*** to avoid the issue of ***highly imbalanced group sizes***, we also use an idea similar to ***cluster sampling***, called ***stratified sampling***. #### 1.1.3 Judgment Sampling This is a sampling method based on expert opinions about the research subject. ### 1.2 Random sample A sample is called a random sample of size ***n*** from a ***population*** with original random variable $X$ if it is a set of variables $X_1, X_2, ..., X_n$ satisfying: + ***n*** variables $X_1, X_2, ..., X_n$ are independent + ***n*** variables $X_1, X_2, ..., X_n$ have the same probability distribution as $X$ Because the variables are ***independent***, the ***joint probability or density functions*** are computed as the ***product of marginal functions***: + If $X$ is a discrete variable: \begin{equation} p_n(x_1, x_2, ..., x_n) = P(X_1= x_1, X_2 = x_2, ..., X_n = x_n) = \prod_{i = 1}^np(X_i = x_i) \end{equation} + If $X$ is a continuous variable: \begin{equation} f_n(x_1, x_2, ..., x_n) = \prod_{i = 1}^nf(x_i) \end{equation} We often hear the term ***statistics***. So ***what is statistics***? Simply put, statistics refers to functions. A function $Y = g(X_1, X_2, ..., X_n)$ depends on the values of a random sample. ***Example:*** + $g(X_1, X_2. .., X_n) = \frac{1}{n} \sum\limits_{i = 1}^nx_i = \bar{X}$ is a statistic + $g(X_1, X_2. .., X_n) = \frac{1}{n} \sum\limits_{i = 1}^n(X_i - \bar{X})^2$ is also a statistic ***Since $X_i$ is also a variable representing values $x_i$, I use $X_i$ instead of $x_i$ in later formulas for better readability*** ## 2. Sample characteristics ### 2.1 Sample mean \begin{equation} \bar{X} = \frac{1}{n} \sum\limits_{i = 1}^nx_i \end{equation} Since $X_i$ are random variables, ***$\bar{X}$ is also a random variable***. Assume the original random variable $X$ ***(the population random variable)*** has $EX = \alpha$ and $VX = \sigma^2$. Because $X_i$ belong to a random sample from the population, $X_i$ have the same distribution as $X$ (by definition), so $EX_i = \alpha$ and $VX_i = \sigma^2$. + $E\bar{X} = E[\frac{1}{n} \sum\limits_{i = 1}^nX_i] = \frac{1}{n}(EX_1 + EX_2 + ... + EX_n) = \alpha$ + $V\bar{X} = \frac{1}{n^2}(VX_1 + VX_2 + ... + VX_n) = \frac{\sigma^2}{n}$ (since the $X_i$ are pairwise independent) From the previous article, I mentioned that $VX$ ***characterizes data dispersion***, and since $V\bar{X} < VX_i, 1 \le n \le n$, ***the values of $\bar{X}$ are more stable around the expectation than $X$*** ### 2.2 Sample variance \begin{equation} S^2 = \frac{1}{n}\sum\limits_{i = 1}^n(x_i - \bar{X})^2 \end{equation} We have: \begin{equation} S^2 = \frac{1}{n}\sum\limits_{i = 1}^n(X_i - \bar{X})^2 = \frac{1}{n} \sum\limits_{i = 1}^nX_i^2 - \frac{1}{n^2}(\sum\limits_{i = }^nX_i)^2 \end{equation} Since $(X_1 + X_2 + ... + X_n)^2 = \sum\limits_{i = 1}^nX_i^2 + \sum\limits_{i \ne j}X_iX_j$, we have: \begin{equation} S^2 = \frac{n - 1}{n^2}\sum\limits_{i = 1}^nX_i^2 - \frac{1}{n^2}\sum\limits_{i \ne j}X_iX_j \end{equation} Because the variables $X_i$ are pairwise independent and identically distributed as $X$: + $E(X_iX_j) = EX_iEX_j = \alpha^2$ + $E(X_i^2) = E(X^2)$ Thus: \begin{equation} ES^2 = \frac{n - 1}{n} VX = \frac{n - 1}{n} \sigma^2 \end{equation} We see that $ES^2 \ne \sigma^2$, so we adjust $S^2$ such that $ES^2 = \sigma^2$. This is called the ***unbiased sample variance $s^2$***: \begin{equation} s^2 = \frac{n}{n-1}S^2 = \frac{1}{n - 1}\sum\limits_{i = 1}^n(x_i - \bar{X})^2 \end{equation} Then: \begin{equation} Es^2 = \sigma^2 \end{equation} ## 3. Estimation problems in probability ### 3.1 Point estimation #### 3.1.1 Parameter estimation Assume the original random variable $X$ has a known ***distribution law*** (e.g., normal, uniform, ...), but the parameters $\theta$ of the ***probability or density function*** are unknown. ***Parameter estimation is the process of using observed samples $x_1, x_2, ...x_n$ to determine $\theta$***. The estimate of $\theta$ is denoted by $\hat{\theta}$: + If $\hat{\theta}$ is a point, this is ***point estimation*** + If $\hat{\theta}$ is an interval, this is ***interval estimation*** #### 3.1.2 How to estimate $\theta$ We always want the best possible estimate of $\theta$. Given observed samples $x_1, x_2, ..., x_n$, there exists a true value of $\theta$, but we do not know it. Therefore, we seek $\hat{\theta}$ such that the ***density function (for continuous variables) or probability function (for discrete variables)*** best matches the probability of observing $x_1, x_2, ..., x_n$. ***The best estimate of $\theta$ is the value that maximizes the probability of observing the sample $x_1, x_2, ..., x_n$.*** Assume $X$ has probability or density function $f(x \mid \theta)$. Because $X_1, X_2, ..., X_n$ are independent and identically distributed as $X$, the likelihood is: \begin{equation} L(x \mid \theta) = \prod\limits_{i=1}^nf(x_i \mid \theta) \end{equation} ***Note:*** Taking derivatives of products is difficult, so we take the natural logarithm of $L(x \mid \theta)$ before differentiation. Since $ln(x)$ is monotonically increasing for $x > 0$, maximizing $ln(L)$ is equivalent to maximizing $L$. #### 3.1.3 Examples of point estimation for parameters in some previously studied distributions ##### 3.1.3.1 Estimating parameter $\lambda$ of the Poisson distribution Since the original random variable $X$ follows a ***Poisson*** distribution, the individual variables $X_i, \forall 1 \le i \le n$, also follow a ***Poisson*** distribution. The probability mass function for each $X_i$ is: \begin{equation} p(X_i = x_i \mid \lambda) = e^{-\lambda} \frac{\lambda^{x_i}}{x_i!}, \lambda > 0 \end{equation} Then: \begin{equation} L(x \mid \lambda) = L(X_1=x_1, X_2=x_2, ..., X_n=x_n \mid \lambda) = \prod\limits_{i=1}^nf(x_i \mid \lambda) = e^{-n\lambda} \frac{\lambda^{\sum\limits_{i = 1}^nx_i}}{\prod\limits_{i = 1}^nx_i!} \end{equation} ***Since $L(x \mid \lambda) > 0$, take the natural logarithm of both sides:*** \begin{equation} ln(L(x \mid \lambda)) = -n\lambda + ln\lambda \times \sum\limits_{i = 1}^nx_i - ln(\prod\limits_{i = 1}^nx_i!) \end{equation} ***Because $ln(L(x \mid \lambda))$ is continuous for all $\lambda > 0$, we freely take the first derivative:*** \begin{equation} \frac{\partial ln(L(x \mid \lambda))}{\partial \lambda} = -n + \frac{\sum\limits_{i = 1}^nx_i}{\lambda} \end{equation} Setting it equal to zero: \begin{equation} \frac{\partial ln(L(x \mid \lambda))}{\partial \lambda} = 0 \Leftrightarrow \lambda = \frac{\sum\limits_{i = 1}^nx_i}{n} \end{equation} The second derivative is: \begin{equation} \frac{\partial^2 ln(L(x \mid \lambda))}{\partial \lambda^2} = -\frac{1}{\lambda^2}\sum\limits_{i = 1}^nx_i < 0 \end{equation} ***Note: Since $X_i$ follow a Poisson distribution, $x_i > 0$.*** ***Thus, the best estimate for parameter $\lambda$ is $\hat{\lambda} = \frac{\sum\limits_{i = 1}^nx_i}{n}$.*** ##### 3.1.3.2 Estimating parameters $\alpha, \sigma^2$ of the normal distribution $N(\alpha, \sigma^2)$ Since the original random variable $X$ follows a ***normal distribution***, the individual variables $X_i, \forall 1 \le i \le n$, also follow a ***normal distribution***. The probability density function for each $X_i$ is: \begin{equation} f(x_i \mid \{\alpha, \sigma \}) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x_i - \alpha)^2}{2\sigma^2}} \end{equation} Then: \begin{equation} L(x \mid \{\alpha, \sigma \}) = (2\pi\sigma^2)^{\frac{-n}{2}}e^{-\frac{\sum\limits_{i = 1}^n(x_i - \alpha)^2}{2\sigma^2}} \end{equation} ***Since $L(x \mid \{\alpha, \sigma \}) > 0$, take the natural logarithm of both sides:*** \begin{equation} ln(L(x \mid \{\alpha, \sigma \})) = -\frac{n}{2}ln(2\pi\sigma^2) -\frac{\sum\limits_{i = 1}^n(x_i - \alpha)^2}{2\sigma^2} \end{equation} Because $\sigma > 0$, $ln(L(x \mid \{\alpha, \sigma \}))$ is differentiable for all $\alpha, \sigma > 0$. Taking derivatives: \begin{equation} \frac{\partial ln(L(x \mid \{\alpha, \sigma \}))}{\partial \alpha} = \frac{1}{\sigma^2}\sum\limits_{i = 1}^n(x_i - \alpha) \end{equation} \begin{equation} \frac{\partial ln(L(x \mid \{\alpha, \sigma \}))}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{\sum\limits_{i = 1}^n(x_i - \alpha)^2}{2\sigma^4} \end{equation} Solving the equations gives: + $\hat{\alpha} = \bar{X} = \frac{1}{n}\sum\limits_{i=1}^nx_i$ + $\hat{\sigma}^2 = S^2 = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{X})^2$ The second derivatives are negative, confirming a maximum. ***Thus, the estimators for the normal distribution parameters are:*** + $\hat{\alpha} = \bar{X}$ + $\hat{\sigma}^2 = S^2$ ##### 3.1.3.3 Estimating parameter $p$ for the Bernoulli distribution Since the original random variable $X$ follows a ***Bernoulli distribution***, the individual variables $X_i$ also follow a ***Bernoulli distribution***. The probability mass function is: \begin{equation} f(x_i \mid p) = p^{x_i}(1-p)^{1-x_i}, 0 < p < 1 \end{equation} Then: \begin{equation} L(x \mid p) = p^{\sum\limits_{i=1}^nx_i}(1-p)^{n-\sum\limits_{i=1}^nx_i} \end{equation} ***Since $L(x \mid p) > 0$, take the natural logarithm:*** \begin{equation} ln(L(x \mid p)) = lnp \times \sum\limits_{i=1}^nx_i + ln(1-p) \times (n-\sum\limits_{i=1}^nx_i) \end{equation} Taking the derivative: \begin{equation} \frac{\partial ln(L(x \mid p))}{\partial p} = \frac{\sum\limits_{i=1}^nx_i}{p} + \frac{n-\sum\limits_{i=1}^nx_i}{p-1} \end{equation} Solving gives: \begin{equation} \hat{p} = \frac{\sum\limits_{i=1}^nx_i}{n} \end{equation} The second derivative is negative, so this is a maximum. ***Thus, the estimator for the Bernoulli parameter is $\hat{p} = \frac{\sum\limits_{i=1}^nx_i}{n}$.*** ##### 3.1.3.4 Estimating parameter vector $p$ for the categorical distribution Unlike the Bernoulli distribution, where variables take values in ***{0,1}***, here each $X_i$ can take one of $m$ different values. ***Vector $p = [p_1, p_2, ..., p_m], 0 < p_i < 1$*** satisfies $p_1 + p_2 + ... + p_m = 1$. Since the original variable $X$ follows a ***multinoulli distribution***, the variables $X_i$ also follow a ***multinoulli distribution***. The likelihood function leads to the estimator: $$p_i = \hat{p_i} = \frac{n_i}{n_1 + n_2 + ... + n_m}, \quad 1 \le i \le m$$ Thus, the estimated parameter vector is: $$p = [\hat{p_1}, \hat{p_2}, ..., \hat{p_m}]$$ $$p_i = \hat{p_i} = \frac{n_i}{n_1 + n_2 + ... + n_m}, 1 \le i \le m$$ ### 3.2 Interval estimation As we know, when we estimate something such as ***exam scores***, ***the height of Asians, ...***, point estimation can sometimes lead to ***large bias when the sample size is small*** and is ***hard to evaluate***. Therefore, we move to ***interval estimation*** because it is more reliable and objective. ***Interval estimation is built based on point estimation.*** We need to estimate an interval for parameter $\theta$ with confidence level $1 - \alpha$. That is, we find an interval $(\theta_1, \theta_2)$ such that: $$P(\theta_1 < \theta < \theta_2) = 1 - \alpha$$ In this article ***[here]()***, I mentioned the concept of ***quantiles*** as follows: The $k$% quantile of a random variable $X$ is a value $x_k$ satisfying: $$P(X < x_k) = \frac{k}{100}$$ That is, we find quantiles $\theta_{\alpha_1}, \theta_{1 - \alpha_2}$ such that $\alpha_1 + \alpha_2 = \alpha$: + $P(\theta < \theta_{\alpha_1}) = \alpha_1$ + $P(\theta < \theta_{1 - \alpha_2}) = 1-\alpha_2$ Then: $$P(\theta_{\alpha_1} < \theta < \theta_{1 - \alpha_2}) = 1-\alpha_2 - \alpha_1 = 1 - \alpha$$ ***Rain has raincoats, sun has umbrellas, if you don’t understand, here comes an example*** We consider the problem of confidence interval estimation for the expectation. This example considers the normal distribution. Assume the original random variable $X$ (population variable) follows a ***normal distribution*** with expectation $EX = \mu$ and variance $VX = \sigma^2$. We estimate a confidence interval for $\mu$ with confidence level $1 - \alpha$. Assume: the variance is known, $\sigma^2 = \sigma_0^2$. Define: $$Z = \frac{\bar{X} - \mu}{\sigma_0}\sqrt{n}$$ Then $Z$ follows a normal distribution with ***mean = 0*** and ***variance = 1***. In other words, $Z$ ***follows the [standard normal distribution]()***. We find $z_{\alpha_1}, z_{1 - \alpha_2}$ such that: + $\alpha_1 + \alpha_2 = \alpha$ + $P(Z < z_{\alpha_1}) = \alpha_1$ + $P(Z < z_{1 - \alpha_2}) = 1 - \alpha_2$ Since $Z$ follows a normal distribution: $$P(Z < -z_{1 - \alpha_1}) = P(Z > z_{1 - \alpha_1}) = 1 - P(Z < z_{1 - \alpha_1}) = 1 - (1 - \alpha_1) = \alpha_1 = P(Z < z_{\alpha_1})$$ Thus: \begin{equation} P(Z < z_{1 - \alpha_2}) - P(Z < z_{\alpha_1}) = 1 - (\alpha_1 + \alpha_2) = 1- \alpha \end{equation} ***Or:*** \begin{equation} P(-z_{1 - \alpha_1} < Z < z_{1 - \alpha_2}) = 1- \alpha \end{equation} Therefore, the confidence interval for $Z$ is $(-z_{1 - \alpha_1}; z_{1 - \alpha_2})$. Then: $$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_{1 - \alpha_1} < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_{1 - \alpha_2}$$ For a given $1 - \alpha$, there are infinitely many pairs $(\alpha_1, \alpha_2)$. We choose some special cases: + Symmetric confidence interval: A symmetric interval satisfies $z_{1 - \alpha_1} = z_{1 - \alpha_2}$, or equivalently: $$\alpha_1 = \alpha_2 = \frac{\alpha}{2}$$ Let $z_b = z_{1 - \frac{\alpha}{2}}$, then: $$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_b < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_b$$ The quantity $\epsilon = \frac{\sigma_0}{\sqrt{n}}z_b$ is the ***estimation accuracy***. It represents the expected average deviation with confidence $1 - \alpha$. + Right-sided confidence interval: This interval satisfies $\alpha_1 = \alpha, \alpha_2 = 0$. Let $z_b = z_{1 - \alpha}$, then: $$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_b < \mu < +\infty$$ + Left-sided confidence interval: This interval satisfies $\alpha_1 = 0, \alpha_2 = \alpha$. Let $z_b = z_{1 - \alpha}$, then: $$ -\infty < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_b$$ Computing $z_b$ means finding the $k$% quantile of variable $Z$ following the ***standard normal distribution $\phi$***. Examples: + $z_b = z_{1 - \alpha}$ means: $\phi(z_b) = \frac{1}{2} - \alpha$ + $z_b = z_{1 - \frac{\alpha}{2}}$ means: $\phi(z_b) = \frac{1 - \alpha}{2}$ ***Review the [previous article]() if you’ve forgotten*** Looking up the ***Laplace table*** gives the result. You can refer to the ***Laplace table*** ***[here](https://cdn.slidesharecdn.com/ss_thumbnails/banggiatrihamlaplace-150919075726-lva1-app6892-thumbnail-4.jpg?cb=1442649495)***. ***In exams, you will always be provided with a table of values or Laplace values*** This article only covers the case where the ***variance is known***. You can refer to confidence interval estimation methods for ***proportions and variances*** in Chapter 4 of the lecture notes by ***Dr. Lê Xuân Lý - HUST*** at [here](https://drive.google.com/drive/folders/1d-X3Q3-mjwma8H-KH_DeusLt7CiL2MFR). ## 4. Example The revenue of a store is a random variable $X$ (million/month) with standard deviation 2 million/month. A random survey of 500 stores of similar scale yields an average revenue of 10 million/month. With 95% confidence, estimate the interval for the average revenue of stores of that scale. ***Solution:*** Define: $$Z = \frac{\bar{X} - \mu}{\sigma_0}\sqrt{n}$$ From the formula above, the symmetric confidence interval for the mean $\mu$ is: $$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_b < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_b$$ With $z_b = z_{1 - \frac{\alpha}{2}}, 1-\alpha=95\text{%}, \sigma=2, \bar{x} = 10, n = 500$. Then: $\phi(z_b) = \frac{1 - \alpha}{2} = 0.475$, so $z_b = 1.96$ (from the ***Laplace table***). ***Substituting values, we get:*** $$9.825 < \mu < 10.175$$ ## 5. References + ***[Lecture slides by Dr. Lê Xuân Lý - HUST](https://drive.google.com/drive/folders/1d-X3Q3-mjwma8H-KH_DeusLt7CiL2MFR)*** + ***[Probability and Statistics textbook - Tống Đình Quy](https://drive.google.com/file/d/1Yw02kvncpFp6WiyP9kZWSEuWyvLL3sYO/view)***