# Sampling and Estimation Problems in Probability
In this article, I address a ***important*** topic because it contains foundational knowledge that helps you better understand two methods: ***MLE*** and ***MAP*** (two extremely important methods for ***probabilistic*** models in machine learning).
## 1. What is a sample?
+ Data collection, organization, aggregation, and processing are called ***descriptive statistics***.
+ Aggregating the above statistical data leads to the concept of a ***data sample***.
+ The parent set of all ***data samples*** is the ***population***.
+ Simply put, a ***data sample*** reflects a part of the information of the ***population***. Therefore, to obtain complete information about a data-related problem, we would have to operate on the ***population***. So why do we need ***the definition of a data sample***?
+ Do you have enough resources to ***handle*** a large dataset?
+ Can you keep up with data changes in such a fast-paced technological era?
You probably already know the answer. ***The answer is no***. Working on the ***population*** is impossible. Therefore, solutions such as ***sampling*** and ***studying samples*** were developed.
### 1.1. Sampling methods
There are many sampling methods, and each method is suitable for different types of data.
#### 1.1.1 Simple Random Sampling
As the name suggests, we randomly select a ***subset*** from the ***population***:
+ With replacement: When a data point is taken out of the population, an identical data point is generated back into the population. In other words, ***sampling with replacement*** allows data points to be repeated.
+ Without replacement: Each data point is selected only once.
#### 1.1.2 Cluster Sampling
Divide the population into ***k*** groups. Perform random sampling on these ***k*** groups. Collecting these samples gives us the ***cluster sample***. When the number of samples in each group is too ***imbalanced***, this method should be used. In machine learning, when splitting ***training and test sets*** to avoid the issue of ***highly imbalanced group sizes***, we also use an idea similar to ***cluster sampling***, called ***stratified sampling***.
#### 1.1.3 Judgment Sampling
This is a sampling method based on expert opinions about the research subject.
### 1.2 Random sample
A sample is called a random sample of size ***n*** from a ***population*** with original random variable $X$ if it is a set of variables $X_1, X_2, ..., X_n$ satisfying:
+ ***n*** variables $X_1, X_2, ..., X_n$ are independent
+ ***n*** variables $X_1, X_2, ..., X_n$ have the same probability distribution as $X$
Because the variables are ***independent***, the ***joint probability or density functions*** are computed as the ***product of marginal functions***:
+ If $X$ is a discrete variable:
\begin{equation}
p_n(x_1, x_2, ..., x_n) = P(X_1= x_1, X_2 = x_2, ..., X_n = x_n) = \prod_{i = 1}^np(X_i = x_i)
\end{equation}
+ If $X$ is a continuous variable:
\begin{equation}
f_n(x_1, x_2, ..., x_n) = \prod_{i = 1}^nf(x_i)
\end{equation}
We often hear the term ***statistics***. So ***what is statistics***?
Simply put, statistics refers to functions. A function $Y = g(X_1, X_2, ..., X_n)$ depends on the values of a random sample.
***Example:***
+ $g(X_1, X_2. .., X_n) = \frac{1}{n} \sum\limits_{i = 1}^nx_i = \bar{X}$ is a statistic
+ $g(X_1, X_2. .., X_n) = \frac{1}{n} \sum\limits_{i = 1}^n(X_i - \bar{X})^2$ is also a statistic
***Since $X_i$ is also a variable representing values $x_i$, I use $X_i$ instead of $x_i$ in later formulas for better readability***
## 2. Sample characteristics
### 2.1 Sample mean
\begin{equation}
\bar{X} = \frac{1}{n} \sum\limits_{i = 1}^nx_i
\end{equation}
Since $X_i$ are random variables, ***$\bar{X}$ is also a random variable***.
Assume the original random variable $X$ ***(the population random variable)*** has $EX = \alpha$ and $VX = \sigma^2$.
Because $X_i$ belong to a random sample from the population, $X_i$ have the same distribution as $X$ (by definition), so $EX_i = \alpha$ and $VX_i = \sigma^2$.
+ $E\bar{X} = E[\frac{1}{n} \sum\limits_{i = 1}^nX_i] = \frac{1}{n}(EX_1 + EX_2 + ... + EX_n) = \alpha$
+ $V\bar{X} = \frac{1}{n^2}(VX_1 + VX_2 + ... + VX_n) = \frac{\sigma^2}{n}$ (since the $X_i$ are pairwise independent)
From the previous article, I mentioned that $VX$ ***characterizes data dispersion***, and since $V\bar{X} < VX_i, 1 \le n \le n$, ***the values of $\bar{X}$ are more stable around the expectation than $X$***
### 2.2 Sample variance
\begin{equation}
S^2 = \frac{1}{n}\sum\limits_{i = 1}^n(x_i - \bar{X})^2
\end{equation}
We have:
\begin{equation}
S^2 = \frac{1}{n}\sum\limits_{i = 1}^n(X_i - \bar{X})^2 = \frac{1}{n} \sum\limits_{i = 1}^nX_i^2 - \frac{1}{n^2}(\sum\limits_{i = }^nX_i)^2
\end{equation}
Since $(X_1 + X_2 + ... + X_n)^2 = \sum\limits_{i = 1}^nX_i^2 + \sum\limits_{i \ne j}X_iX_j$, we have:
\begin{equation}
S^2 = \frac{n - 1}{n^2}\sum\limits_{i = 1}^nX_i^2 - \frac{1}{n^2}\sum\limits_{i \ne j}X_iX_j
\end{equation}
Because the variables $X_i$ are pairwise independent and identically distributed as $X$:
+ $E(X_iX_j) = EX_iEX_j = \alpha^2$
+ $E(X_i^2) = E(X^2)$
Thus:
\begin{equation}
ES^2 = \frac{n - 1}{n} VX = \frac{n - 1}{n} \sigma^2
\end{equation}
We see that $ES^2 \ne \sigma^2$, so we adjust $S^2$ such that $ES^2 = \sigma^2$. This is called the ***unbiased sample variance $s^2$***:
\begin{equation}
s^2 = \frac{n}{n-1}S^2 = \frac{1}{n - 1}\sum\limits_{i = 1}^n(x_i - \bar{X})^2
\end{equation}
Then:
\begin{equation}
Es^2 = \sigma^2
\end{equation}
## 3. Estimation problems in probability
### 3.1 Point estimation
#### 3.1.1 Parameter estimation
Assume the original random variable $X$ has a known ***distribution law*** (e.g., normal, uniform, ...), but the parameters $\theta$ of the ***probability or density function*** are unknown.
***Parameter estimation is the process of using observed samples $x_1, x_2, ...x_n$ to determine $\theta$***. The estimate of $\theta$ is denoted by $\hat{\theta}$:
+ If $\hat{\theta}$ is a point, this is ***point estimation***
+ If $\hat{\theta}$ is an interval, this is ***interval estimation***
#### 3.1.2 How to estimate $\theta$
We always want the best possible estimate of $\theta$.
Given observed samples $x_1, x_2, ..., x_n$, there exists a true value of $\theta$, but we do not know it.
Therefore, we seek $\hat{\theta}$ such that the ***density function (for continuous variables) or probability function (for discrete variables)*** best matches the probability of observing $x_1, x_2, ..., x_n$.
***The best estimate of $\theta$ is the value that maximizes the probability of observing the sample $x_1, x_2, ..., x_n$.***
Assume $X$ has probability or density function $f(x \mid \theta)$.
Because $X_1, X_2, ..., X_n$ are independent and identically distributed as $X$, the likelihood is:
\begin{equation}
L(x \mid \theta) = \prod\limits_{i=1}^nf(x_i \mid \theta)
\end{equation}
***Note:*** Taking derivatives of products is difficult, so we take the natural logarithm of $L(x \mid \theta)$ before differentiation. Since $ln(x)$ is monotonically increasing for $x > 0$, maximizing $ln(L)$ is equivalent to maximizing $L$.
#### 3.1.3 Examples of point estimation for parameters in some previously studied distributions
##### 3.1.3.1 Estimating parameter $\lambda$ of the Poisson distribution
Since the original random variable $X$ follows a ***Poisson*** distribution, the individual variables $X_i, \forall 1 \le i \le n$, also follow a ***Poisson*** distribution.
The probability mass function for each $X_i$ is:
\begin{equation}
p(X_i = x_i \mid \lambda) = e^{-\lambda} \frac{\lambda^{x_i}}{x_i!}, \lambda > 0
\end{equation}
Then:
\begin{equation}
L(x \mid \lambda) = L(X_1=x_1, X_2=x_2, ..., X_n=x_n \mid \lambda) = \prod\limits_{i=1}^nf(x_i \mid \lambda) = e^{-n\lambda} \frac{\lambda^{\sum\limits_{i = 1}^nx_i}}{\prod\limits_{i = 1}^nx_i!}
\end{equation}
***Since $L(x \mid \lambda) > 0$, take the natural logarithm of both sides:***
\begin{equation}
ln(L(x \mid \lambda)) = -n\lambda + ln\lambda \times \sum\limits_{i = 1}^nx_i - ln(\prod\limits_{i = 1}^nx_i!)
\end{equation}
***Because $ln(L(x \mid \lambda))$ is continuous for all $\lambda > 0$, we freely take the first derivative:***
\begin{equation}
\frac{\partial ln(L(x \mid \lambda))}{\partial \lambda} = -n + \frac{\sum\limits_{i = 1}^nx_i}{\lambda}
\end{equation}
Setting it equal to zero:
\begin{equation}
\frac{\partial ln(L(x \mid \lambda))}{\partial \lambda} = 0 \Leftrightarrow \lambda = \frac{\sum\limits_{i = 1}^nx_i}{n}
\end{equation}
The second derivative is:
\begin{equation}
\frac{\partial^2 ln(L(x \mid \lambda))}{\partial \lambda^2} = -\frac{1}{\lambda^2}\sum\limits_{i = 1}^nx_i < 0
\end{equation}
***Note: Since $X_i$ follow a Poisson distribution, $x_i > 0$.***
***Thus, the best estimate for parameter $\lambda$ is $\hat{\lambda} = \frac{\sum\limits_{i = 1}^nx_i}{n}$.***
##### 3.1.3.2 Estimating parameters $\alpha, \sigma^2$ of the normal distribution $N(\alpha, \sigma^2)$
Since the original random variable $X$ follows a ***normal distribution***, the individual variables $X_i, \forall 1 \le i \le n$, also follow a ***normal distribution***.
The probability density function for each $X_i$ is:
\begin{equation}
f(x_i \mid \{\alpha, \sigma \}) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x_i - \alpha)^2}{2\sigma^2}}
\end{equation}
Then:
\begin{equation}
L(x \mid \{\alpha, \sigma \}) = (2\pi\sigma^2)^{\frac{-n}{2}}e^{-\frac{\sum\limits_{i = 1}^n(x_i - \alpha)^2}{2\sigma^2}}
\end{equation}
***Since $L(x \mid \{\alpha, \sigma \}) > 0$, take the natural logarithm of both sides:***
\begin{equation}
ln(L(x \mid \{\alpha, \sigma \})) = -\frac{n}{2}ln(2\pi\sigma^2) -\frac{\sum\limits_{i = 1}^n(x_i - \alpha)^2}{2\sigma^2}
\end{equation}
Because $\sigma > 0$, $ln(L(x \mid \{\alpha, \sigma \}))$ is differentiable for all $\alpha, \sigma > 0$. Taking derivatives:
\begin{equation}
\frac{\partial ln(L(x \mid \{\alpha, \sigma \}))}{\partial \alpha} = \frac{1}{\sigma^2}\sum\limits_{i = 1}^n(x_i - \alpha)
\end{equation}
\begin{equation}
\frac{\partial ln(L(x \mid \{\alpha, \sigma \}))}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{\sum\limits_{i = 1}^n(x_i - \alpha)^2}{2\sigma^4}
\end{equation}
Solving the equations gives:
+ $\hat{\alpha} = \bar{X} = \frac{1}{n}\sum\limits_{i=1}^nx_i$
+ $\hat{\sigma}^2 = S^2 = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{X})^2$
The second derivatives are negative, confirming a maximum.
***Thus, the estimators for the normal distribution parameters are:***
+ $\hat{\alpha} = \bar{X}$
+ $\hat{\sigma}^2 = S^2$
##### 3.1.3.3 Estimating parameter $p$ for the Bernoulli distribution
Since the original random variable $X$ follows a ***Bernoulli distribution***, the individual variables $X_i$ also follow a ***Bernoulli distribution***.
The probability mass function is:
\begin{equation}
f(x_i \mid p) = p^{x_i}(1-p)^{1-x_i}, 0 < p < 1
\end{equation}
Then:
\begin{equation}
L(x \mid p) = p^{\sum\limits_{i=1}^nx_i}(1-p)^{n-\sum\limits_{i=1}^nx_i}
\end{equation}
***Since $L(x \mid p) > 0$, take the natural logarithm:***
\begin{equation}
ln(L(x \mid p)) = lnp \times \sum\limits_{i=1}^nx_i + ln(1-p) \times (n-\sum\limits_{i=1}^nx_i)
\end{equation}
Taking the derivative:
\begin{equation}
\frac{\partial ln(L(x \mid p))}{\partial p} = \frac{\sum\limits_{i=1}^nx_i}{p} + \frac{n-\sum\limits_{i=1}^nx_i}{p-1}
\end{equation}
Solving gives:
\begin{equation}
\hat{p} = \frac{\sum\limits_{i=1}^nx_i}{n}
\end{equation}
The second derivative is negative, so this is a maximum.
***Thus, the estimator for the Bernoulli parameter is $\hat{p} = \frac{\sum\limits_{i=1}^nx_i}{n}$.***
##### 3.1.3.4 Estimating parameter vector $p$ for the categorical distribution
Unlike the Bernoulli distribution, where variables take values in ***{0,1}***, here each $X_i$ can take one of $m$ different values.
***Vector $p = [p_1, p_2, ..., p_m], 0 < p_i < 1$*** satisfies $p_1 + p_2 + ... + p_m = 1$.
Since the original variable $X$ follows a ***multinoulli distribution***, the variables $X_i$ also follow a ***multinoulli distribution***.
The likelihood function leads to the estimator:
$$p_i = \hat{p_i} = \frac{n_i}{n_1 + n_2 + ... + n_m}, \quad 1 \le i \le m$$
Thus, the estimated parameter vector is:
$$p = [\hat{p_1}, \hat{p_2}, ..., \hat{p_m}]$$
$$p_i = \hat{p_i} = \frac{n_i}{n_1 + n_2 + ... + n_m}, 1 \le i \le m$$
### 3.2 Interval estimation
As we know, when we estimate something such as ***exam scores***, ***the height of Asians, ...***, point estimation can sometimes lead to ***large bias when the sample size is small*** and is ***hard to evaluate***. Therefore, we move to ***interval estimation*** because it is more reliable and objective.
***Interval estimation is built based on point estimation.***
We need to estimate an interval for parameter $\theta$ with confidence level $1 - \alpha$. That is, we find an interval $(\theta_1, \theta_2)$ such that:
$$P(\theta_1 < \theta < \theta_2) = 1 - \alpha$$
In this article ***[here]()***, I mentioned the concept of ***quantiles*** as follows:
The $k$% quantile of a random variable $X$ is a value $x_k$ satisfying:
$$P(X < x_k) = \frac{k}{100}$$
That is, we find quantiles $\theta_{\alpha_1}, \theta_{1 - \alpha_2}$ such that $\alpha_1 + \alpha_2 = \alpha$:
+ $P(\theta < \theta_{\alpha_1}) = \alpha_1$
+ $P(\theta < \theta_{1 - \alpha_2}) = 1-\alpha_2$
Then:
$$P(\theta_{\alpha_1} < \theta < \theta_{1 - \alpha_2}) = 1-\alpha_2 - \alpha_1 = 1 - \alpha$$
***Rain has raincoats, sun has umbrellas, if you don’t understand, here comes an example***
We consider the problem of confidence interval estimation for the expectation.
This example considers the normal distribution.
Assume the original random variable $X$ (population variable) follows a ***normal distribution*** with expectation $EX = \mu$ and variance $VX = \sigma^2$.
We estimate a confidence interval for $\mu$ with confidence level $1 - \alpha$.
Assume: the variance is known, $\sigma^2 = \sigma_0^2$.
Define:
$$Z = \frac{\bar{X} - \mu}{\sigma_0}\sqrt{n}$$
Then $Z$ follows a normal distribution with ***mean = 0*** and ***variance = 1***. In other words, $Z$ ***follows the [standard normal distribution]()***.
We find $z_{\alpha_1}, z_{1 - \alpha_2}$ such that:
+ $\alpha_1 + \alpha_2 = \alpha$
+ $P(Z < z_{\alpha_1}) = \alpha_1$
+ $P(Z < z_{1 - \alpha_2}) = 1 - \alpha_2$
Since $Z$ follows a normal distribution:
$$P(Z < -z_{1 - \alpha_1}) = P(Z > z_{1 - \alpha_1}) = 1 - P(Z < z_{1 - \alpha_1}) = 1 - (1 - \alpha_1) = \alpha_1 = P(Z < z_{\alpha_1})$$
Thus:
\begin{equation}
P(Z < z_{1 - \alpha_2}) - P(Z < z_{\alpha_1}) = 1 - (\alpha_1 + \alpha_2) = 1- \alpha
\end{equation}
***Or:***
\begin{equation}
P(-z_{1 - \alpha_1} < Z < z_{1 - \alpha_2}) = 1- \alpha
\end{equation}
Therefore, the confidence interval for $Z$ is $(-z_{1 - \alpha_1}; z_{1 - \alpha_2})$.
Then:
$$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_{1 - \alpha_1} < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_{1 - \alpha_2}$$
For a given $1 - \alpha$, there are infinitely many pairs $(\alpha_1, \alpha_2)$. We choose some special cases:
+ Symmetric confidence interval:
A symmetric interval satisfies $z_{1 - \alpha_1} = z_{1 - \alpha_2}$, or equivalently:
$$\alpha_1 = \alpha_2 = \frac{\alpha}{2}$$
Let $z_b = z_{1 - \frac{\alpha}{2}}$, then:
$$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_b < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_b$$
The quantity $\epsilon = \frac{\sigma_0}{\sqrt{n}}z_b$ is the ***estimation accuracy***. It represents the expected average deviation with confidence $1 - \alpha$.
+ Right-sided confidence interval:
This interval satisfies $\alpha_1 = \alpha, \alpha_2 = 0$.
Let $z_b = z_{1 - \alpha}$, then:
$$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_b < \mu < +\infty$$
+ Left-sided confidence interval:
This interval satisfies $\alpha_1 = 0, \alpha_2 = \alpha$.
Let $z_b = z_{1 - \alpha}$, then:
$$ -\infty < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_b$$
Computing $z_b$ means finding the $k$% quantile of variable $Z$ following the ***standard normal distribution $\phi$***.
Examples:
+ $z_b = z_{1 - \alpha}$ means: $\phi(z_b) = \frac{1}{2} - \alpha$
+ $z_b = z_{1 - \frac{\alpha}{2}}$ means: $\phi(z_b) = \frac{1 - \alpha}{2}$
***Review the [previous article]() if you’ve forgotten***
Looking up the ***Laplace table*** gives the result.
You can refer to the ***Laplace table*** ***[here](https://cdn.slidesharecdn.com/ss_thumbnails/banggiatrihamlaplace-150919075726-lva1-app6892-thumbnail-4.jpg?cb=1442649495)***.
***In exams, you will always be provided with a table of values or Laplace values***
This article only covers the case where the ***variance is known***. You can refer to confidence interval estimation methods for ***proportions and variances*** in Chapter 4 of the lecture notes by ***Dr. Lê Xuân Lý - HUST*** at [here](https://drive.google.com/drive/folders/1d-X3Q3-mjwma8H-KH_DeusLt7CiL2MFR).
## 4. Example
The revenue of a store is a random variable $X$ (million/month) with standard deviation 2 million/month. A random survey of 500 stores of similar scale yields an average revenue of 10 million/month. With 95% confidence, estimate the interval for the average revenue of stores of that scale.
***Solution:***
Define:
$$Z = \frac{\bar{X} - \mu}{\sigma_0}\sqrt{n}$$
From the formula above, the symmetric confidence interval for the mean $\mu$ is:
$$\bar{X} - \frac{\sigma_0}{\sqrt{n}}z_b < \mu < \bar{X} + \frac{\sigma_0}{\sqrt{n}}z_b$$
With $z_b = z_{1 - \frac{\alpha}{2}}, 1-\alpha=95\text{%}, \sigma=2, \bar{x} = 10, n = 500$.
Then: $\phi(z_b) = \frac{1 - \alpha}{2} = 0.475$, so $z_b = 1.96$ (from the ***Laplace table***).
***Substituting values, we get:***
$$9.825 < \mu < 10.175$$
## 5. References
+ ***[Lecture slides by Dr. Lê Xuân Lý - HUST](https://drive.google.com/drive/folders/1d-X3Q3-mjwma8H-KH_DeusLt7CiL2MFR)***
+ ***[Probability and Statistics textbook - Tống Đình Quy](https://drive.google.com/file/d/1Yw02kvncpFp6WiyP9kZWSEuWyvLL3sYO/view)***