---
title: MS8
tags: teach:MS
---
# Chapter 8
Estimation methods:
- Least squares estimation (already seen in regression)
- The method of moments
- The method of Maximum likelihood
## 8.4 The method of moments
### The method of moments
The $k$th moment of a probability low is defined as
$$\mu_k = E(X^k),$$
where $X$ is a random variable. If $X_i\stackrel{i.i.d.}{\sim}X$, the $k$th sample moment is defined as
$$\hat{\mu}_k = \frac{1}{n}\sum_{i=1}^n X_i^k.$$
:::success
**Three steps for a method of moments**
1. Calculate low order (population) moments.
2. Find new expression for the parameters in terms of (population) moments.
3. Insert the sample moments into the expression in (b).
:::
### Example: $Bernoulli(p)$
1. $\mu_1=E(X) = p$
2. $p=\mu_1$.
3. $\hat{p}=\hat{\mu}_1$
### Example: Poisson distribution, $Poisson(\lambda)$
1. The first moment for the Poisson distribution is
$$E(X) = \lambda. $$
2. Express the parameters with moments:
$$\lambda = E(X)$$
3. Therefore, the method of moment estimate of $\lambda$: $$\hat{\lambda}=\bar{X}.$$
### Example: Normal distribution, $N(\mu,\sigma^2)$
1. The first and second moment for the normal distribution is\begin{align*}
&\mu_1 = E(X) = \mu,\\
&\mu_2 = E(X^2) = \mu^2 + \sigma^2.
\end{align*}
2. Express parameters with moments:
\begin{eqnarray*}
\mu &=& E(X)=\mu_1,\\
\sigma^2 &=& E(X^2) - (E(X))^2=\mu_2-\mu_1^2.
\end{eqnarray*}
3. Estimate parameters with sample moments:
\begin{eqnarray*}
\hat{\mu} &=& \hat{\mu}_1 = \bar{X},\\
\hat{\sigma}^2 &=& \hat{\mu}_2 - \hat{\mu}_1^2 = \frac{1}{n}\sum X_i^2 - \bar{X}^2 =\frac{1}{n}\sum(X_i-\bar{X})^2.
\end{eqnarray*}
### Example: Gamma distribution, $Gamma(\alpha,\lambda)$
1. The first two moments of gamma distribution are
\begin{align*}
&\mu_1 = \frac{\alpha}{\lambda},\\
&\mu_2 = \frac{\alpha}{\lambda^2}+(\frac{\alpha}{\lambda})^2=\frac{\alpha(\alpha+1)}{\lambda^2}.
\end{align*}
2. Plug $\mu_1 = \frac{\alpha}{\lambda}$ in the formula of $\mu_2$, we have
$$\mu_2 = \frac{\alpha(\alpha+1)}{\lambda^2}=\mu_1^2 + \frac{\mu_1}{\lambda}.$$
Therefore, we have
$$\lambda = \frac{\mu_1}{\mu_2 - \mu_1^2},\quad \alpha = \frac{\mu_1^2}{\mu_2 - \mu_1^2}.$$
3. And the method the moments estimates the parameter by
$$\hat{\lambda} = \frac{\hat{\mu}_1}{\hat{\mu}_2- \hat{\mu}_1^2},\quad
\hat{\alpha} = \frac{\hat{\mu}_1^2}{\hat{\mu}_2- \hat{\mu}_1^2}.$$
### Example: Augular Distribution
The angle $\theta$ at which electrons are emitted in muon decay has a distribution with the density
$$f(x|\alpha)=\frac{1+\alpha x}{2},\; -1\leq x\leq 1,\; -1\leq \alpha\leq 1,$$
where $x = \cos\theta$. Find the method of moment estimate of $\alpha$.
#### Sol.
1. Because
$$E[X] = \int_{-1}^{1} x\frac{1+\alpha x}{2}dx = \left[\frac{x^2}{4}+\frac{\alpha}{2}\frac{x^3}{3}\right]^{x=1}_{x=-1}=\frac{\alpha}{3},$$we have
$$\mu_1 = \frac{\alpha}{3}.$$
2. Express the parameter with the moment: $$\alpha=3\mu_1.$$
3. The method of moments estimates $\alpha$ by $$\hat{\alpha} = 3\hat{\mu}_1 = 3\bar{X}.$$
## 8.5 The method of maximum likelihood
:::success
**Two steps**
1. Write down the likelihood.
2. The MLEs are the arguments to maximize the likelihood.
:::
### The likelihood
- Suppose that random variables $X_1,\ldots,X_n$ have a joint density or frequency function $f(x_1,\ldots,x_n|\theta)$.
- Given observed values $X_i=x_i$, where $i=1,\ldots,n$, the likelihood of $\theta$ as function of $x_1,\ldots,x_n$ is defined as
$$lik(\theta) = f(x_1,\ldots,x_n|\theta).$$
### The maximum likelihood estimate
#### Definition
- (Plain language) *The maximum likelihood estimate (mle)* of $\theta$ is that value of $\theta$ that maximizes the likelihood-that is, makes the observed data "most probable" or "most likely".
- (Mathematical notations) $\hat{\theta}_{mle}$ is the mle of $\theta$, if
$$\hat{\theta}_{mle} = \arg\max lik(\theta).$$
### Using log-likelihood for numerical ease
- Denote the log-likelihood function $l(\theta)$ to be
$$l(\theta) = \log lik(\theta). $$
- Note logarithm is a strictly increasing function. Sometimes, it is easier to maximize $l(\theta)$ than $lik(\theta)$.
- If the $X_i$ are assumed to be i.i.d. their joint density is the product of the marginal densities, and the likelihood is
$$lik(\theta) = \prod_{i=1}^n f(X_i|\theta).$$
- For an i.i.d. samples, the log likelihood is
$$l(\theta) = \sum_{i=1}^n \log[f(X_i|\theta)].$$
### Example: $Bernoulli(p)$
$X_i\sim Bernoulli(p)$.
1. The likelihood is
$$lik(p)=\prod_{i}^n (p)^{X_i} (1-p)^{1-X_i} $$
2. To maximize the likelihood, we maximize the log-likelihood:
$$l(p) = \log p \sum_{i=1}^n(X_i)+ log(1-p) \sum_{i=1}^n(1-X_i).$$
The first-order-condition requires
$$\frac{d l(p)}{dp} = \frac{1}{p}\sum_{i=1}^nX_i + \frac{-1}{1-p}\sum_{i=1}^n (1-X_i) = 0.$$
\begin{align}
& (1-p)\sum X_i -p(n-\sum X_i)=0,\\
& p(\sum X_i+n - \sum X_i) = \sum X_i,\\
& \hat{p}_{mle} = \frac{\sum X_i}{n}=\bar{X}.
\end{align}
### Example: Poisson distribution
1. The pmf is
$$P(X=x)=\frac{\lambda^xe^{-\lambda}}{x!}.$$
The likelihood is
$$lik(\lambda)=\prod_{i=1}^n \frac{\lambda^{X_i}e^{-\lambda}}{X_i !} .$$
2. The log-likelihood is
\begin{eqnarray*}
l(\lambda) &=&\sum_{i=1}^n (X_i\log(\lambda)-\lambda -\log(X_i!))\\
&=& \log \lambda(\sum_{i=1}^n X_i) -n\lambda -\sum_{i=1}^n \log X_i!.
\end{eqnarray*}
To maximize the log-likelihood, the first-order-condition requires
$$l'(\lambda) = \frac{d}{d\lambda} l(\lambda) = \frac{1}{\lambda} \sum_{i=1}^n X_i - n = 0.$$
Therefore, $\hat{\lambda}_{mle} = \bar{X}$.
### Example: Normal distribution, $N(\mu,\sigma^2)$.
1. The pdf is
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}.$$The likelihood is
$$lik(\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{X_i-\mu}{\sigma}\right)^2\right).$$
2. The log-likelihood is
$$l(\mu,\sigma) = -n\log\sigma -\frac{n}{2}\log2\pi-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i-\mu)^2.$$To maximize the log-likelihood, the first-order-conditions require
\begin{eqnarray*}
\frac{\partial l(\mu,\sigma)}{\partial \mu} &=& \frac{1}{\sigma^2} \sum(X_i-\mu) = 0,\\
\frac{\partial l(\mu, \sigma)}{\partial \sigma} &=& \frac{-n}{\sigma}+\frac{1}{\sigma^3} \sum(X_i -\mu)^2 =0.
\end{eqnarray*}Therefore, the mle estimates for $\mu$ and $\sigma$ is
\begin{eqnarray*}
\hat{\mu}_{mle} &=& \bar{X},\\
\hat{\sigma}_{mle} &=& \sqrt{\frac{1}{n}\sum_{i=1}^n (X_i-\bar{X})^2}.
\end{eqnarray*}
### Example: Gamma distribution.
1. The density is
$$f(x|\alpha,\lambda) = \frac{1}{\Gamma(\alpha)}\lambda^{\alpha}x^{\alpha-1}e^{-\lambda x},\;0\leq x.$$
The likelihood is
$$lik(\alpha,\lambda) = \prod_{i=1}^n \frac{1}{\Gamma(\alpha)}\lambda^{\alpha}X_i^{\alpha-1}e^{-\lambda X_i}.$$
2. The log-likelihood is
\begin{eqnarray*}l(\alpha,\lambda)
& =& \sum_{i=1}^n \left(\alpha\log\lambda +(\alpha-1)\log X_i -\lambda X_i -\log \Gamma(\alpha)\right)\\
&=& n\alpha \log \lambda +(\alpha-1)\sum_{i=1}^n \log X_i-\lambda \sum X_i-n\log\Gamma(\alpha).
\end{eqnarray*}
To maximize the log-likelihood, the first-order-conditions are:
\begin{eqnarray*}
\frac{\partial l(\alpha, \lambda)}{\partial \alpha} &=& n\log(\lambda) + \sum_{i=1}^n \log X_i - n\frac{\Gamma'(\alpha)}{\Gamma(\alpha)},\\
\frac{\partial l(\alpha, \lambda)}{\partial \lambda } &=& \frac{n\alpha }{\lambda} - \sum{X_i}.
\end{eqnarray*}Therefore,$$\hat{\lambda} =\frac{n\hat{\alpha}}{\sum X_i} = \frac{\hat{\alpha}}{\bar{X}}.$$
Plug-in $\hat{\lambda}$ in to the first equation of the first-order-conditions:
$$n\log\hat{\alpha} - n\log\bar{X} +\sum \log X_i - n\frac{\Gamma'(\hat{\alpha})}{\Gamma(\hat{\alpha})}=0.$$
- We need to use additional numerical procedures to find the results. There exists no explicit formulas for the mle of $\alpha$ and $\lambda$.
### Example (d) Muon Decay
- Recall the density:
$$f(x|\alpha) = \frac{1+\alpha x}{2},\; -1\leq x\leq 1,\;-1\leq \alpha\leq 1.$$
- The log-likelihood is
$$l(\alpha) = \sum_{i=1}^n \log(1+\alpha X_i) -n\log 2.$$
- The first-order-condition is
$$\frac{\partial l(\alpha)}{\partial \alpha} =\sum_{i=1}^n \frac{X_i}{1+\alpha X_i} = 0.$$
- Find $\hat{\alpha}$ satisfying
$$\sum \frac{X_i}{1+\hat{\alpha}X_i} = 0$$
using numerical procedures. (No closed-form formula exists for the mle.)
### Definition of consistent estimate
Let $\hat{\theta}_n$ be an estimator of a parameter $\theta$ based on a sample of size $n$.
Then $\hat{\theta}_n$ is said to be consistent in probability if
- (plain language) $\hat{\theta}_n$ converges in probability to $\theta$ as $n$ approaches infinity;
- (mathematical notation)
$$\hat{\theta}_n\stackrel{p}{\rightarrow}\theta,\quad n\rightarrow \infty.$$
- (mathematical definition) that is, for any $\varepsilon>0$,
$$P\left(|\hat{\theta}_n-\theta|>\varepsilon \right)\rightarrow 0\;\mbox{as}\;n\rightarrow\infty. $$
### Unbiased estimator
If $E(\hat{\theta}_n) = \theta$, we say $\hat{\theta}$ is an unbiased estimator.
### Example: Consider $X_i\sim N(\mu, \sigma^2)$.
- $\bar{X}$ unbiased?
- $S^2$ unbiased?
- $\hat{\sigma}_{mle}^2$ is biased, but $\hat{\sigma}_{mle}^2$ is consistent
#### Solution
- Because $E[\bar{X}]=\mu$, $\bar{X}$ is an unbiased estimator of $\mu$.
- Because $(\frac{(n-1)S^2}{\sigma^2})\sim \chi^2_{n-1}$, so $E(\frac{(n-1)S^2}{\sigma^2})=(n-1)$. Thus, $E(S^2)=\sigma^2$. Hence, $S^2$ is an unbiased estimator of $\sigma^2$.
- On the other hand, $\hat{\sigma}_{mle}^2$ is biased, but $\hat{\sigma}_{mle}^2$ is consistent, because:
$$\hat{\sigma}_{mle}^2=\frac{1}{n}\sum(X_i-\bar{X})^2 =\frac{n-1}{n}\times \frac{1}{(n-1)}\sum(X_i-\bar{X})^2 \stackrel{p}{\rightarrow}\sigma^2. $$
:::success
Course stops here! 2021/12/21 :heart:
:::
### Theorem A.
Under appropriate smoothness conditions on $f$, the mle from an i.i.d. sample is consistent.
### A sketch of the proof (skipped).
Idea of the proof: $$\hat{\theta}_{MLE}\stackrel{p}{\rightarrow} \tilde{\theta}_{MLE}=\theta_0.$$
- Let $\theta_0$ be the true parameter.
- By the law of large numbers, we have
$$\frac{1}{n}l(\theta)=\frac{1}{n}\sum_{i=1}^n \log f(X_i|\theta) \stackrel{p}{\rightarrow} E[\log(f(X|\theta))].$$
- Let $\hat{\theta}_{MLE}$ be the maximizer of $l(\theta)$ and $\tilde{\theta}_{MLE}$ be the maximzer of $E[\log(f(X|\theta))]$.
- $\theta$ maximizing $l(\theta)$ is close to that maximizing $E[\log f(X|\theta)]$, i.e., $$\hat{\theta}_{MLE}\rightarrow \tilde{\theta}_{MLE}.$$
- Now, we will show that $\theta_0$ maximizes $E[\log f(X|\theta)]$: To maximize $E[\log f(X|\theta)]$, the first-order-condition is $$\frac{\partial }{\partial \theta }E[\log f(X|\theta)] = \frac{\partial }{\partial \theta}\int \log f(x|\theta) f(x|\theta_0) dx = \int \frac{f'(x|\theta)}{f(x|\theta)}f(x|\theta_0) dx.$$Here, the plum is the differentiation with respect to $\theta$.
- If $\theta = \theta_0$, the last equation becomes
$$\int \frac{f'(x|\theta)}{f(x|\theta)}f(x|\theta_0) dx
= \int f'(x|\theta_0) dx
= \frac{\partial}{\partial \theta}\int f(x|\theta_0) dx
=\frac{\partial}{\partial \theta} (1) = 0.
$$
- This means that $\theta_0$ is a stationary point and hopefully a maximum.
- This explains $\hat{\theta}_{mle}\stackrel{p}{\rightarrow} \theta_0$.
### Lemma A
- Define $I(\theta)$ by
$$I(\theta) = E\left[\frac{\partial}{\partial \theta}\log f(X|\theta)\right]^{2}.$$
- Under appropriate smoothness conditions on $f$, $I(\theta)$ may also be expressed as
$$I(\theta) = -E\left[\frac{\partial^2}{\partial\theta^2}\log f(X|\theta)\right].$$
#### Proof (skipped)
(The proof is rather tricky. Assume that the interchange of integration and differentiation can be swapped.)
- First, because $\int f(x|\theta) dx =1$, we have $\frac{\partial}{\partial \theta} \int f(x|\theta) dx = 0$.
- Second, note that
$$\frac{\partial }{\partial \theta} f(x|\theta) = \left[\frac{\partial}{\partial \theta}\log f(x|\theta)\right]f(x|\theta).$$
- Third,
$$0 = \frac{\partial}{\partial\theta}\int f(x|\theta)dx = \int\frac{\partial}{\partial \theta}f(x|\theta)dx = \int \left[\frac{\partial}{\partial \theta}\log f(x|\theta)\right]f(x|\theta)dx. $$
- Therefore, we have
$$\frac{\partial }{\partial\theta}\int \left[\frac{\partial}{\partial \theta}\log f(x|\theta)\right]f(x|\theta)dx = 0.$$
- By the product rule of differentiation, the left-hand-side equals
\begin{eqnarray*}
&&\int \left[\frac{\partial^2}{\partial\theta^2}\log f(x|\theta)\right]f(x|\theta)dx + \int \frac{\partial}{\partial\theta}\log f(x|\theta) \frac{\partial}{\partial\theta} f(x|\theta)dx\\
&=&\int \left[\frac{\partial^2}{\partial\theta^2}\log f(x|\theta)\right]f(x|\theta)dx + \int \left[\frac{\partial}{\partial\theta}\log f(x|\theta) \right]^2 f(x|\theta)dx=0.
\end{eqnarray*}
- As a result, we have
$$-E\left[\frac{\partial^2}{\partial\theta^2}\log f(X|\theta)\right] = E\left[\frac{\partial}{\theta}\log(f(X|\theta)\right]^2. $$
### Large sample distribution of a mle
The large sample distribution of a mle is approximately normal with mean $\theta_0$ and variance $1/(nI(\theta_0))$. Since this merely a limiting result, which holds as the sample size tends to infinity, we say that the mle is asymptotically unbiased and refer to the variance of the limiting normal distribution as the asymptotic variance of the mle.
- (plain language) Under smoothness conditions on $f$, the probability distribution of $\sqrt{nI(\theta_0)}(\hat{\theta}-\theta_0)$ tends to a standard normal distribution.
- (mathematical notation) Under smoothness conditions on $f$, we have
$$\frac{\hat{\theta}-\theta_0}{\sqrt{\frac{1}{nI(\theta_0)}}}\stackrel{d.}{\rightarrow}Z.$$
### Confidence interval from mle
Let $\hat{\theta}$ be an mle. Then, $\theta$ has a
### Example: $N(\mu,\sigma^2)$
- Recall the mle of $\mu$ and $\sigma^2$ from an i.i.d. normal sample are
\begin{align*}
&\hat{\mu} =\bar{X},\\
& \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2.
\end{align*}
- Find the confidence intervals for these two mle.
- Recall that $\frac{\sqrt{n}(\bar{X}-\mu)}{S}\sim t_{n-1}$.
- We have
$$P\left(-t_{n-1}(\alpha/2)\leq \frac{\sqrt{n}(\bar{X}-\mu)}{S}\leq t_{n-1}(\alpha/2)\right)=1-\alpha.$$
We therefore have
$$P\left(\bar{X}-\frac{S}{\sqrt{n}}t_{n-1}(\alpha/2)\leq \mu \leq \bar{X}+\frac{S}{\sqrt{n}}t_{n-1}(\alpha/2) \right)=1-\alpha.$$The $(1-\alpha)\times 100 \%$ CI for $\mu$ is
$$\bar{X}\pm \frac{S}{\sqrt{n}}t_{n-1}(\alpha/2) $$- Recall that $\frac{n\hat{\sigma}^2}{\sigma^2} =\frac{\sum(X_i-\bar{X})^2}{\sigma^2}\sim \chi^2_{n-1}$. We have
$$P\left( \chi^2_{n-1}(1-\frac{\alpha}{2}) \leq \frac{n\hat{\sigma}^2}{\sigma^2} \leq \chi^2_{n-1}(\alpha/2)\right)=1-\alpha.$$- Therefore,
$$P\left(\frac{n\hat{\sigma}^2}{\chi^2_{n-1}(\alpha/2)}\leq \sigma^2 \leq \frac{n\hat{\sigma}^2}{\chi^2_{n-1}(1-\alpha/2)}\right) = 1-\alpha.$$- The $100(1-\alpha)\%$ C.I. for $\sigma^2$ is
$$\left(\frac{n\hat{\sigma}^2}{\chi^2_{n-1}(\alpha/2)} ,\frac{n\hat{\sigma}^2}{\chi^2_{n-1}(1-\alpha/2)} \right)$$
### Example (b) Poisson Distribution
- Recall that mle for $\lambda$ is $\hat{\lambda}=\bar{X}$.
- Note that the pmf is $f(x|\lambda) = e^{-\lambda}\lambda^x/x!$, hence
\begin{eqnarray*}
\log f(X|\theta) &=& -\lambda + X\log \lambda - \log X!,\\
\frac{\partial \log f(X|\theta) }{\partial\lambda} &=& -1+\frac{X}{\lambda},\\
\frac{\partial^2 \log f(X|\theta) }{\partial\lambda^2} &=&-\frac{X}{\lambda^2}.
\end{eqnarray*}
- Hence,
$$I(\lambda) = -E\left[\frac{\partial^2 \log f(X|\theta) }{\partial\theta^2}\right] = \frac{\lambda}{\lambda^2}=\frac{1}{\lambda}$$
Recall that the mle has the asymptotic distribution: $\frac{\hat{\theta}-\theta_0}{\sqrt{\frac{1}{nI(\theta_0)}}}\stackrel{d}{\rightarrow} Z$.
- We therefore have
$$\frac{\bar{X} - \lambda}{\sqrt{\lambda/n}}\stackrel{d}{\rightarrow} Z,$$
and approximately,
$$\frac{\bar{X} - \lambda}{\sqrt{\bar{X}/n}}\stackrel{d}{\rightarrow} Z.$$
- As a result, the approximate $100(1-\alpha)\%$ C.I. for $\lambda$ is
$$\bar{X}\pm z_{\alpha/2}\sqrt{\frac{\bar{X}}{n}}.$$
## 8.7 Efficiency and the Cram\'er-Rao Lower Bound
### Efficiency and the Cram\'er-Rao Lower Bound
Given two estimates, $\hat{\theta}_1$ and $\hat{\theta}_2$, $\hat{\theta}_1$ is more efficient than $\hat{\theta}_2$ if the former has smaller variance.
### Theorem A. Cramer-Rao Inequality
Let $X_1,\ldots,X_n$ be i.i.d. with density function $f(x|\theta)$. Let $T=t(X_1,\ldots,X_n)$ be an unbiased estimate of $\theta$. Then, under smoothness assumptions on $f(x|\theta)$,
$$Var(T)\geq \frac{1}{nI(\theta)}.$$
## 8.8 Sufficiency
### Sufficiency
- The reason to learn sufficient statistic ($T=T(X_1,\ldots,X_n)$) is that if an estimator (or statistic) is sufficient, it has some good properties:
- So, how do we define sufficiency? The conditional distribution given the sufficient statistic does not involve the parameter.
- How do we find the sufficient statistics? By factorization theorem.
- MLE is a function of $T$.
- Given an statistic, you can always create a new one that has smaller MSE. (Rao-Blackwell Theorem).
### Definition.
A statistic $T(X_1,\ldots,X_n)$ is said to be sufficient for $\theta$ if the conditional distribution of $X_1,\ldots,X_n$, given $T=t$, does not depend on $\theta$ for any value of $t$.
### Example
Let $X_1,\ldots,X_n$ be a sequence of independent Beroulli random variables with $P(X_i=1)=\theta$. Verify that $T=\sum_{i=1}^n X_i$ is sufficient for $\theta$.
#### Sol.
:::info
Hint:
- Note that $T$ is short for $T(X_1,\ldots,X_n)$.
- To show that $T$ is sufficient, we need to show:
$$P(X_1=x_1,\ldots,X_n=x_n|T=t)$$ does not depend on $\theta$.
:::
We verify the following equation:
\begin{eqnarray*}
&& P(X_1=x_1,X_2=x_2,\ldots,X_n=x_n|T=t)\\
&=& \frac{P(X_1=x_1,X_2=x_2,\ldots,X_n=x_n,T=t)}{P(T=t)}\\
& =&\frac{\theta^t(1-\theta)^{n-t}}{C^n_t \theta^t(1-\theta)^{n-t}}\\
&=&\frac{1}{C^{n}_t.}\end{eqnarray*}
### A factorization theorem
A necessary sufficient condition for $T(X_1,\ldots,X_n)$ to be sufficient for a parameter $\theta$ is that the joint probability function (density function or frequency function) factors in the form
$$f(x_1,\ldots,x_n|\theta) = g\left(T(x_1,\ldots,x_n),\theta\right)h(x_1,\ldots,x_n).$$
#### Proof
($\Leftarrow$: If $f$ can be factorized, $T$ is sufficient.) For notational simplicity, we denote $T=T(X_1,\ldots,X_n)=T(X)$, $X=(X_1,\ldots,X_n)$, and $x=(x_1,\ldots,x_n)$.
If $f(x|\theta) = g\left(T,\theta\right)h(x)$, then
$$P(T=t) = \sum_{T(x)=t}P(X=x) = \sum_{T(x) = t}g(t,\theta)h(x) =g(t,\theta) \sum_{T(x) = t} h(x).
$$Therefore,
$$P(X=x |T=t) = \frac{P(X=x,T=t)}{P(T=t)} = \frac{g(t,\theta) h(x)}{g(t,\theta) \sum_{T(x) = t} h(x)}
=\frac{h(x)}{\sum_{T(x) = t} h(x)}$$
does not depend on $\theta$.
($\Rightarrow$: If $T$ is sufficient, $f$ can be factorized.)
Let
\begin{eqnarray*}
g(t,\theta) &=& P(T=t|\theta),\\
h(x) &= &P(X=x|T=t). \quad(\mbox{This is because $T$ is sufficient.})
\end{eqnarray*}
We then have
$$P(X=x|\theta) = P(T=t|\theta)P(X=x|T=t) = g(t,\theta)h(x).$$
### Example. Bernoulli random variable.
We can decompose the joint pmf of Bernoulli random variable:
\begin{eqnarray*}
f(x_1,\ldots,x_n|\theta) &=& \prod_{i=1}^n \theta^{x_i} (1-\theta)^{1-x_i} \\
&=& \theta^{\sum x_i}(1-\theta)^{n-\sum x_i} \\
&=& \left(\frac{\theta}{1-\theta}\right)^{\sum x_i}(1-\theta)^n\\
&=& g(\sum x_i, \theta) h(x_1,\ldots,x_n),
\end{eqnarray*}
where
\begin{eqnarray*}
h(x_1,\ldots,x_n)&=&1, \\
g(t, \theta) &=& \left(\frac{\theta}{1-\theta}\right)^t(1-\theta)^n.
\end{eqnarray*}
Hence, the sufficient statistic is $\sum X_i$.
### Example: Normal random variable.
We decompose the joint density as follows:
\begin{eqnarray*}
f(x|\mu,\sigma) &=& \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}\exp\left[-\frac{1}{2\sigma^2}(x_i-\mu)^2\right]\\
&=& \frac{1}{\sigma^n(2\pi)^{n/2}}\exp\left[- \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right]\\
&=& \frac{1}{\sigma^n(2\pi)^{n/2}}\exp\left[- \frac{1}{2\sigma^2}(\sum x_i^2 -2 \mu\sum x_i + n\mu^2)\right].\end{eqnarray*}
Hence, $\sum X_i^2$ and $\sum X_i$ are sufficient statistics.
### A $k$-parameter member of the exponential family
A $k$-parameter member of the exponential family has a density or frequency function of the form
$$f(x|\theta) = \exp\left[\sum_{i=1}^k c_i(\theta)T_i(x)+d{(\theta)}+S{(x)}\right],\quad x\in A$$
where the set $A$ does not depend on $\theta$. A one-parameter member of the exponential family is
$$f(x|\theta) = \exp(c(\theta)T(x)+d(\theta)+s(x)),\quad x\in A,$$
where $A$ doesn't include $\theta$.
### Example. Bernoulli distribution.
By the decomposition for the pmf of the Bernoulli distribution,
\begin{eqnarray*}
p(X=x) & = & \theta^x (1-\theta)^{1-x} \\
&=& \exp\left[x\log\left(\frac{\theta}{1-\theta}\right)+\log(1-\theta)\right].
\end{eqnarray*}
Because $T(x)=x$, $\sum X_i$ is a sufficient statistic.
### Corollary A.
If $T$ is sufficient for $\theta$, the maximum likelihood estimate is a function of $T$.
#### Sketch of the proof.
If it is a one-parameter exponential distribution, the mle needs to satisfy
$$c'(\theta)T +d'(\theta)=0.$$Hence, $\hat{\theta}_{mle}$ is a function of $T$.
### Rao-Blackwell Theorem
Let $\hat{\theta}$ be an estimator of $\theta$ with $E(\hat{\theta})<\infty$ for all $\theta$. Suppose that $T$ is sufficient for $\theta$, and let $\tilde{\theta} = E(\hat{\theta}|T)$. Then, for all $\theta$,
$$E(\tilde{\theta}-\theta)^2 \leq E(\hat{\theta}-\theta)^2.$$
The inequality is strict unless $\hat{\theta} = \tilde{\theta}$.
#### Proof.
1. Recall hat $$MSE = E(\hat{\theta}-\theta)^2 = E\left[(\hat{\theta}-E(\theta))+(E(\theta)-\theta)\right]^2 = Var(\hat{\theta})+Bias(\theta)^2.$$
Because $$E[\tilde{\theta}] = E[E[\hat{\theta}|T]]=E[\hat{\theta}],$$
to compare the MSE for $\hat{\theta}$ and $\tilde{\theta}$, we only need to compare their variances.
2. Recall the formula:
\begin{equation*}
Var(\hat{\theta}) = Var(E(\hat{\theta}|T)) + E(Var(\hat{\theta}|T)).
\end{equation*}We therefore have
\begin{equation*}
Var(\hat{\theta}) = Var(\tilde{\theta}) + E(Var(\hat{\theta}|T)).
\end{equation*}
Because $E(Var(\hat{\theta}|T))\geq 0$, we have
$$Var(\hat{\theta}) \geq Var(\tilde{\theta}).$$
The equality holds only when $Var(\hat{\theta}|T)=0$, i.e., $\hat{\theta}$ is a function of $T$, then $\hat{\theta}=\tilde{\theta}$.