Parameter Estimation

# Parameter Estimation ###### tags: `cvdl2020` ## Preface There is much theory about parameter estimation, dealing with properties such as the bias and variance of the estimate. This theory is based on analysis of the **probability density functions** of the measurements and the parameter space. In this appendix, we discuss such topics as bias of an estimator, the variance, the Cramer-Rao lower bound on the variance, and the posterior distribution. ## Maximum Likelihood estimate An estimate of the parameter vector $\theta$ given a measured value $x$ is a function denoted $\hat \theta(x)$, assigning a parameter vector $\theta$ to a measurement $x$. The maximum likelihood (ML) estimate if given by $$\hat \theta_{ML} = \arg \max_{\theta} p(x|\theta)$$ ## Bias A desirable property of an estimator is that it can be expected to give the right answer on the average. Given a parameter $\theta$, or equivalently in our case a point on the world line, we consider all possible measurements $x$ and from them reertimate the parameter $\theta$, namely $\hat \theta (x)$. This estimator is known as **unbiased** if on the average we obtain the original parameter $\theta$ (true value). In forming this average, we weight the measurements $x$ according to their probability. More formally, the bias of the estimator is defined as $$E_{\theta}[\hat \theta (x)] - \theta = \int_{x} p(x|\theta) \hat \theta (x) dx - \theta$$ The estimator is unbiased if $$E_{\theta}[\hat \theta (x)]=0, \ \forall \theta$$ ## Variance Another important attribute of an estimator is its variance. Consider an experiment being repeated many times with the same model parameters, but a different instantiation of the noise at each trail. Applying our estimator to the measured data, we obtain an estimate for each of these trails. The variance of the estimator is the variance (or covariance matrix) of the estimated values. More precisely, we can define the variance for an estimation problem involving a single parameter as $$Var_{\theta}(\hat \theta) = E_{\theta}[(\hat \theta (x) - \theta)^2]$$ In the case where the parameters $\theta$ form a vector, $Var_{\theta}(\hat \theta)$ is the covariance matrix $$Var_{\theta}(\hat \theta) = E_{\theta}[(\hat \theta (x) - \theta)(\hat \theta (x) - \theta)^T]$$ Sometimes we might be more interested in the variability of the estimate with respect to the roginal parameter $\theta$, which is the mean-squared error of the estimator. This can be computed by $$E_{\theta}[(\hat \theta (x) - \theta)(\hat \theta (x) - \theta)^T] = Var_{\theta}(\hat \theta) + bias(\hat \theta)bias(\hat \theta)^T$$ ## The Cramer-Rao lower bound It is evident that by adding noise to a set of measurements information is lost. COnsequently. it is not to be expected that nay estimator can have zero bias and variance in the presence of noise on the measurements. For unbiased estimator, this notion is formalized in the **Cramer-Rao lower bound**, which is a bound on the variance of an unbiased estimator. ### Proof Given a probability distribution $p(x|\theta)$, the Fisher score is defined as $V_{\theta}(x) = \partial_{\theta} log p(x|\theta)$. The Fisher Information Matrix is defined to be $$F(\theta) = E_{\theta}[V_{\theta}(x)V_{\theta}(x)^T] \\ = \int_{x} V_{\theta}(x)V_{\theta}(x)^T p(x|\theta) dx$$ :::success For an unbiased estimator $\hat \theta(x)$, $$det(E[(\hat \theta - \theta)(\hat \theta - \theta)^T]) \geq 1/det F(\theta)$$ ::: ## The posterior distribution An alternative to the ML estimte is to consider the probability distribution for the parameters, given the measurements, namely $p(\theta|x)$. This is known as the **posterior distribution**, namely **the distribution for the parameters after the measurements have been taken**. To compute it, we need a **prior distribution** $p(\theta)$ for the parameters **before** any measurements has been taken. The posterior distribution can then be computed by **Bayes Law** $$p(\theta|x) = \frac{p(x|\theta)\theta}{p(x)}$$ Since the measurement x is fixed, so is its proabbility $p(x)$, so we may ignore it, leading to $p(\theta|x) \approx p(x|\theta)p(\theta)$. Notice that the maximum of the posterior distribution is known as the **Maximum A Posteriori (MAP) estimate**