Understanding the concepts of expectation, bias, and variance is crucial for grasping fundamental statistical principles in machine learning. Here’s a beginner-friendly explanation:
The expectation (or expected value) is a core concept in probability theory and statistics, reflecting the average outcome one can expect from a random variable. Mathematically, it's the weighted average of all possible values that this random variable can take on, with the weights being the probabilities of each outcome.
For Discrete Random Variables: If you have a discrete random variable that can take on values with probabilities , the expectation is given by:
For Continuous Random Variables: If is continuous with a probability density function , the expectation is:
Bias in machine learning refers to the error introduced by approximating a real-world problem, which may lead to systematic errors in predictions or estimations. For estimators (functions used to estimate parameters of a distribution), bias measures the difference between the expected value of the estimator and the true value of the parameter being estimated.
An estimator is called unbiased if its bias is 0 for all values of , meaning on average, it accurately predicts the parameter.
Variance measures the spread of the random variable's values around its mean (expected value), indicating the variability from the average. In machine learning, variance captures how much the predictions for a given input vary between different realizations of the model.
In the context of estimators, variance measures how much the estimates of the parameter would differ across different datasets drawn from the same distribution.
The Mean Square Error (MSE) is a measure used to quantify the difference between the values predicted by a model and the actual values. Mathematically, it is defined as the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.
For a set of predictions and the observed values , where indexes over observations, the MSE is given by:
In this formula:
The MSE incorporates both the variance of the estimator (how spread out the predictions are) and its bias (how far the average prediction is from the actual value). Thus, it is a comprehensive measure that evaluates the quality of an estimator or a model in terms of both its precision and its accuracy.
In the context of vectors and matrices (especially in machine learning), the predictions and actual values are often represented as vectors. The operation involving the difference between these vectors, squared, can be expressed using linear algebra notation, especially when dealing with multiple dimensions or multivariate regression. However, in the basic formulation of MSE given above, we directly compute the squared difference without explicitly mentioning vector transposition.
In some contexts, especially when dealing with matrix operations in multivariate regression or when predictions and actual values are matrices, you might see an expression like for calculating a form of squared error. This notation incorporates matrix transposition () and is used to perform matrix multiplication in a way that results in a scalar value representing the squared error. However, this is a more general or complex scenario than the basic MSE formula for a single predictive model.
The simple version of MSE provided initially is widely used in many machine learning tasks, including regression, where the goal is often to minimize this value to improve model performance.
Estimation theory is concerned with estimating the values of parameters based on observed data. These parameters define the characteristics of a population, such as its average or variance.
The Fisher Information provides a measure of how much information an observable data sample carries about an unknown parameter of the model that generated the sample. It's essential for understanding the precision with which we can estimate these parameters.
To derive the log-likelihood formula for a dataset under the assumption that the data points are drawn from a normal (Gaussian) distribution, let's start with the probability density function (pdf) of the normal distribution.
The pdf for a single data point given the mean and standard deviation is:
For a dataset consisting of independently and identically distributed (i.i.d.) observations drawn from this distribution, the likelihood function is the joint probability of observing all data points:
Substituting the pdf into the likelihood function gives:
Taking the natural logarithm of the likelihood function to obtain the log-likelihood function simplifies the multiplication into a sum:
This is the log-likelihood function for a dataset under the assumption that the data points are drawn from a normal distribution, showing how the log-likelihood depends on the mean and standard deviation of the distribution, as well as the observed data points . This formula combines the constant terms and the sum of the squared differences between the observed data points and the mean, scaled by the variance. This expression is essential for methods like Maximum Likelihood Estimation (MLE), where we aim to find the parameters ( and ) that maximize this log-likelihood.
The PDF for a multivariate normal distribution for a random vector with mean vector and covariance matrix is given by:
Here, denotes the determinant of the covariance matrix , and is a quadratic form representing the Mahalanobis distance between the random vector and the mean vector , weighted by the inverse of the covariance matrix.
To obtain the log-likelihood of observing a set of independent and identically distributed (i.i.d.) data points from this distribution, we take the natural logarithm of the product of their probabilities:
Substituting the PDF into the equation above and simplifying, we get:
This expression can be further simplified by aggregating constants and summing the quadratic forms across all observations:
The multivariate normal distribution is a generalization of the univariate normal distribution to multiple variables. It describes the behavior of a random vector in which all linear combinations of the components are normally distributed. This distribution is particularly useful in statistics and machine learning for modeling correlations between variables.
Definition: The covariance matrix, denoted as , is a square matrix that encapsulates the covariance between each pair of elements in the random vector. For a random vector , the element at the th row and th column of , denoted , is the covariance between and . For diagonal elements, where , represents the variance of .
Properties:
Role in Multivariate Normal Distribution: The inverse of the covariance matrix, , plays a crucial role in the probability density function (pdf) of the multivariate normal distribution. It appears in the exponent of the pdf, contributing to measuring the "distance" of an observation from the mean , taking into account the correlations between the variables.
Geometric Interpretation: The inverse of the covariance matrix can be seen as a transformation that "uncorrelates" the variables, mapping them into a new space where their covariance is zero except for their variances (on the diagonal). This is akin to stretching and rotating the coordinate axes so that the contours of equal probability density of the distribution become circles (in 2D) or spheres (in higher dimensions), rather than ellipses or ellipsoids.
Mathematical Significance: adjusts the quadratic term in the density function so that it accounts for both the variance and covariance of the variables. This term essentially measures the Mahalanobis distance, which is a generalized distance metric that considers the scale and correlation of the data dimensions.
Understanding the covariance matrix and its inverse is foundational for working with multivariate normal distributions in statistics and machine learning. They allow us to model complex relationships between multiple random variables, making the multivariate normal distribution a powerful tool for multivariate analysis, pattern recognition, and machine learning algorithms.
The CRLB provides a theoretical lower limit on the variance of unbiased estimators. It tells us the best precision we can achieve with an unbiased estimator for estimating a parameter.
In statistics, an estimator is a rule or formula that tells us how to calculate an estimate of a given quantity based on observed data. The quantity we're trying to estimate could be any parameter of the population from which the data was sampled, such as the population mean.
The arithmetic mean, often simply called the "mean," is one of the most basic estimators. It's used to estimate the central tendency or the average of a set of numbers.
An unbiased estimator is a statistical technique used to estimate an unknown parameter of a distribution. It's called "unbiased" when the expected value of the estimation equals the true value of the parameter being estimated.
An estimator is called unbiased if, on average, it gives us the true value of the parameter we're trying to estimate. In more formal terms, an estimator is unbiased if its expected value—the long-run average of the estimates if we could repeat our sampling process an infinite number of times—is equal to the true parameter.
The arithmetic mean is an unbiased estimator for the parameter because its expected value equals the true mean of the distribution from which the samples are drawn. Mathematically, this can be shown as follows:
The expected value operator has a linear property, which means that the expected value of a sum of random variables is equal to the sum of their expected values
Since each is drawn from a distribution with mean , for all . Thus,
This demonstrates that the arithmetic mean is an unbiased estimator for , fulfilling the criteria for unbiasedness by having its expected value equal to the parameter it estimates.
If you had a bag of marbles, with each marble having a number on it, and you wanted to know the average number, you could take a handful of marbles out and calculate their average by adding up the numbers and dividing by how many marbles you took. If you put them back and took a different handful many, many times, averaging each time, the average of all those averages would be very close to the average number on all marbles in the bag.
This is what it means for the arithmetic mean to be an unbiased estimator of the mean: if you could keep sampling and averaging, on average, you'd get the true mean of the whole population.
To understand the efficiency of the estimator and how it relates to the Cramér-Rao Lower Bound (CRLB), let's go through the concepts step by step.
In the world of statistics, when we talk about the efficiency of an estimator, we are interested in how well the estimator performs in terms of its variance. We prefer estimators that give us results closer to the true parameter value more consistently, i.e., with less variance. Among all unbiased estimators, the one with the lowest variance is considered the most "efficient."
The arithmetic mean estimator of a set of observations has a variance. The variance measures how much the values of would be spread out if we repeatedly took samples from the population and calculated their mean each time. For a normal distribution, the variance of the estimator is given by , where:
For a normal distribution, the Fisher Information for the mean is . This tells us that more samples or less variability in our data both increase the amount of information we have about the mean.
The Cramér-Rao Lower Bound is a theoretical limit that tells us the best variance we could possibly achieve with an unbiased estimator of a parameter. In mathematical terms, it is the reciprocal of the Fisher Information: . This formula says the lowest variance we can hope for decreases with more samples and increases with more variability in the data.
Now, if we compare the actual variance of (which is ) with the CRLB, we find they are the same. This means our arithmetic mean estimator is as good as it gets—it is the most efficient estimator we could use for the mean because it has the lowest variance that an unbiased estimator could possibly have according to the CRLB.
Think of the CRLB as the speed limit for estimators on the highway of statistics. It's the law that says "you cannot go below this variance while being an unbiased estimator." Now, if our arithmetic mean is cruising exactly at that speed limit, it means it's the fastest (or most efficient) estimator allowed by the "laws" of statistics for estimating the population mean.
In summary, the arithmetic mean is not just unbiased (giving us the right answer on average); it's also efficient (giving us the answer with as little random error as possible) when the data comes from a normal distribution. This dual quality of unbiasedness and efficiency makes the arithmetic mean a very powerful and commonly used estimator in statistics.
Understanding these concepts forms the backbone of not only machine learning but also data science and statistics at large. They're essential for analyzing the behavior of algorithms and models, especially in understanding overfitting and underfitting, model selection, and for improving the predictive performance of machine learning models.
To fully grasp the above concepts, you should be familiar with the following:
By starting with these foundational topics and gradually exploring more complex concepts, you'll build a solid understanding of estimation theory and its applications in machine learning and data science.
"Introduction to Probability" by Joseph K. Blitzstein and Jessica Hwang: This book provides a comprehensive introduction to probability, covering expectation, variance, and many other fundamental concepts with clarity.
"The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman: A cornerstone text in statistical learning that explains the trade-off between bias and variance, among other concepts.
Khan Academy’s Statistics and Probability: Offers free online tutorials that cover basics and advanced concepts in statistics and probability, including expectation, bias, and variance.
"Pattern Recognition and Machine Learning" by Christopher M. Bishop: Provides detailed explanations of bias, variance, and expectation within the context of machine learning models.
Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong is an excellent resource that covers the mathematical underpinnings necessary for understanding these concepts in machine learning.
Statistical Inference by George Casella and Roger L. Berger provides a deep dive into estimation theory, unbiased estimators, and much more, giving you a thorough understanding of statistical theory.
Online Courses: Websites like Coursera, edX, and Khan Academy offer courses in statistics and machine learning that start from the basics and advance to more complex topics, including estimation theory and statistical inference.