Learn More →
The Central Limit Theorem (CLT) is a fundamental result in the field of statistics and probability theory. It provides a foundation for understanding why many distributions in nature tend to approximate a normal distribution under certain conditions, even if the original variables themselves are not normally distributed. The theorem states that, given a sufficiently large sample size, the distribution of the sample means will be approximately normally distributed, regardless of the shape of the population distribution, provided the population has a finite variance.
The formula or the mathematical formulation of the CLT can be derived from the concept of convergence in distribution of standardized sums of independent random variables. Let's consider the classical version of the CLT to understand where the formula comes from:
Consider a sequence of independent and identically distributed (i.i.d.) random variables, , each with a mean and a finite variance . The sample mean of these random variables is given by:
The Central Limit Theorem tells us that as approaches infinity, the distribution of the standardized sample means (i.e., how many standard deviations away the sample mean is from the population mean) converges in distribution to a standard normal distribution. The standardized sample mean is given by:
This formula ensures that has a mean of 0 and a standard deviation of 1:
Mean of (Z):
Variance of :
Here, converges in distribution to a standard normal distribution as becomes large.
The CLT is derived from the properties of characteristic functions or moment-generating functions of probability distributions. In essence, the CLT can be proven by showing that the characteristic function of converges to the characteristic function of a standard normal distribution as approaches infinity (see Appendix A.3).
The significance of the CLT lies in its ability to justify the use of the normal distribution in many practical situations, including hypothesis testing, confidence interval construction, and other inferential statistics procedures, even when the underlying population distribution is unknown or non-normal.
We know that that Z-Score or Standardization is the deviation of the data point from mean in units of standard deviation (see Appendix A.2. on Standardization). Here, the deviation is of the sample mean () from the population mean (). Therefore, we derive the standard deviation of the sample mean () as follows.
The variance of the sample mean is derived as follows. Since , its variance is:
Because the are i.i.d., the variances add up, and we get:
This shows that the variance of the sample mean decreases as the sample size increases.
The standard deviation is the square root of the variance. Therefore, the standard deviation of the sample means, also known as the standard error of the mean (SEM), is:
Reducing Spread: Dividing by reduces the spread of the sampling distribution of the sample mean as the sample size increases. This reflects the fact that larger samples are likely to yield means closer to the population mean (), thus decreasing variability among the sample means.
Normalization: The process of dividing the population standard deviation () by normalizes the scale of the sample means' distribution. This normalization ensures that no matter the sample size, the scale (spread) of the distribution of sample means is consistent and comparable.
The CLT states that as approaches infinity, the distribution of the standardized sample means:
converges to a standard normal distribution . Here, is the mean of a sample of size , is the population mean, and is the population standard deviation. The denominator standardizes the distribution of by adjusting for the size of the sample, allowing the theorem to hold across different sample sizes and population variances.
Mathematically, dividing by in the calculation of the SEM and the standardization of sample means under the CLT ensures that the variability among sample means decreases with increasing sample size. This adjustment is fundamental to the convergence of the distribution of sample means to a normal distribution, a cornerstone of statistical inference.
Given the definition of , to find its variance, we use the property that the variance operator for any random variable and constant (see proof in Appendix A.1.). Applying this to the definition of , we get:
Since is a constant, subtracting it from does not affect the variance, so we focus on the scaling factor. Applying the variance operator:
This calculation shows that the variance of is 1. Here's the breakdown:
The derivation shows that the process of standardizing the sample mean results in a new variable with a variance of 1. This is a crucial step in the application of the CLT because it ensures that is scaled appropriately to have a standard normal distribution with mean 0 and variance 1 as becomes large. This standardization allows us to use the properties of the standard normal distribution for statistical inference and hypothesis testing.
Variance is a fundamental statistical measure that quantifies the spread or dispersion of a set of data points or a random variable's values around its mean. Understanding the properties of variance is crucial for statistical analysis, as these properties often underpin the manipulation and interpretation of statistical data. Here are some key properties of variance:
Variance is always non-negative (). This is because variance is defined as the expected value of the squared deviation from the mean, and a square is always non-negative.
The variance of a constant () is zero (). Since a constant does not vary, its spread around its mean (which is the constant itself) is zero.
Scaling a random variable by a constant factor scales the variance by the square of that factor: , where is a constant and is a random variable. This property was detailed in a previous explanation.
For any two random variables and , the variance of their sum is given by , where is the covariance of and . If and are independent, , and the formula simplifies to .
While the expectation operator is linear (), variance is not linear except in specific cases. For independent random variables and , and constants and , . However, for dependent variables, you must also consider the covariance term.
Similar to the sum, the variance of the difference of two random variables is . For independent variables, this simplifies to , as their covariance is zero.
If a random variable has a variance of zero (), then is almost surely a constant. This is because no variation from the mean implies that takes on its mean value with probability 1.
These properties are widely used in statistical modeling, data analysis, and probability theory, especially in the derivation of statistical estimators, hypothesis testing, and in the study of the distributional properties of sums and transformations of random variables.
The plots demonstrate the Central Limit Theorem (CLT) in action for different sample sizes . Each subplot shows the distribution of for 100,000 realizations, where are drawn from the uniform distribution over . This distribution is overlayed with the density of the standard normal distribution (dotted red curve) for comparison.
Here's what we observe:
These observations align with the CLT, highlighting its power: as the number of samples increases, the distribution of the sum (or mean) of these samples increasingly approximates a normal distribution, even when the original variables are not normally distributed.
The number of realizations, or samples, you choose for generating a histogram affects the smoothness and accuracy of the representation of the underlying distribution. This principle applies not just to histograms of samples from a standard normal distribution, but to histograms of samples from any distribution. Here's how changing the number of realizations impacts the histogram:
In practical terms, choosing the number of realizations depends on the balance between computational resources and the need for accuracy. For exploratory data analysis or when computational resources are limited, a smaller number of realizations might suffice. However, for precise statistical analysis or when the goal is to closely approximate the underlying distribution, using a larger number of realizations is preferable.
It's also important to note that while increasing the number of realizations improves the approximation to the underlying distribution, it does not change the distribution itself. The Central Limit Theorem (CLT) ensures that, given enough samples, the distribution of sample means will approximate a normal distribution, independent of the number of realizations used to construct each individual histogram.
The Central Limit Theorem (CLT) primarily applies to the distribution of sample means, stating that the distribution of the sample mean of a sufficiently large number of independent and identically distributed (i.i.d.) random variables, each with a finite mean and variance, will approximate a normal distribution, regardless of the shape of the original distribution.
For variances, the situation is slightly different. While the CLT does imply that sums (and by extension, averages) of i.i.d. random variables tend toward a normal distribution as the sample size increases, the distribution of sample variances follows a different path. Specifically, the distribution of sample variances (scaled appropriately) of a population is described by the Chi-square () distribution when the population is normally distributed (see Appendix A.4. for details on square distribution).
In summary, while the CLT provides a basis for expecting the sample mean to be normally distributed for large sample sizes regardless of the population distribution, the sample variance follows a Chi-square distribution for normally distributed data and may approach normality under certain conditions for large sample sizes in non-normal populations, but this is not as directly assured as it is for sample means.
The Fisher Information provides a measure of how much information an observable data sample carries about an unknown parameter of the model that generated the sample. It's essential for understanding the precision with which we can estimate these parameters.
The likelihood function represents the plausibility of a parameter value given specific observed data. Unlike a probability function, which provides the probability of observing data given certain parameter values, the likelihood function considers the parameter (e.g., mean, variance) as variable and the data as fixed.
Suppose we are studying the heights of chicks in a particular town. We collect height data from a random sample of birds, and we wish to estimate the mean height μ for the entire population of chicks in the town. Assume that the heights are normally distributed, which is a reasonable assumption for biological measurements like height. In this example, the parameter μ represents the mean height we are trying to estimate. The likelihood function measures how "likely" or "plausible" different values of μ are given the observed data. The MLE is particularly powerful because it selects the value of μ that makes the observed data most probable under the assumed statistical model.
Suppose you have a coin, and you want to determine whether it's fair. You flip the coin ten times, and it comes up heads six times. The likelihood function in this scenario would help you evaluate how plausible different probabilities of flipping heads (let's denote this probability as ) are, given that you observed 6 heads out of 10 flips.
If you have a set of independent and identically distributed (i.i.d.) data points from a probability distribution , where represents the parameters of the distribution, then the likelihood function is defined as the product of the probabilities (or probability densities for continuous data) of observing each specific :
In practice, especially with many data points, it's more convenient to work with the logarithm of the likelihood function, known as the log-likelihood function:
This transformation is useful because it turns the product into a sum, simplifying both computation and differentiation.
Fisher Information quantifies how much information an observable random variable, sampled from a distribution, carries about an unknown parameter upon which the probability depends. Mathematically, Fisher Information is defined as the expected value of the squared gradient (first derivative) of the log-likelihood function see Appendix A.5, or equivalently, as the negative expectation of the second derivative of the log-likelihood function with respect to the parameter:
This definition implies that Fisher Information measures the steepness or curvature of the log-likelihood function around the parameter . A steeper curve suggests that the parameter can be estimated with higher precision since small changes in lead to larger changes in the likelihood, making the maximum more distinct and easier to pinpoint accurately.
The likelihood function serves as a bridge between observed data and theoretical models, helping statisticians make inferences about unknown parameters. Fisher Information, derived from the likelihood function, plays a crucial role in assessing the quality and precision of these inferences.
The negative sign in the Fisher Information formula, where Fisher Information is defined as the negative expectation of the second derivative of the log-likelihood function with respect to the parameter , is a critical aspect to understand. The reason for this sign arises from the curvature of the log-likelihood function and its implications for parameter estimation.
Second Derivative of the Log-Likelihood Function: The second derivative of the log-likelihood function, , typically measures the curvature of the log-likelihood function at a particular point . This curvature is crucial in determining the nature of the extremum (maximum or minimum) at that point.
Convexity and Concavity:
Since we are often interested in maximizing the log-likelihood function to find the maximum likelihood estimators, the point of interest (where the first derivative is zero) will generally have a negative second derivative if it is a maximum.
Maximization of Log-Likelihood: In the context of likelihood estimation, we are interested in points where the log-likelihood function is maximized with respect to the parameter . At these points, the curvature (second derivative) is negative, reflecting the concavity of the log-likelihood function.
Negative Expectation: Given that the second derivative at the maximum point of the log-likelihood is negative, taking the negative of this expectation (i.e., the negative of a generally negative number) results in a positive value. Fisher Information must be positive as it quantifies the amount of information the data carries about the parameter, where higher values imply more information or precision in the estimation.
Variance of Estimators: Positive Fisher Information is crucial because it relates directly to the precision of estimators through the Cramér-Rao Lower Bound. According to this bound, the variance of any unbiased estimator is at least the reciprocal of the Fisher Information:
If Fisher Information were not positive (and substantial), this fundamental relationship, which guarantees a lower bound on the variance of estimators, would not hold, undermining the statistical inference process.
In summary, the negative sign in the definition of Fisher Information as the negative of the expected value of the second derivative of the log-likelihood function is necessary to ensure that Fisher Information is a positive quantity, reflecting the concavity of the log-likelihood at its maximum and the precision achievable in parameter estimation.
To calculate the Fisher Information of the likelihood function for i.i.d samples from the given probability mass function (pmf) , we start with the given likelihood function:
First, we need to compute the logarithm of the likelihood function, known as the log-likelihood function:
Next, we differentiate the log-likelihood function twice with respect to and then compute the expectation (the Fisher Information of the likelihood function):
Since the second derivative of the sum of the log-likelihoods is the sum of the second derivatives of the individual log-likelihoods, and knowing that is the Fisher Information of one sample (see Appendix A.6 on property below):
Given that , we can substitute this into the equation for each , recognizing that each term in the sum is just :
The Fisher Information of the likelihood function for i.i.d samples is times the Fisher Information of a single sample. This result indicates that the amount of information about the parameter contained in i.i.d samples is times the information contained in a single sample. Essentially, as the sample size increases, the total information about increases linearly with , implying that larger sample sizes provide more precise estimates of the parameter , as reflected in the decrease in the variance of the estimator.
The variance of a random variable measures the dispersion of that variable's values around its mean. The formula for the variance of a random variable is defined as:
where is the expected value (or mean) of , and denotes the expectation operator.
Now, let's consider a new random variable , where is a constant. We want to derive the variance of , denoted as or .
Given , we apply the variance formula:
Since , we have .
The expected value of is:
This is because the expectation operator is linear, and the constant can be factored out of the expectation.
Substituting and into the variance formula, we get:
Since is a constant, we can factor it out of the squared term:
Noting that is the definition of , we have:
This derivation shows that the variance of , where is a constant, is times the variance of . The key takeaway is that scaling a random variable by a constant scales its variance by , reflecting the squared nature of variance as a measure of dispersion.
Standardization is a statistical method used to transform random variables into a standard scale without distorting differences in the ranges of values. The process converts original data into a format where the mean of the transformed data is 0 and the standard deviation is 1. This transformation is achieved by subtracting the expected value (mean) from each data point and then dividing by the standard deviation.
The formula for standardizing a random variable is:
It is basically the deviation of data point from mean (i.e. how far the data point is from the mean) per unit standard deviation. For e.g, means that the data point is 2 standard deviation away from the mean.
where:
The rationale behind standardization and the specific form of the standardization formula involves several key statistical principles:
Comparability: Standardization allows data from different sources or distributions to be compared directly. Because the standardized data has a mean of 0 and a standard deviation of 1, it removes the units of measurement and normalizes the scale, making different datasets or variables comparable.
Normalization: Many statistical methods and machine learning algorithms assume or perform better when the data is normally distributed or similarly scaled. Standardization can help meet these assumptions or improve performance by giving every variable an equal weight, preventing variables with larger scales from dominating those with smaller scales.
Understanding Z-scores: The standardized value, or Z-score, tells you how many standard deviations away from the mean a data point is. This can be useful for identifying outliers, understanding the distribution of data, and performing statistical tests.
Mathematical Foundation: The formula is grounded in the properties of the normal distribution. In a standard normal distribution, the mean () is 0, and the standard deviation () is 1. The standardization process transforms the data so that it can be described in terms of how far each observation is from the mean, in units of the standard deviation. This transformation is particularly useful in the context of the Central Limit Theorem, which states that the distribution of the sample means tends towards a normal distribution as the sample size increases, regardless of the shape of the population distribution.
The act of subtracting the mean and dividing by the standard deviation in standardization serves to "normalize" the scale of different variables, enabling direct comparison, simplifying the interpretation of data, and preparing data for further statistical analysis or machine learning modeling. This process leverages the fundamental statistical properties of mean and standard deviation to achieve a standardized scale, where the effects of differing magnitudes among original data values are neutralized.
Deriving the normal distribution mathematically from the Central Limit Theorem (CLT) in a simple, non-technical explanation is challenging due to the advanced mathematical concepts involved, particularly the use of characteristic functions or moment-generating functions. However, I'll outline a basic approach using characteristic functions to give you a sense of how the derivation works. This explanation simplifies several steps and assumes some familiarity with concepts from probability theory.
The characteristic function of a random variable is defined as the expected value of , where is the imaginary unit and is a real number:
Characteristic functions are powerful tools in probability theory because they uniquely determine the distribution of a random variable, and they have properties that make them particularly useful for analyzing sums of independent random variables.
Consider independent and identically distributed (i.i.d.) random variables , each with mean and variance . Let be their sum. The characteristic function of is:
This is because the characteristic function of a sum of independent variables is the product of their individual characteristic functions.
First, standardize to get :
We want to find the characteristic function of , . The characteristic function for is , since the variables are i.i.d.
Given the Taylor expansion of around 0:
To adjust this for , note that we're interested in the effect of on , not on the original variables. The transformation involves a shift and scaling of , considering the definition of . So, we replace with to reflect the scaling in and consider the subtraction of , which shifts the mean to 0:
Substituting the approximation for and simplifying, we aim to show that this converges to as .
When you substitute the Taylor expansion into the expression for and simplify, focusing on terms up to the second order, you essentially deal with:
Since is the mean of the original distribution, and we're considering the sum minus , adjusted by , this simplifies to:
As , this expression converges to , by the limit definition of the exponential function:
This result, , is the characteristic function of a standard normal distribution (i.e., a mean ) of 0 and a standard deviation () of 1). The inverse Fourier transform (or the characteristic function inversion theorem) tells us that the probability density function corresponding to this characteristic function is the PDF of the standard normal distribution:
This formula describes the distribution of values that can take, where represents the number of standard deviations away from the mean a particular observation is. The factor normalizes the area under the curve of the PDF to 1, ensuring that the total probability across all possible outcomes is 1, as required for any probability distribution.
This demonstrates how, under the Central Limit Theorem, the distribution of the standardized sum (or average) of a large number of i.i.d. random variables, regardless of their original distribution, converges to a normal distribution, provided the original variables have a finite mean and variance.
This derivation, while not delving into the full technical rigor of the proofs involving characteristic functions, provides a conceptual bridge from the CLT to the emergence of the normal distribution.
The Chi-square distribution is a widely used probability distribution in statistical inference, particularly in hypothesis testing and in constructing confidence intervals. It arises primarily in contexts involving the sum of squared independent, standard normal variables.
The Chi-square distribution with degrees of freedom is defined as the distribution of a sum of the squares of independent standard normal random variables. Mathematically, if are independent and identically distributed (i.i.d.) standard normal random variables (), then the random variable
follows a Chi-square distribution with degrees of freedom, denoted as .
The probability density function of the Chi-square distribution for and degrees of freedom is given by:
where:
The Chi-square function is crucial in fields like biology, finance, and physics, where it helps in decision-making processes involving uncertainty and variability.
Here's the plot showing Chi-square distributions for various degrees of freedom (df). The degrees of freedom were chosen as 2, 5, and 10 for this illustration. As you can see:
This visualization helps in understanding how the shape of the Chi-square distribution is influenced by its degrees of freedom, with greater degrees of freedom leading to a more pronounced and symmetrical shape.
The Fisher Information is a crucial concept in statistical inference, intimately linked to the variability in the score function of the log-likelihood, where the score function represents the first derivative of the log-likelihood with respect to the parameter . Here, Fisher Information is portrayed not just as the variance of the score function, but also through a deeper mathematical connection to the curvature of the log-likelihood function.
Note: This score function is different from Z-Score in Appendix A.3. Z-Score is applied to normalize the data, whereas the purpose of score function here is to find the extrema (maximum) of log-likelihood.
The score function is formally defined as the derivative of the log-likelihood function with respect to the parameter :
This function measures the sensitivity of the likelihood function to changes in the parameter , thus indicating how much information the observed data provide about .
Fisher Information, , can be understood as the variance of the score function, expressed mathematically as:
However, a more profound expression of Fisher Information comes from the second derivative of the log-likelihood function:
This formulation shows that Fisher Information is the negative expectation of the second derivative (or the Laplacian) of the log-likelihood function. It characterizes the curvature of the log-likelihood function around the parameter . A sharper curvature (higher absolute value of the second derivative) implies more information about is available from the data, suggesting that estimates of can be made more precisely.
Information Content: Fisher Information quantifies the amount of information that the sample data () provides about the parameter (). A higher Fisher Information suggests that small changes in produce substantial changes in the likelihood, indicating that the data is highly informative regarding .
Precision of Estimators: Fisher Information is inversely related to the variance of any unbiased estimator of . This relationship is crystallized in the Cramér-Rao Lower Bound, which states that the variance of any unbiased estimator of must be at least as great as the reciprocal of the Fisher Information:
Thus, greater Fisher Information leads to a lower bound on the variance of the estimator, enhancing the precision with which is estimated.
In essence, the Fisher Information not only provides a measure of the expected sharpness of the log-likelihood function's peak but also fundamentally ties to how confidently parameters can be estimated from given data. This relationship underscores the vital role of Fisher Information in both theoretical statistics and practical data analysis.
The expectation of a sum of random variables is equal to the sum of their expectations. This is why the expectation of the sum of the second derivatives can be represented as the sum of the expectations of the second derivatives. makes explicit that each term in the sum is processed individually through the expectation operator. This emphasizes that Fisher Information for the entire dataset can be viewed as the sum of the individual Fisher Informations from each data point.
This formulation shows how the Fisher Information for a model based on multiple i.i.d. observations aggregates information from each observation. Since the observations are i.i.d., the information they provide about is additive. The Fisher Information from each observation contributes to the total information available in the sample about the parameter .
Note: The statement "The expectation of a sum of random variables is equal to the sum of their expectations" is a fundamental property in probability theory known as the linearity of expectation. This property holds regardless of whether the random variables are independent or not.
Let's describe this mathematically:
Suppose are random variables. Then, the expectation of their sum is:
By the linearity of expectation, this can be written as:
To see why this is true, consider the definition of the expected value for discrete random variables (the proof is similar for continuous random variables, using integrals instead of sums). The expected value of a random variable is given by:
Where the sum is taken over all possible values that can take, and is the probability that takes the value .
For the sum of two random variables and , the expectation is:
Using the distributive property of multiplication over addition, this becomes:
Here, is the marginal probability of , which is obtained by summing the joint probabilities over all possible values of :
Using the marginal probability, we rewrite the expectation: Here, is factored out of the inner sum because it does not depend on , making the expression equivalent to summing times the joint probability over all pairs.
Now, each of these sums can be separated into the sums over and respectively, which simplifies to:
This simplification relies on the fact that you can rearrange the sums because the sum of the joint probabilities over one variable or for all values of the other yields the marginal probability of or .
The argument for two variables can be extended inductively to any finite number of random variables. For variables :
This linearity property is extremely useful because it simplifies calculations of expectations in complex situations involving sums of random variables and is a cornerstone in fields such as statistics, finance, and other areas of applied mathematics and engineering.