Central Limit Theorem, Z and t distribution.

After running several statistical tests to assess my models, I decided to dig deeper into the theory and ask myself questions such as “Why the number of samples is relevant for the statistical test?”, “Why the standard deviation has a square root in the denominator?” or “Why do statisticians differentiate between Z and t distribution ?”
Since I did not find a blog post that answered all these questions, I decided to run some simulations in Python and post the results along with this article for people interested.
Central Limit Theorem
The central limit theorem states that if you sufficiently select random samples from a population with mean μ and standard deviation σ, then the distribution of the sample means will be approximately normally distributed with mean μ and standard deviation σ2/n.
Assumptions of Central Limit Theorem
- The sample should be drawn randomly following the condition of randomization.
- The samples drawn should be independent.
- After sampling, the sample size shouldn’t exceed 10% of the total population.
- The sample size should be sufficiently large.
This theorem is applicable even for variables that are originally not normally distributed, excluding a few distributions mentioned in [1]. The theorem is a crucial concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can apply to many problems involving other types of distributions.
To showcase this theorem, I created the distribution of mean μ = 5, standard deviation σ = 2, and size of 1000000 shown in Figure 1.

After selecting 1000 samples with sample sizes of 2, 5, 10, 20, 30, 50, and computing their mean, the output is the one shown in Figure 2, where each figure corresponds to each of the sample sizes.

According to the central limit theorem, the mean and standard deviation for each of those distributions are:

From the means indicated in Figure 1, we observe that the mean of our initial distribution (μ=5) is the same as the mean of the sampling distribution of the mean independently of the sample size.
Regarding the standard deviation, Table 1 compares the empirical standard distribution with the theoretical one.

As indicated by the theorem, we observe how empirical and theoretical standard deviations are similar for all the sample sizes.
For a better understanding check out this YouTube video.
[youtube https://www.youtube.com/watch?v=b5xQmk9veZ4&w=560&h=315]
Z and t distribution.
The standard normal or Z-distribution is a normal distribution with a mean of 0 and a standard deviation of 1. Any normal distribution can be standardized by converting its values into z-scores. Z-scores tell you how many standard deviations from the mean each value lies.
Figure 3: Standard Distribution Curve
The random variable of a standard normal distribution is known as the standard score or a z-score. It is possible to transform every normal random variable X into a z – score using the following formula:
z = (X – μ) / σ
where X is a normal random variable, μ is the mean of X, and σ is the standard deviation of X. Converting a normal distribution into a z-distribution allows you to calculate the probability of certain values occurring and to compare different data sets. The functions for calculating probabilities are complex and difficult to calculate by hand.
Whereas, the t-distribution looks similar since it is also centered at zero and has a bell shape. Still, it is shorter, and flatter and its standard deviation are proportionally larger compared with the Z-distribution. The t distribution is sensitive to sample size and is used for small samples when the population standard deviation is unknown.
Check out this link for some Real life application of Central Limit Theorem.
The t-statistic is used to test hypotheses about an unknown population mean u when the value of σ2 is unknown. The formula for the t statistic has the same structure as the z-score formula, except that the t statistic uses the estimated standard error in the denominator. The only difference between the t formula and the z-score formula is that the z-score uses the actual population variance, σ2 (or the standard deviation) and the t formula uses the corresponding sample variance (or standard deviation) when the population value is not known. To determine how well a t-statistic approximates a z-score, we must determine how well the sample variance approximates the population variance. Basically, for small samples, the t-statistic is used.
The t-distribution gives more probability to observations in the tails of the distribution than the standard normal distribution (a.k.a. the z-distribution).In this way, the t-distribution is more conservative than the standard normal distribution: to reach the same level of confidence or statistical significance, you will need to include a wider range of the data.
Figure 4: Comparison of Z and t distribution in Statistics.
The Z-distributions and t-distributions are well known for their application in Tests of Significance. The test of significance enables us to decide whether the deviation between the observed sample statistic(mean) and the hypothetical value of the population parameter or sample statistics (mean) of two independent groups.
Since we are comparing means, it is necessary to use the mean and the standard deviation of the mean sample distribution which, as mentioned above, are μ and σ2/n, respectively.
Also, to use these tests, it is necessary that:
- The data follow a continuous or ordinal scale.
- The data are randomly selected.
- The data are normally distributed.
According to the central limit theorem, the distribution of the sample mean follows a normal distribution. For this reason, some books indicate that the t-test and z-test can be applied without the normality test. Although the central limit theorem guarantees the normal distribution of the sample mean values, it does not guarantee the normal distribution of samples in the population, which is essential for the t-test since its purpose is to compare certain characteristics representing groups [2].
When to use Z- and t- distribution
Normal distributions are used when the population distribution is assumed to be normal. The T distribution is similar to the normal distribution, just with fatter tails. Both assume a normally distributed population. The t distributions have higher kurtosis than normal distributions. The probability of getting values very far from the mean is larger with a t distribution than a normal distribution. The T distribution can skew exactness relative to the normal distribution. Its shortcoming only arises when there’s a need for perfect normality. The T-distribution should only be used when the population standard deviation is not known.
Considering the assumptions presented above, the answer to when should we use Z-distribution or t-distribution can be answered following the diagram in Figure 3.

One of the key points, and probably the most important lesson in this article, is the passage mentioned in [3], which says that the t-distribution describes the standardized distances of the sample mean to the population mean when the population standard deviation is not known, and the observations come from a normally distributed population.
Besides, as mentioned in [4], the t-distribution becomes closer to the normal distribution when the sample size increases. In particular, when the sample size is above 30, the t-distribution is practically similar to the normal distribution. Figure 6 shows the process.

Therefore, t-distribution is mainly used when the sample size is below 30, is still possible to use it with a bigger sample size.
To conclude, Figure 5 and Figure 6 show the distributions when the standard deviation of the population is known and unknown, respectively. When the standard deviation of the population is known or when the standard deviation of the population is unknown but the sample size is higher than 30, the distributions approximate a normal distribution, therefore being possible to use the Z-scores. Nevertheless, when the standard deviation of the population is unknown and the sample size is smaller than 30, the distribution varies depending on the sample size. Hence, it is required to use the t-scores.


References:
- Check out this Kaggle notebook to know How you can implement the above concepts in python.
- Relevant and interesting links related to the topic : https://genomicsclass.github.io/book/pages/clt_and_t-distribution.htm
- https://towardsdatascience.com/central-limit-theorem-a-real-life-application-f638657686e1