Population Parameter:
Mean, $\mu = \Sigma X / N$
where, $\Sigma X$ is the sum of all values in the population and $N$ is the size of the population.
Standard deviation, $\sigma = \sqrt{\frac{\sum(X-\mu)^2}{N}}$
where, $X$ is a value in the popluation, $\mu$ is the popluation mean, and $N$ is the size of the population
Sample Statistic:
Mean, $\bar{x} = \Sigma x / n$
where, $\Sigma x$ is the sum of all values in the sample and $n$ is the size of the sample.
Standard deviation, $s = \sqrt{\frac{\sum(x-\bar{x})^2}{n-1}}$
where, $x$ is a value in the sample, $\bar{x}$ is the popluation sample, and $n$ is the size of the sample
A population parameter is a numerical value that describes a characteristic of a population, such as the mean or standard deviation. It is usually unknown and is estimated from sample data. For example, the population mean height of all students in a school is a population parameter.
A sample statistic, on the other hand, is a numerical value that describes a characteristic of a sample, such as the sample mean or sample standard deviation. It is calculated from sample data and used to make inferences about the population. For example, the sample mean height of a group of randomly selected students is a sample statistic.
Point estimates are estimates of population parameters based on sample data. For instance, if we wanted to know the average age of registered voters in India, we could take a survey of registered voters and then use the average age of the respondents as a point estimate of the average age of the population as a whole. The average of a sample is known as the sample mean.
The sample mean is usually not exeactly the same as the population mean. This difference can be caused by many factors such as poor survey, biased sampling methods and the randomness inherent to drawing a sample from a population.
import numpy as np
import random
import math
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import poisson
np.random.seed(46) # For reproducability
population_age_1 = poisson.rvs(loc=18, mu=35, size=150000)
population_age_2 = poisson.rvs(loc=18, mu=10, size=100000)
population_ages = np.concatenate((population_age_1, population_age_2))
population_ages.mean()
42.991528
sample_ages = np.random.choice(a=population_ages, size=1000)
print(len(sample_ages))
print(sample_ages.mean())
print(population_ages.mean() - sample_ages.mean())
1000 42.535 0.4565280000000058
Our point estimate based on a sample of 1000 individuals underestimates the true population mean by 0.456 years, but it is close. This illustrates an important point: We can get a fairly accurate estimate of a large population by sampling a relatively small subset of individuals.
population_races = (["white"]*100000) + (["black"]*50000) + \
(["hispanic"]*50000) + (["asian"]*25000) +\
(["other"]*25000)
demo_sample = random.sample(population_races, 1000)
for race in set(demo_sample):
print( race + " proportion estimate:" )
print( demo_sample.count(race)/1000 )
black proportion estimate: 0.208 white proportion estimate: 0.425 other proportion estimate: 0.084 asian proportion estimate: 0.085 hispanic proportion estimate: 0.198
Many statistical procedures assume that the data follows a normal distribution, because the normal distribution has nice properties like symmetricity and having the majority of the data clustered withing a few standard deviations of the mean. Unfortunately, real world data is often not normally distributed and the distribution of a sample tends to mirror the distribution of the population.
plt.hist(population_ages, bins=range(5, 90, 1))
plt.xlabel("Population Ages")
plt.ylabel("Frequency")
plt.show()
plt.hist(sample_ages, bins=range(5, 90, 1))
plt.xlabel("Sample Ages")
plt.ylabel("Frequency")
plt.show()
The sample has the roughly the same shape as the underlying population. This suggested that we can't apply techniques that assume a normal distribution to this dataset, since it is not normal.
In reality, we can actually apply techniques of normal distribution !
Thanks to Central Limit Theorem.
The central limit theorem is one of the most important results of probability theory and servers as the foundation of many methods of statistical analysis. At a high level, the theorem states the distribution of many sample means, known as sampling distribution, will be (approximately) normally distributed. This rule holds true even if the underlying distribution itself is not normally distributed. As a result we can treat the sample mean as if it were drawn from normal distribution.
To illustrate, let's create a sampling distribution by taking 200 samples from our population and then making 200 point estimates of the mean.
point_estimates = list()
for x in range(200):
sample = np.random.choice(a=population_ages, size=500)
point_estimates.append(sample.mean())
print(f"Point estimates: {point_estimates}")
Point estimates: [42.522, 43.342, 42.782, 43.156, 43.306, 42.446, 41.49, 42.82, 42.826, 42.648, 42.792, 43.234, 42.142, 43.432, 43.794, 43.946, 44.73, 43.18, 42.728, 42.02, 43.396, 42.602, 43.906, 44.05, 43.144, 43.202, 42.492, 43.504, 42.5, 44.636, 42.038, 42.242, 43.306, 42.666, 42.442, 42.742, 43.414, 43.368, 43.43, 43.44, 43.024, 42.974, 43.0, 43.042, 43.274, 43.364, 42.216, 43.394, 42.326, 42.936, 43.03, 43.22, 43.202, 42.618, 42.246, 43.238, 43.256, 43.218, 43.15, 42.712, 42.892, 42.59, 42.972, 42.77, 43.62, 43.146, 42.802, 45.182, 43.642, 42.032, 43.242, 43.144, 43.608, 42.154, 43.114, 42.838, 43.03, 42.388, 42.69, 42.902, 42.066, 43.452, 42.572, 42.636, 42.386, 42.812, 42.462, 42.844, 42.76, 43.408, 43.284, 43.964, 42.442, 43.528, 43.728, 42.94, 42.792, 42.646, 43.926, 42.464, 42.31, 42.57, 42.806, 44.74, 42.256, 44.926, 42.19, 43.894, 43.694, 42.562, 43.002, 42.13, 42.466, 42.558, 42.824, 43.828, 42.53, 43.036, 42.964, 43.14, 42.244, 43.294, 43.152, 43.018, 42.652, 42.188, 43.778, 42.608, 43.432, 42.332, 43.388, 42.64, 43.592, 42.24, 42.27, 42.188, 42.492, 42.374, 42.71, 42.696, 42.264, 43.336, 42.004, 42.814, 41.616, 41.116, 41.414, 43.036, 43.05, 42.43, 43.058, 43.006, 41.812, 41.674, 42.676, 42.368, 43.222, 42.542, 43.522, 43.072, 43.216, 43.58, 42.98, 42.176, 42.71, 42.962, 43.396, 42.52, 43.406, 43.822, 43.214, 42.864, 43.51, 42.204, 44.01, 43.426, 42.774, 42.47, 43.684, 42.792, 42.704, 42.84, 43.25, 44.13, 43.01, 41.702, 42.198, 42.868, 43.09, 42.736, 42.15, 43.12, 42.392, 42.8, 42.88, 43.358, 41.964, 42.2, 42.802, 43.07]
sns.histplot(point_estimates, kde=True)
plt.show()
The sampling distribution appears to be roughly normal, despite the bimodal population distribution that the samples were drawn from. In addition, the mean of the sampling distribution approaches the true population mean.
print(population_ages.mean())
print(population_ages.mean() - np.array(point_estimates))
print(population_ages.mean() - np.array(point_estimates).mean())
42.991528 [ 0.469528 -0.350472 0.209528 -0.164472 -0.314472 0.545528 1.501528 0.171528 0.165528 0.343528 0.199528 -0.242472 0.849528 -0.440472 -0.802472 -0.954472 -1.738472 -0.188472 0.263528 0.971528 -0.404472 0.389528 -0.914472 -1.058472 -0.152472 -0.210472 0.499528 -0.512472 0.491528 -1.644472 0.953528 0.749528 -0.314472 0.325528 0.549528 0.249528 -0.422472 -0.376472 -0.438472 -0.448472 -0.032472 0.017528 -0.008472 -0.050472 -0.282472 -0.372472 0.775528 -0.402472 0.665528 0.055528 -0.038472 -0.228472 -0.210472 0.373528 0.745528 -0.246472 -0.264472 -0.226472 -0.158472 0.279528 0.099528 0.401528 0.019528 0.221528 -0.628472 -0.154472 0.189528 -2.190472 -0.650472 0.959528 -0.250472 -0.152472 -0.616472 0.837528 -0.122472 0.153528 -0.038472 0.603528 0.301528 0.089528 0.925528 -0.460472 0.419528 0.355528 0.605528 0.179528 0.529528 0.147528 0.231528 -0.416472 -0.292472 -0.972472 0.549528 -0.536472 -0.736472 0.051528 0.199528 0.345528 -0.934472 0.527528 0.681528 0.421528 0.185528 -1.748472 0.735528 -1.934472 0.801528 -0.902472 -0.702472 0.429528 -0.010472 0.861528 0.525528 0.433528 0.167528 -0.836472 0.461528 -0.044472 0.027528 -0.148472 0.747528 -0.302472 -0.160472 -0.026472 0.339528 0.803528 -0.786472 0.383528 -0.440472 0.659528 -0.396472 0.351528 -0.600472 0.751528 0.721528 0.803528 0.499528 0.617528 0.281528 0.295528 0.727528 -0.344472 0.987528 0.177528 1.375528 1.875528 1.577528 -0.044472 -0.058472 0.561528 -0.066472 -0.014472 1.179528 1.317528 0.315528 0.623528 -0.230472 0.449528 -0.530472 -0.080472 -0.224472 -0.588472 0.011528 0.815528 0.281528 0.029528 -0.404472 0.471528 -0.414472 -0.830472 -0.222472 0.127528 -0.518472 0.787528 -1.018472 -0.434472 0.217528 0.521528 -0.692472 0.199528 0.287528 0.151528 -0.258472 -1.138472 -0.018472 1.289528 0.793528 0.123528 -0.098472 0.255528 0.841528 -0.128472 0.599528 0.191528 0.111528 -0.366472 1.027528 0.791528 0.189528 -0.078472] 0.07206800000000158
A point estimate can give you a rough idea of a population parameter like the mean, but estimates are prone to error and taking multiple samples to get improved estimates may not be feasible.
A confidence interval is a range of values above and below a point estimate that captures the true population parameter at some predetermined confidence interval. For example, if you want to have 95% chance of capturing the true population parameter with a point estimate and a corresponding confidence interval, you would set your confidence level to 95%. Higher confidence levels result in a wider confidence intervals.
Calculate a confidence interval by taking a point estimate and then adding and subtracting a margin of error to create a range. Margin of error is based on your desired confidence level, the spread of the data and the size of your sample. The way you calculate the margin of error depends on whether you know the standard deviation of the population or not.
If you know the standard deviation of the population, the margin of error is equal to:
$$z * \frac{\sigma}{\sqrt{n}}$$Where $\sigma$ (sigma) is the population standard deviation, $n$ is sample size, and $z$ is a number known as the z-critical value. The z-critical value is the number of standard deviations you'd have to go from the mean of the normal distribution to capture the proportion of the data associated with the desired confidence level. For instance, we know that roughly 95% of the data in a normal distribution lies within 2 standard deviations of the mean, so we could use 2 as the z-critical value for a 95% confidence interval (although it is more exact to get z-critical values with stats.norm.ppf().).
Z-Table: Link
Z-Calculator: Link
Let's calculate a 95% confidence for our mean point estimate:
from scipy.stats import norm
sample_size = 1000
sample = np.random.choice(a=population_ages, size=sample_size)
sample_mean = sample.mean()
z_critical = norm.ppf(q=0.975)
print(f"z-critical value: {z_critical}")
pop_stddev = population_ages.std()
margin_of_error = z_critical * (pop_stddev / math.sqrt(sample_size))
confidence_interval = (sample_mean - margin_of_error,
sample_mean + margin_of_error)
print(f"Confidence interval: {confidence_interval}")
z-critical value: 1.959963984540054 Confidence interval: (41.60125214484725, 43.24074785515275)
Notice that the confidence interval we calculated captures the true population mean of 43.0023.
The method norm.ppf() takes a percentage and returns a standard deviation multiplier for what value that percentage occurs at. It is equivalent to a 'One-tail test' on the density plot.
From scipy.stats.norm:
ppf(q, loc=0, scale=1) Percent point function (inverse of cdf — percentiles).
For one-tailed test:
norm.ppf(q=0.95) Returns a 95% confidence interval for a one-tail test on a standard normal distribution (i.e. a special case of the normal distribution where the mean is 0 and the standard deviation is 1).
For two-tailed test: If we need to calculate a 'Two-tail test' (i.e. We're concerned with values both greater and less than our mean) then we need to split the significance (i.e. our alpha value) because we're still using a calculation method for one-tail. The split in half symbolizes the significance level being appropriated to both tails. A 95% confidence level has a 5% alpha; splitting the 5% alpha across both tails returns 2.5%. Taking 2.5% from 100% returns 97.5% as an input for the significance level.
Therefore, if we were concerned with values on both sides of our mean, our code would input .975 to represent a 95% confidence level across two-tails: norm.ppf(q=0.975)
Let's create several confidence intervals and plot them to get a better sense of what it means to "capture" the true mean:
sample_size = 1000
intervals = list()
sample_means = list()
for sample in range(25):
sample = np.random.choice(a=population_ages, size=sample_size)
sample_mean = sample.mean()
sample_means.append(sample_mean)
z_critical = norm.ppf(q=0.975)
pop_stddev = population_ages.std()
margin_of_error = z_critical * (pop_stddev / math.sqrt(sample_size))
confidence_interval = (sample_mean - margin_of_error,
sample_mean + margin_of_error)
intervals.append(confidence_interval)
plt.figure(figsize=(9,9))
plt.errorbar(x=np.arange(0.1, 25, 1),
y=sample_means,
yerr=[abs(top - bot) / 2 for top, bot in intervals],
fmt='o')
plt.hlines(xmin=0, xmax=25,
y=42.99,
linewidth=2.0,
color="red")
plt.plot()
[]
Notice that in the plot above, all but one of the 95% confidence intervals overlap the red line marking the true mean. This is to be expected: since a 95% confidence interval captures the true mean 95% of the time, we'd expect our interval to miss the true mean 5% of the time.
If you don't know the standard deviation of the population, you have to use the standard deviation of your sample as a stand in when creating confidence intervals. Since the sample standard deviation may not match the population parameter the interval will have more error when you don't know the population standard deviation. To account for this error, we use what's known as a t-critical value instead of the z-critical value. The t-critical value is drawn from what's known as a t-distribution--a distribution that closely resembles the normal distribution but that gets wider and wider as the sample size falls. The t-distribution is available in scipy.stats with the nickname "t" so we can get t-critical values with stats.t.ppf().
Let's take a new, smaller sample and then create a confidence interval without the population standard deviation, using the t-distribution:
from scipy.stats import t
sample_size = 25
sample = np.random.choice(a= population_ages, size = sample_size)
sample_mean = sample.mean()
t_critical = t.ppf(q=0.975, df=24) # Get the t-critical value*
print("t-critical value:")
print(t_critical)
sample_stdev = sample.std(ddof=1) # Get the sample standard deviation
sigma = sample_stdev / math.sqrt(sample_size) # Standard deviation estimate
margin_of_error = t_critical * sigma
confidence_interval = (sample_mean - margin_of_error,
sample_mean + margin_of_error)
print("Confidence interval:")
print(confidence_interval)
t-critical value: 2.0638985616280205 Confidence interval: (39.169051919281664, 49.07094808071833)
Note: when using the t-distribution, you have to supply the degrees of freedom (df). For this type of test, the degrees of freedom is equal to the sample size minus 1. If you have a large sample size, the t-distribution approaches the normal distribution.
Notice that the t-critical value is larger than the z-critical value we used for 95% confidence interval. This allows the confidence interval to cast a larger net to make up for the variability caused by using the sample standard deviation in place of the population standard deviation. The end result is a much wider confidence interval (an interval with a larger margin of error.).
If you have a large sample, the t-critical value will approach the z-critical value so there is little difference between using the normal distribution vs. the t-distribution:
# Check the difference between critical values with a sample size of 1000
t.ppf(q=0.975, df= 999) - norm.ppf(0.975)
0.0023774765933946007
t.interval(
confidence=0.95, # Confidence level
df=24, # Degrees of freedom
loc=sample_mean, # Sample mean
scale=sigma # Standard deviation estimate
)
(39.169051919281664, 49.07094808071833)
We can also make a confidence interval for a point estimate of a population proportion. In this case, the margin of error equals:
$$z * \sqrt{\frac{p(1-p)}{n}}$$Where z is the z-critical value for our confidence level, p is the point estimate of the population proportion and n is the sample size. Let's calculate a 95% confidence interval for Hispanics according to the sample proportion we calculated earlier (0.181):
z_critical = norm.ppf(0.975) # Record z-critical value
p = 0.181 # Point estimate of proportion
n = 1000 # Sample size
margin_of_error = z_critical * math.sqrt((p * (1 - p)) / n)
confidence_interval = (p - margin_of_error, # Calculate the the interval
p + margin_of_error)
confidence_interval
(0.1571367643828236, 0.2048632356171764)
The output shows that the confidence interval captured the true population parameter of 0.2. Similar to our population mean point estimates, we can use the scipy stats.distribution.interval() function to calculate a confidence interval for a population proportion for us. In this case were working with z-critical values so we want to work with the normal distribution instead of the t distribution:
norm.interval(
confidence=0.95, # Confidence level
loc=0.181, # Point estimate of proportion
scale=math.sqrt((p * (1 - p)) / n) # Scaling factor
)
(0.1571367643828236, 0.2048632356171764)