A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. Measures of central tendencies are sometimes also called as measures of central locations or summary statistics.
Following are the types of measures of central tendencies,
The mean (or average) is the most popular and well known measure of central tendency. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. The sample mean is usually denoted by $\overline{x}$ (x bar) and the population mean is usually denoted by $\mu$ (mu).
The formula of calculating mean is as follows,
$$\overline{x} = {{x_1 + x_2 + \dots + x_n}\over{n}} = {{\sum{x}}\over{n}}$$
def mean(x):
'''
Returns arithmetic mean of array-like object.
Parameters:
x : array-like object
'''
return sum(x) / len(x)
# Generating 51 random numbers between 0 and 100
from numpy.random import seed, randint
seed(42)
x = randint(0, 100, 51)
print(x)
[51 92 14 71 60 20 82 86 74 74 87 99 23 2 21 52 1 87 29 37 1 63 59 20 32 75 57 21 88 48 90 58 41 91 59 79 14 61 61 46 61 50 54 63 2 50 6 20 72 38 17]
print(f"The mean is: {mean(x)}")
The mean is: 50.1764705882353
# Using the mean function in numpy
import numpy as np
print(f"The mean is: {np.mean(x)}")
The mean is: 50.1764705882353
The mean has one main disadvantage i.e. it is particularly susceptible to the influcence of outliers. These are values that are unusual compared to rest of the dataset.
In situations like these we use another measure of central tendency, particularly median.
The median is the middle score for a set a of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data.
def median(v):
'''
Returns median of array-like object.
Parameters:
x : array-like object
'''
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2
if n % 2 == 1:
return sorted_v[midpoint]
else:
low = midpoint - 1
high = midpoint
return (sorted_v[low] + sorted_v[high]) / 2
# Using the same dataset x
print(f"The median is: {median(x)}")
The median is: 54
# Using the median function in numpy
print(f"The median is: {np.median(x)}")
The median is: 54.0
The mode is the value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all.
from collections import Counter
def mode(x):
'''
Returns a list of modes for the given array-like object.
Parameters:
x : array-like object
'''
c = Counter(x)
return [k for k, v in c.items() if v == c.most_common(1)[0][1]]
# Using the same dataset x
print(f"The mode is: {mode(x)}")
The mode is: [20, 61]
# As numpy does not have a mode function, let's use statistics module of python
import statistics
print(f"The mode is: {statistics.multimode(x)}")
The mode is: [20, 61]
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. The median is a quantile; the median is placed in a probability distribution so that exactly half of the data is lower than the median and half of the data is above the median. The median cuts a distribution into two equal areas and so it is sometimes called 2-quantile.
def quantile(x, p):
'''
Returns the pth-quantile value in array-like object
Parameters:
x : array-like object
p : float value between 0 and 1
'''
p_index = int(p * len(x))
return sorted(x)[p_index]
# Using the same dataset x
print(f"The 0.3 quantile is: {quantile(x, 0.3)}")
The 0.3 quantile is: 32
# Using quantile function in numpy
print(f"The 0.3 quantile is: {np.quantile(x, 0.3, axis = 0)}")
The 0.3 quantile is: 32.0
Dispersion in statistics is a way of describing how spread out a set of data is. Dispersion is the state of data getting dispersed, stretched, or spread out in different categories. Dispersion is a set of measures that helps one to determine the quality of data in an objectively quantifiable manner.
If all the values are close together then the spread is low, on the other hand, if some or all of the values differ by a large amount from the mean (and each other), then there is large spread in data.
Following are the types of measures of dispersion,
Variance is the arithmetic mean of the sqaures of the deviations of the given values from their arithmetic mean. In simpler language, it is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set. Variance is calculated by using the following formula, $$ V = \frac{ \sum_{i=1}^n (x_{i} - \bar{x}) ^{2}}{N} $$
def variance(x):
'''
Returns variance of an array-like object of numeric values
Parameters:
x : array-like object of numeric values
'''
length = len(x)
avg = mean(x)
sum_squares = 0
for i in range(length):
sum_squares += (x[i] - avg) ** 2
variance = sum_squares / length
return variance
# Using the same dataset x
print(f"The variance is: {variance(x)}")
The variance is: 782.4198385236446
# Using Variance function in numpy
print(f"The variance is: {np.var(x)}")
The variance is: 782.4198385236447
As variance is produced by squaring the distance from the mean, its unit does not match that of original data. Standard deviation is a mathematical trick to bring back the parity. It is the positive square root of the variance. Standard deviation is denoted using $\sigma$ (sigma) and can be calculated using the following formula, $$ \sigma = \sqrt{V} = \sqrt{\frac{ \sum_{i=1}^n (x_{i} - \bar{x}) ^{2}}{N}} $$
Note: The denominator in the calculation might change according to the context of variance / standard deviation, i.e. of population or sample.
from math import sqrt
def std(x):
'''
Returns standard deviaion of array-like object of numeric values
Parameters:
x: array-like object of numeric values
'''
return sqrt(variance(x))
# Using the same dataset x
print(f"The standard deviation is: {std(x)}")
The standard deviation is: 27.971768598421598
# Using Standard deviation function in numpy
print(f"The standard deviation is: {np.std(x)}")
The standard deviation is: 27.9717685984216
The range is the simplest measure of dispersion. As a quantity, the range is the difference between the higest and lowest scores in a distribution.
def srange(x):
'''
Returns range of an array like object of numeric values
Parameters:
x : array-like object of numeric values
'''
return max(x) - min(x)
# Using the same dataset x
print(f"The range is: {srange(x)}")
The range is: 98
# Using the ptp (peak to peak) function of numpy
print(f"The range is: {np.ptp(x)}")
The range is: 98
In descriptive stats, the interquartile range (IQR), also called the midspread or H-spread, is a measure of dispersion being equal to difference between 75th and 25th percentiles (i.e. 0.75 and 0.25 percentile).
The interquartile range is often used to find outliers in data. Outliers here are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. In a boxplot, the highest and lowest occurring value within this limit are indicated by whiskers of the box (frequently with an additional bar at the end of the whisker) and any outliers as individual points.
def iqr(x):
'''
Returns IQR of array-like object of numeric values
Parameters:
x : array-like object of numeric values
'''
return np.quantile(x, 0.75) - np.quantile(x, 0.25)
# Using the same dataset x
print(f"The IQR is: {iqr(x)}")
The IQR is: 51.0