Covariance is a measure of how much two random variables vary together. It's similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary together.
A positive covariance means that the two variables at hand are postively related, and they move in the same direction.
A negative covariance means that the two variables are inversely related, or that they move in opposite directions.
The formula to calculate covariance is as follows, $$ cov_{x, y} = \frac{ \sum (x_i - \bar{x})(y_i - \bar{y})}{N - 1} $$
import numpy as np
def covariance(x, y):
'''
Returns covariance between two array-like objects of numeric values.
Parameters:
x: array-like object of numeric values
y: array-like object of numeric values
'''
if len(x) != len(y):
return
x_mean = np.mean(x)
y_mean = np.mean(y)
sum = 0
for i in range(len(x)):
sum += ((x[i] - x_mean) * (y[i] - y_mean))
return sum / (len(x) - 1)
# Generating x and y:
x = np.arange(10)
y = x + np.random.rand(10) # Adding some noise using rand()
print(f"x: {x}")
print(f"y: {y}")
x: [0 1 2 3 4 5 6 7 8 9] y: [0.75748157 1.33266435 2.5820461 3.13766692 4.6475828 5.05373785 6.43012799 7.23605775 8.26982824 9.96623949]
# Let's plot the data
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.scatterplot(x = x, y = y, palette = "Pastel1")
plt.xlabel("x")
plt.ylabel("y")
plt.show()
/tmp/ipykernel_4004/716813267.py:6: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.scatterplot(x = x, y = y, palette = "Pastel1")
It's obvious that covariance between x and y must be positive.
print(f"Covariance: {covariance(x, y)}")
Covariance: 9.166253608051877
# Using cov function of numpy to calculate the covariance
print(f"Covariance: {np.cov(x, y)[0][1]}")
Covariance: 9.166253608051877
np.cov() always returns the covariance matrix.
Correlation is also a measure of how much two random variables change together. However,
The Pearson product-moment correlation coefficient, also known as $r, \rho$, or Pearson's $r$, is a measure of the strength and direction of the linear relationship between two variables that is defined as the covariance of the variables divided by the product of their standard deviations.
The correlation is unitless and always lies between -1 (perfect anti-correlation) and 1 (perfect correlation).
The formula of Pearson's r is as follows,
$$ \rho \ (or \ r) = \frac{cov_{x, y}}{\sigma_x \sigma_y} = \frac{ \sum (x_i - \bar{x})(y_i - \bar{y})}{ \sqrt{\sum (x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}} $$
def correlation(x, y):
'''
Returns correlation between two array-like objects of numeric values.
Parameters:
x: array-like object of numeric values
y: array-like object of numeric values
'''
return np.corrcoef(x, y)[0][1] # np.corrcoef() always returns the correlation matrix
# Using the same data x and y as above
print(f"Correlation: {correlation(x, y)}")
Correlation: 0.995402722824334