The Pearson correlation coefficient (r) is a dimensionless statistic that measures the strength and direction of the linear relationship between two continuous variables, ranging from −1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear association. It is calculated as the covariance of the two variables divided by the product of their standard deviations. While correlation quantifies association, it does not imply causation — a fundamental principle in statistical reasoning.
r = Σ[(xᵢ−x̄)(yᵢ−ȳ)] / √[Σ(xᵢ−x̄)² × Σ(yᵢ−ȳ)²]
LaTeX: r = \dfrac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \sum(y_i-\bar{y})^2}}
| Symbol | Meaning | Unit |
|---|---|---|
| r | Pearson correlation coefficient | dimensionless |
| x_i, y_i | Individual data point values | same as data |
| \bar{x}, \bar{y} | Sample means of x and y | same as data |
| n | Number of data pairs | count |
Problem
Two variables: x = [1, 2, 3, 4, 5], y = [2, 4, 5, 4, 5]. Calculate the Pearson correlation coefficient.
Solution
Step 1: x̄ = 3, ȳ = 4. Step 2: (xᵢ−x̄): −2, −1, 0, 1, 2; (yᵢ−ȳ): −2, 0, 1, 0, 1. Step 3: Products: 4, 0, 0, 0, 2 → Σ = 6. Step 4: Σ(xᵢ−x̄)² = 4+1+0+1+4 = 10; Σ(yᵢ−ȳ)² = 4+0+1+0+1 = 6. Step 5: r = 6 / √(10 × 6) = 6 / √60 = 6/7.746 ≈ 0.775.
Answer
r ≈ 0.775 — moderate-to-strong positive linear correlation
| r Value Range | Strength | Direction | Example |
|---|---|---|---|
| −1.00 | Perfect | Negative | Exact inverse relationship |
| −0.70 to −0.99 | Strong | Negative | Study time vs errors |
| −0.30 to −0.69 | Moderate | Negative | Stress vs sleep |
| −0.29 to 0.29 | Weak/None | — | Shoe size vs IQ |
| 0.30 to 0.69 | Moderate | Positive | Height vs weight |
| 0.70 to 1.00 | Strong | Positive | Temperature vs ice cream sales |
Wikimedia Commons, CC BY-SA
Regression analysis is a set of statistical methods for estimating the relationship between a dependent variable (response) and one or more independent variables (predictors), typically by fitting a mathematical model that minimises prediction error. Simple linear regression fits a straight line through bivariate data using the method of least squares; multiple regression extends this to several predictors. It is used extensively in economics, biology, engineering, and machine learning for prediction, forecasting, and causal inference.
A Z-score (also called a standard score) measures how many standard deviations a data point is from the mean of its distribution. It standardises values from different distributions, enabling direct comparison by placing them on a common scale. Z-scores are widely used in quality control, hypothesis testing, and the construction of standard normal tables.
The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as the sample size n increases, regardless of the shape of the underlying population distribution, provided the population has a finite mean and variance. For most practical purposes, normality is achieved when n ≥ 30. The CLT is the theoretical foundation for Z-tests, t-tests, confidence intervals, and virtually all classical inferential statistics.
The term "correlation" comes from the Latin "correlatio" (mutual relation), popularised by Francis Galton in 1888. Karl Pearson formalised the product-moment correlation coefficient formula in 1895, hence the eponym "Pearson's r".