Regression analysis is a set of statistical methods for estimating the relationship between a dependent variable (response) and one or more independent variables (predictors), typically by fitting a mathematical model that minimises prediction error. Simple linear regression fits a straight line through bivariate data using the method of least squares; multiple regression extends this to several predictors. It is used extensively in economics, biology, engineering, and machine learning for prediction, forecasting, and causal inference.
ŷ = β₀ + β₁x + ε
LaTeX: \hat{y} = \beta_0 + \beta_1 x + \varepsilon
| Symbol | Meaning | Unit |
|---|---|---|
| \hat{y} | Predicted value of the dependent variable | same as y |
| \beta_0 | Intercept (value of ŷ when x = 0) | same as y |
| \beta_1 | Slope (change in ŷ per unit increase in x) | y-units per x-unit |
| x | Independent (predictor) variable | units of x |
| \varepsilon | Error (residual) term | same as y |
Problem
Data: x (hours studied) = [2, 4, 6, 8], y (exam score) = [50, 65, 75, 85]. Fit a simple linear regression and predict score for 7 hours.
Solution
Step 1: n=4, Σx=20, Σy=275, Σx²=120, Σxy=1 470. Step 2: β₁ = (nΣxy − ΣxΣy) / (nΣx² − (Σx)²) = (4×1470 − 20×275) / (4×120 − 400) = (5880−5500)/(480−400) = 380/80 = 4.75. Step 3: β₀ = (Σy − β₁Σx)/n = (275 − 4.75×20)/4 = (275−95)/4 = 45.00. Step 4: ŷ(7) = 45 + 4.75×7 = 45 + 33.25 = 78.25.
Answer
Regression line: ŷ = 45 + 4.75x; predicted score for 7 hours = 78.25
| Type | Response Variable | Predictors | Typical Application |
|---|---|---|---|
| Simple linear | Continuous | 1 continuous | Height vs weight |
| Multiple linear | Continuous | ≥2 mixed | House price prediction |
| Logistic | Binary (0/1) | Mixed | Disease classification |
| Polynomial | Continuous | 1 with powers | Curvilinear growth |
| Ridge / Lasso | Continuous | Many (regularised) | High-dimensional data |
| Poisson | Count data | Mixed | Event rate modelling |
Wikimedia Commons, CC BY-SA
The Pearson correlation coefficient (r) is a dimensionless statistic that measures the strength and direction of the linear relationship between two continuous variables, ranging from −1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear association. It is calculated as the covariance of the two variables divided by the product of their standard deviations. While correlation quantifies association, it does not imply causation — a fundamental principle in statistical reasoning.
The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as the sample size n increases, regardless of the shape of the underlying population distribution, provided the population has a finite mean and variance. For most practical purposes, normality is achieved when n ≥ 30. The CLT is the theoretical foundation for Z-tests, t-tests, confidence intervals, and virtually all classical inferential statistics.
Hypothesis testing is a formal statistical procedure for making decisions about a population parameter based on sample data, by evaluating evidence against a null hypothesis (H₀) in favour of an alternative hypothesis (H₁). A test statistic is computed and compared to a critical value or converted to a p-value; if the result is statistically significant (p < α), the null hypothesis is rejected. It underpins scientific research, clinical trials, quality assurance, and data-driven decision-making across all quantitative disciplines.
Francis Galton coined the term "regression" in 1886 from the Latin "regressio" (a return), observing that extreme parental heights tended to "regress" toward the mean in offspring. Karl Pearson and Udny Yule later formalised linear regression as a statistical method.