Cross-validation is a statistical technique for evaluating a machine learning model's ability to generalize to an independent dataset. The most common form, k-fold cross-validation, partitions the training data into k equally sized subsets; the model is trained on k−1 folds and evaluated on the remaining fold, repeating this process k times and averaging the results. Cross-validation provides a more reliable performance estimate than a single train-test split and helps in selecting hyperparameters and comparing models.
Cross-Validation Score = (1/k) * sum of Score for each fold i from 1 to k
LaTeX: CV_k = \frac{1}{k} \sum_{i=1}^{k} \text{Score}_i
| Symbol | Meaning | Unit |
|---|---|---|
| CV_k | Mean cross-validation score across all folds | depends on metric (e.g., accuracy %) |
| k | Number of folds | count |
| \text{Score}_i | Model performance score on fold i (e.g., accuracy) | depends on metric |
Problem
A classifier is evaluated using 5-fold cross-validation. The accuracy scores on each fold are: 82%, 85%, 79%, 88%, and 86%. Calculate the mean CV accuracy and standard deviation.
Solution
Step 1 — Sum the scores: 82 + 85 + 79 + 88 + 86 = 420 Step 2 — Mean CV accuracy: 420 / 5 = 84% Step 3 — Compute variance: Deviations from mean: (−2)², 1², (−5)², 4², 2² = 4, 1, 25, 16, 4 Variance = (4 + 1 + 25 + 16 + 4) / 5 = 50 / 5 = 10 Step 4 — Standard deviation: σ = √10 ≈ 3.16%
Answer
Mean CV accuracy = 84% ± 3.16%
| Strategy | k Value | Computational Cost | Best Used When |
|---|---|---|---|
| Hold-out (train/test split) | N/A | Very low | Very large datasets |
| k-Fold (k=5) | 5 | Moderate | Standard practice |
| k-Fold (k=10) | 10 | Moderate–high | Recommended default |
| Leave-One-Out (LOOCV) | n (all samples) | Very high | Small datasets |
| Stratified k-Fold | 5–10 | Moderate | Imbalanced class distribution |
Wikimedia Commons, CC BY-SA
Overfitting occurs when a machine learning model learns the training data too well — including its noise and random fluctuations — to the point where it performs poorly on new, unseen data. An overfitted model has high training accuracy but low validation/test accuracy, indicating it has memorized patterns specific to the training set rather than generalizing. Overfitting is more likely with complex models, small datasets, or insufficient regularization.
Supervised learning is a machine learning approach where a model is trained on a labeled dataset, meaning each training example is paired with the correct output (label). The model learns a mapping from inputs to outputs by minimizing the difference between its predictions and the true labels. It is the most widely used ML paradigm and underpins applications such as image recognition, speech transcription, and credit scoring.
Machine learning is a branch of artificial intelligence in which systems learn from data to improve their performance on tasks without being explicitly programmed for each task. It works by identifying statistical patterns in training data and using those patterns to make predictions or decisions on new, unseen data. Machine learning powers applications ranging from spam filters and recommendation engines to medical diagnosis and autonomous vehicles.
"Cross-validation" combines "cross" (from Old English "cros," meaning to pass over or intersect) and "validation" (from Latin "validus," strong or effective). The technique was formalized in statistics by Seymour Geisser and Allen in the 1970s and later adopted widely in machine learning.