Cross-Validation

Short Definition

Cross-validation evaluates model performance by training and testing on multiple data splits.

Definition

Cross-validation is a resampling technique used to assess how well a model generalizes to unseen data. The dataset is divided into multiple subsets (folds), and the model is trained and evaluated repeatedly so that each subset is used for validation at least once.

Cross-validation provides a more robust estimate of performance than a single train/test split.

Why It Matters

Performance estimates from a single split can be noisy or misleading. Cross-validation:

  • reduces dependence on a particular data split
  • provides more stable performance estimates
  • helps compare models and hyperparameters fairly

It is especially useful when datasets are small.

How It Works (Conceptually)

  • Split the dataset into k folds
  • Train the model on k−1 folds
  • Validate on the remaining fold
  • Repeat until all folds are used
  • Aggregate results across runs

Cross-validation estimates expected generalization performance.

Minimal Python Example

for fold in folds:
train(data_except(fold))
evaluate(fold)


Common Pitfalls

  • Using cross-validation on time-series data without care
  • Data leakage across folds
  • Treating cross-validation results as exact
  • Excessive computational cost for large models

Related Concepts

  • Generalization
  • Evaluation Metrics
  • Train/Test Split
  • Hyperparameter Optimization
  • Data Leakage