Data Leakage

Short Definition

Data leakage occurs when information from outside the training process improperly influences model training or evaluation.

Definition

Data leakage refers to any situation in which a machine learning model has access—directly or indirectly—to information that would not be available at prediction time. This typically happens when training, validation, or test data are not properly isolated, leading to overly optimistic performance estimates.

Data leakage compromises the validity of evaluation results.

Why It Matters

Models affected by data leakage can appear highly accurate during development but fail in real-world deployment. Because leakage inflates performance metrics, it often goes unnoticed until systems break in production.

Data leakage is one of the most common and dangerous sources of false confidence in machine learning.

Common Forms of Data Leakage

Train–test leakage: test data influences training
Validation leakage: validation data repeatedly used for tuning
Feature leakage: features encode future or target information
Temporal leakage: training data includes information from the future
Preprocessing leakage: statistics computed on full datasets before splitting

Leakage can occur at any stage of the pipeline.

How Data Leakage Happens

Data leakage often arises from:

improper data splitting
careless feature engineering
global normalization or scaling
reuse of evaluation data
automated pipelines without safeguards

Leakage is usually accidental, not intentional.

How It Affects Models

Inflated evaluation metrics
Misleading comparisons between models
Poor generalization to real-world data
Failed deployments despite “excellent” test results

Leakage causes models to learn shortcuts that do not exist in practice.

Minimal Conceptual Example

			
# leakage example (conceptua
mean = compute_mean(full_dataset) # includes test data
x_train = normalize(train_data, mean)

This allows test information to influence training indirectly.

Detecting Data Leakage

Signs of possible leakage include:

unusually high validation or test performance
minimal performance gap between training and test sets
sudden drops in performance after deployment
features that seem too predictive

Detection often requires careful pipeline review.

Preventing Data Leakage

Common safeguards include:

strict separation of data splits
fitting preprocessing steps only on training data
time-aware splitting for temporal data
limited access to test results
reproducible, auditable pipelines

Prevention is easier than detection.

Relationship to Generalization

Data leakage invalidates generalization estimates. A model that benefits from leaked information has not truly learned to generalize—it has memorized artifacts of the evaluation setup.

Leakage undermines trust in all reported metrics.

Related Concepts

Data & Distribution
Training Data
Validation Data
Test Data
Train/Test Split
Distribution Shift
Generalization