Short Definition
Data leakage occurs when information from outside the training process improperly influences model training or evaluation.
Definition
Data leakage refers to any situation in which a machine learning model has access—directly or indirectly—to information that would not be available at prediction time. This typically happens when training, validation, or test data are not properly isolated, leading to overly optimistic performance estimates.
Data leakage compromises the validity of evaluation results.
Why It Matters
Models affected by data leakage can appear highly accurate during development but fail in real-world deployment. Because leakage inflates performance metrics, it often goes unnoticed until systems break in production.
Data leakage is one of the most common and dangerous sources of false confidence in machine learning.
Common Forms of Data Leakage
- Train–test leakage: test data influences training
- Validation leakage: validation data repeatedly used for tuning
- Feature leakage: features encode future or target information
- Temporal leakage: training data includes information from the future
- Preprocessing leakage: statistics computed on full datasets before splitting
Leakage can occur at any stage of the pipeline.
How Data Leakage Happens
Data leakage often arises from:
- improper data splitting
- careless feature engineering
- global normalization or scaling
- reuse of evaluation data
- automated pipelines without safeguards
Leakage is usually accidental, not intentional.
How It Affects Models
- Inflated evaluation metrics
- Misleading comparisons between models
- Poor generalization to real-world data
- Failed deployments despite “excellent” test results
Leakage causes models to learn shortcuts that do not exist in practice.
Minimal Conceptual Example
# leakage example (conceptuamean = compute_mean(full_dataset) # includes test datax_train = normalize(train_data, mean)
This allows test information to influence training indirectly.
Detecting Data Leakage
Signs of possible leakage include:
- unusually high validation or test performance
- minimal performance gap between training and test sets
- sudden drops in performance after deployment
- features that seem too predictive
Detection often requires careful pipeline review.
Preventing Data Leakage
Common safeguards include:
- strict separation of data splits
- fitting preprocessing steps only on training data
- time-aware splitting for temporal data
- limited access to test results
- reproducible, auditable pipelines
Prevention is easier than detection.
Relationship to Generalization
Data leakage invalidates generalization estimates. A model that benefits from leaked information has not truly learned to generalize—it has memorized artifacts of the evaluation setup.
Leakage undermines trust in all reported metrics.
Related Concepts
- Data & Distribution
- Training Data
- Validation Data
- Test Data
- Train/Test Split
- Distribution Shift
- Generalization