Short Definition
Target leakage occurs when features reveal information about the target that would not be available at prediction time.
Definition
Target leakage is a specific form of data leakage where input features directly or indirectly encode the target label (or future information correlated with it). As a result, the model learns shortcuts that exploit leaked signals rather than meaningful patterns, leading to deceptively high performance during evaluation.
Target leakage invalidates model evaluation by allowing the model to “peek” at the answer.
Why It Matters
Models affected by target leakage often appear exceptionally accurate during development but fail catastrophically in real-world use. Because the leakage is embedded in features, it can be difficult to detect without domain knowledge and careful pipeline inspection.
Target leakage is one of the most damaging and subtle forms of leakage.
Common Sources of Target Leakage
- Post-outcome features: variables created after the target event occurs
- Proxy features: variables highly correlated with the target due to data collection artifacts
- Aggregations using future data: statistics computed with knowledge of the label
- Human-influenced features: annotations or scores derived from the target itself
Leakage often arises unintentionally during feature engineering.
How Target Leakage Happens
Target leakage typically occurs when:
- features are engineered without considering prediction-time availability
- temporal order is ignored
- labels influence preprocessing or aggregation steps
- business logic unintentionally exposes outcomes
The problem is conceptual, not algorithmic.
How It Affects Models
- Inflated training and validation performance
- Minimal gap between training and test metrics
- Overconfidence in deployment
- Rapid performance collapse in production
The model learns to exploit leaked signals rather than the underlying task.
Minimal Conceptual Example
# target leakage example (conceptual)features["days_until_default"] # derived using knowledge of default event
This feature would not be available at prediction time.
Detecting Target Leakage
Warning signs include:
- near-perfect performance on validation data
- features with implausibly high predictive power
- strong correlations that lack causal explanation
- performance drops when features are removed
Detection often requires close collaboration with domain experts.
Preventing Target Leakage
Effective prevention strategies include:
- strict time-aware feature construction
- defining features based on prediction-time availability
- isolating labels from feature pipelines
- auditing feature sources and dependencies
- validating features against real deployment constraints
Feature design should mirror real-world usage.
Target Leakage vs Data Leakage
Target leakage is a subset of data leakage. While data leakage broadly includes any improper information flow, target leakage specifically involves the target variable influencing features.
All target leakage is data leakage, but not all data leakage is target leakage.
Relationship to Generalization
Models trained with target leakage do not generalize. Apparent performance gains reflect information leakage, not learned structure. Such models fail to transfer beyond the contaminated evaluation setup.
Related Concepts
- Data & Distribution
- Data Leakage
- Training Data
- Validation Data
- Test Data
- Feature Engineering
- Generalization