Neural Network Lexicon

Target Leakage

Short Definition

Target leakage occurs when features reveal information about the target that would not be available at prediction time.

Definition

Target leakage is a specific form of data leakage where input features directly or indirectly encode the target label (or future information correlated with it). As a result, the model learns shortcuts that exploit leaked signals rather than meaningful patterns, leading to deceptively high performance during evaluation.

Target leakage invalidates model evaluation by allowing the model to “peek” at the answer.

Why It Matters

Models affected by target leakage often appear exceptionally accurate during development but fail catastrophically in real-world use. Because the leakage is embedded in features, it can be difficult to detect without domain knowledge and careful pipeline inspection.

Target leakage is one of the most damaging and subtle forms of leakage.

Common Sources of Target Leakage

Post-outcome features: variables created after the target event occurs
Proxy features: variables highly correlated with the target due to data collection artifacts
Aggregations using future data: statistics computed with knowledge of the label
Human-influenced features: annotations or scores derived from the target itself

Leakage often arises unintentionally during feature engineering.

How Target Leakage Happens

Target leakage typically occurs when:

features are engineered without considering prediction-time availability
temporal order is ignored
labels influence preprocessing or aggregation steps
business logic unintentionally exposes outcomes

The problem is conceptual, not algorithmic.

How It Affects Models

Inflated training and validation performance
Minimal gap between training and test metrics
Overconfidence in deployment
Rapid performance collapse in production

The model learns to exploit leaked signals rather than the underlying task.

Minimal Conceptual Example

			
# target leakage example (conceptual)
features["days_until_default"] # derived using knowledge of default event

This feature would not be available at prediction time.

Detecting Target Leakage

Warning signs include:

near-perfect performance on validation data
features with implausibly high predictive power
strong correlations that lack causal explanation
performance drops when features are removed

Detection often requires close collaboration with domain experts.

Preventing Target Leakage

Effective prevention strategies include:

strict time-aware feature construction
defining features based on prediction-time availability
isolating labels from feature pipelines
auditing feature sources and dependencies
validating features against real deployment constraints

Feature design should mirror real-world usage.

Target Leakage vs Data Leakage

Target leakage is a subset of data leakage. While data leakage broadly includes any improper information flow, target leakage specifically involves the target variable influencing features.

All target leakage is data leakage, but not all data leakage is target leakage.

Relationship to Generalization

Models trained with target leakage do not generalize. Apparent performance gains reflect information leakage, not learned structure. Such models fail to transfer beyond the contaminated evaluation setup.

Related Concepts

Data & Distribution
Data Leakage
Training Data
Validation Data
Test Data
Feature Engineering
Generalization