Target Leakage

Short Definition

Target leakage occurs when features reveal information about the target that would not be available at prediction time.

Definition

Target leakage is a specific form of data leakage where input features directly or indirectly encode the target label (or future information correlated with it). As a result, the model learns shortcuts that exploit leaked signals rather than meaningful patterns, leading to deceptively high performance during evaluation.

Target leakage invalidates model evaluation by allowing the model to “peek” at the answer.

Why It Matters

Models affected by target leakage often appear exceptionally accurate during development but fail catastrophically in real-world use. Because the leakage is embedded in features, it can be difficult to detect without domain knowledge and careful pipeline inspection.

Target leakage is one of the most damaging and subtle forms of leakage.

Common Sources of Target Leakage

  • Post-outcome features: variables created after the target event occurs
  • Proxy features: variables highly correlated with the target due to data collection artifacts
  • Aggregations using future data: statistics computed with knowledge of the label
  • Human-influenced features: annotations or scores derived from the target itself

Leakage often arises unintentionally during feature engineering.

How Target Leakage Happens

Target leakage typically occurs when:

  • features are engineered without considering prediction-time availability
  • temporal order is ignored
  • labels influence preprocessing or aggregation steps
  • business logic unintentionally exposes outcomes

The problem is conceptual, not algorithmic.

How It Affects Models

  • Inflated training and validation performance
  • Minimal gap between training and test metrics
  • Overconfidence in deployment
  • Rapid performance collapse in production

The model learns to exploit leaked signals rather than the underlying task.

Minimal Conceptual Example

# target leakage example (conceptual)
features["days_until_default"] # derived using knowledge of default event

This feature would not be available at prediction time.

Detecting Target Leakage

Warning signs include:

  • near-perfect performance on validation data
  • features with implausibly high predictive power
  • strong correlations that lack causal explanation
  • performance drops when features are removed

Detection often requires close collaboration with domain experts.

Preventing Target Leakage

Effective prevention strategies include:

  • strict time-aware feature construction
  • defining features based on prediction-time availability
  • isolating labels from feature pipelines
  • auditing feature sources and dependencies
  • validating features against real deployment constraints

Feature design should mirror real-world usage.

Target Leakage vs Data Leakage

Target leakage is a subset of data leakage. While data leakage broadly includes any improper information flow, target leakage specifically involves the target variable influencing features.

All target leakage is data leakage, but not all data leakage is target leakage.

Relationship to Generalization

Models trained with target leakage do not generalize. Apparent performance gains reflect information leakage, not learned structure. Such models fail to transfer beyond the contaminated evaluation setup.

Related Concepts

  • Data & Distribution
  • Data Leakage
  • Training Data
  • Validation Data
  • Test Data
  • Feature Engineering
  • Generalization