Processing-Time Leakage

Short Definition

Processing-time leakage occurs when models use information available at data processing time but unavailable at prediction time.

Definition

Processing-time leakage is a form of data leakage that arises when features, labels, or splits are constructed using timestamps tied to data ingestion or processing rather than the true event time. As a result, models are inadvertently trained or evaluated using information that would not have been accessible when a real-world prediction is made.

Processing-time leakage violates operational causality.

Why It Matters

In modern data pipelines, events are often logged, delayed, reprocessed, or backfilled. Using processing time instead of event time can silently introduce future knowledge into training data, producing overly optimistic evaluation results and brittle deployed models.

This type of leakage is common in streaming and log-based systems.

Event Time vs Processing Time

  • Event time: when the real-world event actually occurred
  • Processing time: when the system recorded, ingested, or processed the event

These timestamps often differ due to latency, batching, retries, or system failures.

Common Sources of Processing-Time Leakage

Typical sources include:

  • splitting data using ingestion timestamps instead of event timestamps
  • computing aggregates over data as processed, not as occurred
  • including late-arriving events in historical features
  • training on labels that became available only after prediction time
  • mixing backfilled data with real-time data incorrectly

Leakage often appears “correct” in logs but is invalid operationally.

Processing-Time Leakage vs Temporal Feature Leakage

  • Processing-time leakage: caused by incorrect time reference
  • Temporal feature leakage: caused by feature construction using future data

Processing-time leakage often enables temporal feature leakage.

Minimal Conceptual Example

# invalid (processing-time based)
train = data[data.ingest_time < T]
# valid (event-time based)
train = data[data.event_time < T]

How It Affects Evaluation

  • inflated offline performance
  • unrealistic calibration
  • misleading robustness estimates
  • sudden performance collapse in production

The model learns patterns unavailable at inference time.

Detecting Processing-Time Leakage

Warning signs include:

  • performance that degrades sharply after deployment
  • discrepancies between batch and streaming predictions
  • inconsistent results when re-evaluated with event-time splits
  • unexpectedly strong performance from simple temporal features

Detection requires time-aware audits.

Preventing Processing-Time Leakage

Best practices include:

  • always using event-time timestamps for modeling
  • explicitly modeling label latency
  • separating ingestion pipelines from modeling timelines
  • validating with walk-forward or event-time evaluation
  • documenting feature availability windows

Time semantics must be explicit.

Relationship to Event-Time Sampling

Event-time sampling is the primary defense against processing-time leakage. Without it, even well-designed evaluation protocols can be invalidated.

Relationship to Generalization

Processing-time leakage inflates apparent generalization by allowing models to exploit future information. True generalization can only be assessed under realistic time constraints.

Related Concepts

  • Data & Distribution
  • Data Leakage
  • Temporal Feature Leakage
  • Event-Time Sampling
  • Label Latency
  • Time-Aware Sampling
  • Walk-Forward Validation
  • Evaluation Protocols