Processing-Time Leakage

Short Definition

Processing-time leakage occurs when models use information available at data processing time but unavailable at prediction time.

Definition

Processing-time leakage is a form of data leakage that arises when features, labels, or splits are constructed using timestamps tied to data ingestion or processing rather than the true event time. As a result, models are inadvertently trained or evaluated using information that would not have been accessible when a real-world prediction is made.

Processing-time leakage violates operational causality.

Why It Matters

In modern data pipelines, events are often logged, delayed, reprocessed, or backfilled. Using processing time instead of event time can silently introduce future knowledge into training data, producing overly optimistic evaluation results and brittle deployed models.

This type of leakage is common in streaming and log-based systems.

Event Time vs Processing Time

Event time: when the real-world event actually occurred
Processing time: when the system recorded, ingested, or processed the event

These timestamps often differ due to latency, batching, retries, or system failures.

Common Sources of Processing-Time Leakage

Typical sources include:

splitting data using ingestion timestamps instead of event timestamps
computing aggregates over data as processed, not as occurred
including late-arriving events in historical features
training on labels that became available only after prediction time
mixing backfilled data with real-time data incorrectly

Leakage often appears “correct” in logs but is invalid operationally.

Processing-Time Leakage vs Temporal Feature Leakage

Processing-time leakage: caused by incorrect time reference
Temporal feature leakage: caused by feature construction using future data

Processing-time leakage often enables temporal feature leakage.

Minimal Conceptual Example

			
# invalid (processing-time based)
train = data[data.ingest_time < T]
# valid (event-time based)
train = data[data.event_time < T]

How It Affects Evaluation

inflated offline performance
unrealistic calibration
misleading robustness estimates
sudden performance collapse in production

The model learns patterns unavailable at inference time.

Detecting Processing-Time Leakage

Warning signs include:

performance that degrades sharply after deployment
discrepancies between batch and streaming predictions
inconsistent results when re-evaluated with event-time splits
unexpectedly strong performance from simple temporal features

Detection requires time-aware audits.

Preventing Processing-Time Leakage

Best practices include:

always using event-time timestamps for modeling
explicitly modeling label latency
separating ingestion pipelines from modeling timelines
validating with walk-forward or event-time evaluation
documenting feature availability windows

Time semantics must be explicit.

Relationship to Event-Time Sampling

Event-time sampling is the primary defense against processing-time leakage. Without it, even well-designed evaluation protocols can be invalidated.

Relationship to Generalization

Processing-time leakage inflates apparent generalization by allowing models to exploit future information. True generalization can only be assessed under realistic time constraints.

Related Concepts

Data & Distribution
Data Leakage
Temporal Feature Leakage
Event-Time Sampling
Label Latency
Time-Aware Sampling
Walk-Forward Validation
Evaluation Protocols