Short Definition
Processing-time leakage occurs when models use information available at data processing time but unavailable at prediction time.
Definition
Processing-time leakage is a form of data leakage that arises when features, labels, or splits are constructed using timestamps tied to data ingestion or processing rather than the true event time. As a result, models are inadvertently trained or evaluated using information that would not have been accessible when a real-world prediction is made.
Processing-time leakage violates operational causality.
Why It Matters
In modern data pipelines, events are often logged, delayed, reprocessed, or backfilled. Using processing time instead of event time can silently introduce future knowledge into training data, producing overly optimistic evaluation results and brittle deployed models.
This type of leakage is common in streaming and log-based systems.
Event Time vs Processing Time
- Event time: when the real-world event actually occurred
- Processing time: when the system recorded, ingested, or processed the event
These timestamps often differ due to latency, batching, retries, or system failures.
Common Sources of Processing-Time Leakage
Typical sources include:
- splitting data using ingestion timestamps instead of event timestamps
- computing aggregates over data as processed, not as occurred
- including late-arriving events in historical features
- training on labels that became available only after prediction time
- mixing backfilled data with real-time data incorrectly
Leakage often appears “correct” in logs but is invalid operationally.
Processing-Time Leakage vs Temporal Feature Leakage
- Processing-time leakage: caused by incorrect time reference
- Temporal feature leakage: caused by feature construction using future data
Processing-time leakage often enables temporal feature leakage.
Minimal Conceptual Example
# invalid (processing-time based)train = data[data.ingest_time < T]# valid (event-time based)train = data[data.event_time < T]
How It Affects Evaluation
- inflated offline performance
- unrealistic calibration
- misleading robustness estimates
- sudden performance collapse in production
The model learns patterns unavailable at inference time.
Detecting Processing-Time Leakage
Warning signs include:
- performance that degrades sharply after deployment
- discrepancies between batch and streaming predictions
- inconsistent results when re-evaluated with event-time splits
- unexpectedly strong performance from simple temporal features
Detection requires time-aware audits.
Preventing Processing-Time Leakage
Best practices include:
- always using event-time timestamps for modeling
- explicitly modeling label latency
- separating ingestion pipelines from modeling timelines
- validating with walk-forward or event-time evaluation
- documenting feature availability windows
Time semantics must be explicit.
Relationship to Event-Time Sampling
Event-time sampling is the primary defense against processing-time leakage. Without it, even well-designed evaluation protocols can be invalidated.
Relationship to Generalization
Processing-time leakage inflates apparent generalization by allowing models to exploit future information. True generalization can only be assessed under realistic time constraints.
Related Concepts
- Data & Distribution
- Data Leakage
- Temporal Feature Leakage
- Event-Time Sampling
- Label Latency
- Time-Aware Sampling
- Walk-Forward Validation
- Evaluation Protocols