Short Definition
Training drift and evaluation drift describe different mismatches between how models are trained, evaluated, and deployed.
Definition
Training drift occurs when the data distribution or learning conditions used during training no longer reflect the deployment environment.
Evaluation drift occurs when the evaluation setup (datasets, metrics, protocols, or thresholds) no longer reflects how the model is actually used in deployment.
Training drift breaks learning assumptions; evaluation drift breaks measurement validity.
Why This Distinction Matters
Models can fail either because they were trained on outdated or unrepresentative data (training drift) or because their reported performance no longer measures real-world behavior (evaluation drift). Treating one as the other leads to ineffective fixes—such as retraining when the real issue is flawed evaluation.
Diagnosis determines intervention.
Training Drift
Training drift refers to misalignment between the training data/process and the deployment environment.
Common Causes of Training Drift
- data drift or concept drift over time
- outdated training datasets
- changes in feature availability or pipelines
- evolving user behavior or policies
- delayed or biased labels
- retraining schedules that lag reality
The model learns from the wrong world.
Typical Symptoms
- gradual degradation after deployment
- improved performance after retraining on recent data
- strong offline metrics that decay in production
Training drift primarily affects learning quality.
Evaluation Drift
Evaluation drift refers to misalignment between evaluation practices and real-world usage.
Common Causes of Evaluation Drift
- static test sets while deployment data changes
- thresholds tuned for old class prevalences
- metrics that no longer reflect business costs
- leaderboard or benchmark overfitting
- evaluation performed on cleaned or idealized data
- missing OOD or stress-test scenarios
The model is measured against the wrong standard.
Typical Symptoms
- stable offline metrics despite poor production outcomes
- sudden failures not predicted by evaluation
- miscalibrated confidence and thresholds
- disagreement between model metrics and user impact
Evaluation drift primarily affects measurement validity.
Minimal Conceptual Illustration
Training Drift: Train data ≠ Deploy data
Evaluation Drift: Eval setup ≠ Deploy reality
Corrective Actions
Addressing Training Drift
- retraining on recent or rolling windows
- updating features and pipelines
- handling label latency explicitly
- adopting adaptive or online learning
- revisiting data collection strategies
Addressing Evaluation Drift
- updating test sets and validation splits
- recalibrating thresholds and metrics
- adding OOD and stress-test evaluations
- aligning metrics with decision costs
- revising evaluation protocols regularly
Fix the mismatch, not the symptom.
Relationship to Data Drift and Concept Drift
- Training drift is often caused by data drift or concept drift
- Evaluation drift can occur even without data drift
Evaluation drift can hide training drift—or exaggerate it.
Relationship to Generalization
Training drift reduces true generalization by learning from outdated distributions. Evaluation drift distorts perceived generalization by measuring performance under irrelevant conditions.
Both undermine trust in reported results.
Relationship to Reproducibility
Evaluation drift compromises longitudinal comparisons: metrics from different periods are no longer comparable. Training drift complicates reproducing past results on new data.
Consistency requires alignment across time.
Common Pitfalls
- retraining without fixing evaluation
- updating metrics without updating data
- relying on static benchmarks indefinitely
- ignoring deployment constraints in evaluation
- assuming production failures imply bad models
Drift is structural, not incidental.
Summary Comparison
| Aspect | Training Drift | Evaluation Drift |
|---|---|---|
| Affects | Learning process | Measurement process |
| Root cause | Data/feature changes | Protocol/metric changes |
| Detectable via | Production metrics, retraining | Metric–outcome mismatch |
| Typical fix | Retraining, data updates | Evaluation redesign |
| Risk if ignored | Model decay | False confidence |
Related Concepts
- Generalization & Evaluation
- Data Drift vs Concept Drift
- Distribution Shift
- Evaluation Protocols
- Threshold Selection
- Calibration
- Rolling Retraining
- Deployment Monitoring