Short Definition
Calibration drift occurs when a model’s predicted probabilities no longer correspond to true outcome frequencies over time.
Definition
Calibration drift refers to the degradation of the alignment between predicted confidence levels and empirical correctness after deployment. A model that was once well-calibrated may become overconfident or underconfident as data distributions, decision policies, or environments change.
Confidence decays even when accuracy appears stable.
Why It Matters
Many decisions rely directly on predicted probabilities—for thresholding, ranking, risk estimation, or cost minimization. When calibration drifts, these decisions become systematically biased, increasing expected cost and undermining trust.
Unreliable confidence leads to unreliable decisions.
Common Causes of Calibration Drift
Calibration drift can be driven by:
- distribution shift (covariate, label, or concept shift)
- feedback loops induced by model decisions
- changes in class prevalence
- policy or threshold updates
- retraining without recalibration
- proxy metric optimization
Calibration is context-dependent.
Calibration Drift vs Metric Drift
- Metric drift: the meaning of a metric changes
- Calibration drift: confidence–correctness alignment breaks
Calibration drift is a specific, high-impact form of metric drift.
Calibration Drift vs Accuracy Degradation
Calibration can drift even when accuracy remains constant. A model may maintain correctness rates while becoming increasingly overconfident or underconfident.
Accuracy hides confidence failures.
Minimal Conceptual Illustration
Before: P=0.8 → ~80% correct
After: P=0.8 → ~60% correct
Detection Signals
Common indicators of calibration drift include:
- widening gaps in reliability diagrams
- increasing Expected Calibration Error (ECE)
- divergence between NLL and accuracy
- unstable decision thresholds
- rising decision cost despite stable accuracy
Confidence metrics must be monitored.
Relationship to Distribution Shift
Distribution shift is the primary driver of calibration drift. Even small shifts in feature distributions or class priors can significantly distort predicted probabilities.
Calibration is fragile under shift.
Impact on Decision Thresholding
Thresholds optimized under prior calibration assumptions become suboptimal when calibration drifts. This leads to systematic over- or under-triggering of actions.
Thresholds inherit calibration errors.
Interaction with Feedback Loops
Feedback loops can accelerate calibration drift by censoring outcomes or reinforcing selective exposure, making confidence estimates increasingly biased.
Confidence becomes self-referential.
Mitigation Strategies
Common mitigation approaches include:
- periodic post-deployment recalibration
- monitoring calibration metrics separately from accuracy
- recalibrating after retraining or policy changes
- evaluating calibration under OOD scenarios
- incorporating uncertainty-aware decision policies
Calibration requires maintenance.
Role in Evaluation Governance
Evaluation governance should:
- mandate calibration monitoring
- define recalibration triggers
- restrict confidence-based decisions without validation
- document calibration assumptions
Calibration is a governance responsibility.
Common Pitfalls
- assuming softmax probabilities are calibrated
- recalibrating only in-distribution
- ignoring subgroup-level calibration drift
- optimizing accuracy at calibration’s expense
- failing to update thresholds after recalibration
Confidence must be earned continuously.
Summary Characteristics
| Aspect | Calibration Drift |
|---|---|
| What degrades | Confidence reliability |
| Visibility | Low without monitoring |
| Dependency | Distribution and policy |
| Impact | High on decisions |
| Mitigation | Recalibration and governance |
Related Concepts
- Generalization & Evaluation
- Calibration
- Expected Calibration Error (ECE)
- Reliability Diagrams
- Metric Drift
- Distribution Shift
- Decision Thresholding
- Outcome-Aware Evaluation