Short Definition
Accuracy measures how often predictions are correct, while calibration measures how well predicted probabilities reflect true outcome frequencies.
Definition
Accuracy quantifies the proportion of correct predictions made by a model, typically at a chosen decision threshold.
Calibration assesses whether a model’s predicted probabilities correspond to empirical correctness—for example, whether predictions made with 80% confidence are correct about 80% of the time.
Accuracy measures correctness; calibration measures confidence reliability.
Why This Distinction Matters
A model can be highly accurate yet poorly calibrated, producing overconfident or underconfident predictions. In real-world systems where probabilities drive decisions, costs, or risk controls, poor calibration can be more harmful than modest accuracy loss.
Correct answers with wrong confidence are dangerous.
Accuracy
Accuracy focuses on:
- discrete correctness
- thresholded decisions
- average-case performance
- benchmark comparison
Strengths of Accuracy
- simple and intuitive
- widely reported and comparable
- useful for baseline evaluation
- efficient for early model selection
Limitations of Accuracy
- ignores confidence information
- sensitive to class imbalance
- threshold-dependent
- misaligned with decision costs
- blind to uncertainty failures
Accuracy compresses prediction quality into a single bit.
Calibration
Calibration focuses on:
- probability correctness
- confidence–outcome alignment
- reliability across confidence levels
- decision support validity
Common Calibration Measures
- reliability diagrams
- Expected Calibration Error (ECE)
- Brier score
- negative log-likelihood (NLL)
Strengths of Calibration
- enables risk-aware decisions
- supports threshold tuning
- improves trust and interpretability
- critical for safety-sensitive systems
Limitations of Calibration
- distribution-dependent
- can degrade under shift
- may mask accuracy deficiencies
- harder to optimize directly
Calibration evaluates belief quality.
Minimal Conceptual Illustration
High Accuracy, Poor Calibration: Right answers, wrong confidence
Lower Accuracy, Good Calibration: Fewer errors, honest confidence
Relationship Between Calibration and Accuracy
Improving accuracy does not guarantee improved calibration, and vice versa. Optimization objectives often prioritize accuracy or loss minimization, leaving calibration as a secondary concern.
Calibration and accuracy are partially independent axes.
Impact on Decision Thresholding
Accurate but miscalibrated models produce unstable thresholds and inconsistent operating points. Well-calibrated probabilities enable principled threshold selection based on costs and risk tolerance.
Thresholds rely on calibration.
Relationship to Distribution Shift
Under distribution shift:
- accuracy may degrade gradually
- calibration often degrades rapidly
- confidence may become misleading before accuracy drops
Calibration failure is an early warning signal.
Evaluation Implications
Robust evaluation should:
- report accuracy and calibration jointly
- analyze confidence–error relationships
- evaluate calibration under shift
- avoid reporting accuracy alone
Accuracy without calibration is incomplete.
Common Pitfalls
- equating high accuracy with trustworthy predictions
- reporting accuracy without calibration analysis
- calibrating models only in-distribution
- assuming softmax probabilities are calibrated
- ignoring calibration after retraining
Confidence must be validated.
Summary Comparison
| Aspect | Accuracy | Calibration |
|---|---|---|
| Measures | Correctness | Confidence reliability |
| Output type | Discrete | Probabilistic |
| Threshold dependence | High | Low |
| Decision relevance | Partial | High |
| Robustness to shift | Moderate | Low |
| Safety relevance | Limited | Critical |
Related Concepts
- Generalization & Evaluation
- Calibration
- Expected Calibration Error (ECE)
- Reliability Diagrams
- Decision Thresholding
- Offline Metrics vs Business Metrics
- Uncertainty Estimation
- Confidence Collapse