Short Definition
Calibration vs Raw Accuracy contrasts two evaluation dimensions of machine learning models: Raw Accuracy measures how often predictions are correct, while Calibration measures how well predicted probabilities reflect true outcome frequencies.
A model can be accurate but poorly calibrated.
Definition
In classification tasks, models output:
- A predicted class
- A confidence score (probability)
Two distinct properties matter:
Raw Accuracy
[
\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
]
Accuracy evaluates correctness of final decisions.
It does not assess whether confidence scores are trustworthy.
Calibration
A model is calibrated if:
[
\mathbb{P}(Y = \hat{Y} \mid \hat{p} = p) = p
]
Meaning:
If the model predicts 70% confidence,
it should be correct about 70% of the time.
Calibration evaluates probability reliability.
Core Difference
| Aspect | Raw Accuracy | Calibration |
|---|---|---|
| Measures | Correctness | Confidence reliability |
| Concern | Classification outcome | Probability trustworthiness |
| Sensitive to | Decision threshold | Confidence distribution |
| Risk if poor | Wrong answers | Misleading confidence |
Accuracy answers: “Was it right?”
Calibration answers: “Can I trust its confidence?”
Minimal Conceptual Illustration
“`text
Model A:
Accuracy = 90%
Confidence always = 99%
→ Overconfident.
Model B:
Accuracy = 90%
Confidence ≈ true correctness rate
→ Well calibrated.
Both equally accurate.
Only one trustworthy.
Why Accuracy Is Not Enough
High accuracy does not guarantee:
- Reliable uncertainty estimates
- Safe deployment decisions
- Appropriate thresholding
In safety-critical systems, confidence calibration matters as much as correctness.
Common Miscalibration Patterns
- Overconfidence
- Predicted probabilities too high.
- Common in deep neural networks.
- Underconfidence
- Predictions overly cautious.
- Confidence collapse
- Overconfidence under distribution shift.
Modern neural networks tend to be overconfident.
Measuring Calibration
Common metrics:
- Expected Calibration Error (ECE)
- Maximum Calibration Error (MCE)
- Brier Score
- Reliability Diagrams
Calibration is typically evaluated via probability binning.
Trade-Off Dynamics
Improving accuracy does not necessarily improve calibration.
In fact:
- Larger models often increase confidence.
- Scaling can worsen miscalibration.
- Regularization affects calibration behavior.
Calibration requires explicit evaluation.
Thresholding Implications
Many systems use confidence thresholds:Accept prediction if p>τ
If poorly calibrated:
- High confidence may be misleading.
- Risk-sensitive decisions become unsafe.
Calibration affects operating point selection.
Distribution Shift Effects
Under distribution shift:
- Accuracy often drops.
- Confidence may remain high.
- Miscalibration increases.
Robust systems must monitor calibration drift.
Alignment Perspective
Poor calibration can lead to:
- Overconfident hallucinations.
- Misleading outputs.
- Unsafe automation decisions.
Alignment requires not only correctness,
but truthful confidence estimation.
Calibration is central to trustworthy AI.
Governance Perspective
In regulated domains:
- Medical diagnosis
- Autonomous systems
- Financial decision systems
Confidence must reflect risk.
Policies often require:
- Probability reliability auditing
- Ongoing calibration monitoring
- Threshold governance
Calibration is part of risk management.
Improving Calibration
Common techniques:
- Temperature Scaling
- Label Smoothing
- Mixup
- Bayesian methods
- Ensemble averaging
Calibration is typically applied post-training.
Summary
Raw Accuracy:
- Measures correctness.
- Ignores probability reliability.
Calibration:
- Measures trustworthiness of confidence.
- Critical for risk-sensitive decisions.
High accuracy without calibration can be dangerous.
Reliable AI systems require both.
Related Concepts
- Calibration
- Reliability Diagrams
- Expected Calibration Error (ECE)
- Decision Thresholding
- Operating Point Selection
- Confidence Collapse
- Distribution Shift
- Uncertainty Estimation