Calibration vs Raw Accuracy

Short Definition

Calibration vs Raw Accuracy contrasts two evaluation dimensions of machine learning models: Raw Accuracy measures how often predictions are correct, while Calibration measures how well predicted probabilities reflect true outcome frequencies.

A model can be accurate but poorly calibrated.

Definition

In classification tasks, models output:

  • A predicted class
  • A confidence score (probability)

Two distinct properties matter:

Raw Accuracy

[
\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
]

Accuracy evaluates correctness of final decisions.

It does not assess whether confidence scores are trustworthy.

Calibration

A model is calibrated if:

[
\mathbb{P}(Y = \hat{Y} \mid \hat{p} = p) = p
]

Meaning:

If the model predicts 70% confidence,
it should be correct about 70% of the time.

Calibration evaluates probability reliability.

Core Difference

AspectRaw AccuracyCalibration
MeasuresCorrectnessConfidence reliability
ConcernClassification outcomeProbability trustworthiness
Sensitive toDecision thresholdConfidence distribution
Risk if poorWrong answersMisleading confidence

Accuracy answers: “Was it right?”
Calibration answers: “Can I trust its confidence?”

Minimal Conceptual Illustration

“`text
Model A:
Accuracy = 90%
Confidence always = 99%
→ Overconfident.

Model B:
Accuracy = 90%
Confidence ≈ true correctness rate
→ Well calibrated.

Both equally accurate.
Only one trustworthy.

Why Accuracy Is Not Enough

High accuracy does not guarantee:

  • Reliable uncertainty estimates
  • Safe deployment decisions
  • Appropriate thresholding

In safety-critical systems, confidence calibration matters as much as correctness.

Common Miscalibration Patterns

  1. Overconfidence
    • Predicted probabilities too high.
    • Common in deep neural networks.
  2. Underconfidence
    • Predictions overly cautious.
  3. Confidence collapse
    • Overconfidence under distribution shift.

Modern neural networks tend to be overconfident.

Measuring Calibration

Common metrics:

  • Expected Calibration Error (ECE)
  • Maximum Calibration Error (MCE)
  • Brier Score
  • Reliability Diagrams

Calibration is typically evaluated via probability binning.

Trade-Off Dynamics

Improving accuracy does not necessarily improve calibration.

In fact:

  • Larger models often increase confidence.
  • Scaling can worsen miscalibration.
  • Regularization affects calibration behavior.

Calibration requires explicit evaluation.

Thresholding Implications

Many systems use confidence thresholds:Accept prediction if p>τ\text{Accept prediction if } p > \tauAccept prediction if p>τ

If poorly calibrated:

  • High confidence may be misleading.
  • Risk-sensitive decisions become unsafe.

Calibration affects operating point selection.

Distribution Shift Effects

Under distribution shift:

  • Accuracy often drops.
  • Confidence may remain high.
  • Miscalibration increases.

Robust systems must monitor calibration drift.

Alignment Perspective

Poor calibration can lead to:

  • Overconfident hallucinations.
  • Misleading outputs.
  • Unsafe automation decisions.

Alignment requires not only correctness,
but truthful confidence estimation.

Calibration is central to trustworthy AI.


Governance Perspective

In regulated domains:

  • Medical diagnosis
  • Autonomous systems
  • Financial decision systems

Confidence must reflect risk.

Policies often require:

  • Probability reliability auditing
  • Ongoing calibration monitoring
  • Threshold governance

Calibration is part of risk management.

Improving Calibration

Common techniques:

  • Temperature Scaling
  • Label Smoothing
  • Mixup
  • Bayesian methods
  • Ensemble averaging

Calibration is typically applied post-training.

Summary

Raw Accuracy:

  • Measures correctness.
  • Ignores probability reliability.

Calibration:

  • Measures trustworthiness of confidence.
  • Critical for risk-sensitive decisions.

High accuracy without calibration can be dangerous.

Reliable AI systems require both.

Related Concepts

  • Calibration
  • Reliability Diagrams
  • Expected Calibration Error (ECE)
  • Decision Thresholding
  • Operating Point Selection
  • Confidence Collapse
  • Distribution Shift
  • Uncertainty Estimation