Calibration vs Accuracy

Short Definition

Accuracy measures how often predictions are correct, while calibration measures how well predicted probabilities reflect true outcome frequencies.

Definition

Accuracy quantifies the proportion of correct predictions made by a model, typically at a chosen decision threshold.
Calibration assesses whether a model’s predicted probabilities correspond to empirical correctness—for example, whether predictions made with 80% confidence are correct about 80% of the time.

Accuracy measures correctness; calibration measures confidence reliability.

Why This Distinction Matters

A model can be highly accurate yet poorly calibrated, producing overconfident or underconfident predictions. In real-world systems where probabilities drive decisions, costs, or risk controls, poor calibration can be more harmful than modest accuracy loss.

Correct answers with wrong confidence are dangerous.

Accuracy

Accuracy focuses on:

  • discrete correctness
  • thresholded decisions
  • average-case performance
  • benchmark comparison

Strengths of Accuracy

  • simple and intuitive
  • widely reported and comparable
  • useful for baseline evaluation
  • efficient for early model selection

Limitations of Accuracy

  • ignores confidence information
  • sensitive to class imbalance
  • threshold-dependent
  • misaligned with decision costs
  • blind to uncertainty failures

Accuracy compresses prediction quality into a single bit.

Calibration

Calibration focuses on:

  • probability correctness
  • confidence–outcome alignment
  • reliability across confidence levels
  • decision support validity

Common Calibration Measures

  • reliability diagrams
  • Expected Calibration Error (ECE)
  • Brier score
  • negative log-likelihood (NLL)

Strengths of Calibration

  • enables risk-aware decisions
  • supports threshold tuning
  • improves trust and interpretability
  • critical for safety-sensitive systems

Limitations of Calibration

  • distribution-dependent
  • can degrade under shift
  • may mask accuracy deficiencies
  • harder to optimize directly

Calibration evaluates belief quality.

Minimal Conceptual Illustration


High Accuracy, Poor Calibration: Right answers, wrong confidence
Lower Accuracy, Good Calibration: Fewer errors, honest confidence

Relationship Between Calibration and Accuracy

Improving accuracy does not guarantee improved calibration, and vice versa. Optimization objectives often prioritize accuracy or loss minimization, leaving calibration as a secondary concern.

Calibration and accuracy are partially independent axes.

Impact on Decision Thresholding

Accurate but miscalibrated models produce unstable thresholds and inconsistent operating points. Well-calibrated probabilities enable principled threshold selection based on costs and risk tolerance.

Thresholds rely on calibration.

Relationship to Distribution Shift

Under distribution shift:

  • accuracy may degrade gradually
  • calibration often degrades rapidly
  • confidence may become misleading before accuracy drops

Calibration failure is an early warning signal.

Evaluation Implications

Robust evaluation should:

  • report accuracy and calibration jointly
  • analyze confidence–error relationships
  • evaluate calibration under shift
  • avoid reporting accuracy alone

Accuracy without calibration is incomplete.

Common Pitfalls

  • equating high accuracy with trustworthy predictions
  • reporting accuracy without calibration analysis
  • calibrating models only in-distribution
  • assuming softmax probabilities are calibrated
  • ignoring calibration after retraining

Confidence must be validated.

Summary Comparison

AspectAccuracyCalibration
MeasuresCorrectnessConfidence reliability
Output typeDiscreteProbabilistic
Threshold dependenceHighLow
Decision relevancePartialHigh
Robustness to shiftModerateLow
Safety relevanceLimitedCritical

Related Concepts

  • Generalization & Evaluation
  • Calibration
  • Expected Calibration Error (ECE)
  • Reliability Diagrams
  • Decision Thresholding
  • Offline Metrics vs Business Metrics
  • Uncertainty Estimation
  • Confidence Collapse