Compute-Aware Evaluation

Short Definition

Compute-aware evaluation assesses model performance while explicitly accounting for computational cost, latency, or resource usage.

Definition

Compute-aware evaluation extends traditional accuracy-focused evaluation by measuring how model performance changes under different compute budgets. It treats computation as a constrained resource and evaluates models across accuracy–cost trade-offs rather than at a single operating point.

Performance is meaningful only relative to cost.

Why It Matters

In real deployments, models operate under strict constraints:

  • latency and tail latency budgets
  • throughput requirements
  • energy and infrastructure costs
  • device limitations

An accurate model that violates budgets is unusable.

Core Principle

Evaluation shifts from:

“How accurate is the model?”

to:

“How much accuracy do we get for a given compute budget?”

Efficiency becomes part of correctness.

Minimal Conceptual Illustration

Accuracy ↑
│ ●
│ ●
│ ●
│●
└──────────────→ Compute Budget

What Is Measured

Compute-aware evaluation may include:

  • accuracy vs FLOPs
  • accuracy vs latency
  • accuracy vs energy
  • accuracy vs depth or activated experts
  • accuracy vs throughput

Cost axes must reflect deployment reality.

Pareto Frontiers

Models are compared using Pareto frontiers:

  • points where no other model is both more accurate and cheaper
  • dominated models are discarded
  • trade-offs are made explicit

Pareto dominance replaces single scores.

Relationship to Adaptive Computation

For adaptive models:

  • compute varies per input
  • average cost is insufficient
  • tail cost (p95 / p99) matters

Evaluation must capture variability.

Relationship to Compute-Aware Loss Functions

Compute-aware losses shape training behavior; compute-aware evaluation validates whether the learned trade-offs hold under real inference conditions.

Training intent must match evaluation reality.

Metrics Commonly Used

  • accuracy @ budget
  • expected compute
  • worst-case compute
  • latency percentiles
  • energy per prediction
  • cost per correct prediction

One metric is never enough.

Inference-Time Alignment

Evaluation must reflect:

  • real routing or halting behavior
  • production batch sizes
  • hardware and runtime constraints
  • concurrency and load patterns

Offline benchmarks often lie.

Robustness Under Budget

Compute-aware evaluation should test:

  • behavior when budgets tighten
  • degradation under load
  • performance under distribution shift at fixed cost

Efficiency failures often emerge under stress.

Failure Modes

Ignoring compute-aware evaluation leads to:

  • models that exceed latency budgets
  • misleading benchmark wins
  • poor real-world performance
  • unanticipated cost overruns

Accuracy alone is a false victory.

Practical Evaluation Guidelines

  • define budgets before training
  • evaluate across a range of budgets
  • include tail-latency metrics
  • benchmark on target hardware
  • report Pareto curves, not single numbers

Evaluation is a design tool.

Common Pitfalls

  • optimizing average latency only
  • using FLOPs as a proxy for real cost
  • ignoring variance in adaptive models
  • comparing models at different budgets
  • reporting only best-case performance

Budgets define success.

Summary Characteristics

AspectCompute-Aware Evaluation
Primary focusAccuracy–cost trade-offs
Key outputsPareto frontiers
Deployment alignmentHigh
ComplexityModerate
NecessityCritical for adaptive models

Related Concepts

  • Generalization & Evaluation
  • Compute-Aware Loss Functions
  • Adaptive Computation Depth
  • Early Exit Networks
  • Conditional Computation
  • Compute–Data Trade-offs
  • Budget-Constrained Inference