Compute-Aware Evaluation

Short Definition

Compute-aware evaluation assesses model performance while explicitly accounting for computational cost, latency, or resource usage.

Definition

Compute-aware evaluation extends traditional accuracy-focused evaluation by measuring how model performance changes under different compute budgets. It treats computation as a constrained resource and evaluates models across accuracy–cost trade-offs rather than at a single operating point.

Performance is meaningful only relative to cost.

Why It Matters

In real deployments, models operate under strict constraints:

latency and tail latency budgets
throughput requirements
energy and infrastructure costs
device limitations

An accurate model that violates budgets is unusable.

Core Principle

Evaluation shifts from:

“How accurate is the model?”

to:

“How much accuracy do we get for a given compute budget?”

Efficiency becomes part of correctness.

Minimal Conceptual Illustration

			
Accuracy ↑
│      ●
│    ●
│  ●
│●
└──────────────→ Compute Budget

		

What Is Measured

Compute-aware evaluation may include:

accuracy vs FLOPs
accuracy vs latency
accuracy vs energy
accuracy vs depth or activated experts
accuracy vs throughput

Cost axes must reflect deployment reality.

Pareto Frontiers

Models are compared using Pareto frontiers:

points where no other model is both more accurate and cheaper
dominated models are discarded
trade-offs are made explicit

Pareto dominance replaces single scores.

Relationship to Adaptive Computation

For adaptive models:

compute varies per input
average cost is insufficient
tail cost (p95 / p99) matters

Evaluation must capture variability.

Relationship to Compute-Aware Loss Functions

Compute-aware losses shape training behavior; compute-aware evaluation validates whether the learned trade-offs hold under real inference conditions.

Training intent must match evaluation reality.

Metrics Commonly Used

accuracy @ budget
expected compute
worst-case compute
latency percentiles
energy per prediction
cost per correct prediction

One metric is never enough.

Inference-Time Alignment

Evaluation must reflect:

real routing or halting behavior
production batch sizes
hardware and runtime constraints
concurrency and load patterns

Offline benchmarks often lie.

Robustness Under Budget

Compute-aware evaluation should test:

behavior when budgets tighten
degradation under load
performance under distribution shift at fixed cost

Efficiency failures often emerge under stress.

Failure Modes

Ignoring compute-aware evaluation leads to:

models that exceed latency budgets
misleading benchmark wins
poor real-world performance
unanticipated cost overruns

Accuracy alone is a false victory.

Practical Evaluation Guidelines

define budgets before training
evaluate across a range of budgets
include tail-latency metrics
benchmark on target hardware
report Pareto curves, not single numbers

Evaluation is a design tool.

Common Pitfalls

optimizing average latency only
using FLOPs as a proxy for real cost
ignoring variance in adaptive models
comparing models at different budgets
reporting only best-case performance

Budgets define success.

Summary Characteristics

Aspect	Compute-Aware Evaluation
Primary focus	Accuracy–cost trade-offs
Key outputs	Pareto frontiers
Deployment alignment	High
Complexity	Moderate
Necessity	Critical for adaptive models

Related Concepts

Generalization & Evaluation
Compute-Aware Loss Functions
Adaptive Computation Depth
Early Exit Networks
Conditional Computation
Compute–Data Trade-offs
Budget-Constrained Inference