Short Definition
Compute-aware evaluation assesses model performance while explicitly accounting for computational cost, latency, or resource usage.
Definition
Compute-aware evaluation extends traditional accuracy-focused evaluation by measuring how model performance changes under different compute budgets. It treats computation as a constrained resource and evaluates models across accuracy–cost trade-offs rather than at a single operating point.
Performance is meaningful only relative to cost.
Why It Matters
In real deployments, models operate under strict constraints:
- latency and tail latency budgets
- throughput requirements
- energy and infrastructure costs
- device limitations
An accurate model that violates budgets is unusable.
Core Principle
Evaluation shifts from:
“How accurate is the model?”
to:
“How much accuracy do we get for a given compute budget?”
Efficiency becomes part of correctness.
Minimal Conceptual Illustration
Accuracy ↑│ ●│ ●│ ●│●└──────────────→ Compute Budget
What Is Measured
Compute-aware evaluation may include:
- accuracy vs FLOPs
- accuracy vs latency
- accuracy vs energy
- accuracy vs depth or activated experts
- accuracy vs throughput
Cost axes must reflect deployment reality.
Pareto Frontiers
Models are compared using Pareto frontiers:
- points where no other model is both more accurate and cheaper
- dominated models are discarded
- trade-offs are made explicit
Pareto dominance replaces single scores.
Relationship to Adaptive Computation
For adaptive models:
- compute varies per input
- average cost is insufficient
- tail cost (p95 / p99) matters
Evaluation must capture variability.
Relationship to Compute-Aware Loss Functions
Compute-aware losses shape training behavior; compute-aware evaluation validates whether the learned trade-offs hold under real inference conditions.
Training intent must match evaluation reality.
Metrics Commonly Used
- accuracy @ budget
- expected compute
- worst-case compute
- latency percentiles
- energy per prediction
- cost per correct prediction
One metric is never enough.
Inference-Time Alignment
Evaluation must reflect:
- real routing or halting behavior
- production batch sizes
- hardware and runtime constraints
- concurrency and load patterns
Offline benchmarks often lie.
Robustness Under Budget
Compute-aware evaluation should test:
- behavior when budgets tighten
- degradation under load
- performance under distribution shift at fixed cost
Efficiency failures often emerge under stress.
Failure Modes
Ignoring compute-aware evaluation leads to:
- models that exceed latency budgets
- misleading benchmark wins
- poor real-world performance
- unanticipated cost overruns
Accuracy alone is a false victory.
Practical Evaluation Guidelines
- define budgets before training
- evaluate across a range of budgets
- include tail-latency metrics
- benchmark on target hardware
- report Pareto curves, not single numbers
Evaluation is a design tool.
Common Pitfalls
- optimizing average latency only
- using FLOPs as a proxy for real cost
- ignoring variance in adaptive models
- comparing models at different budgets
- reporting only best-case performance
Budgets define success.
Summary Characteristics
| Aspect | Compute-Aware Evaluation |
|---|---|
| Primary focus | Accuracy–cost trade-offs |
| Key outputs | Pareto frontiers |
| Deployment alignment | High |
| Complexity | Moderate |
| Necessity | Critical for adaptive models |
Related Concepts
- Generalization & Evaluation
- Compute-Aware Loss Functions
- Adaptive Computation Depth
- Early Exit Networks
- Conditional Computation
- Compute–Data Trade-offs
- Budget-Constrained Inference