Short Definition
Throughput vs latency describes the trade-off between how many requests a system can process per unit time and how long each individual request takes to complete.
Definition
Throughput measures system capacity (e.g., requests per second), while latency measures per-request response time. In ML inference systems, optimizing one often degrades the other due to batching, queueing, and resource contention.
Fast systems are not always responsive systems.
Why It Matters
Production ML systems must satisfy both:
- latency constraints (user experience, SLAs)
- throughput requirements (cost efficiency, scale)
Ignoring the trade-off leads to SLA violations or runaway costs.
Core Trade-off
- Increasing throughput typically requires batching and parallelism
- Batching increases queueing time, raising latency
- Reducing latency often limits batching, reducing throughput
Capacity and responsiveness compete.
Minimal Conceptual Illustration
Latency ↓ Throughput ↑
Small batches Large batches
Low queueing High queueing
Fast responses Slow responses
Key Definitions
- Latency: Time from request arrival to response
- Throughput: Requests processed per unit time
- Concurrency: Number of in-flight requests
- Queueing delay: Time waiting before execution
Queueing mediates the trade-off.
The Role of Queueing
As load increases:
- queues form
- latency grows nonlinearly
- tail latency explodes before throughput saturates
Queueing dominates the tail.
Relationship to Tail Latency Metrics
High throughput configurations often worsen:
- p95 / p99 latency
- SLA compliance
- worst-case responsiveness
The tail pays the price for scale.
Interaction with Adaptive Models
Adaptive inference (early exits, MoE):
- reduces average compute
- increases variance across requests
- complicates batching strategies
Adaptivity sharpens the trade-off.
Batch Size Effects
- Small batches: lower latency, lower throughput
- Large batches: higher throughput, higher latency
- Dynamic batching: attempts to balance both
Batching is the primary control lever.
Hardware Considerations
- GPUs favor throughput via batching
- CPUs favor lower-latency execution
- Memory bandwidth and cache behavior affect both
Hardware shapes the curve.
Deployment Strategies
Common strategies to manage the trade-off include:
- separate latency-critical and batch pipelines
- dynamic batch sizing
- priority queues
- admission control
- fallback models under load
One size does not fit all.
Evaluation Best Practices
Systems should be evaluated on:
- latency–throughput curves
- tail latency under peak load
- SLA violation rates
- cost per request at target latency
Single-point benchmarks mislead.
Failure Modes
Ignoring throughput–latency trade-offs can cause:
- cascading latency spikes
- unpredictable user experience
- cost overruns from over-provisioning
- hidden SLA breaches
Throughput wins can hide latency losses.
Practical Design Guidelines
- define latency budgets before scaling throughput
- monitor p99 latency as load increases
- cap queue lengths explicitly
- separate critical from bulk traffic
- reassess trade-offs as traffic evolves
Latency is a first-class constraint.
Common Pitfalls
- optimizing throughput in isolation
- reporting average latency only
- ignoring queueing effects
- assuming GPUs always improve latency
- batching without tail-latency limits
Scale exposes trade-offs.
Summary Characteristics
| Aspect | Throughput | Latency |
|---|---|---|
| Measures | Capacity | Responsiveness |
| Optimized by | Batching | Immediate execution |
| Affected by load | Saturation | Nonlinear growth |
| SLA relevance | Indirect | Direct |
| Governing factor | Queueing | Tail behavior |
Related Concepts
- Generalization & Evaluation
- Tail Latency Metrics
- Budget-Constrained Inference
- Accuracy–Latency Trade-offs
- Efficiency Governance
- Latency Drift Monitoring
- Queueing Effects in ML Systems