Throughput vs Latency

Short Definition

Throughput vs latency describes the trade-off between how many requests a system can process per unit time and how long each individual request takes to complete.

Definition

Throughput measures system capacity (e.g., requests per second), while latency measures per-request response time. In ML inference systems, optimizing one often degrades the other due to batching, queueing, and resource contention.

Fast systems are not always responsive systems.

Why It Matters

Production ML systems must satisfy both:

latency constraints (user experience, SLAs)
throughput requirements (cost efficiency, scale)

Ignoring the trade-off leads to SLA violations or runaway costs.

Core Trade-off

Increasing throughput typically requires batching and parallelism
Batching increases queueing time, raising latency
Reducing latency often limits batching, reducing throughput

Capacity and responsiveness compete.

Minimal Conceptual Illustration

Latency ↓ Throughput ↑ Small batches Large batches Low queueing High queueing Fast responses Slow responses

Key Definitions

Latency: Time from request arrival to response
Throughput: Requests processed per unit time
Concurrency: Number of in-flight requests
Queueing delay: Time waiting before execution

Queueing mediates the trade-off.

The Role of Queueing

As load increases:

queues form
latency grows nonlinearly
tail latency explodes before throughput saturates

Queueing dominates the tail.

Relationship to Tail Latency Metrics

High throughput configurations often worsen:

p95 / p99 latency
SLA compliance
worst-case responsiveness

The tail pays the price for scale.

Interaction with Adaptive Models

Adaptive inference (early exits, MoE):

reduces average compute
increases variance across requests
complicates batching strategies

Adaptivity sharpens the trade-off.

Batch Size Effects

Small batches: lower latency, lower throughput
Large batches: higher throughput, higher latency
Dynamic batching: attempts to balance both

Batching is the primary control lever.

Hardware Considerations

GPUs favor throughput via batching
CPUs favor lower-latency execution
Memory bandwidth and cache behavior affect both

Hardware shapes the curve.

Deployment Strategies

Common strategies to manage the trade-off include:

separate latency-critical and batch pipelines
dynamic batch sizing
priority queues
admission control
fallback models under load

One size does not fit all.

Evaluation Best Practices

Systems should be evaluated on:

latency–throughput curves
tail latency under peak load
SLA violation rates
cost per request at target latency

Single-point benchmarks mislead.

Failure Modes

Ignoring throughput–latency trade-offs can cause:

cascading latency spikes
unpredictable user experience
cost overruns from over-provisioning
hidden SLA breaches

Throughput wins can hide latency losses.

Practical Design Guidelines

define latency budgets before scaling throughput
monitor p99 latency as load increases
cap queue lengths explicitly
separate critical from bulk traffic
reassess trade-offs as traffic evolves

Latency is a first-class constraint.

Common Pitfalls

optimizing throughput in isolation
reporting average latency only
ignoring queueing effects
assuming GPUs always improve latency
batching without tail-latency limits

Scale exposes trade-offs.

Summary Characteristics

Aspect	Throughput	Latency
Measures	Capacity	Responsiveness
Optimized by	Batching	Immediate execution
Affected by load	Saturation	Nonlinear growth
SLA relevance	Indirect	Direct
Governing factor	Queueing	Tail behavior

Related Concepts

Generalization & Evaluation
Tail Latency Metrics
Budget-Constrained Inference
Accuracy–Latency Trade-offs
Efficiency Governance
Latency Drift Monitoring
Queueing Effects in ML Systems