Throughput vs Latency

Short Definition

Throughput vs latency describes the trade-off between how many requests a system can process per unit time and how long each individual request takes to complete.

Definition

Throughput measures system capacity (e.g., requests per second), while latency measures per-request response time. In ML inference systems, optimizing one often degrades the other due to batching, queueing, and resource contention.

Fast systems are not always responsive systems.

Why It Matters

Production ML systems must satisfy both:

  • latency constraints (user experience, SLAs)
  • throughput requirements (cost efficiency, scale)

Ignoring the trade-off leads to SLA violations or runaway costs.

Core Trade-off

  • Increasing throughput typically requires batching and parallelism
  • Batching increases queueing time, raising latency
  • Reducing latency often limits batching, reducing throughput

Capacity and responsiveness compete.

Minimal Conceptual Illustration


Latency ↓ Throughput ↑
Small batches Large batches
Low queueing High queueing
Fast responses Slow responses

Key Definitions

  • Latency: Time from request arrival to response
  • Throughput: Requests processed per unit time
  • Concurrency: Number of in-flight requests
  • Queueing delay: Time waiting before execution

Queueing mediates the trade-off.

The Role of Queueing

As load increases:

  • queues form
  • latency grows nonlinearly
  • tail latency explodes before throughput saturates

Queueing dominates the tail.

Relationship to Tail Latency Metrics

High throughput configurations often worsen:

  • p95 / p99 latency
  • SLA compliance
  • worst-case responsiveness

The tail pays the price for scale.

Interaction with Adaptive Models

Adaptive inference (early exits, MoE):

  • reduces average compute
  • increases variance across requests
  • complicates batching strategies

Adaptivity sharpens the trade-off.

Batch Size Effects

  • Small batches: lower latency, lower throughput
  • Large batches: higher throughput, higher latency
  • Dynamic batching: attempts to balance both

Batching is the primary control lever.

Hardware Considerations

  • GPUs favor throughput via batching
  • CPUs favor lower-latency execution
  • Memory bandwidth and cache behavior affect both

Hardware shapes the curve.

Deployment Strategies

Common strategies to manage the trade-off include:

  • separate latency-critical and batch pipelines
  • dynamic batch sizing
  • priority queues
  • admission control
  • fallback models under load

One size does not fit all.

Evaluation Best Practices

Systems should be evaluated on:

  • latency–throughput curves
  • tail latency under peak load
  • SLA violation rates
  • cost per request at target latency

Single-point benchmarks mislead.

Failure Modes

Ignoring throughput–latency trade-offs can cause:

  • cascading latency spikes
  • unpredictable user experience
  • cost overruns from over-provisioning
  • hidden SLA breaches

Throughput wins can hide latency losses.

Practical Design Guidelines

  • define latency budgets before scaling throughput
  • monitor p99 latency as load increases
  • cap queue lengths explicitly
  • separate critical from bulk traffic
  • reassess trade-offs as traffic evolves

Latency is a first-class constraint.

Common Pitfalls

  • optimizing throughput in isolation
  • reporting average latency only
  • ignoring queueing effects
  • assuming GPUs always improve latency
  • batching without tail-latency limits

Scale exposes trade-offs.

Summary Characteristics

AspectThroughputLatency
MeasuresCapacityResponsiveness
Optimized byBatchingImmediate execution
Affected by loadSaturationNonlinear growth
SLA relevanceIndirectDirect
Governing factorQueueingTail behavior

Related Concepts

  • Generalization & Evaluation
  • Tail Latency Metrics
  • Budget-Constrained Inference
  • Accuracy–Latency Trade-offs
  • Efficiency Governance
  • Latency Drift Monitoring
  • Queueing Effects in ML Systems