Compute-Optimal vs Data-Optimal Scaling

Short Definition

Compute-Optimal vs Data-Optimal Scaling contrasts two strategies for training large neural networks: compute-optimal scaling balances model size and dataset size to minimize loss under a fixed compute budget, while data-optimal scaling prioritizes using as much high-quality data as possible to maximize performance regardless of compute efficiency.

It distinguishes efficiency-driven scaling from data-driven scaling.

Definition

Modern scaling research studies how performance improves as we increase:

  • Model parameters ( N )
  • Dataset size ( D )
  • Training compute ( C )

Performance typically follows power-law relationships.

Two different optimization targets emerge:

Compute-Optimal Scaling

Given a fixed compute budget ( C ), choose model size and dataset size to minimize loss.

Compute roughly scales as:

[
C \propto N \times D
]

Under this constraint, there exists an optimal balance:

  • Too large model + too little data → undertrained
  • Too much data + too small model → capacity bottleneck

Compute-optimal scaling balances both.

This principle underlies Chinchilla-style scaling laws.

Data-Optimal Scaling

Data-optimal scaling asks:

Given unlimited compute, how much data should be used to achieve best performance?

This regime:

  • Prioritizes large, high-quality datasets
  • Often implies larger models
  • Is not constrained by compute budget
  • Optimizes final capability rather than efficiency

It emphasizes maximal performance rather than cost-efficiency.

Core Difference

AspectCompute-OptimalData-Optimal
ConstraintFixed computeFixed data goal
ObjectiveMinimize loss per computeMaximize final performance
Trade-offBalance N and DIncrease both aggressively
EfficiencyHighSecondary
Used inIndustrial training planningFrontier research

Compute-optimal focuses on efficiency.
Data-optimal focuses on capability.

Minimal Conceptual Illustration


Compute-Optimal:
Fixed budget → best N/D ratio.

Data-Optimal:
Unlimited budget → scale data and model.

Different optimization targets produce different training strategies.

Scaling Law Context

Empirical scaling laws show:L(N,D)aNα+bDβ\mathcal{L}(N, D) \approx aN^{-\alpha} + bD^{-\beta}L(N,D)≈aN−α+bD−β

Compute-optimal solutions lie along curves satisfying:NDN \propto DN∝D

Under fixed compute, optimal models are often smaller than naive scaling would suggest.

Chinchilla scaling demonstrated:

Many large models were undertrained relative to dataset size.

Undertraining vs Overtraining

If model too large relative to data:

  • Undertrained regime
  • Inefficient compute usage

If dataset too large relative to model:

  • Capacity-limited regime
  • Diminishing returns

Compute-optimal balances both to avoid waste.

Scaling Strategy Implications

Compute-optimal scaling:

  • Reduces wasted compute
  • Achieves better loss per FLOP
  • Improves cost efficiency

Data-optimal scaling:

  • Pushes capability frontier
  • Increases emergent behaviors
  • Often required for breakthrough performance

Frontier labs often pursue hybrid strategies.

Alignment Implications

Data-optimal scaling:

  • Increases capability rapidly
  • May widen capability–alignment gap
  • Requires stronger governance

Compute-optimal scaling:

  • Improves efficiency
  • Slows reckless capability expansion
  • More economically sustainable

Scaling strategy influences risk trajectory.

Governance Perspective

Compute-optimal scaling:

  • Enables resource planning
  • Predictable cost-performance curves
  • Easier benchmarking

Data-optimal scaling:

  • May accelerate strategic capability jumps
  • Raises oversight challenges
  • Requires policy discussion on scaling pace

Scaling choices are strategic decisions.

Relation to Emergence vs Smooth Scaling

Data-optimal scaling may:

  • Accelerate apparent emergent behaviors
  • Push models into new capability regimes

Compute-optimal scaling may:

  • Maintain smoother capability growth
  • Delay crossing critical thresholds

Scaling policy influences emergence timing.

Practical Considerations

Compute-Optimal:

  • Used in production planning
  • Important for startups and cloud efficiency

Data-Optimal:

  • Used in frontier model training
  • Requires massive infrastructure

Trade-off is economic as well as technical.

Summary

Compute-Optimal Scaling:

  • Balances model size and data under fixed compute.
  • Maximizes efficiency.

Data-Optimal Scaling:

  • Prioritizes maximum performance.
  • Scales aggressively beyond efficiency constraints.

Scaling strategy affects capability growth, cost, and alignment risk.

Related Concepts

  • Scaling Laws
  • Architecture Scaling Laws
  • Double Descent
  • Overparameterization vs Underparameterization
  • Emergence vs Smooth Scaling
  • Capability–Alignment Gap
  • Compute–Data Trade-offs
  • Alignment Capability Scaling