Short Definition

Compute-Optimal vs Data-Optimal Scaling contrasts two strategies for training large neural networks: compute-optimal scaling balances model size and dataset size to minimize loss under a fixed compute budget, while data-optimal scaling prioritizes using as much high-quality data as possible to maximize performance regardless of compute efficiency.

It distinguishes efficiency-driven scaling from data-driven scaling.

Definition

Modern scaling research studies how performance improves as we increase:

Model parameters ( N )
Dataset size ( D )
Training compute ( C )

Performance typically follows power-law relationships.

Two different optimization targets emerge:

Compute-Optimal Scaling

Given a fixed compute budget ( C ), choose model size and dataset size to minimize loss.

Compute roughly scales as:

[
C \propto N \times D
]

Under this constraint, there exists an optimal balance:

Too large model + too little data → undertrained
Too much data + too small model → capacity bottleneck

Compute-optimal scaling balances both.

This principle underlies Chinchilla-style scaling laws.

Data-Optimal Scaling

Data-optimal scaling asks:

Given unlimited compute, how much data should be used to achieve best performance?

This regime:

Prioritizes large, high-quality datasets
Often implies larger models
Is not constrained by compute budget
Optimizes final capability rather than efficiency

It emphasizes maximal performance rather than cost-efficiency.

Core Difference

Aspect	Compute-Optimal	Data-Optimal
Constraint	Fixed compute	Fixed data goal
Objective	Minimize loss per compute	Maximize final performance
Trade-off	Balance N and D	Increase both aggressively
Efficiency	High	Secondary
Used in	Industrial training planning	Frontier research

Compute-optimal focuses on efficiency.
Data-optimal focuses on capability.

Minimal Conceptual Illustration

Compute-Optimal:
Fixed budget → best N/D ratio.

Data-Optimal:
Unlimited budget → scale data and model.

Different optimization targets produce different training strategies.

Scaling Law Context

Empirical scaling laws show: $\mathcal{L}(N, D) \approx aN^{-\alpha} + bD^{-\beta}$ L(N,D)≈aN−α+bD−β

Compute-optimal solutions lie along curves satisfying: $N \propto D$ N∝D

Under fixed compute, optimal models are often smaller than naive scaling would suggest.

Chinchilla scaling demonstrated:

Many large models were undertrained relative to dataset size.

Undertraining vs Overtraining

If model too large relative to data:

Undertrained regime
Inefficient compute usage

If dataset too large relative to model:

Capacity-limited regime
Diminishing returns

Compute-optimal balances both to avoid waste.

Scaling Strategy Implications

Compute-optimal scaling:

Reduces wasted compute
Achieves better loss per FLOP
Improves cost efficiency

Data-optimal scaling:

Pushes capability frontier
Increases emergent behaviors
Often required for breakthrough performance

Frontier labs often pursue hybrid strategies.

Alignment Implications

Data-optimal scaling:

Increases capability rapidly
May widen capability–alignment gap
Requires stronger governance

Compute-optimal scaling:

Improves efficiency
Slows reckless capability expansion
More economically sustainable

Scaling strategy influences risk trajectory.

Governance Perspective

Compute-optimal scaling:

Enables resource planning
Predictable cost-performance curves
Easier benchmarking

Data-optimal scaling:

May accelerate strategic capability jumps
Raises oversight challenges
Requires policy discussion on scaling pace

Scaling choices are strategic decisions.

Relation to Emergence vs Smooth Scaling

Data-optimal scaling may:

Accelerate apparent emergent behaviors
Push models into new capability regimes

Compute-optimal scaling may:

Maintain smoother capability growth
Delay crossing critical thresholds

Scaling policy influences emergence timing.

Practical Considerations

Compute-Optimal:

Used in production planning
Important for startups and cloud efficiency

Data-Optimal:

Used in frontier model training
Requires massive infrastructure

Trade-off is economic as well as technical.

Summary

Compute-Optimal Scaling:

Balances model size and data under fixed compute.
Maximizes efficiency.

Data-Optimal Scaling:

Prioritizes maximum performance.
Scales aggressively beyond efficiency constraints.

Scaling strategy affects capability growth, cost, and alignment risk.

Related Concepts

Scaling Laws
Architecture Scaling Laws
Double Descent
Overparameterization vs Underparameterization
Emergence vs Smooth Scaling
Capability–Alignment Gap
Compute–Data Trade-offs
Alignment Capability Scaling