Short Definition
Compute-Optimal vs Data-Optimal Scaling contrasts two strategies for training large neural networks: compute-optimal scaling balances model size and dataset size to minimize loss under a fixed compute budget, while data-optimal scaling prioritizes using as much high-quality data as possible to maximize performance regardless of compute efficiency.
It distinguishes efficiency-driven scaling from data-driven scaling.
Definition
Modern scaling research studies how performance improves as we increase:
- Model parameters ( N )
- Dataset size ( D )
- Training compute ( C )
Performance typically follows power-law relationships.
Two different optimization targets emerge:
Compute-Optimal Scaling
Given a fixed compute budget ( C ), choose model size and dataset size to minimize loss.
Compute roughly scales as:
[
C \propto N \times D
]
Under this constraint, there exists an optimal balance:
- Too large model + too little data → undertrained
- Too much data + too small model → capacity bottleneck
Compute-optimal scaling balances both.
This principle underlies Chinchilla-style scaling laws.
Data-Optimal Scaling
Data-optimal scaling asks:
Given unlimited compute, how much data should be used to achieve best performance?
This regime:
- Prioritizes large, high-quality datasets
- Often implies larger models
- Is not constrained by compute budget
- Optimizes final capability rather than efficiency
It emphasizes maximal performance rather than cost-efficiency.
Core Difference
| Aspect | Compute-Optimal | Data-Optimal |
|---|---|---|
| Constraint | Fixed compute | Fixed data goal |
| Objective | Minimize loss per compute | Maximize final performance |
| Trade-off | Balance N and D | Increase both aggressively |
| Efficiency | High | Secondary |
| Used in | Industrial training planning | Frontier research |
Compute-optimal focuses on efficiency.
Data-optimal focuses on capability.
Minimal Conceptual Illustration
Compute-Optimal:
Fixed budget → best N/D ratio.
Data-Optimal:
Unlimited budget → scale data and model.
Different optimization targets produce different training strategies.
Scaling Law Context
Empirical scaling laws show:L(N,D)≈aN−α+bD−β
Compute-optimal solutions lie along curves satisfying:N∝D
Under fixed compute, optimal models are often smaller than naive scaling would suggest.
Chinchilla scaling demonstrated:
Many large models were undertrained relative to dataset size.
Undertraining vs Overtraining
If model too large relative to data:
- Undertrained regime
- Inefficient compute usage
If dataset too large relative to model:
- Capacity-limited regime
- Diminishing returns
Compute-optimal balances both to avoid waste.
Scaling Strategy Implications
Compute-optimal scaling:
- Reduces wasted compute
- Achieves better loss per FLOP
- Improves cost efficiency
Data-optimal scaling:
- Pushes capability frontier
- Increases emergent behaviors
- Often required for breakthrough performance
Frontier labs often pursue hybrid strategies.
Alignment Implications
Data-optimal scaling:
- Increases capability rapidly
- May widen capability–alignment gap
- Requires stronger governance
Compute-optimal scaling:
- Improves efficiency
- Slows reckless capability expansion
- More economically sustainable
Scaling strategy influences risk trajectory.
Governance Perspective
Compute-optimal scaling:
- Enables resource planning
- Predictable cost-performance curves
- Easier benchmarking
Data-optimal scaling:
- May accelerate strategic capability jumps
- Raises oversight challenges
- Requires policy discussion on scaling pace
Scaling choices are strategic decisions.
Relation to Emergence vs Smooth Scaling
Data-optimal scaling may:
- Accelerate apparent emergent behaviors
- Push models into new capability regimes
Compute-optimal scaling may:
- Maintain smoother capability growth
- Delay crossing critical thresholds
Scaling policy influences emergence timing.
Practical Considerations
Compute-Optimal:
- Used in production planning
- Important for startups and cloud efficiency
Data-Optimal:
- Used in frontier model training
- Requires massive infrastructure
Trade-off is economic as well as technical.
Summary
Compute-Optimal Scaling:
- Balances model size and data under fixed compute.
- Maximizes efficiency.
Data-Optimal Scaling:
- Prioritizes maximum performance.
- Scales aggressively beyond efficiency constraints.
Scaling strategy affects capability growth, cost, and alignment risk.
Related Concepts
- Scaling Laws
- Architecture Scaling Laws
- Double Descent
- Overparameterization vs Underparameterization
- Emergence vs Smooth Scaling
- Capability–Alignment Gap
- Compute–Data Trade-offs
- Alignment Capability Scaling