Short Definition
SGD vs Adam compares two major optimization algorithms in deep learning: Stochastic Gradient Descent (SGD), which uses uniform learning rates, and Adam, which adapts learning rates per parameter using moment estimates.
It contrasts simplicity and generalization stability with adaptive convergence speed.
Definition
Training neural networks requires updating parameters using gradient information.
Two widely used optimizers are:
Stochastic Gradient Descent (SGD)
[
\theta_{t+1} = \theta_t – \eta \nabla_\theta \mathcal{L}(\theta_t)
]
Where:
- ( \eta ) = learning rate
- ( \nabla_\theta \mathcal{L} ) = gradient
SGD may include momentum:
[
v_t = \beta v_{t-1} + \nabla_\theta \mathcal{L}
]
[
\theta_{t+1} = \theta_t – \eta v_t
]
Adam (Adaptive Moment Estimation)
Adam maintains moving averages of:
- First moment (mean of gradients)
- Second moment (variance of gradients)
[
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
]
[
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
]
[
\theta_{t+1} = \theta_t – \eta \frac{m_t}{\sqrt{v_t} + \epsilon}
]
Adam adapts learning rates per parameter.
Core Difference
| Aspect | SGD | Adam |
|---|---|---|
| Learning rate | Global | Per-parameter |
| Convergence speed | Slower initially | Faster initially |
| Generalization | Often better | Sometimes worse |
| Hyperparameter sensitivity | Higher | Lower |
| Memory cost | Low | Higher |
SGD is simple and stable.
Adam is adaptive and faster early in training.
Minimal Conceptual Illustration
SGD:
All parameters updated with same step size.
Adam:
Each parameter has its own adaptive step size.
Large gradients → smaller steps.
Small gradients → larger steps.
Adam normalizes gradient magnitudes automatically.
Convergence Behavior
Adam often:
- Converges faster in early epochs.
- Requires less manual learning rate tuning.
- Handles sparse gradients well.
SGD often:
- Converges more slowly.
- Achieves better final generalization in vision tasks.
- Benefits from carefully designed learning rate schedules.
Generalization Trade-Off
Empirical findings show:
- SGD frequently generalizes better in large-scale vision models.
- Adam may overfit or converge to sharper minima.
- Switching from Adam to SGD late in training can improve performance.
The difference is linked to optimization geometry and implicit regularization.
Loss Landscape Perspective
Adam’s adaptive updates may:
- Follow sharp curvature directions aggressively.
- Converge to narrow minima.
SGD’s uniform noise may:
- Encourage flatter minima.
- Improve generalization robustness.
Flat minima are often associated with better out-of-distribution stability.
Scaling Context
In large Transformer models:
- Adam (or AdamW) is dominant.
- Adaptive updates help stabilize large parameter spaces.
- LayerNorm interacts well with Adam.
In convolutional vision models:
- SGD with momentum remains common.
Optimizer choice depends on architecture.
Computational Considerations
Adam requires:
- Storing first and second moment estimates.
- More memory per parameter.
SGD requires:
- Minimal additional memory.
At massive scale, memory cost matters.
Alignment Perspective
Optimizer choice influences:
- Training stability
- Convergence behavior
- Sensitivity to reward shaping
- Optimization strength
Stronger adaptive optimization may:
- Accelerate reward exploitation.
- Increase proxy objective overfitting.
- Amplify Goodhart effects.
Optimization dynamics affect alignment robustness.
Governance Perspective
Large-scale model training decisions include:
- Optimizer selection
- Compute efficiency trade-offs
- Stability guarantees
Optimization strategy affects:
- Reproducibility
- Resource usage
- Risk of instability
Optimizer design is part of system-level governance.
When to Use Each
SGD:
- Vision models
- Large datasets
- When generalization is primary goal
Adam:
- Transformers
- Sparse gradients
- Rapid experimentation
- Large-scale LLM training
Hybrid strategies are common.
Summary
SGD:
- Simple, stable, strong generalization.
- Requires careful learning rate scheduling.
Adam:
- Adaptive, faster early convergence.
- More memory usage.
- Dominant in Transformer training.
Optimizer choice affects convergence dynamics, generalization behavior, and alignment stability.
Related Concepts
- Optimization
- Optimizers
- Learning Rate Schedules
- Convergence
- Gradient Noise
- Loss Landscape Geometry
- Weight Decay
- AdamW
- Optimization Stability