SGD vs Adam (Comparison)

Short Definition

SGD vs Adam compares two major optimization algorithms in deep learning: Stochastic Gradient Descent (SGD), which uses uniform learning rates, and Adam, which adapts learning rates per parameter using moment estimates.

It contrasts simplicity and generalization stability with adaptive convergence speed.

Definition

Training neural networks requires updating parameters using gradient information.

Two widely used optimizers are:

Stochastic Gradient Descent (SGD)

[
\theta_{t+1} = \theta_t – \eta \nabla_\theta \mathcal{L}(\theta_t)
]

Where:

  • ( \eta ) = learning rate
  • ( \nabla_\theta \mathcal{L} ) = gradient

SGD may include momentum:

[
v_t = \beta v_{t-1} + \nabla_\theta \mathcal{L}
]

[
\theta_{t+1} = \theta_t – \eta v_t
]

Adam (Adaptive Moment Estimation)

Adam maintains moving averages of:

  • First moment (mean of gradients)
  • Second moment (variance of gradients)

[
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
]

[
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
]

[
\theta_{t+1} = \theta_t – \eta \frac{m_t}{\sqrt{v_t} + \epsilon}
]

Adam adapts learning rates per parameter.

Core Difference

AspectSGDAdam
Learning rateGlobalPer-parameter
Convergence speedSlower initiallyFaster initially
GeneralizationOften betterSometimes worse
Hyperparameter sensitivityHigherLower
Memory costLowHigher

SGD is simple and stable.
Adam is adaptive and faster early in training.

Minimal Conceptual Illustration


SGD:
All parameters updated with same step size.

Adam:
Each parameter has its own adaptive step size.
Large gradients → smaller steps.
Small gradients → larger steps.

Adam normalizes gradient magnitudes automatically.

Convergence Behavior

Adam often:

  • Converges faster in early epochs.
  • Requires less manual learning rate tuning.
  • Handles sparse gradients well.

SGD often:

  • Converges more slowly.
  • Achieves better final generalization in vision tasks.
  • Benefits from carefully designed learning rate schedules.

Generalization Trade-Off

Empirical findings show:

  • SGD frequently generalizes better in large-scale vision models.
  • Adam may overfit or converge to sharper minima.
  • Switching from Adam to SGD late in training can improve performance.

The difference is linked to optimization geometry and implicit regularization.


Loss Landscape Perspective

Adam’s adaptive updates may:

  • Follow sharp curvature directions aggressively.
  • Converge to narrow minima.

SGD’s uniform noise may:

  • Encourage flatter minima.
  • Improve generalization robustness.

Flat minima are often associated with better out-of-distribution stability.

Scaling Context

In large Transformer models:

  • Adam (or AdamW) is dominant.
  • Adaptive updates help stabilize large parameter spaces.
  • LayerNorm interacts well with Adam.

In convolutional vision models:

  • SGD with momentum remains common.

Optimizer choice depends on architecture.

Computational Considerations

Adam requires:

  • Storing first and second moment estimates.
  • More memory per parameter.

SGD requires:

  • Minimal additional memory.

At massive scale, memory cost matters.

Alignment Perspective

Optimizer choice influences:

  • Training stability
  • Convergence behavior
  • Sensitivity to reward shaping
  • Optimization strength

Stronger adaptive optimization may:

  • Accelerate reward exploitation.
  • Increase proxy objective overfitting.
  • Amplify Goodhart effects.

Optimization dynamics affect alignment robustness.

Governance Perspective

Large-scale model training decisions include:

  • Optimizer selection
  • Compute efficiency trade-offs
  • Stability guarantees

Optimization strategy affects:

  • Reproducibility
  • Resource usage
  • Risk of instability

Optimizer design is part of system-level governance.

When to Use Each

SGD:

  • Vision models
  • Large datasets
  • When generalization is primary goal

Adam:

  • Transformers
  • Sparse gradients
  • Rapid experimentation
  • Large-scale LLM training

Hybrid strategies are common.

Summary

SGD:

  • Simple, stable, strong generalization.
  • Requires careful learning rate scheduling.

Adam:

  • Adaptive, faster early convergence.
  • More memory usage.
  • Dominant in Transformer training.

Optimizer choice affects convergence dynamics, generalization behavior, and alignment stability.

Related Concepts

  • Optimization
  • Optimizers
  • Learning Rate Schedules
  • Convergence
  • Gradient Noise
  • Loss Landscape Geometry
  • Weight Decay
  • AdamW
  • Optimization Stability