Short Definition
Entropy-SGD is an optimization algorithm that modifies standard stochastic gradient descent by explicitly favoring flat minima through local entropy maximization.
It encourages convergence to wide, stable regions of the loss landscape.
Definition
Standard SGD minimizes a loss function:
[
\mathcal{L}(\theta)
]
Entropy-SGD instead optimizes a local entropy objective:
[
F(\theta)
- \frac{1}{\beta}
\log
\int
\exp(-\beta \mathcal{L}(\theta’))
\,
\mathcal{N}(\theta’; \theta, \gamma^{-1}I)
\,
d\theta’
]
Where:
- ( \theta ) = model parameters
- ( \beta ) = sharpness control parameter
- ( \gamma ) = local neighborhood scale
This objective prefers parameter regions with many nearby low-loss solutions.
In simple terms:
Instead of minimizing just the loss at a point, Entropy-SGD minimizes loss averaged over a local neighborhood.
Core Idea
Sharp minima:
- Narrow valleys
- Small parameter perturbations cause large loss increase
Flat minima:
- Wide valleys
- Stable under parameter perturbations
Entropy-SGD explicitly favors flat regions.
Minimal Conceptual Illustration
Standard SGD:
Descends to nearest minimum.
Entropy-SGD:
Prefers wide valleys over sharp ones.
Flatness becomes part of the objective.
Algorithm Mechanism
Entropy-SGD introduces:
- An inner loop performing local Langevin dynamics
- Noise-based sampling within parameter neighborhood
- Averaging over nearby solutions
Update structure:
- Sample local parameter perturbations.
- Estimate local entropy.
- Update parameters toward flat regions.
It combines optimization with controlled noise injection.
Connection to Stochastic Gradient Flow
Entropy-SGD is closely related to:
- Stochastic gradient flow
- Langevin dynamics
- Boltzmann distributions
It approximates sampling from:p(θ)∝exp(−βL(θ))
But constrained to local neighborhoods.
It introduces thermodynamic intuition into optimization.
Sharp vs Flat Minima
Empirical findings suggest:
- Flat minima correlate with better generalization.
- Entropy-SGD tends to find flatter regions.
- Generalization performance can improve.
It explicitly optimizes for flatness rather than relying on implicit noise.
Relationship to SAM (Sharpness-Aware Minimization)
Entropy-SGD and SAM both:
- Penalize sharp minima
- Encourage stability
Difference:
- SAM perturbs parameters adversarially.
- Entropy-SGD samples stochastically.
- SAM is computationally cheaper.
SAM is more widely adopted in practice.
Computational Trade-Off
Entropy-SGD:
- Requires inner-loop sampling.
- Higher computational cost.
- More complex tuning.
Due to cost, it is less common in large-scale LLM training.
However, conceptually influential.
Generalization Implications
Entropy-SGD:
- Reduces overfitting.
- Improves robustness.
- Encourages stable representations.
It operationalizes the flat-minima hypothesis.
Alignment Perspective
Flat minima may:
- Improve robustness under distribution shift.
- Reduce overconfident misbehavior.
- Stabilize optimization of proxy objectives.
Explicit flatness optimization may reduce alignment fragility.
However:
Stronger optimization may also increase capability.
Trade-offs remain.
Governance Perspective
Entropy-aware optimization affects:
- Reliability
- Stability
- Risk under perturbations
- Generalization safety
Flatness-oriented methods may improve robustness in safety-critical systems.
Optimization strategy is part of risk management.
Summary
Entropy-SGD:
- Optimizes local entropy rather than pointwise loss.
- Prefers flat minima.
- Uses stochastic inner-loop sampling.
- Connects optimization to statistical physics.
- Influenced later flatness-based methods like SAM.
It makes flatness an explicit objective.
Related Concepts
- Stochastic Gradient Flow
- Sharp vs Flat Minima
- Implicit Regularization
- Large Batch vs Small Batch Training
- Sharpness-Aware Minimization (SAM)
- Loss Landscape Geometry
- Optimization Stability
- Langevin Dynamics