Short Definition

Entropy-SGD is an optimization algorithm that modifies standard stochastic gradient descent by explicitly favoring flat minima through local entropy maximization.

It encourages convergence to wide, stable regions of the loss landscape.

Definition

Standard SGD minimizes a loss function:

[
\mathcal{L}(\theta)
]

Entropy-SGD instead optimizes a local entropy objective:

[

F(\theta)

\frac{1}{\beta}
\log
\int
\exp(-\beta \mathcal{L}(\theta’))
\,
\mathcal{N}(\theta’; \theta, \gamma^{-1}I)
\,
d\theta’
]

Where:

( \theta ) = model parameters
( \beta ) = sharpness control parameter
( \gamma ) = local neighborhood scale

This objective prefers parameter regions with many nearby low-loss solutions.

In simple terms:

Instead of minimizing just the loss at a point, Entropy-SGD minimizes loss averaged over a local neighborhood.

Core Idea

Sharp minima:

Narrow valleys
Small parameter perturbations cause large loss increase

Flat minima:

Wide valleys
Stable under parameter perturbations

Entropy-SGD explicitly favors flat regions.

Minimal Conceptual Illustration

Standard SGD:
Descends to nearest minimum.

Entropy-SGD:
Prefers wide valleys over sharp ones.

Flatness becomes part of the objective.

Algorithm Mechanism

Entropy-SGD introduces:

An inner loop performing local Langevin dynamics
Noise-based sampling within parameter neighborhood
Averaging over nearby solutions

Update structure:

Sample local parameter perturbations.
Estimate local entropy.
Update parameters toward flat regions.

It combines optimization with controlled noise injection.

Connection to Stochastic Gradient Flow

Entropy-SGD is closely related to:

Stochastic gradient flow
Langevin dynamics
Boltzmann distributions

It approximates sampling from: $p(\theta) \propto \exp(-\beta \mathcal{L}(\theta))$ p(θ)∝exp(−βL(θ))

But constrained to local neighborhoods.

It introduces thermodynamic intuition into optimization.

Sharp vs Flat Minima

Empirical findings suggest:

Flat minima correlate with better generalization.
Entropy-SGD tends to find flatter regions.
Generalization performance can improve.

It explicitly optimizes for flatness rather than relying on implicit noise.

Relationship to SAM (Sharpness-Aware Minimization)

Entropy-SGD and SAM both:

Penalize sharp minima
Encourage stability

Difference:

SAM perturbs parameters adversarially.
Entropy-SGD samples stochastically.
SAM is computationally cheaper.

SAM is more widely adopted in practice.

Computational Trade-Off

Entropy-SGD:

Requires inner-loop sampling.
Higher computational cost.
More complex tuning.

Due to cost, it is less common in large-scale LLM training.

However, conceptually influential.

Generalization Implications

Entropy-SGD:

Reduces overfitting.
Improves robustness.
Encourages stable representations.

It operationalizes the flat-minima hypothesis.

Alignment Perspective

Flat minima may:

Improve robustness under distribution shift.
Reduce overconfident misbehavior.
Stabilize optimization of proxy objectives.

Explicit flatness optimization may reduce alignment fragility.

However:

Stronger optimization may also increase capability.

Trade-offs remain.

Governance Perspective

Entropy-aware optimization affects:

Reliability
Stability
Risk under perturbations
Generalization safety

Flatness-oriented methods may improve robustness in safety-critical systems.

Optimization strategy is part of risk management.

Summary

Entropy-SGD:

Optimizes local entropy rather than pointwise loss.
Prefers flat minima.
Uses stochastic inner-loop sampling.
Connects optimization to statistical physics.
Influenced later flatness-based methods like SAM.

It makes flatness an explicit objective.

Related Concepts

Stochastic Gradient Flow
Sharp vs Flat Minima
Implicit Regularization
Large Batch vs Small Batch Training
Sharpness-Aware Minimization (SAM)
Loss Landscape Geometry
Optimization Stability
Langevin Dynamics