Short Definition
L1 vs L2 Regularization compares two penalty-based techniques used to reduce overfitting in neural networks: L1 regularization encourages sparsity by penalizing absolute weight values, while L2 regularization penalizes squared weight magnitudes to promote smooth, small weights.
They differ in sparsity behavior and geometric effect.
Definition
Regularization adds a penalty term to the loss function to prevent overfitting.
Given a base loss:
[
\mathcal{L}_{data}(\theta)
]
Regularized loss becomes:
L1 Regularization
[
\mathcal{L}(\theta) = \mathcal{L}_{data}(\theta) + \lambda |\theta|_1
]
Where:
[
|\theta|_1 = \sum_i |\theta_i|
]
L2 Regularization
[
\mathcal{L}(\theta) = \mathcal{L}_{data}(\theta) + \lambda |\theta|_2^2
]
Where:
[
|\theta|_2^2 = \sum_i \theta_i^2
]
Both constrain weight magnitude, but in different ways.
Core Difference
| Aspect | L1 | L2 |
|---|---|---|
| Penalty type | Absolute value | Squared magnitude |
| Weight behavior | Drives some weights to zero | Shrinks weights smoothly |
| Sparsity | Encourages sparse models | Rarely produces exact zeros |
| Geometry | Diamond constraint region | Circular constraint region |
| Feature selection | Yes | No |
L1 promotes sparsity.
L2 promotes smooth shrinkage.
Minimal Conceptual Illustration
L1:
Some weights → exactly 0
Model becomes sparse.
L2:
All weights → smaller
Model remains dense.
L1 performs implicit feature selection.
L2 distributes shrinkage across parameters.
Geometric Interpretation
In parameter space:
- L1 constraint region is diamond-shaped.
- L2 constraint region is circular (or spherical in higher dimensions).
Optimization tends to hit corners of L1 region → zeros emerge.
L2’s smooth boundary rarely produces exact zero weights.
Geometry explains sparsity behavior.
Optimization Behavior
L1 gradient:∂θi∂∣θi∣=sign(θi)
L2 gradient:∂θi∂θi2=2θi
L1 applies constant shrinkage regardless of magnitude.
L2 shrinkage increases with weight size.
Bias–Variance Trade-Off
Both methods reduce variance and control model complexity.
L1:
- Stronger bias
- Feature selection
- Useful for high-dimensional sparse problems
L2:
- Lower variance
- Better when all features are useful
- More stable gradients
Choice depends on structure of true signal.
Relationship to Weight Decay
In SGD:
L2 regularization ≈ weight decay.
In Adam:
L2 regularization behaves differently due to adaptive scaling.
AdamW decouples weight decay to restore proper L2 behavior.
Deep Learning Context
In large neural networks:
- L2 regularization (weight decay) is dominant.
- L1 is less common due to optimization challenges.
- Structured sparsity is often achieved via pruning rather than L1 alone.
Transformers commonly use L2-based weight decay.
Sparsity and Efficiency
L1 can lead to:
- Model sparsity
- Feature pruning
- Compressed representations
However:
- Training instability may increase.
- Optimization may become less smooth.
Modern sparse models often use structured pruning instead.
Alignment Perspective
Regularization affects:
- Model capacity
- Overfitting to proxy metrics
- Robustness under distribution shift
Excess capacity can amplify:
- Reward hacking
- Specification gaming
- Metric exploitation
Regularization moderates optimization power.
Governance Perspective
Regularization strength influences:
- Generalization reliability
- Robustness
- Overfitting risk
- Resource efficiency
Under-regularized models may behave unpredictably under shift.
Regularization is part of risk management in model training.
When to Use Each
L1:
- Feature selection needed
- High-dimensional sparse inputs
- Interpretability priority
L2:
- Deep neural networks
- Large-scale models
- Stable optimization desired
Most modern deep learning uses L2 (weight decay).
Summary
L1 Regularization:
- Penalizes absolute weights.
- Encourages sparsity.
- Performs feature selection.
L2 Regularization:
- Penalizes squared weights.
- Smoothly shrinks parameters.
- Improves stability and generalization.
Both reduce overfitting, but differ in geometric and sparsity behavior.
Related Concepts
- Weight Decay
- Adam vs AdamW
- Optimization Stability
- Bias–Variance Trade-Off
- Structured vs Unstructured Pruning
- Sparse vs Dense Models
- Implicit Regularization
- Overfitting