L1 vs L2 Regularization

Short Definition

L1 vs L2 Regularization compares two penalty-based techniques used to reduce overfitting in neural networks: L1 regularization encourages sparsity by penalizing absolute weight values, while L2 regularization penalizes squared weight magnitudes to promote smooth, small weights.

They differ in sparsity behavior and geometric effect.

Definition

Regularization adds a penalty term to the loss function to prevent overfitting.

Given a base loss:

[
\mathcal{L}_{data}(\theta)
]

Regularized loss becomes:

L1 Regularization

[
\mathcal{L}(\theta) = \mathcal{L}_{data}(\theta) + \lambda |\theta|_1
]

Where:

[
|\theta|_1 = \sum_i |\theta_i|
]

L2 Regularization

[
\mathcal{L}(\theta) = \mathcal{L}_{data}(\theta) + \lambda |\theta|_2^2
]

Where:

[
|\theta|_2^2 = \sum_i \theta_i^2
]

Both constrain weight magnitude, but in different ways.

Core Difference

AspectL1L2
Penalty typeAbsolute valueSquared magnitude
Weight behaviorDrives some weights to zeroShrinks weights smoothly
SparsityEncourages sparse modelsRarely produces exact zeros
GeometryDiamond constraint regionCircular constraint region
Feature selectionYesNo

L1 promotes sparsity.
L2 promotes smooth shrinkage.

Minimal Conceptual Illustration


L1:
Some weights → exactly 0
Model becomes sparse.

L2:
All weights → smaller
Model remains dense.

L1 performs implicit feature selection.
L2 distributes shrinkage across parameters.

Geometric Interpretation

In parameter space:

  • L1 constraint region is diamond-shaped.
  • L2 constraint region is circular (or spherical in higher dimensions).

Optimization tends to hit corners of L1 region → zeros emerge.

L2’s smooth boundary rarely produces exact zero weights.

Geometry explains sparsity behavior.

Optimization Behavior

L1 gradient:θiθi=sign(θi)\frac{\partial}{\partial \theta_i} |\theta_i| = \text{sign}(\theta_i)∂θi​∂​∣θi​∣=sign(θi​)

L2 gradient:θiθi2=2θi\frac{\partial}{\partial \theta_i} \theta_i^2 = 2\theta_i∂θi​∂​θi2​=2θi​

L1 applies constant shrinkage regardless of magnitude.
L2 shrinkage increases with weight size.


Bias–Variance Trade-Off

Both methods reduce variance and control model complexity.

L1:

  • Stronger bias
  • Feature selection
  • Useful for high-dimensional sparse problems

L2:

  • Lower variance
  • Better when all features are useful
  • More stable gradients

Choice depends on structure of true signal.

Relationship to Weight Decay

In SGD:

L2 regularization ≈ weight decay.

In Adam:

L2 regularization behaves differently due to adaptive scaling.

AdamW decouples weight decay to restore proper L2 behavior.

Deep Learning Context

In large neural networks:

  • L2 regularization (weight decay) is dominant.
  • L1 is less common due to optimization challenges.
  • Structured sparsity is often achieved via pruning rather than L1 alone.

Transformers commonly use L2-based weight decay.

Sparsity and Efficiency

L1 can lead to:

  • Model sparsity
  • Feature pruning
  • Compressed representations

However:

  • Training instability may increase.
  • Optimization may become less smooth.

Modern sparse models often use structured pruning instead.

Alignment Perspective

Regularization affects:

  • Model capacity
  • Overfitting to proxy metrics
  • Robustness under distribution shift

Excess capacity can amplify:

  • Reward hacking
  • Specification gaming
  • Metric exploitation

Regularization moderates optimization power.

Governance Perspective

Regularization strength influences:

  • Generalization reliability
  • Robustness
  • Overfitting risk
  • Resource efficiency

Under-regularized models may behave unpredictably under shift.

Regularization is part of risk management in model training.

When to Use Each

L1:

  • Feature selection needed
  • High-dimensional sparse inputs
  • Interpretability priority

L2:

  • Deep neural networks
  • Large-scale models
  • Stable optimization desired

Most modern deep learning uses L2 (weight decay).

Summary

L1 Regularization:

  • Penalizes absolute weights.
  • Encourages sparsity.
  • Performs feature selection.

L2 Regularization:

  • Penalizes squared weights.
  • Smoothly shrinks parameters.
  • Improves stability and generalization.

Both reduce overfitting, but differ in geometric and sparsity behavior.

Related Concepts

  • Weight Decay
  • Adam vs AdamW
  • Optimization Stability
  • Bias–Variance Trade-Off
  • Structured vs Unstructured Pruning
  • Sparse vs Dense Models
  • Implicit Regularization
  • Overfitting