Short Definition

L1 vs L2 Regularization compares two penalty-based techniques used to reduce overfitting in neural networks: L1 regularization encourages sparsity by penalizing absolute weight values, while L2 regularization penalizes squared weight magnitudes to promote smooth, small weights.

They differ in sparsity behavior and geometric effect.

Definition

Regularization adds a penalty term to the loss function to prevent overfitting.

Given a base loss:

[
\mathcal{L}_{data}(\theta)
]

Regularized loss becomes:

L1 Regularization

[
\mathcal{L}(\theta) = \mathcal{L}_{data}(\theta) + \lambda |\theta|_1
]

Where:

[
|\theta|_1 = \sum_i |\theta_i|
]

L2 Regularization

[
\mathcal{L}(\theta) = \mathcal{L}_{data}(\theta) + \lambda |\theta|_2^2
]

Where:

[
|\theta|_2^2 = \sum_i \theta_i^2
]

Both constrain weight magnitude, but in different ways.

Core Difference

Aspect	L1	L2
Penalty type	Absolute value	Squared magnitude
Weight behavior	Drives some weights to zero	Shrinks weights smoothly
Sparsity	Encourages sparse models	Rarely produces exact zeros
Geometry	Diamond constraint region	Circular constraint region
Feature selection	Yes	No

L1 promotes sparsity.
L2 promotes smooth shrinkage.

Minimal Conceptual Illustration

L1:
Some weights → exactly 0
Model becomes sparse.

L2:
All weights → smaller
Model remains dense.

L1 performs implicit feature selection.
L2 distributes shrinkage across parameters.

Geometric Interpretation

In parameter space:

L1 constraint region is diamond-shaped.
L2 constraint region is circular (or spherical in higher dimensions).

Optimization tends to hit corners of L1 region → zeros emerge.

L2’s smooth boundary rarely produces exact zero weights.

Geometry explains sparsity behavior.

Optimization Behavior

L1 gradient: $\frac{\partial}{\partial \theta_i} |\theta_i| = \text{sign}(\theta_i)$ ∂θi∂∣θi∣=sign(θi)

L2 gradient: $\frac{\partial}{\partial \theta_i} \theta_i^2 = 2\theta_i$ ∂θi∂θi2=2θi

L1 applies constant shrinkage regardless of magnitude.
L2 shrinkage increases with weight size.

Bias–Variance Trade-Off

Both methods reduce variance and control model complexity.

L1:

Stronger bias
Feature selection
Useful for high-dimensional sparse problems

L2:

Lower variance
Better when all features are useful
More stable gradients

Choice depends on structure of true signal.

Relationship to Weight Decay

In SGD:

L2 regularization ≈ weight decay.

In Adam:

L2 regularization behaves differently due to adaptive scaling.

AdamW decouples weight decay to restore proper L2 behavior.

Deep Learning Context

In large neural networks:

L2 regularization (weight decay) is dominant.
L1 is less common due to optimization challenges.
Structured sparsity is often achieved via pruning rather than L1 alone.

Transformers commonly use L2-based weight decay.

Sparsity and Efficiency

L1 can lead to:

Model sparsity
Feature pruning
Compressed representations

However:

Training instability may increase.
Optimization may become less smooth.

Modern sparse models often use structured pruning instead.

Alignment Perspective

Regularization affects:

Model capacity
Overfitting to proxy metrics
Robustness under distribution shift

Excess capacity can amplify:

Reward hacking
Specification gaming
Metric exploitation

Regularization moderates optimization power.

Governance Perspective

Regularization strength influences:

Generalization reliability
Robustness
Overfitting risk
Resource efficiency

Under-regularized models may behave unpredictably under shift.

Regularization is part of risk management in model training.

When to Use Each

L1:

Feature selection needed
High-dimensional sparse inputs
Interpretability priority

L2:

Deep neural networks
Large-scale models
Stable optimization desired

Most modern deep learning uses L2 (weight decay).

Summary

L1 Regularization:

Penalizes absolute weights.
Encourages sparsity.
Performs feature selection.

L2 Regularization:

Penalizes squared weights.
Smoothly shrinks parameters.
Improves stability and generalization.

Both reduce overfitting, but differ in geometric and sparsity behavior.

Related Concepts

Weight Decay
Adam vs AdamW
Optimization Stability
Bias–Variance Trade-Off
Structured vs Unstructured Pruning
Sparse vs Dense Models
Implicit Regularization
Overfitting