Short Definition
The Fisher Information Matrix (FIM) measures the sensitivity of a model’s predicted probability distribution to changes in its parameters. It captures curvature in parameter space from a probabilistic perspective and plays a central role in natural gradient methods and second-order optimization.
It links optimization geometry with information theory.
Definition
Let a model define a probability distribution:
[
p(y \mid x; \theta)
]
The Fisher Information Matrix is defined as:
[
F(\theta)
\mathbb{E}{x,y \sim p} \left[ \nabla\theta \log p(y \mid x; \theta)
\;
\nabla_\theta \log p(y \mid x; \theta)^T
\right]
]
Equivalent form:
[
F(\theta)
\mathbb{E}
\left[
(\nabla_\theta \log p)
(\nabla_\theta \log p)^T
\right]
]
It is the covariance of the score function.
Intuition
The FIM measures:
How much the output distribution changes when parameters change.
Large values:
- Small parameter changes significantly alter predictions.
- High sensitivity.
Small values:
- Predictions stable under perturbations.
- Low sensitivity.
It encodes the geometry of the model in probability space.
Minimal Conceptual Illustration
Flat parameter direction:
Small output change → Low Fisher value
Sensitive parameter direction:
Large output change → High Fisher value
Fisher captures probabilistic curvature.
Relation to the Hessian
For log-likelihood losses:L(θ)=−logp(y∣x;θ)
Under certain regularity conditions:F(θ)≈E[H(θ)]
Where H is the Hessian.
However:
- Hessian measures curvature of the loss.
- Fisher measures curvature of the distribution.
They coincide under idealized assumptions but differ in practice.
Geometric Interpretation
The Fisher Information Matrix defines a Riemannian metric on parameter space.
Distance between parameter configurations:D(θ1,θ2)=(θ1−θ2)TF(θ)(θ1−θ2)
This measures how distinguishable two models are in distribution space.
It defines information geometry.
Natural Gradient
Standard gradient descent:θt+1=θt−η∇θL
Natural gradient modifies update using FIM:θt+1=θt−ηF−1∇θL
This rescales updates by local curvature in probability space.
Natural gradient steps are invariant to parameterization.
Optimization Implications
Using FIM:
- Corrects poorly scaled directions.
- Stabilizes updates.
- Accounts for model sensitivity.
However:
- Computing full FIM is expensive.
- Requires approximations (e.g., K-FAC).
In large models, exact Fisher is infeasible.
Relation to Curvature
Hessian:
- Measures curvature of loss landscape.
Fisher:
- Measures curvature of predictive distribution.
Fisher is always positive semi-definite.
Hessian may have negative eigenvalues (saddles).
Fisher provides more stable curvature information.
Connection to Bayesian Learning
In Bayesian inference:Posterior variance∝F−1
Fisher approximates uncertainty in parameters.
It relates to:
- Laplace approximations
- Variational inference
- Uncertainty estimation
It connects optimization to statistics.
Scaling Context
In overparameterized models:
- Fisher spectrum becomes highly anisotropic.
- Many near-zero eigenvalues.
- Effective parameter dimension smaller than nominal count.
Understanding Fisher structure helps explain generalization.
Robustness Implications
Large Fisher eigenvalues:
- High sensitivity to perturbations.
- Potential adversarial vulnerability.
Small Fisher eigenvalues:
- Stable directions.
- Robust behavior.
Fisher geometry informs robustness analysis.
Alignment Perspective
Fisher Information reflects:
- Optimization strength.
- Sensitivity to objective shifts.
- Potential for proxy exploitation.
High sensitivity directions may amplify:
- Reward mis-specification.
- Alignment fragility.
Controlling curvature can influence optimization behavior.
Governance Perspective
Monitoring Fisher-related metrics may inform:
- Training stability
- Capability growth
- Sensitivity to objective changes
- Risk assessment under scaling
Information geometry affects system reliability.
Summary
Fisher Information Matrix:
- Measures sensitivity of model predictions to parameters.
- Defines curvature in probability space.
- Enables natural gradient methods.
- Connects optimization, statistics, and geometry.
- Central to advanced training theory.
It is curvature from an information-theoretic perspective.
Related Concepts
- Loss Landscape Curvature
- Hessian Spectrum
- Natural Gradient Descent
- Second-Order Optimization
- Implicit Regularization
- Uncertainty Estimation
- Overparameterization
- Optimization Stability