Fisher Information Matrix (FIM)

Short Definition

The Fisher Information Matrix (FIM) measures the sensitivity of a model’s predicted probability distribution to changes in its parameters. It captures curvature in parameter space from a probabilistic perspective and plays a central role in natural gradient methods and second-order optimization.

It links optimization geometry with information theory.

Definition

Let a model define a probability distribution:

[
p(y \mid x; \theta)
]

The Fisher Information Matrix is defined as:

[

F(\theta)

\mathbb{E}{x,y \sim p} \left[ \nabla\theta \log p(y \mid x; \theta)
\;
\nabla_\theta \log p(y \mid x; \theta)^T
\right]
]

Equivalent form:

[

F(\theta)

\mathbb{E}
\left[
(\nabla_\theta \log p)
(\nabla_\theta \log p)^T
\right]
]

It is the covariance of the score function.

Intuition

The FIM measures:

How much the output distribution changes when parameters change.

Large values:

  • Small parameter changes significantly alter predictions.
  • High sensitivity.

Small values:

  • Predictions stable under perturbations.
  • Low sensitivity.

It encodes the geometry of the model in probability space.

Minimal Conceptual Illustration


Flat parameter direction:
Small output change → Low Fisher value

Sensitive parameter direction:
Large output change → High Fisher value

Fisher captures probabilistic curvature.

Relation to the Hessian

For log-likelihood losses:L(θ)=logp(yx;θ)\mathcal{L}(\theta) = – \log p(y \mid x; \theta)L(θ)=−logp(y∣x;θ)

Under certain regularity conditions:F(θ)E[H(θ)]F(\theta) \approx \mathbb{E}[H(\theta)]F(θ)≈E[H(θ)]

Where HHH is the Hessian.

However:

  • Hessian measures curvature of the loss.
  • Fisher measures curvature of the distribution.

They coincide under idealized assumptions but differ in practice.


Geometric Interpretation

The Fisher Information Matrix defines a Riemannian metric on parameter space.

Distance between parameter configurations:D(θ1,θ2)=(θ1θ2)TF(θ)(θ1θ2)D(\theta_1, \theta_2) = (\theta_1 – \theta_2)^T F(\theta) (\theta_1 – \theta_2)D(θ1​,θ2​)=(θ1​−θ2​)TF(θ)(θ1​−θ2​)

This measures how distinguishable two models are in distribution space.

It defines information geometry.

Natural Gradient

Standard gradient descent:θt+1=θtηθL\theta_{t+1} = \theta_t – \eta \nabla_\theta \mathcal{L}θt+1​=θt​−η∇θ​L

Natural gradient modifies update using FIM:θt+1=θtηF1θL\theta_{t+1} = \theta_t – \eta F^{-1} \nabla_\theta \mathcal{L}θt+1​=θt​−ηF−1∇θ​L

This rescales updates by local curvature in probability space.

Natural gradient steps are invariant to parameterization.

Optimization Implications

Using FIM:

  • Corrects poorly scaled directions.
  • Stabilizes updates.
  • Accounts for model sensitivity.

However:

  • Computing full FIM is expensive.
  • Requires approximations (e.g., K-FAC).

In large models, exact Fisher is infeasible.

Relation to Curvature

Hessian:

  • Measures curvature of loss landscape.

Fisher:

  • Measures curvature of predictive distribution.

Fisher is always positive semi-definite.
Hessian may have negative eigenvalues (saddles).

Fisher provides more stable curvature information.

Connection to Bayesian Learning

In Bayesian inference:Posterior varianceF1\text{Posterior variance} \propto F^{-1}Posterior variance∝F−1

Fisher approximates uncertainty in parameters.

It relates to:

  • Laplace approximations
  • Variational inference
  • Uncertainty estimation

It connects optimization to statistics.

Scaling Context

In overparameterized models:

  • Fisher spectrum becomes highly anisotropic.
  • Many near-zero eigenvalues.
  • Effective parameter dimension smaller than nominal count.

Understanding Fisher structure helps explain generalization.

Robustness Implications

Large Fisher eigenvalues:

  • High sensitivity to perturbations.
  • Potential adversarial vulnerability.

Small Fisher eigenvalues:

  • Stable directions.
  • Robust behavior.

Fisher geometry informs robustness analysis.

Alignment Perspective

Fisher Information reflects:

  • Optimization strength.
  • Sensitivity to objective shifts.
  • Potential for proxy exploitation.

High sensitivity directions may amplify:

  • Reward mis-specification.
  • Alignment fragility.

Controlling curvature can influence optimization behavior.

Governance Perspective

Monitoring Fisher-related metrics may inform:

  • Training stability
  • Capability growth
  • Sensitivity to objective changes
  • Risk assessment under scaling

Information geometry affects system reliability.

Summary

Fisher Information Matrix:

  • Measures sensitivity of model predictions to parameters.
  • Defines curvature in probability space.
  • Enables natural gradient methods.
  • Connects optimization, statistics, and geometry.
  • Central to advanced training theory.

It is curvature from an information-theoretic perspective.

Related Concepts

  • Loss Landscape Curvature
  • Hessian Spectrum
  • Natural Gradient Descent
  • Second-Order Optimization
  • Implicit Regularization
  • Uncertainty Estimation
  • Overparameterization
  • Optimization Stability