NTK Regime vs Finite-Width Regime

Short Definition

NTK Regime vs Finite-Width Regime contrasts two training dynamics in neural networks: the NTK regime describes behavior near infinite width where learning is approximately linear and representations remain fixed, while the finite-width regime reflects realistic networks where representations evolve significantly during training.

One is analytically tractable; the other enables rich feature learning.

Definition

Neural network behavior depends strongly on model width and training dynamics.

Two important regimes are:

NTK Regime (Lazy Training)

In the infinite-width limit:

  • Parameter updates remain small.
  • Network function evolves linearly around initialization.
  • The Neural Tangent Kernel (NTK) remains nearly constant.
  • Training resembles kernel regression.

Formally, the model is approximated by:

[
f(x; \theta) \approx f(x; \theta_0)

  • \nabla_\theta f(x; \theta_0)(\theta – \theta_0)
    ]

This linearization governs dynamics.

Representations do not meaningfully change.

Finite-Width Regime (Feature Learning)

In realistic finite networks:

  • Parameters move significantly from initialization.
  • Hidden representations evolve.
  • Feature geometry changes.
  • Nonlinear dynamics dominate.

The kernel is not fixed during training.

Learning is strongly representation-dependent.

Core Difference

AspectNTK RegimeFinite-Width Regime
WidthInfinite (theoretical)Finite (practical)
Representation changeMinimalSignificant
Training dynamicsLinearizedNonlinear
Analytical tractabilityHighLow
Feature learningLimitedStrong

NTK preserves initial feature space.
Finite width reshapes it.

Minimal Conceptual Illustration


NTK:
Initialization → small parameter drift
Feature space ≈ constant.

Finite width:
Initialization → major restructuring
Feature space transforms.

The key difference is representation evolution.

Kernel Stability

In the NTK regime:K(x,x)=θf(x;θ)θf(x;θ)K(x, x’) = \nabla_\theta f(x;\theta) \nabla_\theta f(x’;\theta)K(x,x′)=∇θ​f(x;θ)∇θ​f(x′;θ)

remains approximately constant during training.

In the finite-width regime:

  • The kernel evolves.
  • Learning reshapes similarity structure.
  • New features emerge.

This difference fundamentally changes generalization behavior.

Relation to Feature Learning vs Lazy Training

NTK regime ≈ Lazy Training.

Finite-width regime ≈ Feature Learning.

Lazy training:

  • Linear approximation holds.

Feature learning:

  • Network escapes linear regime.
  • Representation hierarchy forms.

Modern deep learning relies heavily on finite-width effects.

Width and Learning Rate Effects

NTK regime is encouraged by:

  • Extremely wide networks.
  • Small learning rates.
  • Large initialization scales.

Finite-width behavior emerges with:

  • Moderate width.
  • Larger learning rates.
  • Strong representation bottlenecks.

Practical models sit between regimes.

Generalization Implications

NTK regime:

  • Generalization governed by kernel properties.
  • More predictable behavior.
  • Limited expressive adaptation.

Finite-width regime:

  • Learns task-specific features.
  • Often achieves superior performance.
  • Harder to analyze theoretically.

Generalization theory differs across regimes.

Scaling Perspective

As width increases:

  • Behavior moves toward NTK regime.
  • Feature learning becomes weaker relative to kernel effects.

However:

  • Modern Transformers still exhibit strong feature learning.
  • Infinite-width assumptions do not fully capture LLM dynamics.

Real systems operate in mixed regimes.

Alignment Perspective

Finite-width regime enables:

  • Emergent behaviors.
  • Strategic reasoning.
  • Internal representation restructuring.

NTK regime is more stable and predictable.

Understanding regime helps forecast:

  • Capability scaling
  • Representation drift
  • Alignment complexity

Alignment challenges are stronger in finite-width regime.

Governance Perspective

NTK regime:

  • Easier to analyze formally.
  • More predictable training dynamics.

Finite-width regime:

  • Harder to model.
  • Greater unpredictability.
  • Richer emergent capability.

Policy decisions must assume finite-width dynamics dominate frontier systems.

Summary

NTK Regime:

  • Infinite-width limit.
  • Linearized training.
  • Fixed feature representation.

Finite-Width Regime:

  • Practical deep networks.
  • Strong representation learning.
  • Nonlinear training dynamics.

Modern AI systems operate primarily in the finite-width regime.

Related Concepts

  • Neural Tangent Kernel (NTK)
  • Feature Learning vs Lazy Training
  • Overparameterization vs Underparameterization
  • Double Descent
  • Implicit Regularization
  • Gradient Flow
  • Scaling Laws
  • Representation Learning