Short Definition

NTK Regime vs Finite-Width Regime contrasts two training dynamics in neural networks: the NTK regime describes behavior near infinite width where learning is approximately linear and representations remain fixed, while the finite-width regime reflects realistic networks where representations evolve significantly during training.

One is analytically tractable; the other enables rich feature learning.

Definition

Neural network behavior depends strongly on model width and training dynamics.

Two important regimes are:

NTK Regime (Lazy Training)

In the infinite-width limit:

Parameter updates remain small.
Network function evolves linearly around initialization.
The Neural Tangent Kernel (NTK) remains nearly constant.
Training resembles kernel regression.

Formally, the model is approximated by:

[
f(x; \theta) \approx f(x; \theta_0)

\nabla_\theta f(x; \theta_0)(\theta – \theta_0)
]

This linearization governs dynamics.

Representations do not meaningfully change.

Finite-Width Regime (Feature Learning)

In realistic finite networks:

Parameters move significantly from initialization.
Hidden representations evolve.
Feature geometry changes.
Nonlinear dynamics dominate.

The kernel is not fixed during training.

Learning is strongly representation-dependent.

Core Difference

Aspect	NTK Regime	Finite-Width Regime
Width	Infinite (theoretical)	Finite (practical)
Representation change	Minimal	Significant
Training dynamics	Linearized	Nonlinear
Analytical tractability	High	Low
Feature learning	Limited	Strong

NTK preserves initial feature space.
Finite width reshapes it.

Minimal Conceptual Illustration

NTK:
Initialization → small parameter drift
Feature space ≈ constant.

Finite width:
Initialization → major restructuring
Feature space transforms.

The key difference is representation evolution.

Kernel Stability

In the NTK regime: $K(x, x’) = \nabla_\theta f(x;\theta) \nabla_\theta f(x’;\theta)$ K(x,x′)=∇θf(x;θ)∇θf(x′;θ)

remains approximately constant during training.

In the finite-width regime:

The kernel evolves.
Learning reshapes similarity structure.
New features emerge.

This difference fundamentally changes generalization behavior.

Relation to Feature Learning vs Lazy Training

NTK regime ≈ Lazy Training.

Finite-width regime ≈ Feature Learning.

Lazy training:

Linear approximation holds.

Feature learning:

Network escapes linear regime.
Representation hierarchy forms.

Modern deep learning relies heavily on finite-width effects.

Width and Learning Rate Effects

NTK regime is encouraged by:

Extremely wide networks.
Small learning rates.
Large initialization scales.

Finite-width behavior emerges with:

Moderate width.
Larger learning rates.
Strong representation bottlenecks.

Practical models sit between regimes.

Generalization Implications

NTK regime:

Generalization governed by kernel properties.
More predictable behavior.
Limited expressive adaptation.

Finite-width regime:

Learns task-specific features.
Often achieves superior performance.
Harder to analyze theoretically.

Generalization theory differs across regimes.

Scaling Perspective

As width increases:

Behavior moves toward NTK regime.
Feature learning becomes weaker relative to kernel effects.

However:

Modern Transformers still exhibit strong feature learning.
Infinite-width assumptions do not fully capture LLM dynamics.

Real systems operate in mixed regimes.

Alignment Perspective

Finite-width regime enables:

Emergent behaviors.
Strategic reasoning.
Internal representation restructuring.

NTK regime is more stable and predictable.

Understanding regime helps forecast:

Capability scaling
Representation drift
Alignment complexity

Alignment challenges are stronger in finite-width regime.

Governance Perspective

NTK regime:

Easier to analyze formally.
More predictable training dynamics.

Finite-width regime:

Harder to model.
Greater unpredictability.
Richer emergent capability.

Policy decisions must assume finite-width dynamics dominate frontier systems.

Summary

NTK Regime:

Infinite-width limit.
Linearized training.
Fixed feature representation.

Finite-Width Regime:

Practical deep networks.
Strong representation learning.
Nonlinear training dynamics.

Modern AI systems operate primarily in the finite-width regime.

Related Concepts

Neural Tangent Kernel (NTK)
Feature Learning vs Lazy Training
Overparameterization vs Underparameterization
Double Descent
Implicit Regularization
Gradient Flow
Scaling Laws
Representation Learning