Short Definition

Feature Learning vs Lazy Training contrasts two regimes of neural network training: one where representations evolve significantly during optimization (feature learning), and one where the network behaves approximately linearly around its initialization and learns mainly through output layer adjustment (lazy training).

It distinguishes dynamic representation learning from kernel-like behavior.

Definition

Neural networks can operate in two qualitatively different regimes during training:

Feature Learning Regime

Internal representations change substantially.
Hidden layers evolve to extract task-relevant features.
Parameter updates significantly reshape the feature space.
Training is strongly nonlinear in parameter space.

This is the regime typically associated with deep learning success.

Lazy Training Regime

Network remains close to its random initialization.
Feature representations change minimally.
Learning behaves approximately linearly.
Dynamics resemble Neural Tangent Kernel (NTK) theory.

In this regime, the network acts like a kernel machine.

Core Distinction

Aspect	Feature Learning	Lazy Training
Representation evolution	Significant	Minimal
Parameter movement	Large	Small
Nonlinearity	High	Approx. linear
Relation to NTK	Deviates from NTK	Approximates NTK
Practical prevalence	Common in modern deep nets	Occurs at extreme width

Feature learning reshapes internal geometry.
Lazy training preserves initial geometry.

Minimal Conceptual Illustration

Lazy Training:
Initialization → slight parameter change
Feature space ≈ unchanged.

Feature Learning:
Initialization → deep internal restructuring
Feature space transformed.

The difference lies in how much the representation moves.

Mathematical Perspective

Consider a network: $f(x; \theta)$ f(x;θ)

Under small parameter updates: $f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0) (\theta – \theta_0)$ f(x;θ)≈f(x;θ0)+∇θf(x;θ0)(θ−θ0)

This linearization underlies the NTK regime.

Lazy training occurs when: $\|\theta – \theta_0\| \text{ remains small}$ ∥θ−θ0∥ remains small

Feature learning occurs when: $\|\theta – \theta_0\| \text{ grows significantly}$ ∥θ−θ0∥ grows significantly

Width and Learning Rate Effects

Lazy training is more likely when:

Network width is extremely large.
Learning rate is small.
Initialization scale is large.
Training stays near linear regime.

Feature learning is encouraged by:

Finite width.
Larger learning rates.
Representation bottlenecks.
Architectural inductive bias.

Relation to NTK

NTK theory describes the lazy training regime.

In infinite-width networks:

Kernel remains nearly constant.
Representation does not meaningfully evolve.

Practical networks often deviate from strict NTK behavior.

Why Feature Learning Matters

Feature learning enables:

Hierarchical representations.
Abstraction formation.
Task-specific embedding spaces.
Emergent capabilities.

Pure lazy training limits representational power.

Deep learning’s success largely depends on feature learning.

Generalization Implications

Lazy training:

Generalization governed by kernel properties.
Less expressive representational adaptation.

Feature learning:

Enables adaptive representation shaping.
Can improve robustness and transfer.

However:

Strong feature learning may increase overfitting risk.

Balance is task-dependent.

Scaling Perspective

As model width increases:

Behavior shifts toward lazy regime.
Feature learning becomes weaker relative to linearized dynamics.

However:

Real-world large models often remain in mixed regimes.
Transformers still exhibit substantial feature learning.

Modern scaling does not eliminate representation evolution.

Alignment Perspective

Feature learning enables:

Emergent reasoning abilities.
Representation abstraction.
Complex internal strategies.

Lazy training may:

Limit behavioral flexibility.
Reduce representation drift.

Alignment-relevant behaviors depend strongly on feature learning dynamics.

Understanding which regime a model operates in informs capability forecasting.

Governance Perspective

The distinction affects:

Predictability of scaling.
Theoretical guarantees.
Stability under distribution shift.
Risk modeling.

Lazy regimes are easier to analyze theoretically.
Feature learning regimes are more powerful but less predictable.

Empirical Observations

Modern deep networks:

Operate between pure lazy and full feature-learning regimes.
Display partial NTK-like behavior early in training.
Transition to stronger feature learning later.

Training dynamics evolve over time.

Summary

Feature Learning:

Significant representation evolution.
Strong nonlinear dynamics.
Enables deep abstraction.

Lazy Training:

Minimal representation change.
Approximates kernel regression.
Linearized behavior around initialization.

The tension between these regimes explains much of modern deep learning theory.

Related Concepts

Neural Tangent Kernel (NTK)
Double Descent
Implicit Regularization
Overparameterization
Scaling Laws
Interpolation Regime
Representation Learning
Gradient Flow