Routing Entropy

Short Definition

Routing entropy measures the uncertainty or diversity of routing decisions in conditional or sparse neural network architectures.

Definition

Routing entropy quantifies how evenly a routing mechanism distributes inputs across available computation paths (e.g., experts in an MoE). High routing entropy indicates diverse, exploratory routing, while low routing entropy indicates concentrated, deterministic routing.

Entropy reveals how much the router explores.

Why It Matters

Routing entropy is a critical diagnostic for sparse models. It helps detect:

  • expert collapse
  • premature specialization
  • routing rigidity
  • underutilized capacity

Low entropy often precedes failure.

Core Concept

Given routing probabilities over experts ( p_1, \dots, p_n ), routing entropy is typically computed as:

H = – Σ p_i log(p_i)

Higher values indicate more balanced routing.

Entropy measures decision spread.

Minimal Conceptual Illustration

High entropy: A ███ B ███ C ███ D ███
Low entropy: A ████████ B █ C █ D █

Role in Conditional Computation

Routing entropy governs:

  • how much of the model is explored
  • how quickly experts specialize
  • the trade-off between exploration and exploitation

Entropy tunes learning exposure.

Training Dynamics

During training:

  • Early phase: higher entropy encourages exploration
  • Mid phase: entropy gradually decreases as specialization emerges
  • Late phase: overly low entropy risks collapse

Entropy should decay, not vanish.

Relationship to Expert Collapse

Expert collapse is often preceded by:

  • sharp entropy drops
  • persistent skew in routing distributions
  • reduced variance in routing decisions

Entropy is an early warning signal.

Interaction with Load Balancing

Load balancing mechanisms aim to:

  • maintain sufficient routing entropy
  • prevent dominant experts
  • stabilize sparse training dynamics

Balancing sustains entropy.

Deterministic vs Stochastic Routing

  • Deterministic routing: low entropy, stable, collapse-prone
  • Stochastic routing: higher entropy, exploratory, noisier

Entropy reflects routing strategy.

Entropy Control Mechanisms

Common methods to regulate routing entropy include:

  • temperature scaling of gate logits
  • routing noise injection
  • entropy regularization terms
  • delayed sparsification schedules

Entropy can be engineered.

Evaluation and Monitoring

Routing entropy should be tracked:

  • per layer
  • per training phase
  • across datasets
  • under distribution shift

Static entropy metrics are insufficient.

Failure Modes

Mismanaged routing entropy can lead to:

  • wasted capacity (too high)
  • expert collapse (too low)
  • unstable convergence
  • brittle generalization

Entropy extremes are dangerous.

Practical Guidelines

  • initialize routing with higher entropy
  • monitor entropy trends, not single values
  • avoid hard routing early in training
  • correlate entropy with expert utilization
  • reassess entropy under OOD inputs

Entropy requires governance.

Common Pitfalls

  • treating entropy as a tuning afterthought
  • optimizing entropy without considering task performance
  • assuming high entropy always improves generalization
  • ignoring entropy at inference time
  • failing to log routing statistics

Entropy without context misleads.

Summary Characteristics

AspectRouting Entropy
MeasuresRouting diversity
RoleStability & utilization
Diagnostic valueHigh
Control difficultyModerate
RelevanceCritical for sparse models

Related Concepts

  • Architecture & Representation
  • Expert Routing
  • Conditional Computation
  • Sparse Training Dynamics
  • Load Balancing in MoE
  • Expert Collapse
  • Gating Mechanisms