Short Definition
Routing entropy measures the uncertainty or diversity of routing decisions in conditional or sparse neural network architectures.
Definition
Routing entropy quantifies how evenly a routing mechanism distributes inputs across available computation paths (e.g., experts in an MoE). High routing entropy indicates diverse, exploratory routing, while low routing entropy indicates concentrated, deterministic routing.
Entropy reveals how much the router explores.
Why It Matters
Routing entropy is a critical diagnostic for sparse models. It helps detect:
- expert collapse
- premature specialization
- routing rigidity
- underutilized capacity
Low entropy often precedes failure.
Core Concept
Given routing probabilities over experts ( p_1, \dots, p_n ), routing entropy is typically computed as:
H = – Σ p_i log(p_i)
Higher values indicate more balanced routing.
Entropy measures decision spread.
Minimal Conceptual Illustration
High entropy: A ███ B ███ C ███ D ███Low entropy: A ████████ B █ C █ D █
Role in Conditional Computation
Routing entropy governs:
- how much of the model is explored
- how quickly experts specialize
- the trade-off between exploration and exploitation
Entropy tunes learning exposure.
Training Dynamics
During training:
- Early phase: higher entropy encourages exploration
- Mid phase: entropy gradually decreases as specialization emerges
- Late phase: overly low entropy risks collapse
Entropy should decay, not vanish.
Relationship to Expert Collapse
Expert collapse is often preceded by:
- sharp entropy drops
- persistent skew in routing distributions
- reduced variance in routing decisions
Entropy is an early warning signal.
Interaction with Load Balancing
Load balancing mechanisms aim to:
- maintain sufficient routing entropy
- prevent dominant experts
- stabilize sparse training dynamics
Balancing sustains entropy.
Deterministic vs Stochastic Routing
- Deterministic routing: low entropy, stable, collapse-prone
- Stochastic routing: higher entropy, exploratory, noisier
Entropy reflects routing strategy.
Entropy Control Mechanisms
Common methods to regulate routing entropy include:
- temperature scaling of gate logits
- routing noise injection
- entropy regularization terms
- delayed sparsification schedules
Entropy can be engineered.
Evaluation and Monitoring
Routing entropy should be tracked:
- per layer
- per training phase
- across datasets
- under distribution shift
Static entropy metrics are insufficient.
Failure Modes
Mismanaged routing entropy can lead to:
- wasted capacity (too high)
- expert collapse (too low)
- unstable convergence
- brittle generalization
Entropy extremes are dangerous.
Practical Guidelines
- initialize routing with higher entropy
- monitor entropy trends, not single values
- avoid hard routing early in training
- correlate entropy with expert utilization
- reassess entropy under OOD inputs
Entropy requires governance.
Common Pitfalls
- treating entropy as a tuning afterthought
- optimizing entropy without considering task performance
- assuming high entropy always improves generalization
- ignoring entropy at inference time
- failing to log routing statistics
Entropy without context misleads.
Summary Characteristics
| Aspect | Routing Entropy |
|---|---|
| Measures | Routing diversity |
| Role | Stability & utilization |
| Diagnostic value | High |
| Control difficulty | Moderate |
| Relevance | Critical for sparse models |
Related Concepts
- Architecture & Representation
- Expert Routing
- Conditional Computation
- Sparse Training Dynamics
- Load Balancing in MoE
- Expert Collapse
- Gating Mechanisms