Short Definition
Multi-Head Attention is an extension of the attention mechanism that allows a model to attend to information from multiple representation subspaces simultaneously.
Definition
Multi-Head Attention is a neural architecture component that applies multiple parallel attention operations (called heads) to the same input. Each head learns different projection matrices for queries, keys, and values, enabling the model to capture diverse patterns and relationships within the data.
Instead of one focus, the model uses many.
Why It Matters
A single attention head:
- may focus on limited patterns
- captures one type of relationship at a time
Multi-head attention:
- models multiple dependency types simultaneously
- captures syntactic and semantic patterns in parallel
- improves representational richness
- stabilizes learning
Parallel attention increases expressiveness.
Core Mechanism
For each head ( i ):
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
All heads are concatenated:
MultiHead(Q, K, V) = Concat(head₁, ..., head_h) W^O
Where:
- h = number of heads
- W_i^Q, W_i^K, W_i^V = learned projections
- W^O = output projection
Each head learns a different view of the data.
Minimal Conceptual Illustration
Input → Head 1 → \ Head 2 → → Concat → Linear → Output Head 3 → /
Multiple perspectives combine into one representation.
Intuition
Imagine reading a sentence:
- One head tracks subject–verb agreement
- Another tracks long-range references
- Another captures positional structure
Each head specializes.
Role in Transformers
Multi-head attention is a core building block of the Transformer architecture:
- Used in encoder self-attention
- Used in decoder self-attention
- Used in encoder–decoder attention
It enables rich contextual modeling.
Why Not Just One Large Head?
Instead of increasing dimensionality in a single head:
- Multiple smaller heads encourage specialization
- Parallel subspaces improve generalization
- Empirically performs better
Diversity outperforms monolithic focus.
Computational Properties
- Complexity remains O(n²) for sequence length n
- Heads operate in parallel
- Projection cost scales with number of heads
Expressiveness increases without quadratic explosion in head count.
Multi-Head vs Single-Head Attention
| Aspect | Single-Head | Multi-Head |
|---|---|---|
| Representational diversity | Low | High |
| Dependency types captured | Limited | Multiple |
| Empirical performance | Lower | Higher |
| Used in Transformers | No | Yes |
Multi-head attention is standard in modern architectures.
Practical Considerations
When configuring multi-head attention:
- Ensure embedding dimension divisible by number of heads
- Monitor memory usage for long sequences
- Tune head count based on model size
Too many heads can reduce per-head capacity.
Common Pitfalls
- Assuming more heads always improve performance
- Ignoring dimension constraints
- Misinterpreting attention weights as explanations
- Underestimating computational cost
Attention heads are not interpretable “experts” by default.
Summary Characteristics
| Aspect | Multi-Head Attention |
|---|---|
| Mechanism type | Parallel attention |
| Core advantage | Representation diversity |
| Primary use | Transformers |
| Complexity | O(n²) per layer |
| Key requirement | Dimensional divisibility |