Multi-Head Attention

Short Definition

Multi-Head Attention is an extension of the attention mechanism that allows a model to attend to information from multiple representation subspaces simultaneously.

Definition

Multi-Head Attention is a neural architecture component that applies multiple parallel attention operations (called heads) to the same input. Each head learns different projection matrices for queries, keys, and values, enabling the model to capture diverse patterns and relationships within the data.

Instead of one focus, the model uses many.

Why It Matters

A single attention head:

  • may focus on limited patterns
  • captures one type of relationship at a time

Multi-head attention:

  • models multiple dependency types simultaneously
  • captures syntactic and semantic patterns in parallel
  • improves representational richness
  • stabilizes learning

Parallel attention increases expressiveness.

Core Mechanism

For each head ( i ):


head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

All heads are concatenated:

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) W^O

Where:

  • h = number of heads
  • W_i^Q, W_i^K, W_i^V = learned projections
  • W^O = output projection

Each head learns a different view of the data.

Minimal Conceptual Illustration

Input → Head 1 → \
Head 2 → → Concat → Linear → Output
Head 3 → /

Multiple perspectives combine into one representation.

Intuition

Imagine reading a sentence:

  • One head tracks subject–verb agreement
  • Another tracks long-range references
  • Another captures positional structure

Each head specializes.

Role in Transformers

Multi-head attention is a core building block of the Transformer architecture:

  • Used in encoder self-attention
  • Used in decoder self-attention
  • Used in encoder–decoder attention

It enables rich contextual modeling.

Why Not Just One Large Head?

Instead of increasing dimensionality in a single head:

  • Multiple smaller heads encourage specialization
  • Parallel subspaces improve generalization
  • Empirically performs better

Diversity outperforms monolithic focus.

Computational Properties

  • Complexity remains O(n²) for sequence length n
  • Heads operate in parallel
  • Projection cost scales with number of heads

Expressiveness increases without quadratic explosion in head count.

Multi-Head vs Single-Head Attention

AspectSingle-HeadMulti-Head
Representational diversityLowHigh
Dependency types capturedLimitedMultiple
Empirical performanceLowerHigher
Used in TransformersNoYes

Multi-head attention is standard in modern architectures.

Practical Considerations

When configuring multi-head attention:

  • Ensure embedding dimension divisible by number of heads
  • Monitor memory usage for long sequences
  • Tune head count based on model size

Too many heads can reduce per-head capacity.

Common Pitfalls

  • Assuming more heads always improve performance
  • Ignoring dimension constraints
  • Misinterpreting attention weights as explanations
  • Underestimating computational cost

Attention heads are not interpretable “experts” by default.

Summary Characteristics

AspectMulti-Head Attention
Mechanism typeParallel attention
Core advantageRepresentation diversity
Primary useTransformers
ComplexityO(n²) per layer
Key requirementDimensional divisibility

Related Concepts