Cross-Attention

Short Definition

Cross-Attention is an attention mechanism where the query vectors come from one sequence while the key and value vectors come from another sequence. It allows a model to incorporate information from an external context when generating representations or predictions.

Cross-attention is a core component of encoder–decoder architectures such as the original Transformer.

Definition

In standard self-attention, the queries, keys, and values are derived from the same sequence.

In cross-attention, they originate from different sequences.

Let:

  • (Q) = query vectors from the decoder
  • (K) = key vectors from the encoder
  • (V) = value vectors from the encoder

The attention computation becomes:

[
Attention(Q,K,V) =
softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V
]

Here the decoder queries the encoded input sequence to retrieve relevant information.

Core Idea

Cross-attention enables interaction between two different sequences.

Typical scenario:

Input sentence → Encoder → Encoded representations

Decoder queries encoder

Generated output

The decoder dynamically attends to different parts of the input while generating tokens.

Minimal Conceptual Illustration

Example: machine translation.

Input sentence:

The cat sat on the mat

During the generation of french translation:

Le chat s’est assis sur le tapis

When generating assis, the decoder may attend strongly to:

sat

Cross-attention allows the model to retrieve the most relevant encoded token.

Self-Attention vs Cross-Attention

PropertySelf-AttentionCross-Attention
Query sourcesame sequencedifferent sequence
Key sourcesame sequenceexternal sequence
Value sourcesame sequenceexternal sequence
Purposecontextualize tokensconnect input and output

Self-attention models relationships within a sequence, while cross-attention models relationships between sequences.

Role in Transformer Architecture

In encoder–decoder Transformers, the decoder layer typically contains:

  • Masked Self-Attention
  • Cross-Attention
  • Feedforward Network

Flow:

Cross-attention allows the decoder to access encoded input information.

Mathematical Interpretation

For each decoder token representation (q_i), attention weights are computed against all encoder keys (k_j):

[
\alpha_{ij} =
softmax\left(\frac{q_i \cdot k_j}{\sqrt{d_k}}\right)
]

The resulting representation is:

[
z_i = \sum_j \alpha_{ij} v_j
]

This produces a context vector conditioned on the input sequence.

Applications

Cross-attention is widely used in tasks that involve conditional generation.

Examples include:

  • machine translation
  • text summarization
  • image captioning
  • speech recognition
  • multimodal models

In multimodal systems, cross-attention often connects text tokens to visual features.

Advantages

Cross-attention provides several benefits:

  • direct conditioning on input representations
  • flexible alignment between input and output tokens
  • improved performance on sequence-to-sequence tasks

It allows the model to dynamically retrieve relevant information from another representation space.

Limitations

Cross-attention introduces additional computational cost.

The attention complexity becomes:

[
O(n \times m)
]

where:

  • (n) = decoder sequence length
  • (m) = encoder sequence length

For very long inputs, this can become expensive.

Summary

Cross-attention is an attention mechanism that allows a model to attend to representations from a different sequence. It plays a critical role in encoder–decoder architectures by enabling the decoder to access information from the encoded input sequence during generation.

Related Concepts

  • Self-Attention
  • Transformer Architecture
  • Encoder–Decoder Models
  • Autoregressive Models
  • Attention Mechanism
  • Causal Masking