Short Definition
Cross-Attention is an attention mechanism where the query vectors come from one sequence while the key and value vectors come from another sequence. It allows a model to incorporate information from an external context when generating representations or predictions.
Cross-attention is a core component of encoder–decoder architectures such as the original Transformer.
Definition
In standard self-attention, the queries, keys, and values are derived from the same sequence.
In cross-attention, they originate from different sequences.
Let:
- (Q) = query vectors from the decoder
- (K) = key vectors from the encoder
- (V) = value vectors from the encoder
The attention computation becomes:
[
Attention(Q,K,V) =
softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V
]
Here the decoder queries the encoded input sequence to retrieve relevant information.
Core Idea
Cross-attention enables interaction between two different sequences.
Typical scenario:
Input sentence → Encoder → Encoded representations
↓
Decoder queries encoder
↓
Generated output
The decoder dynamically attends to different parts of the input while generating tokens.
Minimal Conceptual Illustration
Example: machine translation.
Input sentence:
The cat sat on the mat
During the generation of french translation:
Le chat s’est assis sur le tapis
When generating assis, the decoder may attend strongly to:
sat
Cross-attention allows the model to retrieve the most relevant encoded token.
Self-Attention vs Cross-Attention
| Property | Self-Attention | Cross-Attention |
|---|---|---|
| Query source | same sequence | different sequence |
| Key source | same sequence | external sequence |
| Value source | same sequence | external sequence |
| Purpose | contextualize tokens | connect input and output |
Self-attention models relationships within a sequence, while cross-attention models relationships between sequences.
Role in Transformer Architecture
In encoder–decoder Transformers, the decoder layer typically contains:
- Masked Self-Attention
- Cross-Attention
- Feedforward Network
Flow:
Cross-attention allows the decoder to access encoded input information.
Mathematical Interpretation
For each decoder token representation (q_i), attention weights are computed against all encoder keys (k_j):
[
\alpha_{ij} =
softmax\left(\frac{q_i \cdot k_j}{\sqrt{d_k}}\right)
]
The resulting representation is:
[
z_i = \sum_j \alpha_{ij} v_j
]
This produces a context vector conditioned on the input sequence.
Applications
Cross-attention is widely used in tasks that involve conditional generation.
Examples include:
- machine translation
- text summarization
- image captioning
- speech recognition
- multimodal models
In multimodal systems, cross-attention often connects text tokens to visual features.
Advantages
Cross-attention provides several benefits:
- direct conditioning on input representations
- flexible alignment between input and output tokens
- improved performance on sequence-to-sequence tasks
It allows the model to dynamically retrieve relevant information from another representation space.
Limitations
Cross-attention introduces additional computational cost.
The attention complexity becomes:
[
O(n \times m)
]
where:
- (n) = decoder sequence length
- (m) = encoder sequence length
For very long inputs, this can become expensive.
Summary
Cross-attention is an attention mechanism that allows a model to attend to representations from a different sequence. It plays a critical role in encoder–decoder architectures by enabling the decoder to access information from the encoded input sequence during generation.
Related Concepts
- Self-Attention
- Transformer Architecture
- Encoder–Decoder Models
- Autoregressive Models
- Attention Mechanism
- Causal Masking