Cross-Attention

Short Definition

Cross-Attention is an attention mechanism where the query vectors come from one sequence while the key and value vectors come from another sequence. It allows a model to incorporate information from an external context when generating representations or predictions.

Cross-attention is a core component of encoder–decoder architectures such as the original Transformer.

Definition

In standard self-attention, the queries, keys, and values are derived from the same sequence.

In cross-attention, they originate from different sequences.

Let:

(Q) = query vectors from the decoder
(K) = key vectors from the encoder
(V) = value vectors from the encoder

The attention computation becomes:

[
Attention(Q,K,V) =
softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V
]

Here the decoder queries the encoded input sequence to retrieve relevant information.

Core Idea

Cross-attention enables interaction between two different sequences.

Typical scenario:

Input sentence → Encoder → Encoded representations
↓
Decoder queries encoder
↓
Generated output

The decoder dynamically attends to different parts of the input while generating tokens.

Minimal Conceptual Illustration

Example: machine translation.

Input sentence:

The cat sat on the mat

During the generation of french translation:

Le chat s’est assis sur le tapis

When generating assis, the decoder may attend strongly to:

sat

Cross-attention allows the model to retrieve the most relevant encoded token.

Self-Attention vs Cross-Attention

Property	Self-Attention	Cross-Attention
Query source	same sequence	different sequence
Key source	same sequence	external sequence
Value source	same sequence	external sequence
Purpose	contextualize tokens	connect input and output

Self-attention models relationships within a sequence, while cross-attention models relationships between sequences.

Role in Transformer Architecture

In encoder–decoder Transformers, the decoder layer typically contains:

Masked Self-Attention
Cross-Attention
Feedforward Network

Flow:

Cross-attention allows the decoder to access encoded input information.

Mathematical Interpretation

For each decoder token representation (q_i), attention weights are computed against all encoder keys (k_j):

[
\alpha_{ij} =
softmax\left(\frac{q_i \cdot k_j}{\sqrt{d_k}}\right)
]

The resulting representation is:

[
z_i = \sum_j \alpha_{ij} v_j
]

This produces a context vector conditioned on the input sequence.

Applications

Cross-attention is widely used in tasks that involve conditional generation.

Examples include:

machine translation
text summarization
image captioning
speech recognition
multimodal models

In multimodal systems, cross-attention often connects text tokens to visual features.

Advantages

Cross-attention provides several benefits:

direct conditioning on input representations
flexible alignment between input and output tokens
improved performance on sequence-to-sequence tasks

It allows the model to dynamically retrieve relevant information from another representation space.

Limitations

Cross-attention introduces additional computational cost.

The attention complexity becomes:

[
O(n \times m)
]

where:

(n) = decoder sequence length
(m) = encoder sequence length

For very long inputs, this can become expensive.

Summary

Cross-attention is an attention mechanism that allows a model to attend to representations from a different sequence. It plays a critical role in encoder–decoder architectures by enabling the decoder to access information from the encoded input sequence during generation.

Related Concepts

Self-Attention
Transformer Architecture
Encoder–Decoder Models
Autoregressive Models
Attention Mechanism
Causal Masking