Short Definition
Self-Attention is a mechanism in neural networks that allows each element of a sequence to dynamically focus on other elements in the same sequence when computing its representation.
It enables models to capture relationships between tokens regardless of their distance in the sequence.
Definition
In sequence modeling, each token often depends on the context provided by other tokens.
Self-attention allows a model to compute contextualized representations by comparing each token with all others in the sequence.
Given a sequence:
[
x_1, x_2, …, x_n
]
the model computes attention scores between every pair of tokens.
These scores determine how strongly one token should influence another during representation learning.
Core Idea
Instead of processing tokens sequentially like recurrent models, self-attention allows every token to interact with every other token simultaneously.
x1 ↔ x2 ↔ x3 ↔ x4 ↔ x5
Each token gathers information from the entire sequence to build its representation.
This enables the model to capture long-range dependencies efficiently.
Query–Key–Value Mechanism
Self-attention uses three learned projections of each token representation:
- Query (Q)
- Key (K)
- Value (V)
For each token embedding (x):
[
Q = xW_Q
]
[
K = xW_K
]
[
V = xW_V
]
The attention score between tokens is computed using the scaled dot product:
[
Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V
]
Where:
- (QK^T) measures similarity between tokens
- (d_k) is the dimensionality of keys
- softmax converts similarities into attention weights
These weights determine how much information flows between tokens.
Minimal Conceptual Illustration
Sentence:
The cat sat on the mat
When computing the representation for **sat**, the model may attend strongly to:
cat
sat
mat
This helps the model understand the grammatical and semantic relationships in the sentence.
Self-Attention vs Other Mechanisms
| Mechanism | Context Modeling |
|---|---|
| RNN | sequential recurrence |
| CNN | local receptive fields |
| Self-Attention | global pairwise interactions |
Self-attention directly connects all tokens in a sequence.
Multi-Head Self-Attention
Transformers extend self-attention using multiple attention heads.
Each head learns different relationships between tokens.
[
MultiHead(Q,K,V) = Concat(head_1, …, head_h)W^O
]
Different heads may capture:
- syntactic relationships
- semantic dependencies
- positional interactions
Advantages
Self-attention provides several benefits:
- captures long-range dependencies
- parallel computation across tokens
- flexible representation learning
- strong scalability
These properties made it central to modern large language models.
Computational Complexity
Self-attention computes interactions between all token pairs.
For sequence length (n):
[
O(n^2)
]
This quadratic complexity makes long sequences expensive to process.
Many research efforts focus on improving attention efficiency.
Applications
Self-attention is used in many modern architectures:
- Transformers
- Large Language Models
- Vision Transformers
- Multimodal models
It is a core component of modern deep learning systems.
Summary
Self-attention is a mechanism that allows each token in a sequence to dynamically attend to all other tokens when computing its representation.
By enabling global contextual interactions, it provides powerful modeling capabilities and forms the foundation of the Transformer architecture.
Related Concepts
- Transformer Architecture
- Multi-Head Attention
- Positional Encoding
- Encoder–Decoder Models
- Attention Mechanism
- Recurrent Neural Networks (RNN)
- State-Space Models