Short Definition
Scaled Dot-Product Attention is the core attention mechanism used in Transformer models. It computes attention weights by taking the dot product between query and key vectors, scaling the result, and applying a softmax to determine how strongly each input element should influence the output.
This mechanism allows neural networks to dynamically focus on relevant parts of the input sequence.
Definition
In Transformer architectures, attention operates on three sets of vectors:
- Queries (Q)
- Keys (K)
- Values (V)
The attention output is computed using the following formula:
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
Where:
- (Q) = matrix of query vectors
- (K) = matrix of key vectors
- (V) = matrix of value vectors
- (d_k) = dimensionality of key vectors
The scaling factor ( \sqrt{d_k} ) prevents dot products from becoming excessively large.
Core Idea
Attention determines how much each token should attend to other tokens in the sequence.
Conceptually:
Query token
↓
Compare with keys of all tokens
↓
Compute attention weights
↓
Weighted combination of value vectors
This allows the model to capture dependencies between elements of the input.
Minimal Conceptual Illustration
Example sequence:
“The cat sat on the mat”
When processing the word “sat”, the model may assign high attention to:
“cat”
because it helps determine the meaning of the sentence.
Attention weights might look like:
cat → 0.45
sat → 0.25
mat → 0.15
others → small values
The final representation is a weighted combination of the value vectors.
Why Scaling is Needed
The dot product (QK^T) grows larger as the dimensionality of the vectors increases.
Large values cause the softmax function to produce extremely peaked distributions, which can lead to unstable gradients.
Dividing by ( \sqrt{d_k} ) stabilizes the training process.
Role in Transformer Architecture
Scaled dot-product attention is the fundamental computation used inside Transformer blocks.
Typical Transformer layer structure:
Transformer layer structure:
Input
↓
Multi-Head Attention
↓
Feedforward Network
↓
Output
Each attention head independently applies scaled dot-product attention.
Multi-Head Attention
Transformers extend scaled dot-product attention by running several attention mechanisms in parallel.
Each head learns different relationships within the sequence.
Example:
Head 1 → syntactic relationships
Head 2 → long-range dependencies
Head 3 → positional structure
Advantages
The outputs of all heads are concatenated and transformed.
Parallel Computation
Unlike recurrent networks, attention can be computed in parallel across all tokens.
Long-Range Dependencies
Attention allows direct connections between distant elements in a sequence.
Flexible Representation
The model dynamically decides which parts of the input are most relevant.
Applications
Scaled dot-product attention is used in many modern architectures.
Examples include:
– Transformers
– Large Language Models (LLMs)
– Vision Transformers (ViT)
– Multimodal models
It has become a foundational component of modern deep learning.
Summary
Scaled Dot-Product Attention is a mechanism that computes relationships between elements of an input sequence using query, key, and value vectors. By scaling the dot product and applying softmax normalization, the method produces stable attention weights that allow models to dynamically focus on relevant information. This mechanism forms the computational backbone of Transformer architectures.
Related Concepts
– Attention Mechanism
– Self-Attention
– Multi-Head Attention
– Transformer Architecture
– Query-Key-Value Representation
– Cross-Attention