Scaled Dot-Product Attention

Short Definition

Scaled Dot-Product Attention is the core attention mechanism used in Transformer models. It computes attention weights by taking the dot product between query and key vectors, scaling the result, and applying a softmax to determine how strongly each input element should influence the output.

This mechanism allows neural networks to dynamically focus on relevant parts of the input sequence.

Definition

In Transformer architectures, attention operates on three sets of vectors:

Queries (Q)
Keys (K)
Values (V)

The attention output is computed using the following formula:

\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

(Q) = matrix of query vectors
(K) = matrix of key vectors
(V) = matrix of value vectors
(d_k) = dimensionality of key vectors

The scaling factor ( \sqrt{d_k} ) prevents dot products from becoming excessively large.

Core Idea

Attention determines how much each token should attend to other tokens in the sequence.

Conceptually:

Query token
↓
Compare with keys of all tokens
↓
Compute attention weights
↓
Weighted combination of value vectors

This allows the model to capture dependencies between elements of the input.

Minimal Conceptual Illustration

Example sequence:

“The cat sat on the mat”

When processing the word “sat”, the model may assign high attention to:

“cat”

because it helps determine the meaning of the sentence.

Attention weights might look like:

cat → 0.45
sat → 0.25
mat → 0.15
others → small values

The final representation is a weighted combination of the value vectors.

Why Scaling is Needed

The dot product (QK^T) grows larger as the dimensionality of the vectors increases.

Large values cause the softmax function to produce extremely peaked distributions, which can lead to unstable gradients.

Dividing by ( \sqrt{d_k} ) stabilizes the training process.

Role in Transformer Architecture

Scaled dot-product attention is the fundamental computation used inside Transformer blocks.

Typical Transformer layer structure:

Transformer layer structure:

Input
↓
Multi-Head Attention
↓
Feedforward Network
↓
Output

Each attention head independently applies scaled dot-product attention.

Multi-Head Attention

Transformers extend scaled dot-product attention by running several attention mechanisms in parallel.

Each head learns different relationships within the sequence.

Example:

Head 1 → syntactic relationships
Head 2 → long-range dependencies
Head 3 → positional structure

Advantages

The outputs of all heads are concatenated and transformed.

Parallel Computation

Unlike recurrent networks, attention can be computed in parallel across all tokens.

Long-Range Dependencies

Attention allows direct connections between distant elements in a sequence.

Flexible Representation

The model dynamically decides which parts of the input are most relevant.

Applications

Scaled dot-product attention is used in many modern architectures.

Examples include:

– Transformers
– Large Language Models (LLMs)
– Vision Transformers (ViT)
– Multimodal models

It has become a foundational component of modern deep learning.

Summary

Scaled Dot-Product Attention is a mechanism that computes relationships between elements of an input sequence using query, key, and value vectors. By scaling the dot product and applying softmax normalization, the method produces stable attention weights that allow models to dynamically focus on relevant information. This mechanism forms the computational backbone of Transformer architectures.

Related Concepts

– Attention Mechanism
– Self-Attention
– Multi-Head Attention
– Transformer Architecture
– Query-Key-Value Representation
– Cross-Attention