Decoder-Only Transformers

Short Definition

Decoder-Only Transformers are Transformer architectures that generate outputs autoregressively using a single stack of masked self-attention layers. Each token is predicted based only on previously generated tokens.

This architecture is the foundation of modern large language models such as GPT-style systems.

Definition

A decoder-only Transformer models the probability of a sequence by factorizing it autoregressively:

[
P(x_1, x_2, …, x_n) = \prod_{t=1}^{n} P(x_t \mid x_1, …, x_{t-1})
]

The model predicts the next token given the previous context.

Unlike encoder–decoder architectures, decoder-only models use a single Transformer stack that processes both prompts and generated tokens.

Future tokens are masked during attention using causal masking, ensuring that predictions depend only on past tokens.

Core Idea

Decoder-only Transformers generate sequences step-by-step.

Conceptually:

Prompt → next token → next token → next token

The model repeatedly predicts the next token until the sequence is complete.

Minimal Conceptual Illustration

Example text generation:

Prompt: The cat sat

Prediction sequence:

The cat sat → on
The cat sat on → the
The cat sat on the → mat

Masked Self-Attention
↓
Feedforward Network
↓
Residual Connections
↓
Layer Normalization

key property:

Tokens attend only to previous tokens

This preserves the autoregressive generation process.

Attention Masking

Decoder-only models rely on causal masking to prevent access to future tokens.

Example attention matrix:

Token1 → Token1
Token2 → Token1, Token2
Token3 → Token1, Token2, Token3
Token4 → Token1, Token2, Token3, Token4

Future tokens remain hidden.

Training Objective

Decoder-only models are typically trained using next-token prediction.

Loss function:

[
\mathcal{L}(\theta) =
-\sum_{t=1}^{n} \log P(x_t \mid x_1,…,x_{t-1})
]

This objective teaches the model to generate coherent sequences.

Examples of Decoder-Only Models

Many modern AI systems use this architecture.

Examples include:

GPT models
LLaMA
Claude-style architectures
PaLM (decoder-style language modeling)

These models power most modern generative AI systems.

Advantages

Decoder-only architectures offer several benefits:

simple architecture
scalable training
flexible prompt-based interaction
strong generative capabilities

They can perform many tasks through prompt conditioning.

Limitations

Decoder-only models also have limitations.

Sequential Generation

Tokens must be generated one at a time, which limits inference speed.

Quadratic Attention Cost

Self-attention scales with sequence length:

[
O(n^2)
]

This can become expensive for long contexts.

Role in Modern AI

Decoder-only Transformers dominate modern language modeling because they:

support general-purpose generation
adapt easily through prompts
scale effectively with data and compute

They are the foundation of most large language models.

Summary

Decoder-only Transformers are autoregressive architectures that generate sequences using masked self-attention. By predicting each token based on previous context, they provide a flexible and scalable framework for language modeling and generative AI systems.

Related Concepts

Transformer Architecture
Autoregressive Models
Causal Masking
Self-Attention
Encoder-Only vs Decoder-Only Transformers
Prompt Conditioning