Autoregressive Models

Short Definition

Autoregressive Models are predictive models that generate outputs sequentially by conditioning each prediction on previously generated outputs.

They model the probability of a sequence as a product of conditional probabilities.

Definition

In an autoregressive model, the probability of a sequence (x_1, x_2, …, x_n) is factorized as:

[
P(x_1, x_2, …, x_n) = \prod_{t=1}^{n} P(x_t \mid x_1, …, x_{t-1})
]

This means each element of the sequence is predicted based on all previous elements.

For example:

[
P(x_3 \mid x_1, x_2)
]

depends only on earlier tokens.

This sequential dependency allows autoregressive models to capture complex patterns in sequences.

Core Idea

Autoregressive models generate sequences step-by-step.

x1 → predict x2 → predict x3 → predict x4

Each prediction becomes part of the context for the next step.

This makes them especially suitable for tasks involving structured sequences.

Minimal Conceptual Illustration

Example sentence generation:

Input prompt: “The cat”

Step 1: predict “sat”
Step 2: predict “on”
Step 3: predict “the”
Step 4: predict “mat”

Each new token is conditioned on all previously generated tokens.

Training Objective

Autoregressive models are typically trained using maximum likelihood estimation (MLE).

The training objective is:

[
\mathcal{L}(\theta) = – \sum_{t=1}^{n} \log P(x_t \mid x_1, …, x_{t-1})
]

This encourages the model to assign high probability to the correct next token.

Examples of Autoregressive Architectures

Many modern models use autoregressive generation.

Examples include:

  • GPT family of language models
  • Transformer decoder models
  • PixelCNN for image generation
  • WaveNet for audio generation

These models generate outputs token by token.

Autoregressive vs Non-Autoregressive Models

Model TypeGeneration Style
Autoregressivesequential prediction
Non-autoregressiveparallel prediction

Autoregressive models typically produce higher-quality outputs but are slower because generation must proceed sequentially.

Causal Masking

In Transformer-based autoregressive models, causal masking ensures that a token cannot attend to future tokens.

The attention matrix is masked so that:

token_t → attends only to tokens ≤ t

This preserves the autoregressive property during training.

Advantages

Autoregressive models offer several benefits:

  • strong sequence modeling capability
  • flexible generation of variable-length outputs
  • powerful representation of conditional distributions

They have become the dominant architecture for text generation tasks.

Limitations

However, they also have drawbacks.

Because tokens must be generated sequentially:

  • inference can be slow
  • generation cannot be fully parallelized

This can limit throughput in large-scale systems.

Applications

Autoregressive models are widely used in:

  • language generation
  • machine translation
  • image generation
  • speech synthesis
  • code generation

They are the foundation of modern large language models.

Summary

Autoregressive models generate sequences by predicting each element conditioned on previous elements.

By factorizing sequence probability into conditional distributions, they provide a powerful framework for modeling complex sequential data.

Related Concepts

  • Transformer Architecture
  • Causal Masking
  • Self-Attention
  • Sequence-to-Sequence Models
  • Language Modeling
  • Decoder-Only Transformers