Masked Language Modeling (MLM)

Short Definition

Masked Language Modeling (MLM) is a self-supervised training objective used in natural language processing where some tokens in a text sequence are hidden, and the model is trained to predict the missing tokens using surrounding context.

This method enables models to learn deep contextual representations of language.


Definition

In Masked Language Modeling, a portion of the input tokens in a sequence is replaced with a special [MASK] token, and the model must predict the original token.

Formally, given a sequence:

[
x = (x_1, x_2, …, x_n)
]

some tokens are replaced with masks:

[
x’ = (x_1, …, [MASK], …, x_n)
]

The model learns the probability:

[
p(x_i \mid x_1, …, x_{i-1}, x_{i+1}, …, x_n)
]

meaning it predicts the masked token using both left and right context.


Core Idea

Masked Language Modeling allows a model to learn bidirectional contextual representations.

Conceptually:

Original Sentence
“The cat sat on the mat”

Mask one token:

“The cat sat on the [MASK]”

Model prediction:

mat

By repeatedly solving these prediction tasks, the model learns relationships between words.


Minimal Conceptual Illustration

Training step:

Input:
“The quick brown [MASK] jumps”

Target:
fox

The model processes the full sentence and predicts the masked token using contextual clues.


Why Masking Works

Masking forces the model to develop semantic and syntactic understanding of language.

Instead of simply predicting the next token, the model must analyze the full context.

Example:

Sentence:
“Paris is the capital of [MASK]”

Possible predictions:

France

The model learns geographic knowledge from text patterns.


Masking Strategy

Typical MLM training masks around 15% of tokens in the input sequence.

The masking procedure usually follows this rule:

CaseAction
80%Replace token with [MASK]
10%Replace with random token
10%Keep token unchanged

This prevents the model from relying too heavily on the [MASK] token during training.


Models Using MLM

Masked Language Modeling is widely used in encoder-based language models.

Examples include:

  • BERT
  • RoBERTa
  • DeBERTa
  • ELECTRA

These models learn deep contextual embeddings useful for many downstream tasks.


MLM vs Autoregressive Language Modeling

PropertyMasked Language ModelingAutoregressive Modeling
ContextBidirectionalLeft-to-right
PredictionMissing tokensNext token
Example ModelsBERTGPT

Autoregressive models generate text sequentially, while MLM models focus on representation learning.


Applications

Models trained with MLM are widely used for language understanding tasks.

Examples include:

  • text classification
  • question answering
  • named entity recognition
  • semantic similarity
  • document retrieval

MLM-based models often serve as pretrained encoders for downstream applications.


Limitations

MLM has several limitations.

Training–Inference Mismatch

The [MASK] token appears during training but not during real text usage.

Inefficient Prediction

Only masked tokens contribute to the training objective.

Not Naturally Generative

MLM models are not designed for text generation.


Importance in Modern NLP

Masked Language Modeling was a key innovation that enabled powerful pretrained language models like BERT, which significantly improved performance across many NLP benchmarks.


Summary

Masked Language Modeling is a self-supervised learning method where models learn language representations by predicting hidden tokens within a sentence. By using both left and right context, MLM enables models to learn rich semantic and syntactic structures from large text corpora.


Related Concepts