Short Definition
Masked Language Modeling (MLM) is a self-supervised training objective used in natural language processing where some tokens in a text sequence are hidden, and the model is trained to predict the missing tokens using surrounding context.
This method enables models to learn deep contextual representations of language.
Definition
In Masked Language Modeling, a portion of the input tokens in a sequence is replaced with a special [MASK] token, and the model must predict the original token.
Formally, given a sequence:
[
x = (x_1, x_2, …, x_n)
]
some tokens are replaced with masks:
[
x’ = (x_1, …, [MASK], …, x_n)
]
The model learns the probability:
[
p(x_i \mid x_1, …, x_{i-1}, x_{i+1}, …, x_n)
]
meaning it predicts the masked token using both left and right context.
Core Idea
Masked Language Modeling allows a model to learn bidirectional contextual representations.
Conceptually:
Original Sentence
“The cat sat on the mat”
Mask one token:
“The cat sat on the [MASK]”
Model prediction:
mat
By repeatedly solving these prediction tasks, the model learns relationships between words.
Minimal Conceptual Illustration
Training step:
Input:
“The quick brown [MASK] jumps”
Target:
fox
The model processes the full sentence and predicts the masked token using contextual clues.
Why Masking Works
Masking forces the model to develop semantic and syntactic understanding of language.
Instead of simply predicting the next token, the model must analyze the full context.
Example:
Sentence:
“Paris is the capital of [MASK]”
Possible predictions:
France
The model learns geographic knowledge from text patterns.
Masking Strategy
Typical MLM training masks around 15% of tokens in the input sequence.
The masking procedure usually follows this rule:
| Case | Action |
|---|---|
| 80% | Replace token with [MASK] |
| 10% | Replace with random token |
| 10% | Keep token unchanged |
This prevents the model from relying too heavily on the [MASK] token during training.
Models Using MLM
Masked Language Modeling is widely used in encoder-based language models.
Examples include:
- BERT
- RoBERTa
- DeBERTa
- ELECTRA
These models learn deep contextual embeddings useful for many downstream tasks.
MLM vs Autoregressive Language Modeling
| Property | Masked Language Modeling | Autoregressive Modeling |
|---|---|---|
| Context | Bidirectional | Left-to-right |
| Prediction | Missing tokens | Next token |
| Example Models | BERT | GPT |
Autoregressive models generate text sequentially, while MLM models focus on representation learning.
Applications
Models trained with MLM are widely used for language understanding tasks.
Examples include:
- text classification
- question answering
- named entity recognition
- semantic similarity
- document retrieval
MLM-based models often serve as pretrained encoders for downstream applications.
Limitations
MLM has several limitations.
Training–Inference Mismatch
The [MASK] token appears during training but not during real text usage.
Inefficient Prediction
Only masked tokens contribute to the training objective.
Not Naturally Generative
MLM models are not designed for text generation.
Importance in Modern NLP
Masked Language Modeling was a key innovation that enabled powerful pretrained language models like BERT, which significantly improved performance across many NLP benchmarks.
Summary
Masked Language Modeling is a self-supervised learning method where models learn language representations by predicting hidden tokens within a sentence. By using both left and right context, MLM enables models to learn rich semantic and syntactic structures from large text corpora.