Short Definition
Stratified sampling is a sampling method that preserves key subgroup proportions in each data split.
Definition
Stratified sampling is a technique in which data is divided into distinct subgroups (strata) based on a specified attribute—most commonly the target label—and samples are drawn from each subgroup in proportion to their prevalence. This ensures that each split (e.g., training, validation, test) reflects the overall distribution of the stratification variable.
Stratified sampling is a data-splitting strategy, not a modeling method.
Why It Matters
Random splits can inadvertently distort label or subgroup distributions, especially in imbalanced datasets. Stratified sampling reduces variance in evaluation metrics and prevents misleading results caused by uneven representation across splits.
It is particularly important when minority classes carry high importance.
Common Use Cases
Stratified sampling is commonly used for:
- train/validation/test splits in classification
- stratified k-fold cross-validation
- maintaining class proportions under imbalance
- fair subgroup evaluation
It is most applicable to supervised learning tasks.
How Stratified Sampling Works
A typical process:
- Partition data into strata based on a chosen attribute (e.g., class label)
- Sample from each stratum proportionally
- Combine samples to form each split
The stratification variable is preserved by design.
Minimal Conceptual Example
# conceptual illustrationsplit = stratified_split(data, by="label", ratios=(0.7, 0.15, 0.15))
Stratified Sampling vs Random Sampling
- Stratified sampling: preserves subgroup proportions
- Random sampling: may distort distributions, especially with small samples
Stratification trades simplicity for reliability.
Limitations and Considerations
- requires a known stratification variable
- may not scale to many strata simultaneously
- does not address bias in data collection
- not suitable for all data types (e.g., time series without care)
Stratification controls proportions, not representativeness.
Common Pitfalls
- stratifying on inappropriate or leaky variables
- assuming stratification fixes dataset bias
- ignoring dependencies within strata
- applying stratification to non-IID or temporal data without adjustments
Stratification must respect data structure.
Relationship to Class Imbalance
Stratified sampling is a common mitigation for class imbalance during evaluation. It ensures that minority classes are present in all splits, enabling more stable metric estimation without altering the underlying data distribution.
Relationship to Generalization
Stratified sampling improves the reliability of in-distribution generalization estimates by reducing split-induced variance. It does not protect against distribution shift or out-of-distribution data.
Related Concepts
- Data & Distribution
- Class Imbalance
- Label Distribution
- Cross-Validation Strategies
- Holdout Sets
- Sampling Bias
- Generalization