Short Definition
Blocked sampling groups data into contiguous blocks and samples at the block level rather than the individual example level.
Definition
Blocked sampling is a data sampling strategy in which observations are grouped into blocks—such as time intervals, users, sessions, documents, or spatial regions—and sampling or splitting is performed on entire blocks instead of individual data points. This preserves dependencies within blocks and prevents leakage across related samples.
Blocked sampling enforces structural independence where sample-level independence does not hold.
Why It Matters
Many real-world datasets violate the IID assumption. Observations are often correlated through time, users, devices, locations, or sessions. Random sampling at the individual level can leak information across splits and produce overly optimistic evaluation results.
Blocked sampling prevents hidden dependence from contaminating training and evaluation.
Common Use Cases
Blocked sampling is commonly used when data contains:
- temporal dependence (logs, time series)
- user or entity-level clustering
- session-based interactions
- spatial correlation
- repeated measurements
If samples share context, they should often be blocked.
How Blocked Sampling Works
A typical blocked sampling workflow:
- Define blocks based on a dependency source (e.g., user ID, time window)
- Assign entire blocks to training, validation, or test sets
- Ensure no block appears in more than one split
- Train and evaluate models using block-respecting splits
Blocks become the unit of independence.
Minimal Conceptual Example
# conceptual blocked splittrain_blocks, test_blocks = split(block_ids)train = data[data.block_id.isin(train_blocks)]test = data[data.block_id.isin(test_blocks)]
Blocked Sampling vs Random Sampling
- Random sampling: assumes sample-level independence
- Blocked sampling: enforces block-level independence
Random sampling can silently leak correlated information.
Blocked Sampling vs Time-Aware Sampling
- Blocked sampling: preserves arbitrary dependency structure
- Time-aware sampling: preserves temporal order specifically
Blocked sampling is more general; time-aware sampling is a special case.
Choosing Blocks
Effective blocking requires:
- identifying true sources of dependence
- avoiding overly large blocks that reduce data efficiency
- avoiding overly small blocks that fail to break dependence
- aligning blocks with deployment conditions
Poor block definition undermines the method.
Common Pitfalls
- blocking on irrelevant attributes
- mixing blocks across splits
- creating blocks that overlap in time or identity
- ignoring within-block label leakage
- evaluating metrics without acknowledging block structure
Blocks must reflect real dependencies.
Relationship to Generalization
Blocked sampling provides more realistic generalization estimates when deployment involves new users, sessions, time periods, or regions unseen during training.
Relationship to Evaluation Protocols
Blocked sampling is often a foundational component of robust evaluation protocols, especially in settings where IID assumptions are invalid.
Related Concepts
- Data & Distribution
- Sampling Strategies
- Time-Aware Sampling
- Forward-Chaining Splits
- Train/Test Contamination
- Data Leakage
- Evaluation Protocols