Missing Data

Short Definition

Missing data refers to the absence of values for one or more features in a dataset.

Definition

Missing data occurs when expected observations or feature values are not recorded, unavailable, or lost in a dataset. This absence can arise during data collection, transmission, storage, or preprocessing and can affect both inputs and labels.

Missing data is a data property, not a modeling choice.

Why It Matters

Most machine learning models assume complete inputs. Missing values can distort learned representations, bias parameter estimates, and degrade evaluation metrics if not handled properly.

Improper handling of missing data can silently reduce model reliability.

Common Causes of Missing Data

  • sensor or system failures
  • manual data entry errors
  • optional or skipped fields
  • data corruption or truncation
  • privacy-driven data removal

Missingness often reflects real-world constraints.

Types of Missing Data

Missing data is commonly categorized as:

  • Missing Completely at Random (MCAR): missingness unrelated to data
  • Missing at Random (MAR): missingness depends on observed variables
  • Missing Not at Random (MNAR): missingness depends on unobserved values

The type determines appropriate handling strategies.

How Missing Data Affects Models

  • reduced effective sample size
  • biased feature distributions
  • unstable training dynamics
  • misleading evaluation results
  • increased uncertainty in predictions

Models may learn artifacts of missingness rather than meaningful patterns.

Minimal Conceptual Example

# conceptual illustration
if x_feature is None:
model_behavior_becomes_unpredictable()

Common Strategies for Handling Missing Data

  • removal of incomplete samples (with caution)
  • simple imputation (mean, median, mode)
  • model-based or learned imputation
  • adding missingness indicators
  • using models that handle missing values natively

No single approach works universally.

Common Pitfalls

  • blindly dropping rows with missing values
  • imputing using statistics computed on full datasets
  • ignoring patterns in missingness
  • assuming missingness is random

Handling missing data requires domain awareness.

Relationship to Data Quality and Bias

Missing data is a key dimension of data quality. Systematic missingness can introduce sampling bias and invalidate generalization claims if it correlates with target variables or subgroups.

Related Concepts