Data Quality

Short Definition

Data quality describes how suitable data is for learning, evaluation, and deployment.

Definition

Data quality refers to the degree to which a dataset accurately, consistently, and reliably represents the real-world phenomena it is intended to model. High-quality data supports effective learning and trustworthy evaluation, while poor-quality data introduces noise, bias, and misleading signals.

Data quality is a property of the data itself, independent of model architecture.

Why It Matters

Machine learning models can only learn from the information present in data. Poor data quality limits performance, distorts evaluation metrics, and undermines generalization—regardless of model complexity.

Many apparent modeling problems are, in fact, data quality problems.

Key Dimensions of Data Quality

Data quality is multidimensional and commonly evaluated along these axes:

  • Accuracy: correctness of values and labels
  • Completeness: absence of missing or truncated data
  • Consistency: internal coherence across sources and fields
  • Timeliness: relevance to current conditions
  • Validity: conformity to expected formats and ranges
  • Representativeness: alignment with the target population

Deficiencies in any dimension can impair learning.

How Data Quality Affects Models

  • noisy labels degrade learning signals
  • missing or corrupted values distort feature representations
  • biased samples reduce generalization
  • stale data increases sensitivity to drift

Models amplify data issues rather than correcting them.

Data Quality vs Data Quantity

More data does not guarantee better performance. Low-quality data can overwhelm useful signals, while smaller, high-quality datasets often produce more reliable models.

Quality and quantity must be balanced.

Minimal Conceptual Example

# conceptual illustration
high_accuracy_model != high_quality_data

This highlights that performance metrics alone do not imply data quality.

Improving Data Quality

Common strategies include:

  • careful data collection and labeling
  • validation and cleaning pipelines
  • auditing data sources
  • monitoring for drift and anomalies
  • documenting known limitations

Data improvement is an ongoing process.

Common Pitfalls

  • assuming model complexity compensates for poor data
  • relying solely on automated cleaning
  • ignoring labeling consistency
  • treating data quality as a one-time task

Data quality degrades over time without maintenance.

Relationship to Generalization and Robustness

High data quality improves generalization under normal conditions but does not guarantee robustness to adversarial or worst-case inputs. Reliable systems require both good data and robust modeling assumptions.

Related Concepts

  • Data & Distribution
  • Training Data
  • Sampling Bias
  • Label Noise
  • Distribution Shift
  • Generalization
  • Model Robustness