Systematic approach to data quality is foundation of trustworthy analytics. Six quality dimensions, automated checks and processes for continuous improvement.
Six Dimensions of Data Quality¶
- Completeness — missing values (% non-null)
- Uniqueness — duplicates (% unique keys)
- Validity — values in allowed range/format
- Accuracy — correctness against reality
- Consistency — agreement between systems
- Timeliness — data is sufficiently current
Data Quality Score¶
# DQ score calculation
def calculate_dq_score(checks_results):
passed = sum(1 for c in checks_results if c.passed)
total = len(checks_results)
return (passed / total) * 100
# Example output:
# Completeness: 99.8%
# Uniqueness: 100%
# Validity: 98.5%
# Timeliness: 100%
# Overall DQ Score: 99.6%
Automation¶
- Prevention — schema enforcement, validation during ingestion
- Detection — Great Expectations, Soda, dbt tests
- Alerting — Slack/email on control failures
- Remediation — automatic fixing or quarantine
Summary¶
DQ framework with six dimensions, automated checks and DQ score ensures systematic data quality management.
data qualityframeworkmetricsprocesses