Data Quality & Governance
Data without quality is noise. Governance without automation is bureaucracy.
We implement data quality frameworks, governance models, data catalogs, and lineage tracking. Know where your data originated, how it was transformed, who owns it — and whether you can trust it.
Why data quality is critical¶
A dashboard nobody trusts is more expensive than no dashboard at all. People ignore it and make decisions based on intuition — or create their own Excel files. We’ve seen this dozens of times:
- Revenue differs by 5% between financial and sales reports
- Duplicate customers — one customer in 3 systems under 3 different IDs
- Missing data — 15% of orders lack category information, segmentation is unusable
- Stale data — pipeline crashed a week ago, nobody noticed
Data quality isn’t nice-to-have. It’s a prerequisite for any data initiative — BI, analytics, AI/ML.
Data Quality Framework¶
6 dimensions of quality¶
For every dataset, we measure and monitor:
- Completeness — What percentage of values are missing? Threshold per column (e.g., email: max 2% null)
- Consistency — Do data match between sources? Customer in CRM = customer in ERP?
- Accuracy — Are values correct? Does postal code exist? Is date in the past, not in year 2087?
- Timeliness — How fresh is the data? SLA: orders within 5 minutes, financial data within 1 hour
- Uniqueness — Are there duplicates? Fuzzy duplicate detection (Smith John vs. John Smith)
- Validity — Do values match defined format and range? Email has @, age is 0-150
Automated quality checks¶
Quality checks run automatically as part of every pipeline:
- dbt tests: Schema validation (unique, not_null, accepted_values, relationships)
- Great Expectations: Comprehensive data tests with human-readable documentation
- Custom validators: Business-specific rules (order sum > 0, delivery date > order date)
- Anomaly detection: Statistical anomalies in volume, distribution, trends
When quality check fails: - Pipeline stops (better no data than bad data) - Alert to Slack/Teams with problem details - Failed records go to quarantine for review - Quality incident logged with root cause and resolution
Quality dashboard¶
Central overview of all dataset quality: - Quality score per dataset (aggregation of 6 dimensions) - Trend over time — is quality improving or deteriorating? - Top issues — which problems have the biggest impact? - SLA compliance — how many datasets meet defined SLA?
Data Governance¶
Ownership model¶
Every dataset has defined: - Data Owner — business responsibility (who defines what data means) - Data Steward — operational responsibility (who resolves quality issues) - Technical Owner — technical responsibility (who manages pipeline)
Data Contracts¶
Formal agreement between producer and consumer:
contract:
name: orders-v2
owner: team-ecommerce
schema:
- name: order_id
type: string
constraints: [not_null, unique]
- name: total_amount
type: decimal(10,2)
constraints: [not_null, positive]
quality:
completeness: ">99%"
freshness: "<5 minutes"
sla:
availability: "99.9%"
support: "business-hours"
Breaking change = new contract version + notification to all consumers + migration period.
Data Lineage¶
We automatically track data journey from source to consumer:
- Where data came from — source system, table, API endpoint
- How it was transformed — which pipeline, what transformations, what filters
- Where it goes — which dashboards, models, reports consume the data
- Impact analysis — change in source → which downstream systems are affected?
Tools: dbt lineage, DataHub, Apache Atlas, OpenLineage.
Data Catalog¶
Central place for data discovery and documentation:
- Search & discovery — analyst searches for “monthly revenue” → finds definition, owner, quality score
- Business glossary — unified definitions of business terms
- Data dictionary — technical description of tables and columns
- Usage analytics — which datasets are used, which aren’t
- Collaboration — comments, questions, ratings
GDPR and compliance¶
Personal Data Management¶
- PII detection: Automatic classification of columns containing personal data
- Data masking: PII pseudonymization in development and testing environments
- Encryption: At-rest and in-transit encryption for sensitive data
- Access control: RBAC — PII access only for authorized roles
Right to be Forgotten¶
Automated pipeline for personal data deletion: 1. Request comes via API/form 2. Identification of all person occurrences across platform (lineage) 3. Anonymization/deletion in all systems 4. Audit log as compliance proof 5. Confirmation to requester
Retention Policies¶
- Automatic data deletion/archiving after retention period expires
- Per-dataset configuration (financial data: 10 years, logs: 90 days, marketing data: 2 years)
- Audit trail of retention operations
Implementation approach¶
- Assessment (1-2 weeks): Audit current state — where are the biggest quality problems? Does governance exist? Who owns data?
- Framework setup (2-3 weeks): Quality checks, monitoring, alerting. Ownership model. First 5-10 datasets under governance.
- Catalog and lineage (2-4 weeks): Data catalog deployment, automatic lineage, key dataset documentation.
- Scaling (ongoing): Gradual expansion to all datasets. Data steward training. Continuous improvement.
Časté otázky
6 dimensions: completeness (missing values), consistency (agreement between sources), accuracy (correctness), timeliness (freshness), uniqueness (duplicates), validity (format and range). Automated checks at input and output of every pipeline. Quality score per dataset, trend over time.
Formal agreement between data producer and consumer. Defines schema, quality expectations, SLA, ownership. Breaking change requires versioning, notification, and migration period. Contract in code (protobuf, JSON Schema), not in documents.
If you have more than 3 data sources and more than 5 consumers — yes. Catalog dramatically reduces time spent searching for data ('who should I ask'), increases trust (quality score, owner) and enables impact analysis during changes.
PII detection and classification, data masking/pseudonymization, retention policies, right to be forgotten pipeline, audit trail of all access, consent management integration. Everything automated and auditable.