Data Quality & Governance

Data without quality is noise. Governance without automation is bureaucracy.

We implement data quality frameworks, governance models, data catalogs, and lineage tracking. Know where your data originated, how it was transformed, who owns it — and whether you can trust it.

I want data I can trust Back to Data Platform

>95%

Data quality score

100%

Lineage coverage

<5 min

Issue detection

Auditable

GDPR compliance

Why data quality is critical¶

A dashboard nobody trusts is more expensive than no dashboard at all. People ignore it and make decisions based on intuition — or create their own Excel files. We’ve seen this dozens of times:

Revenue differs by 5% between financial and sales reports
Duplicate customers — one customer in 3 systems under 3 different IDs
Missing data — 15% of orders lack category information, segmentation is unusable
Stale data — pipeline crashed a week ago, nobody noticed

Data quality isn’t nice-to-have. It’s a prerequisite for any data initiative — BI, analytics, AI/ML.

Data Quality Framework¶

6 dimensions of quality¶

For every dataset, we measure and monitor:

Completeness — What percentage of values are missing? Threshold per column (e.g., email: max 2% null)
Consistency — Do data match between sources? Customer in CRM = customer in ERP?
Accuracy — Are values correct? Does postal code exist? Is date in the past, not in year 2087?
Timeliness — How fresh is the data? SLA: orders within 5 minutes, financial data within 1 hour
Uniqueness — Are there duplicates? Fuzzy duplicate detection (Smith John vs. John Smith)
Validity — Do values match defined format and range? Email has @, age is 0-150

Automated quality checks¶

Quality checks run automatically as part of every pipeline:

dbt tests: Schema validation (unique, not_null, accepted_values, relationships)
Great Expectations: Comprehensive data tests with human-readable documentation
Custom validators: Business-specific rules (order sum > 0, delivery date > order date)
Anomaly detection: Statistical anomalies in volume, distribution, trends

When quality check fails: - Pipeline stops (better no data than bad data) - Alert to Slack/Teams with problem details - Failed records go to quarantine for review - Quality incident logged with root cause and resolution

Quality dashboard¶

Central overview of all dataset quality: - Quality score per dataset (aggregation of 6 dimensions) - Trend over time — is quality improving or deteriorating? - Top issues — which problems have the biggest impact? - SLA compliance — how many datasets meet defined SLA?

Data Governance¶

Ownership model¶

Every dataset has defined: - Data Owner — business responsibility (who defines what data means) - Data Steward — operational responsibility (who resolves quality issues) - Technical Owner — technical responsibility (who manages pipeline)

Data Contracts¶

Formal agreement between producer and consumer:

contract:
  name: orders-v2
  owner: team-ecommerce
  schema:
    - name: order_id
      type: string
      constraints: [not_null, unique]
    - name: total_amount
      type: decimal(10,2)
      constraints: [not_null, positive]
  quality:
    completeness: ">99%"
    freshness: "<5 minutes"
  sla:
    availability: "99.9%"
    support: "business-hours"

Breaking change = new contract version + notification to all consumers + migration period.

Data Lineage¶

We automatically track data journey from source to consumer:

Where data came from — source system, table, API endpoint
How it was transformed — which pipeline, what transformations, what filters
Where it goes — which dashboards, models, reports consume the data
Impact analysis — change in source → which downstream systems are affected?

Tools: dbt lineage, DataHub, Apache Atlas, OpenLineage.

Data Catalog¶

Central place for data discovery and documentation:

Search & discovery — analyst searches for “monthly revenue” → finds definition, owner, quality score
Business glossary — unified definitions of business terms
Data dictionary — technical description of tables and columns
Usage analytics — which datasets are used, which aren’t
Collaboration — comments, questions, ratings

Personal Data Management¶

PII detection: Automatic classification of columns containing personal data
Data masking: PII pseudonymization in development and testing environments
Encryption: At-rest and in-transit encryption for sensitive data
Access control: RBAC — PII access only for authorized roles

Right to be Forgotten¶

Automated pipeline for personal data deletion: 1. Request comes via API/form 2. Identification of all person occurrences across platform (lineage) 3. Anonymization/deletion in all systems 4. Audit log as compliance proof 5. Confirmation to requester

Retention Policies¶

Automatic data deletion/archiving after retention period expires
Per-dataset configuration (financial data: 10 years, logs: 90 days, marketing data: 2 years)
Audit trail of retention operations

Implementation approach¶

Assessment (1-2 weeks): Audit current state — where are the biggest quality problems? Does governance exist? Who owns data?
Framework setup (2-3 weeks): Quality checks, monitoring, alerting. Ownership model. First 5-10 datasets under governance.
Catalog and lineage (2-4 weeks): Data catalog deployment, automatic lineage, key dataset documentation.
Scaling (ongoing): Gradual expansion to all datasets. Data steward training. Continuous improvement.

Časté otázky

6 dimensions: completeness (missing values), consistency (agreement between sources), accuracy (correctness), timeliness (freshness), uniqueness (duplicates), validity (format and range). Automated checks at input and output of every pipeline. Quality score per dataset, trend over time.

Formal agreement between data producer and consumer. Defines schema, quality expectations, SLA, ownership. Breaking change requires versioning, notification, and migration period. Contract in code (protobuf, JSON Schema), not in documents.

If you have more than 3 data sources and more than 5 consumers — yes. Catalog dramatically reduces time spent searching for data ('who should I ask'), increases trust (quality score, owner) and enables impact analysis during changes.

PII detection and classification, data masking/pseudonymization, retention policies, right to be forgotten pipeline, audit trail of all access, consent management integration. Everything automated and auditable.

Souvisí s

Data Platform & Integration {'cs': 'ETL/ELT, data lakehouse, real-time pipelines.', 'en': 'ETL/ELT, data lakehouse, real-time pipelines.'}

AI & Agentic Systems {'cs': 'Stavíme AI agenty s governance, bezpečností a produkčním provozem.', 'en': 'We build AI agents with governance, security, and production operations.'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku