_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Data Quality & Governance

Data without quality is noise. Governance without automation is bureaucracy.

We implement data quality frameworks, governance models, data catalogs, and lineage tracking. Know where your data originated, how it was transformed, who owns it — and whether you can trust it.

>95%
Data quality score
100%
Lineage coverage
<5 min
Issue detection
Auditable
GDPR compliance

Why data quality is critical

A dashboard nobody trusts is more expensive than no dashboard at all. People ignore it and make decisions based on intuition — or create their own Excel files. We’ve seen this dozens of times:

  • Revenue differs by 5% between financial and sales reports
  • Duplicate customers — one customer in 3 systems under 3 different IDs
  • Missing data — 15% of orders lack category information, segmentation is unusable
  • Stale data — pipeline crashed a week ago, nobody noticed

Data quality isn’t nice-to-have. It’s a prerequisite for any data initiative — BI, analytics, AI/ML.

Data Quality Framework

6 dimensions of quality

For every dataset, we measure and monitor:

  1. Completeness — What percentage of values are missing? Threshold per column (e.g., email: max 2% null)
  2. Consistency — Do data match between sources? Customer in CRM = customer in ERP?
  3. Accuracy — Are values correct? Does postal code exist? Is date in the past, not in year 2087?
  4. Timeliness — How fresh is the data? SLA: orders within 5 minutes, financial data within 1 hour
  5. Uniqueness — Are there duplicates? Fuzzy duplicate detection (Smith John vs. John Smith)
  6. Validity — Do values match defined format and range? Email has @, age is 0-150

Automated quality checks

Quality checks run automatically as part of every pipeline:

  • dbt tests: Schema validation (unique, not_null, accepted_values, relationships)
  • Great Expectations: Comprehensive data tests with human-readable documentation
  • Custom validators: Business-specific rules (order sum > 0, delivery date > order date)
  • Anomaly detection: Statistical anomalies in volume, distribution, trends

When quality check fails: - Pipeline stops (better no data than bad data) - Alert to Slack/Teams with problem details - Failed records go to quarantine for review - Quality incident logged with root cause and resolution

Quality dashboard

Central overview of all dataset quality: - Quality score per dataset (aggregation of 6 dimensions) - Trend over time — is quality improving or deteriorating? - Top issues — which problems have the biggest impact? - SLA compliance — how many datasets meet defined SLA?

Data Governance

Ownership model

Every dataset has defined: - Data Owner — business responsibility (who defines what data means) - Data Steward — operational responsibility (who resolves quality issues) - Technical Owner — technical responsibility (who manages pipeline)

Data Contracts

Formal agreement between producer and consumer:

contract:
  name: orders-v2
  owner: team-ecommerce
  schema:
    - name: order_id
      type: string
      constraints: [not_null, unique]
    - name: total_amount
      type: decimal(10,2)
      constraints: [not_null, positive]
  quality:
    completeness: ">99%"
    freshness: "<5 minutes"
  sla:
    availability: "99.9%"
    support: "business-hours"

Breaking change = new contract version + notification to all consumers + migration period.

Data Lineage

We automatically track data journey from source to consumer:

  • Where data came from — source system, table, API endpoint
  • How it was transformed — which pipeline, what transformations, what filters
  • Where it goes — which dashboards, models, reports consume the data
  • Impact analysis — change in source → which downstream systems are affected?

Tools: dbt lineage, DataHub, Apache Atlas, OpenLineage.

Data Catalog

Central place for data discovery and documentation:

  • Search & discovery — analyst searches for “monthly revenue” → finds definition, owner, quality score
  • Business glossary — unified definitions of business terms
  • Data dictionary — technical description of tables and columns
  • Usage analytics — which datasets are used, which aren’t
  • Collaboration — comments, questions, ratings

GDPR and compliance

Personal Data Management

  • PII detection: Automatic classification of columns containing personal data
  • Data masking: PII pseudonymization in development and testing environments
  • Encryption: At-rest and in-transit encryption for sensitive data
  • Access control: RBAC — PII access only for authorized roles

Right to be Forgotten

Automated pipeline for personal data deletion: 1. Request comes via API/form 2. Identification of all person occurrences across platform (lineage) 3. Anonymization/deletion in all systems 4. Audit log as compliance proof 5. Confirmation to requester

Retention Policies

  • Automatic data deletion/archiving after retention period expires
  • Per-dataset configuration (financial data: 10 years, logs: 90 days, marketing data: 2 years)
  • Audit trail of retention operations

Implementation approach

  1. Assessment (1-2 weeks): Audit current state — where are the biggest quality problems? Does governance exist? Who owns data?
  2. Framework setup (2-3 weeks): Quality checks, monitoring, alerting. Ownership model. First 5-10 datasets under governance.
  3. Catalog and lineage (2-4 weeks): Data catalog deployment, automatic lineage, key dataset documentation.
  4. Scaling (ongoing): Gradual expansion to all datasets. Data steward training. Continuous improvement.

Časté otázky

6 dimensions: completeness (missing values), consistency (agreement between sources), accuracy (correctness), timeliness (freshness), uniqueness (duplicates), validity (format and range). Automated checks at input and output of every pipeline. Quality score per dataset, trend over time.

Formal agreement between data producer and consumer. Defines schema, quality expectations, SLA, ownership. Breaking change requires versioning, notification, and migration period. Contract in code (protobuf, JSON Schema), not in documents.

If you have more than 3 data sources and more than 5 consumers — yes. Catalog dramatically reduces time spent searching for data ('who should I ask'), increases trust (quality score, owner) and enables impact analysis during changes.

PII detection and classification, data masking/pseudonymization, retention policies, right to be forgotten pipeline, audit trail of all access, consent management integration. Everything automated and auditable.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku