_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Synthetic Data for Enterprise Testing — A Complete Guide 2026

05. 02. 2026 12 min read CORE SYSTEMSdata
Synthetic Data for Enterprise Testing — A Complete Guide 2026

Testing enterprise applications with production data is not only risky in 2026 — in many cases it is directly illegal. GDPR, NIS2 and growing regulation are forcing companies to seek alternatives. Synthetic data — artificially generated datasets that statistically match production data without containing any personal information — is the answer. In this guide we cover everything from theory through tools to concrete implementation patterns.

Why Production Data in Test Environments Is Not the Answer

A surprisingly large number of companies still copy production databases into test environments. The problems are numerous:

  • GDPR violations: Customer personal data in a test environment constitutes purpose extension without a legal basis. Fines reach 4% of turnover.
  • NIS2 regulation: From 2025, NIS2 applies to ICT service providers. Insufficient protection of test data is an audit finding rated “high”.
  • Data breaches: Test environments typically have weaker security — broader access, less monitoring, weaker encryption. 67% of data breaches in 2025 originated from non-production environments.
  • Masking is not enough: Anonymising and pseudonymising production data is fragile. Re-identification is possible by combining quasi-identifiers (age + postcode + gender = 87% of the population uniquely identifiable).
  • Operational costs: Copying terabyte databases, managing access, audit logging — all of this costs time and money.

Synthetic data solves these problems at a fundamental level: there is no real person who can be identified, because the data never represented a real person.

What Synthetic Data Is and How It Works

Synthetic data is an artificially generated dataset produced by algorithms that preserve the statistical properties, distributions and correlations of the original data — without any link to specific individuals or records.

Key Properties of Quality Synthetic Data

Statistical fidelity: Value distributions, averages, variances and correlations between columns match the original. Synthetic data preserves these distributions automatically.

Privacy guarantees: No synthetic record should be too similar to a real one. Measured using metrics like Distance to Closest Record (DCR) or membership inference resistance.

Utility: ML models trained on synthetic data achieve comparable accuracy to those trained on original data. Benchmarked by metric drop (accuracy drop, F1 drop).

Consistency: Referential integrity between tables is maintained. An order references an existing customer who has a valid address in the correct format.

Generative Approaches

Three main categories of generators are used in practice:

1. Statistical models (rule-based) The simplest approach. You define distributions for each column and the generator produces data according to rules. Suitable for simple datasets without complex dependencies.

# Example: Faker + custom distributions
from faker import Faker
import numpy as np

fake = Faker('en_GB')

def generate_customer():
    age = int(np.random.normal(38, 12))
    age = max(18, min(99, age))
    return {
        'name': fake.name(),
        'email': fake.email(),
        'age': age,
        'city': np.random.choice(
            ['London', 'Manchester', 'Birmingham', 'Leeds'],
            p=[0.45, 0.20, 0.15, 0.20]
        ),
        'monthly_spend': max(0, np.random.lognormal(7.5, 1.2))
    }

Advantages: fast, deterministic, easily explainable. Disadvantages: don’t capture complex correlations between columns.

2. GAN-based generators (CTGAN, TableGAN) Generative adversarial networks trained on tabular data. The generator produces synthetic records, the discriminator learns to distinguish real from synthetic. After convergence, the generator produces statistically faithful data.

from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

synthesizer = CTGANSynthesizer(metadata, epochs=500)
synthesizer.fit(real_data)

synthetic_data = synthesizer.sample(num_rows=100_000)

Advantages: automatically capture complex correlations. Disadvantages: require sufficient training data (thousands of records), training takes hours.

3. LLM-based generators The latest approach — using large language models to generate contextually rich synthetic data. Particularly effective for unstructured and semi-structured data (order notes, medical reports, customer communications).

# Example with Claude API for generating realistic customer tickets
import anthropic

client = anthropic.Anthropic()

prompt = """Generate 5 realistic customer support tickets 
for an electronics e-commerce store. Each ticket must contain:
- subject, problem description, category, priority, sentiment
Format: JSON array. Tickets should be diverse — complaints, 
queries, grievances, compliments."""

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2000,
    messages=[{"role": "user", "content": prompt}]
)

Advantages: generate contextually rich data with realistic relationships, no training dataset needed. Disadvantages: more expensive for large volumes, poorer reproducibility.

Enterprise Synthetic Data System Architecture

In an enterprise environment you don’t just need a generator — you need an entire pipeline. Here is the reference architecture we deploy with clients:

1. Metadata Layer — Understanding Your Data

Before generating anything, you need to understand the schema, distributions and relationships:

# schema-profile.yaml
tables:
  customers:
    columns:
      - name: customer_id
        type: integer
        role: primary_key
        generator: sequential
      - name: email
        type: string
        role: pii
        pattern: "{first_name}.{last_name}@{domain}"
      - name: birth_date
        type: date
        distribution: normal
        mean: "1988-06-15"
        std_days: 4380  # ~12 years
      - name: segment
        type: categorical
        values: [premium, standard, basic]
        weights: [0.15, 0.55, 0.30]

    constraints:
      - type: unique
        columns: [email]
      - type: range
        column: birth_date
        min: "1940-01-01"
        max: "2008-12-31"

  orders:
    columns:
      - name: order_id
        type: integer
        role: primary_key
      - name: customer_id
        type: integer
        role: foreign_key
        references: customers.customer_id
      - name: total_amount
        type: decimal
        distribution: lognormal
        mean: 150.0
        std: 280.0

    relationships:
      - type: one_to_many
        parent: customers
        child: orders
        distribution: poisson
        mean: 4.2  # average orders per customer

2. Generation Engine

An orchestrator that: - Respects referential integrity (generates parents before children) - Applies constraints (unique, range, null rates) - Preserves temporal consistency (order after registration) - Supports incremental generation (new records alongside existing ones)

3. Validation Layer — Quality Assurance

Every generated dataset passes through automatic validation:

from sdmetrics.reports.single_table import QualityReport
from sdmetrics.single_table import NewRowSynthesis

# Statistical quality
report = QualityReport()
report.generate(real_data, synthetic_data, metadata)
print(f"Overall quality score: {report.get_score()}")
# Target: > 0.85

# Privacy validation
privacy_score = NewRowSynthesis.compute(
    real_data, synthetic_data, metadata
)
print(f"New row synthesis: {privacy_score}")
# Target: > 0.95 (95%+ of records are unique)

4. Distribution Layer — Delivering Data

Synthetic data must be easily accessible to developers and CI/CD pipelines:

  • Self-service portal: Developer selects schema, row count, seed → generates on demand
  • CI/CD integration: Every build automatically generates a fresh dataset
  • Snapshot management: Versioned datasets for reproducible tests
  • Format flexibility: SQL dump, CSV, Parquet, API endpoint

Practical Patterns for Enterprise

Pattern 1: Banking Transactions

Banks must test AML (Anti-Money Laundering) systems with realistic transactions. Synthetic data must include:

  • Normal transaction patterns (salary → rent → purchases)
  • Anomalies for detection (structuring, layering, round-tripping)
  • Locale-specific formats (IBAN, sort codes, transaction references)
def generate_bank_transaction():
    """Generates a bank transaction with realistic patterns."""
    tx_type = np.random.choice(
        ['incoming', 'outgoing', 'internal'],
        p=[0.35, 0.55, 0.10]
    )

    # Realistic amounts by type
    if tx_type == 'incoming' and np.random.random() < 0.3:
        # Salary — normal distribution around median
        amount = max(1500, np.random.normal(3800, 1200))
    else:
        amount = np.random.lognormal(5.5, 1.8)

    return {
        'iban': fake.iban(),
        'amount': round(amount, 2),
        'currency': 'GBP',
        'reference': fake.bothify(text='??######'),
        'type': tx_type,
        'timestamp': fake.date_time_between(
            start_date='-90d', end_date='now'
        )
    }

Pattern 2: E-commerce Orders

For testing logistics systems, you need consistent orders:

  • Customer → basket → order → payment → dispatch → delivery
  • Each step has realistic time intervals
  • Valid addresses matching real postal codes
  • Seasonal patterns (Christmas, Black Friday, summer)
class OrderGenerator:
    """E-commerce order generator with seasonal patterns."""

    SEASONAL_MULTIPLIERS = {
        1: 0.7, 2: 0.65, 3: 0.8, 4: 0.85,
        5: 0.9, 6: 0.85, 7: 0.75, 8: 0.8,
        9: 0.95, 10: 1.0, 11: 1.4, 12: 1.8
    }

    def generate_order(self, date: datetime) -> dict:
        month = date.month
        base_items = np.random.poisson(2.3)
        items = max(1, int(base_items * self.SEASONAL_MULTIPLIERS[month]))

        return {
            'order_date': date,
            'items_count': items,
            'shipping_address': {
                'city': fake.city(),
                'postcode': fake.postcode(),
                'street': fake.street_address(),
            },
            'payment_method': np.random.choice(
                ['card', 'bank_transfer', 'paypal', 'buy_now_pay_later'],
                p=[0.45, 0.25, 0.20, 0.10]
            ),
            'delivery_method': np.random.choice(
                ['standard', 'express', 'click_collect', 'locker'],
                p=[0.40, 0.25, 0.20, 0.15]
            )
        }

Pattern 3: Healthcare Data

Hospitals and insurers need to test systems with patient data, where GDPR is especially strict:

  • Diagnoses matching ICD-10 classification
  • Realistic hospitalisation patterns
  • Demographic correlations (age ↔ diagnosis)
  • No real national IDs — generated with valid format but non-existent

Synthetic Data Tools in 2026

Open-Source Tools

Tool Approach Best for Licence
SDV (Synthetic Data Vault) GAN/statistical Tabular data, multi-table MIT
Faker Rule-based PII replacement, simple datasets MIT
Gretel.ai SDK GAN + LLM Complex enterprise data Freemium
DataSynthesizer Bayesian network Academic / simpler use cases MIT
Synthcity GAN/VAE/diffusion Healthcare data Apache 2.0
Mimesis Rule-based High performance, multi-locale MIT

Enterprise Platforms

Mostly AI: Leader in Gartner Magic Quadrant for synthetic data. Strong support for tabular data, automatic privacy validation, self-hosted and cloud. Price: from €50K/year.

Tonic.ai: Focused on database subsetting + synthesis. Direct integration with PostgreSQL, MySQL, Oracle. Strong in CI/CD pipelines. Price: from $30K/year.

Gretel.ai: Cloud-native platform with the best LLM integration. Generates unstructured data too (text, JSON, logs). Free tier for smaller volumes.

Is Synthetic Data Personal Data?

The key question. The answer depends on the generation method:

Fully synthetic data (de novo): Generated purely from statistical distributions without direct mapping to specific individuals. According to the EDPB (European Data Protection Board) opinion from 2025, this is not personal data if: - No record can be traced back to a specific person - The generator does not perform 1:1 transformation - Distance to Closest Record (DCR) is above the safety threshold

Pseudonymised data: Transformations of production data (masking, hashing). Still personal data under GDPR — pseudonymisation is not anonymisation.

Differentially private data: Adding calibrated noise that mathematically guarantees that the presence or absence of one record does not affect the output. Strongest legal position — demonstrable anonymisation.

Practical Recommendations

  1. Document the generation method — an auditor must see that synthetic data cannot be reverse-linked
  2. Conduct re-identification tests — regularly verify that synthetic records cannot be linked to real individuals
  3. Keep metadata — which model, which configuration, which seed generated the data
  4. Separate environments — the generator that has access to production data runs in a protected environment

CI/CD Pipeline Integration

Synthetic data has the greatest value when automated in CI/CD:

# .github/workflows/integration-tests.yml
name: Integration Tests with Synthetic Data

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_PASSWORD: test
        ports: ['5432:5432']

    steps:
      - uses: actions/checkout@v4

      - name: Generate synthetic data
        run: |
          pip install sdv faker
          python scripts/generate_test_data.py \
            --schema config/data-schema.yaml \
            --rows 50000 \
            --seed ${{ github.run_number }} \
            --output /tmp/synthetic_data/

      - name: Load data into test DB
        run: |
          psql -h localhost -U postgres -d testdb \
            -f /tmp/synthetic_data/load.sql
        env:
          PGPASSWORD: test

      - name: Run integration tests
        run: pytest tests/integration/ -v --tb=short

      - name: Validate data quality
        run: |
          python scripts/validate_synthetic_data.py \
            --data /tmp/synthetic_data/ \
            --min-quality 0.85 \
            --min-privacy 0.95

Deterministic vs Stochastic Generation

For CI/CD, reproducible output is critical:

class DeterministicGenerator:
    """Seed-based generator for reproducible CI/CD tests."""

    def __init__(self, seed: int = 42):
        self.rng = np.random.RandomState(seed)
        self.fake = Faker('en_GB')
        Faker.seed(seed)

    def generate_dataset(self, n_rows: int) -> pd.DataFrame:
        """Same seed = same data every time."""
        records = [self._generate_record() for _ in range(n_rows)]
        return pd.DataFrame(records)

    def _generate_record(self) -> dict:
        return {
            'id': self.fake.uuid4(),
            'name': self.fake.name(),
            'email': self.fake.email(),
            'amount': round(self.rng.lognormal(5.5, 1.5), 2),
            'created_at': self.fake.date_time_this_year()
        }

Measuring Synthetic Data Quality

Generating is not enough — you must measure. Key metrics:

Fidelity Metrics

  • Column Shape: Distribution of each column vs. original (KS test, chi-squared)
  • Column Pair Trends: Correlations between pairs of columns
  • Parent-Child Relationships: Referential integrity and distributional consistency

Privacy Metrics

  • DCR (Distance to Closest Record): Minimum distance of a synthetic record from the nearest real one. Median should be > 5th percentile of real data.
  • Membership Inference: Can an ML model tell whether a specific record was in the training dataset? Target: accuracy ≤ 52% (close to random).
  • Attribute Inference: Can an attacker infer a sensitive attribute from synthetic data better than from public statistics? Target: minimal advantage.

Utility Metrics

  • ML Efficacy: Train the same model on real and synthetic data, compare performance on a real test set
  • Query Accuracy: Analytical queries (aggregations, filters) on synthetic data should give results within ±5% of real
  • Statistical Tests: Kolmogorov-Smirnov test for continuous variables, chi-squared for categorical
from sdmetrics.reports.single_table import QualityReport, DiagnosticReport

# Quality
quality = QualityReport()
quality.generate(real_data, synthetic_data, metadata)

print("Column Shapes:", quality.get_details('Column Shapes'))
print("Column Pair Trends:", quality.get_details('Column Pair Trends'))

# Diagnostics
diag = DiagnosticReport()
diag.generate(real_data, synthetic_data, metadata)
print("Coverage:", diag.get_details('Coverage'))
print("Boundaries:", diag.get_details('Boundary'))

Common Mistakes and How to Avoid Them

  1. Generating without profiling — generating data without understanding original distributions. Result: uniform synthetic data that doesn’t match reality.

  2. Ignoring temporal dependencies — orders with a date before customer registration. Payments before invoice issuance. Absurd but common.

  3. Data too uniform — synthetic data without outliers and edge cases. Tests pass, but production fails on unexpected values.

  4. One-time generation — generating a dataset once and using it for months. Data becomes stale, new features have no test data.

  5. Missing privacy validation — assuming data is safe without measurement. Without metrics, you can’t know.

Case Study: Implementation for a Retailer

One of our clients — a retailer with 2M+ customers and 50M+ transactions per year — needed a test environment for a new loyalty system.

Initial State

  • Copy of production DB (PostgreSQL 14, 800 GB) in test environment
  • Masking via custom SQL scripts (unreliable, 3 audit findings)
  • Refresh once a month (manual, 6h test downtime)
  • 12 developers with access to production data

Solution

  • SDV + Faker for tabular data
  • Custom generator for loyalty points and campaigns
  • CI/CD integration — fresh data on every PR
  • Validation pipeline — automatic quality and privacy checks

Results after 3 months

  • 0 developers with access to production data (down from 12)
  • Audit compliance: All GDPR findings closed
  • Generation: 50M synthetic transactions in 45 minutes (vs. 6h copy)
  • Bug detection: +23% more bugs found thanks to edge case injection
  • Costs: -40% on test infrastructure (smaller DB, no security overlays)

Where Synthetic Data Is Heading in 2026

  • Foundation models for tabular data: Models like TabPFN and TabDDPM enable zero-shot generation
  • Federated synthetic data: Organisations share synthetic data representations without sharing data itself
  • Regulatory standards: EU AI Act positions synthetic data as a key AI compliance tool
  • Real-time synthetic data: Streaming generation for testing real-time systems (Kafka, event sourcing, CDC)

Conclusion

Synthetic data is not a luxury — it is a necessity. Companies that still copy production databases into test environments risk GDPR fines, data breaches and audit findings. The technology in 2026 is mature enough for enterprise deployment.

Start simply: Faker for reference data, SDV for complex datasets, automatic validation in CI/CD. The key is a systematic approach — not a one-off script, but an integrated pipeline with measurable quality and privacy.


Need help implementing a complete synthetic data platform? Contact us — from data model analysis through generator deployment to CI/CD integration and GDPR compliance documentation.

synthetic-datatestinggdpraidata-engineeringprivacy
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us