Testing enterprise applications with production data is not only risky in 2026 — in many cases it is directly illegal. GDPR, NIS2 and growing regulation are forcing companies to seek alternatives. Synthetic data — artificially generated datasets that statistically match production data without containing any personal information — is the answer. In this guide we cover everything from theory through tools to concrete implementation patterns.
Why Production Data in Test Environments Is Not the Answer¶
A surprisingly large number of companies still copy production databases into test environments. The problems are numerous:
- GDPR violations: Customer personal data in a test environment constitutes purpose extension without a legal basis. Fines reach 4% of turnover.
- NIS2 regulation: From 2025, NIS2 applies to ICT service providers. Insufficient protection of test data is an audit finding rated “high”.
- Data breaches: Test environments typically have weaker security — broader access, less monitoring, weaker encryption. 67% of data breaches in 2025 originated from non-production environments.
- Masking is not enough: Anonymising and pseudonymising production data is fragile. Re-identification is possible by combining quasi-identifiers (age + postcode + gender = 87% of the population uniquely identifiable).
- Operational costs: Copying terabyte databases, managing access, audit logging — all of this costs time and money.
Synthetic data solves these problems at a fundamental level: there is no real person who can be identified, because the data never represented a real person.
What Synthetic Data Is and How It Works¶
Synthetic data is an artificially generated dataset produced by algorithms that preserve the statistical properties, distributions and correlations of the original data — without any link to specific individuals or records.
Key Properties of Quality Synthetic Data¶
Statistical fidelity: Value distributions, averages, variances and correlations between columns match the original. Synthetic data preserves these distributions automatically.
Privacy guarantees: No synthetic record should be too similar to a real one. Measured using metrics like Distance to Closest Record (DCR) or membership inference resistance.
Utility: ML models trained on synthetic data achieve comparable accuracy to those trained on original data. Benchmarked by metric drop (accuracy drop, F1 drop).
Consistency: Referential integrity between tables is maintained. An order references an existing customer who has a valid address in the correct format.
Generative Approaches¶
Three main categories of generators are used in practice:
1. Statistical models (rule-based) The simplest approach. You define distributions for each column and the generator produces data according to rules. Suitable for simple datasets without complex dependencies.
# Example: Faker + custom distributions
from faker import Faker
import numpy as np
fake = Faker('en_GB')
def generate_customer():
age = int(np.random.normal(38, 12))
age = max(18, min(99, age))
return {
'name': fake.name(),
'email': fake.email(),
'age': age,
'city': np.random.choice(
['London', 'Manchester', 'Birmingham', 'Leeds'],
p=[0.45, 0.20, 0.15, 0.20]
),
'monthly_spend': max(0, np.random.lognormal(7.5, 1.2))
}
Advantages: fast, deterministic, easily explainable. Disadvantages: don’t capture complex correlations between columns.
2. GAN-based generators (CTGAN, TableGAN) Generative adversarial networks trained on tabular data. The generator produces synthetic records, the discriminator learns to distinguish real from synthetic. After convergence, the generator produces statistically faithful data.
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
synthesizer = CTGANSynthesizer(metadata, epochs=500)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=100_000)
Advantages: automatically capture complex correlations. Disadvantages: require sufficient training data (thousands of records), training takes hours.
3. LLM-based generators The latest approach — using large language models to generate contextually rich synthetic data. Particularly effective for unstructured and semi-structured data (order notes, medical reports, customer communications).
# Example with Claude API for generating realistic customer tickets
import anthropic
client = anthropic.Anthropic()
prompt = """Generate 5 realistic customer support tickets
for an electronics e-commerce store. Each ticket must contain:
- subject, problem description, category, priority, sentiment
Format: JSON array. Tickets should be diverse — complaints,
queries, grievances, compliments."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2000,
messages=[{"role": "user", "content": prompt}]
)
Advantages: generate contextually rich data with realistic relationships, no training dataset needed. Disadvantages: more expensive for large volumes, poorer reproducibility.
Enterprise Synthetic Data System Architecture¶
In an enterprise environment you don’t just need a generator — you need an entire pipeline. Here is the reference architecture we deploy with clients:
1. Metadata Layer — Understanding Your Data¶
Before generating anything, you need to understand the schema, distributions and relationships:
# schema-profile.yaml
tables:
customers:
columns:
- name: customer_id
type: integer
role: primary_key
generator: sequential
- name: email
type: string
role: pii
pattern: "{first_name}.{last_name}@{domain}"
- name: birth_date
type: date
distribution: normal
mean: "1988-06-15"
std_days: 4380 # ~12 years
- name: segment
type: categorical
values: [premium, standard, basic]
weights: [0.15, 0.55, 0.30]
constraints:
- type: unique
columns: [email]
- type: range
column: birth_date
min: "1940-01-01"
max: "2008-12-31"
orders:
columns:
- name: order_id
type: integer
role: primary_key
- name: customer_id
type: integer
role: foreign_key
references: customers.customer_id
- name: total_amount
type: decimal
distribution: lognormal
mean: 150.0
std: 280.0
relationships:
- type: one_to_many
parent: customers
child: orders
distribution: poisson
mean: 4.2 # average orders per customer
2. Generation Engine¶
An orchestrator that: - Respects referential integrity (generates parents before children) - Applies constraints (unique, range, null rates) - Preserves temporal consistency (order after registration) - Supports incremental generation (new records alongside existing ones)
3. Validation Layer — Quality Assurance¶
Every generated dataset passes through automatic validation:
from sdmetrics.reports.single_table import QualityReport
from sdmetrics.single_table import NewRowSynthesis
# Statistical quality
report = QualityReport()
report.generate(real_data, synthetic_data, metadata)
print(f"Overall quality score: {report.get_score()}")
# Target: > 0.85
# Privacy validation
privacy_score = NewRowSynthesis.compute(
real_data, synthetic_data, metadata
)
print(f"New row synthesis: {privacy_score}")
# Target: > 0.95 (95%+ of records are unique)
4. Distribution Layer — Delivering Data¶
Synthetic data must be easily accessible to developers and CI/CD pipelines:
- Self-service portal: Developer selects schema, row count, seed → generates on demand
- CI/CD integration: Every build automatically generates a fresh dataset
- Snapshot management: Versioned datasets for reproducible tests
- Format flexibility: SQL dump, CSV, Parquet, API endpoint
Practical Patterns for Enterprise¶
Pattern 1: Banking Transactions¶
Banks must test AML (Anti-Money Laundering) systems with realistic transactions. Synthetic data must include:
- Normal transaction patterns (salary → rent → purchases)
- Anomalies for detection (structuring, layering, round-tripping)
- Locale-specific formats (IBAN, sort codes, transaction references)
def generate_bank_transaction():
"""Generates a bank transaction with realistic patterns."""
tx_type = np.random.choice(
['incoming', 'outgoing', 'internal'],
p=[0.35, 0.55, 0.10]
)
# Realistic amounts by type
if tx_type == 'incoming' and np.random.random() < 0.3:
# Salary — normal distribution around median
amount = max(1500, np.random.normal(3800, 1200))
else:
amount = np.random.lognormal(5.5, 1.8)
return {
'iban': fake.iban(),
'amount': round(amount, 2),
'currency': 'GBP',
'reference': fake.bothify(text='??######'),
'type': tx_type,
'timestamp': fake.date_time_between(
start_date='-90d', end_date='now'
)
}
Pattern 2: E-commerce Orders¶
For testing logistics systems, you need consistent orders:
- Customer → basket → order → payment → dispatch → delivery
- Each step has realistic time intervals
- Valid addresses matching real postal codes
- Seasonal patterns (Christmas, Black Friday, summer)
class OrderGenerator:
"""E-commerce order generator with seasonal patterns."""
SEASONAL_MULTIPLIERS = {
1: 0.7, 2: 0.65, 3: 0.8, 4: 0.85,
5: 0.9, 6: 0.85, 7: 0.75, 8: 0.8,
9: 0.95, 10: 1.0, 11: 1.4, 12: 1.8
}
def generate_order(self, date: datetime) -> dict:
month = date.month
base_items = np.random.poisson(2.3)
items = max(1, int(base_items * self.SEASONAL_MULTIPLIERS[month]))
return {
'order_date': date,
'items_count': items,
'shipping_address': {
'city': fake.city(),
'postcode': fake.postcode(),
'street': fake.street_address(),
},
'payment_method': np.random.choice(
['card', 'bank_transfer', 'paypal', 'buy_now_pay_later'],
p=[0.45, 0.25, 0.20, 0.10]
),
'delivery_method': np.random.choice(
['standard', 'express', 'click_collect', 'locker'],
p=[0.40, 0.25, 0.20, 0.15]
)
}
Pattern 3: Healthcare Data¶
Hospitals and insurers need to test systems with patient data, where GDPR is especially strict:
- Diagnoses matching ICD-10 classification
- Realistic hospitalisation patterns
- Demographic correlations (age ↔ diagnosis)
- No real national IDs — generated with valid format but non-existent
Synthetic Data Tools in 2026¶
Open-Source Tools¶
| Tool | Approach | Best for | Licence |
|---|---|---|---|
| SDV (Synthetic Data Vault) | GAN/statistical | Tabular data, multi-table | MIT |
| Faker | Rule-based | PII replacement, simple datasets | MIT |
| Gretel.ai SDK | GAN + LLM | Complex enterprise data | Freemium |
| DataSynthesizer | Bayesian network | Academic / simpler use cases | MIT |
| Synthcity | GAN/VAE/diffusion | Healthcare data | Apache 2.0 |
| Mimesis | Rule-based | High performance, multi-locale | MIT |
Enterprise Platforms¶
Mostly AI: Leader in Gartner Magic Quadrant for synthetic data. Strong support for tabular data, automatic privacy validation, self-hosted and cloud. Price: from €50K/year.
Tonic.ai: Focused on database subsetting + synthesis. Direct integration with PostgreSQL, MySQL, Oracle. Strong in CI/CD pipelines. Price: from $30K/year.
Gretel.ai: Cloud-native platform with the best LLM integration. Generates unstructured data too (text, JSON, logs). Free tier for smaller volumes.
GDPR and Legal Considerations¶
Is Synthetic Data Personal Data?¶
The key question. The answer depends on the generation method:
Fully synthetic data (de novo): Generated purely from statistical distributions without direct mapping to specific individuals. According to the EDPB (European Data Protection Board) opinion from 2025, this is not personal data if: - No record can be traced back to a specific person - The generator does not perform 1:1 transformation - Distance to Closest Record (DCR) is above the safety threshold
Pseudonymised data: Transformations of production data (masking, hashing). Still personal data under GDPR — pseudonymisation is not anonymisation.
Differentially private data: Adding calibrated noise that mathematically guarantees that the presence or absence of one record does not affect the output. Strongest legal position — demonstrable anonymisation.
Practical Recommendations¶
- Document the generation method — an auditor must see that synthetic data cannot be reverse-linked
- Conduct re-identification tests — regularly verify that synthetic records cannot be linked to real individuals
- Keep metadata — which model, which configuration, which seed generated the data
- Separate environments — the generator that has access to production data runs in a protected environment
CI/CD Pipeline Integration¶
Synthetic data has the greatest value when automated in CI/CD:
# .github/workflows/integration-tests.yml
name: Integration Tests with Synthetic Data
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
POSTGRES_PASSWORD: test
ports: ['5432:5432']
steps:
- uses: actions/checkout@v4
- name: Generate synthetic data
run: |
pip install sdv faker
python scripts/generate_test_data.py \
--schema config/data-schema.yaml \
--rows 50000 \
--seed ${{ github.run_number }} \
--output /tmp/synthetic_data/
- name: Load data into test DB
run: |
psql -h localhost -U postgres -d testdb \
-f /tmp/synthetic_data/load.sql
env:
PGPASSWORD: test
- name: Run integration tests
run: pytest tests/integration/ -v --tb=short
- name: Validate data quality
run: |
python scripts/validate_synthetic_data.py \
--data /tmp/synthetic_data/ \
--min-quality 0.85 \
--min-privacy 0.95
Deterministic vs Stochastic Generation¶
For CI/CD, reproducible output is critical:
class DeterministicGenerator:
"""Seed-based generator for reproducible CI/CD tests."""
def __init__(self, seed: int = 42):
self.rng = np.random.RandomState(seed)
self.fake = Faker('en_GB')
Faker.seed(seed)
def generate_dataset(self, n_rows: int) -> pd.DataFrame:
"""Same seed = same data every time."""
records = [self._generate_record() for _ in range(n_rows)]
return pd.DataFrame(records)
def _generate_record(self) -> dict:
return {
'id': self.fake.uuid4(),
'name': self.fake.name(),
'email': self.fake.email(),
'amount': round(self.rng.lognormal(5.5, 1.5), 2),
'created_at': self.fake.date_time_this_year()
}
Measuring Synthetic Data Quality¶
Generating is not enough — you must measure. Key metrics:
Fidelity Metrics¶
- Column Shape: Distribution of each column vs. original (KS test, chi-squared)
- Column Pair Trends: Correlations between pairs of columns
- Parent-Child Relationships: Referential integrity and distributional consistency
Privacy Metrics¶
- DCR (Distance to Closest Record): Minimum distance of a synthetic record from the nearest real one. Median should be > 5th percentile of real data.
- Membership Inference: Can an ML model tell whether a specific record was in the training dataset? Target: accuracy ≤ 52% (close to random).
- Attribute Inference: Can an attacker infer a sensitive attribute from synthetic data better than from public statistics? Target: minimal advantage.
Utility Metrics¶
- ML Efficacy: Train the same model on real and synthetic data, compare performance on a real test set
- Query Accuracy: Analytical queries (aggregations, filters) on synthetic data should give results within ±5% of real
- Statistical Tests: Kolmogorov-Smirnov test for continuous variables, chi-squared for categorical
from sdmetrics.reports.single_table import QualityReport, DiagnosticReport
# Quality
quality = QualityReport()
quality.generate(real_data, synthetic_data, metadata)
print("Column Shapes:", quality.get_details('Column Shapes'))
print("Column Pair Trends:", quality.get_details('Column Pair Trends'))
# Diagnostics
diag = DiagnosticReport()
diag.generate(real_data, synthetic_data, metadata)
print("Coverage:", diag.get_details('Coverage'))
print("Boundaries:", diag.get_details('Boundary'))
Common Mistakes and How to Avoid Them¶
-
Generating without profiling — generating data without understanding original distributions. Result: uniform synthetic data that doesn’t match reality.
-
Ignoring temporal dependencies — orders with a date before customer registration. Payments before invoice issuance. Absurd but common.
-
Data too uniform — synthetic data without outliers and edge cases. Tests pass, but production fails on unexpected values.
-
One-time generation — generating a dataset once and using it for months. Data becomes stale, new features have no test data.
-
Missing privacy validation — assuming data is safe without measurement. Without metrics, you can’t know.
Case Study: Implementation for a Retailer¶
One of our clients — a retailer with 2M+ customers and 50M+ transactions per year — needed a test environment for a new loyalty system.
Initial State¶
- Copy of production DB (PostgreSQL 14, 800 GB) in test environment
- Masking via custom SQL scripts (unreliable, 3 audit findings)
- Refresh once a month (manual, 6h test downtime)
- 12 developers with access to production data
Solution¶
- SDV + Faker for tabular data
- Custom generator for loyalty points and campaigns
- CI/CD integration — fresh data on every PR
- Validation pipeline — automatic quality and privacy checks
Results after 3 months¶
- 0 developers with access to production data (down from 12)
- Audit compliance: All GDPR findings closed
- Generation: 50M synthetic transactions in 45 minutes (vs. 6h copy)
- Bug detection: +23% more bugs found thanks to edge case injection
- Costs: -40% on test infrastructure (smaller DB, no security overlays)
Where Synthetic Data Is Heading in 2026¶
- Foundation models for tabular data: Models like TabPFN and TabDDPM enable zero-shot generation
- Federated synthetic data: Organisations share synthetic data representations without sharing data itself
- Regulatory standards: EU AI Act positions synthetic data as a key AI compliance tool
- Real-time synthetic data: Streaming generation for testing real-time systems (Kafka, event sourcing, CDC)
Conclusion¶
Synthetic data is not a luxury — it is a necessity. Companies that still copy production databases into test environments risk GDPR fines, data breaches and audit findings. The technology in 2026 is mature enough for enterprise deployment.
Start simply: Faker for reference data, SDV for complex datasets, automatic validation in CI/CD. The key is a systematic approach — not a one-off script, but an integrated pipeline with measurable quality and privacy.
Need help implementing a complete synthetic data platform? Contact us — from data model analysis through generator deployment to CI/CD integration and GDPR compliance documentation.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us