DataHub — Open Data Catalog for Modern Data Stack

DataHub centralizes metadata from entire data stack — automatic lineage, search, tagging and governance.

DataHub — Central Hub for Metadata¶

Solves the problem — where to find data and how to trust it.

Features¶

Automatic ingestion — 50+ connectors
Lineage — automatic dependency mapping
Search — full-text search
Ownership — assign owners

# DataHub — Open Data Catalog for Modern Data Stack
source:
  type: postgres
  config:
    host_port: "warehouse:5432"
    database: analytics
    profiling:
      enabled: true
sink:
  type: datahub-rest
  config:
    server: "http://datahub:8080"

Practical Deployment¶

DataHub is typically deployed as a Docker Compose stack or on Kubernetes using a Helm chart. After startup, you configure ingestion recipes for individual data sources — PostgreSQL, Snowflake, Airflow, dbt, and dozens more. Ingestion runs periodically (cron) or as part of a CI/CD pipeline.

DataHub’s greatest value lies in automatic column-level lineage — you can see where data comes from and where it flows, down to individual columns. This dramatically simplifies debugging data issues and impact analysis when schema changes occur. For teams managing dozens of databases and hundreds of tables, a data catalog is an essential tool for ensuring data governance and reducing time spent searching for the right data.

Summary¶

DataHub is leading open-source catalog with automatic lineage and rich integrations.

datahubdata catalogmetadatalineage

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

DataHub — Open Data Catalog for Modern Data Stack

DataHub — Central Hub for Metadata¶

Features¶

Practical Deployment¶

Summary¶

CORE SYSTEMS team

More know-how

Vector Databases: Pinecone vs Weaviate vs Qdrant vs pgvector

Read Replicas — Scaling Reads

ClickHouse — Columnar Database for Lightning-Fast Analytical Queries

Docker Compose for Development