DataHub centralizes metadata from entire data stack — automatic lineage, search, tagging and governance.
DataHub — Central Hub for Metadata¶
Solves the problem — where to find data and how to trust it.
Features¶
- Automatic ingestion — 50+ connectors
- Lineage — automatic dependency mapping
- Search — full-text search
- Ownership — assign owners
# DataHub — Open Data Catalog for Modern Data Stack
source:
type: postgres
config:
host_port: "warehouse:5432"
database: analytics
profiling:
enabled: true
sink:
type: datahub-rest
config:
server: "http://datahub:8080"
Practical Deployment¶
DataHub is typically deployed as a Docker Compose stack or on Kubernetes using a Helm chart. After startup, you configure ingestion recipes for individual data sources — PostgreSQL, Snowflake, Airflow, dbt, and dozens more. Ingestion runs periodically (cron) or as part of a CI/CD pipeline.
DataHub’s greatest value lies in automatic column-level lineage — you can see where data comes from and where it flows, down to individual columns. This dramatically simplifies debugging data issues and impact analysis when schema changes occur. For teams managing dozens of databases and hundreds of tables, a data catalog is an essential tool for ensuring data governance and reducing time spent searching for the right data.
Summary¶
DataHub is leading open-source catalog with automatic lineage and rich integrations.