Great Expectations lets you define, test and document expectations for your data. It automatically generates documentation and integrates with Airflow, Spark and pandas.
Why Validate Data Quality¶
Great Expectations defines rules and automatically checks them on every pipeline run.
import great_expectations as gx
context = gx.get_context()
validator = context.get_validator(batch_request=batch_request)
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between(
"total_czk", min_value=0, max_value=10_000_000
)
validator.save_expectation_suite()
Airflow Integration¶
def validate_data():
context = gx.get_context()
result = context.run_checkpoint("daily_orders")
if not result.success:
raise ValueError("Data quality check failed!")
extract >> validate_task >> transform
Expectation Types and Data Docs¶
Great Expectations offers hundreds of built-in expectations — from basic (not null, unique, between) to advanced (distribution tests, regex patterns, referential integrity between tables). You can create custom expectations as Python classes.
Data Docs is automatically generated HTML documentation that visualizes validation results — red/green indicators for each expectation, column statistics, and historical trends. In a CI/CD pipeline, run validation as a gate — if data does not meet expectations, the pipeline fails and bad data does not reach the production layer. The Profiler automatically analyzes existing data and suggests expectations, significantly speeding up initial setup.
Summary¶
Great Expectations is the standard for automated data validation in Python pipelines.