Data Quality
Great Expectations — automatizovaná validace kvality dat
Great Expectations umožňuje definovat, testovat a dokumentovat očekávání na vaše data. Automaticky generuje dokumentaci a integruje se s Airflow, Spark i pandas.
Proč validovat kvalitu dat
Great Expectations definuje pravidla a automaticky je kontroluje v každém běhu pipeline.
import great_expectations as gx
context = gx.get_context()
validator = context.get_validator(batch_request=batch_request)
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between(
"total_czk", min_value=0, max_value=10_000_000
)
validator.save_expectation_suite()
Integrace s Airflow
def validate_data():
context = gx.get_context()
result = context.run_checkpoint("daily_orders")
if not result.success:
raise ValueError("Data quality check failed!")
extract >> validate_task >> transform
Shrnutí
Great Expectations je standard pro automatizovanou validaci dat v Python pipeline.