Getting started
Great Expectations (GX) is a framework for describing data using expressive tests and then validating that the data meets test criteria. Astronomer maintains the Great Expectations Airflow Provider to give users a convenient method for running validations directly from their DAGs. The Great Expectations Airflow Provider has three Operators to choose from, which vary in the amount of configuration they require and the flexibility they provide.
GXValidateDataFrameOperator
GXValidateBatchOperator
GXValidateCheckpointOperator
Operator use cases
When deciding which Operator best fits your use case, consider the location of the data you are validating, whether or not you need external alerts or actions to be triggered by the Operator, and what Data Context you will use. When picking a Data Context, consider whether or not you need to view how results change over time.
- If your data is in memory as a Spark or Pandas DataFrame, we recommend using the
GXValidateDataFrameOperator
. This option requires only a DataFrame and your Expectations to create a validation result. - If your data is not in memory, we recommend configuring GX to connect to it by defining a BatchDefinition with the
GXValidateBatchOperator
. This option requires a BatchDefinition and your Expectations to create a validation result. - If you want to trigger actions based on validation results, use the
GXValidateCheckpointOperator
. This option supports all features of GX Core, so it requires the most configuration - you have to define a Checkpoint, BatchDefinition, ExpectationSuite, and ValidationDefinition to get validation results.
The Operators vary in which Data Contexts they support. All 3 Operators support Ephemeral and GX Cloud Data Contexts. Only the GXValidateCheckpointOperator
supports the File Data Context.
- If the results are used only within the Airflow DAG by other tasks, we recommend using an Ephemeral Data Context. The serialized Validation Result will be available within the DAG as the task result, but will not persist externally for viewing the results across multiple runs. All 3 Operators support the Ephemeral Data Context.
- To persist and view results outside of Airflow, we recommend using a Cloud Data Context. Validation Results are automatically visible in the GX Cloud UI when using a Cloud Data Context, and the task result contains a link to the stored validation result. All 3 Operators support the Cloud Data Context.
- If you want to manage Validation Results yourself, use a File Data Context. With this option, Validation Results can be viewed in Data Docs. Only the
GXValidateCheckpointOperator
supports the File Data Context.
Prerequisites
- Python version 3.9 to 3.12
- Great Expectations version 1.3.9+
- Apache Airflow® version 2.1.0+
Assumed knowledge
To get the most out of this getting started guide, make sure you have an understanding of:
- The basics of Great Expectations. See Try GX Core.
- Airflow fundamentals, such as writing DAGs and defining tasks. See Get started with Apache Airflow.
- Airflow Operators. See Operators 101.
- Airflow connections. See Managing your Connections in Apache Airflow.
Install the provider and dependencies
- Install the provider.
athena
- azure
- bigquery
- gcp
- mssql
- postgresql
- s3
- snowflake
- spark
Configure an Operator
After deciding which Operator best fits your use case, follow the Operator-specific instructions below to configure it.
Data Frame Operator
-
Import the Operator.
-
Instantiate the Operator with required and optional parameters.
from typing import TYPE_CHECKING if TYPE_CHECKING: from pandas import DataFrame def my_data_frame_configuration(): DataFrame: import pandas as pd # airflow best practice is to not import heavy dependencies in the top level return pd.read_csv(my_data_file) my_data_frame_operator = GXValidateDataFrameOperator( task_id="my_data_frame_operator", configure_dataframe=my_data_frame_configuration, expect=my_expectation_suite, )
task_id
: alphanumeric name used in the Airflow UI and GX Cloud.configure_dataframe
: function that returns a DataFrame to pass data to the Operator.expect
: either a single Expectation or an Expectation Suite to validate against your data.result_format
(optional): acceptsBOOLEAN_ONLY
,BASIC
,SUMMARY
, orCOMPLETE
to set the verbosity of returned Validation Results. Defaults toSUMMARY
.context_type
(optional): acceptsephemeral
orcloud
to set the Data Context used by the Operator. Defaults toephemeral
, which does not persist results between runs. To save and view Validation Results in GX Cloud, usecloud
and complete the additional Cloud Data Context configuration below.
For more details, explore this end-to-end code sample.
-
If you use a Cloud Data Context, create a free GX Cloud account to get your Cloud credentials and then set the following Airflow variables.
GX_CLOUD_ACCESS_TOKEN
GX_CLOUD_ORGANIZATION_ID
Batch Operator
-
Import the Operator.
-
Instantiate the Operator with required and optional parameters.
my_batch_operator = GXValidateBatchOperator( task_id="my_batch_operator", configure_batch_definition=my_batch_definition_function, expect=my_expectation_suite, )
task_id
: alphanumeric name used in the Airflow UI and GX Cloud.configure_batch_definition
: function that returns a BatchDefinition to configure GX to read your data.expect
: either a single Expectation or an Expectation Suite to validate against your data.batch_parameters
(optional): dictionary that specifies a time-based Batch of data to validate your Expectations against. Defaults to the first valid Batch found, which is the most recent Batch (with default sort ascending) or the oldest Batch if the Batch Definition has been configured to sort descending.result_format
(optional): acceptsBOOLEAN_ONLY
,BASIC
,SUMMARY
, orCOMPLETE
to set the verbosity of returned Validation Results. Defaults toSUMMARY
.context_type
(optional): acceptsephemeral
orcloud
to set the Data Context used by the Operator. Defaults toephemeral
, which does not persist results between runs. To save and view Validation Results in GX Cloud, usecloud
and complete the additional Cloud Data Context configuration below.
For more details, explore this end-to-end code sample.
-
If you use a Cloud Data Context, create a free GX Cloud account to get your Cloud credentials and then set the following Airflow variables.
GX_CLOUD_ACCESS_TOKEN
GX_CLOUD_ORGANIZATION_ID
Checkpoint Operator
-
Import the Operator.
-
Instantiate the Operator with required and optional parameters.
my_checkpoint_operator = GXValidateCheckpointOperator( task_id="my_checkpoint_operator", configure_checkpoint=my_checkpoint_function, )
task_id
: alphanumeric name used in the Airflow UI and GX Cloud.configure_checkpoint
: function that returns a Checkpoint, which orchestrates a ValidationDefinition, BatchDefinition, and ExpectationSuite. The Checkpoint can also specify a Result Format and trigger actions based on Validation Results.batch_parameters
(optional): dictionary that specifies a time-based Batch of data to validate your Expectations against. Defaults to the first valid Batch found, which is the most recent Batch (with default sort ascending) or the oldest Batch if the Batch Definition has been configured to sort descending.context_type
(optional): acceptsephemeral
,cloud
, orfile
to set the Data Context used by the Operator. Defaults toephemeral
, which does not persist results between runs. To save and view Validation Results in GX Cloud, usecloud
and complete the additional Cloud Data Context configuration below. To manage Validation Results yourself, usefile
and complete the additional File Data Context configuration below.configure_file_data_context
(optional): function that returns a FileDataContext. Applicable only when using a File Data Context. See the additional File Data Context configuration below for more information.
For more details, explore this end-to-end code sample.
-
If you use a Cloud Data Context, create a free GX Cloud account to get your Cloud credentials and then set the following Airflow variables.
GX_CLOUD_ACCESS_TOKEN
GX_CLOUD_ORGANIZATION_ID
-
If you use a File Data Context, pass the
configure_file_data_context
parameter. This takes a function that returns a FileDataContext. By default, GX will write results in the configuration directory. If you are retrieving your FileDataContext from a remote location, you can yield the FileDataContext in theconfigure_file_data_context
function and write the directory back to the remote after control is returned to the generator.
Add the configured Operator to a DAG
After configuring an Operator, add it to a DAG. Explore our example DAGs, which have sample tasks that demonstrate Operator functionality.
Note that the shape of the Validation Results depends on both the Operator type and whether or not you set the optional result_format
parameter.
- GXValidateDataFrameOperator
and GXValidateBatchOperator
return a serialized ExpectationSuiteValidationResult
- GXValidateCheckpointOperator
returns a CheckpointResult.
- The included fields depend on the Result Format verbosity.
Run the DAG
Trigger the DAG manually or run it on a schedule to start validating your expectations of your data.