Skip to content

Getting started

Great Expectations (GX) is a framework for describing data using expressive tests and then validating that the data meets test criteria. Astronomer maintains the Great Expectations Airflow Provider to give users a convenient method for running validations directly from their DAGs. The Great Expectations Airflow Provider has three Operators to choose from, which vary in the amount of configuration they require and the flexibility they provide.

  • GXValidateDataFrameOperator
  • GXValidateBatchOperator
  • GXValidateCheckpointOperator

Operator use cases

When deciding which Operator best fits your use case, consider the location of the data you are validating, whether or not you need external alerts or actions to be triggered by the Operator, and what Data Context you will use. When picking a Data Context, consider whether or not you need to view how results change over time.

  • If your data is in memory as a Spark or Pandas DataFrame, we recommend using the GXValidateDataFrameOperator. This option requires only a DataFrame and your Expectations to create a validation result.
  • If your data is not in memory, we recommend configuring GX to connect to it by defining a BatchDefinition with the GXValidateBatchOperator. This option requires a BatchDefinition and your Expectations to create a validation result.
  • If you want to trigger actions based on validation results, use the GXValidateCheckpointOperator. This option supports all features of GX Core, so it requires the most configuration - you have to define a Checkpoint, BatchDefinition, ExpectationSuite, and ValidationDefinition to get validation results.

The Operators vary in which Data Contexts they support. All 3 Operators support Ephemeral and GX Cloud Data Contexts. Only the GXValidateCheckpointOperator supports the File Data Context.

  • If the results are used only within the Airflow DAG by other tasks, we recommend using an Ephemeral Data Context. The serialized Validation Result will be available within the DAG as the task result, but will not persist externally for viewing the results across multiple runs. All 3 Operators support the Ephemeral Data Context.
  • To persist and view results outside of Airflow, we recommend using a Cloud Data Context. Validation Results are automatically visible in the GX Cloud UI when using a Cloud Data Context, and the task result contains a link to the stored validation result. All 3 Operators support the Cloud Data Context.
  • If you want to manage Validation Results yourself, use a File Data Context. With this option, Validation Results can be viewed in Data Docs. Only the GXValidateCheckpointOperator supports the File Data Context.

Prerequisites

Assumed knowledge

To get the most out of this getting started guide, make sure you have an understanding of:

Install the provider and dependencies

  1. Install the provider.

pip install airflow-provider-great-expectations
2. (Optional) Install additional dependencies for the data sources you’ll use. For example, to install the optional Snowflake dependency, use the following command.

pip install "airflow-provider-great-expectations[snowflake]"
The following backends are supported as optional dependencies: - athena - azure - bigquery - gcp - mssql - postgresql - s3 - snowflake - spark

Configure an Operator

After deciding which Operator best fits your use case, follow the Operator-specific instructions below to configure it.

Data Frame Operator

  1. Import the Operator.

    from great_expectations_provider.operators.validate_dataframe import (
        GXValidateDataFrameOperator,
    )
    
  2. Instantiate the Operator with required and optional parameters.

    from typing import TYPE_CHECKING
    
    if TYPE_CHECKING:
        from pandas import DataFrame
    
    def my_data_frame_configuration(): DataFrame:
        import pandas as pd  # airflow best practice is to not import heavy dependencies in the top level
        return pd.read_csv(my_data_file)
    
    my_data_frame_operator = GXValidateDataFrameOperator(
        task_id="my_data_frame_operator",
        configure_dataframe=my_data_frame_configuration,
        expect=my_expectation_suite,
    )
    
    • task_id: alphanumeric name used in the Airflow UI and GX Cloud.
    • configure_dataframe: function that returns a DataFrame to pass data to the Operator.
    • expect: either a single Expectation or an Expectation Suite to validate against your data.
    • result_format (optional): accepts BOOLEAN_ONLY, BASIC, SUMMARY, or COMPLETE to set the verbosity of returned Validation Results. Defaults to SUMMARY.
    • context_type (optional): accepts ephemeral or cloud to set the Data Context used by the Operator. Defaults to ephemeral, which does not persist results between runs. To save and view Validation Results in GX Cloud, use cloud and complete the additional Cloud Data Context configuration below.

    For more details, explore this end-to-end code sample.

  3. If you use a Cloud Data Context, create a free GX Cloud account to get your Cloud credentials and then set the following Airflow variables.

    • GX_CLOUD_ACCESS_TOKEN
    • GX_CLOUD_ORGANIZATION_ID

Batch Operator

  1. Import the Operator.

    from great_expectations_provider.operators.validate_batch import (
        GXValidateBatchOperator,
    )
    
  2. Instantiate the Operator with required and optional parameters.

    my_batch_operator = GXValidateBatchOperator(
        task_id="my_batch_operator",
        configure_batch_definition=my_batch_definition_function,
        expect=my_expectation_suite,
    )
    
    • task_id: alphanumeric name used in the Airflow UI and GX Cloud.
    • configure_batch_definition: function that returns a BatchDefinition to configure GX to read your data.
    • expect: either a single Expectation or an Expectation Suite to validate against your data.
    • batch_parameters (optional): dictionary that specifies a time-based Batch of data to validate your Expectations against. Defaults to the first valid Batch found, which is the most recent Batch (with default sort ascending) or the oldest Batch if the Batch Definition has been configured to sort descending.
    • result_format (optional): accepts BOOLEAN_ONLY, BASIC, SUMMARY, or COMPLETE to set the verbosity of returned Validation Results. Defaults to SUMMARY.
    • context_type (optional): accepts ephemeral or cloud to set the Data Context used by the Operator. Defaults to ephemeral, which does not persist results between runs. To save and view Validation Results in GX Cloud, use cloud and complete the additional Cloud Data Context configuration below.

    For more details, explore this end-to-end code sample.

  3. If you use a Cloud Data Context, create a free GX Cloud account to get your Cloud credentials and then set the following Airflow variables.

    • GX_CLOUD_ACCESS_TOKEN
    • GX_CLOUD_ORGANIZATION_ID

Checkpoint Operator

  1. Import the Operator.

    from great_expectations_provider.operators.validate_checkpoint import (
        GXValidateCheckpointOperator,
    )
    
  2. Instantiate the Operator with required and optional parameters.

    my_checkpoint_operator = GXValidateCheckpointOperator(
        task_id="my_checkpoint_operator",
        configure_checkpoint=my_checkpoint_function,
    )
    
    • task_id: alphanumeric name used in the Airflow UI and GX Cloud.
    • configure_checkpoint: function that returns a Checkpoint, which orchestrates a ValidationDefinition, BatchDefinition, and ExpectationSuite. The Checkpoint can also specify a Result Format and trigger actions based on Validation Results.
    • batch_parameters (optional): dictionary that specifies a time-based Batch of data to validate your Expectations against. Defaults to the first valid Batch found, which is the most recent Batch (with default sort ascending) or the oldest Batch if the Batch Definition has been configured to sort descending.
    • context_type (optional): accepts ephemeral, cloud, or file to set the Data Context used by the Operator. Defaults to ephemeral, which does not persist results between runs. To save and view Validation Results in GX Cloud, use cloud and complete the additional Cloud Data Context configuration below. To manage Validation Results yourself, use file and complete the additional File Data Context configuration below.
    • configure_file_data_context (optional): function that returns a FileDataContext. Applicable only when using a File Data Context. See the additional File Data Context configuration below for more information.

    For more details, explore this end-to-end code sample.

  3. If you use a Cloud Data Context, create a free GX Cloud account to get your Cloud credentials and then set the following Airflow variables.

    • GX_CLOUD_ACCESS_TOKEN
    • GX_CLOUD_ORGANIZATION_ID
  4. If you use a File Data Context, pass the configure_file_data_context parameter. This takes a function that returns a FileDataContext. By default, GX will write results in the configuration directory. If you are retrieving your FileDataContext from a remote location, you can yield the FileDataContext in the configure_file_data_context function and write the directory back to the remote after control is returned to the generator.

Add the configured Operator to a DAG

After configuring an Operator, add it to a DAG. Explore our example DAGs, which have sample tasks that demonstrate Operator functionality.

Note that the shape of the Validation Results depends on both the Operator type and whether or not you set the optional result_format parameter. - GXValidateDataFrameOperator and GXValidateBatchOperator return a serialized ExpectationSuiteValidationResult - GXValidateCheckpointOperator returns a CheckpointResult. - The included fields depend on the Result Format verbosity.

Run the DAG

Trigger the DAG manually or run it on a schedule to start validating your expectations of your data.