Caching#

This page explains the caching strategies in astronomer-cosmos Astronomer Cosmos behavior.

All Cosmos caching mechanisms can be enabled or turned off in the airflow.cfg file or using environment variables.

Note

For more information, see configuring a Cosmos project.

Depending on the Cosmos version, it creates a cache for two types of data:

  • The dbt ls output

  • The dbt partial_parse.msgpack file

It is possible to turn off any cache in Cosmos by exporting the environment variable AIRFLOW__COSMOS__ENABLE_CACHE=0. Disabling individual types of cache in Cosmos is also possible, as explained below.

Caching the dbt ls output#

(Introduced in Cosmos 1.5)

While parsing a dbt project using LoadMode.DBT_LS, Cosmos uses subprocess to run dbt ls. This operation can be very costly; it can increase the DAG parsing times and affect not only the scheduler DAG processing but also the tasks queueing time.

Cosmos 1.5 introduced a feature to mitigate the performance issue associated with LoadMode.DBT_LS by caching the output of this command as an Airflow Variable. Based on an initial analysis, enabling this setting reduced some DAGs task queueing from 30s to 0s. Additionally, some users reported improvements of 84% in the DAG run time.

This feature is on by default. To turn it off, export the following environment variable: AIRFLOW__COSMOS__ENABLE_CACHE_DBT_LS=0.

(Introduced in Cosmos 1.6 - Experimental feature)

Starting with Cosmos 1.6.0, users can also set a remote directory path to store this cache instead of using Airflow Variables. To do so, you need to configure a remote cache directory. See remote_cache_dir: and remote_cache_dir_conn_id: for more information. This is an experimental feature introduced in 1.6.0 to gather user feedback. The remote_cache_dir will eventually be merged into the cache_dir: setting in upcoming releases.

How the cache is refreshed

If using the default Variables cache approach, users can purge or delete the cache via Airflow UI by identifying and deleting the cache key. In case you’re using the alternative approach by setting the remote_cache_dir introduced in Cosmos 1.6.0, you can delete the cache by removing the specific files by identifying them using your configured path in the remote store.

Cosmos will refresh the cache in a few circumstances:

  • if any files of the dbt project change

  • if one of the arguments that affect the dbt ls command execution changes

To evaluate if the dbt project changed, it calculates the changes using a few of the MD5 of all the files in the directory.

Additionally, if any of the following DAG configurations are changed, we’ll automatically purge the cache of the DAGs that use that specific configuration:

  • ProjectConfig.dbt_vars

  • ProjectConfig.env_vars

  • ProjectConfig.partial_parse

  • RenderConfig.env_vars

  • RenderConfig.exclude

  • RenderConfig.select

  • RenderConfig.selector

Finally, if users would like to define specific Airflow variables that, if changed, will cause the recreation of the cache, they can specify those by using:

  • RenderConfig.airflow_vars_to_purge_cache

Example:

RenderConfig(airflow_vars_to_purge_cache == ["refresh_cache"])

Cleaning up stale cache

Not rarely, Cosmos DbtDags and DbtTaskGroups may be renamed or deleted. In those cases, to clean up the Airflow metadata database, it is possible to use the method delete_unused_dbt_ls_cache.

The method deletes the Cosmos cache stored in Airflow Variables based on the last execution of their associated DAGs.

As an example, the following clean-up DAG will delete any cache associated with Cosmos that has not been used for the last five days:

from datetime import datetime, timedelta

from airflow.decorators import dag, task

from cosmos.cache import delete_unused_dbt_ls_cache, delete_unused_dbt_ls_remote_cache_files


@dag(
    schedule_interval="0 0 * * 0",  # Runs every Sunday
    start_date=datetime(2023, 1, 1),
    catchup=False,
    tags=["example"],
)
def example_cosmos_cleanup_dag():

    @task()
    def clear_db_ls_cache(session=None):
        """
        Delete the dbt ls cache that has not been used for the last five days.
        """
        delete_unused_dbt_ls_cache(max_age_last_usage=timedelta(days=5))

    clear_db_ls_cache()

    @task()
    def clear_db_ls_remote_cache(session=None):
        """
        Delete the dbt ls remote cache files that have not been used for the last five days.
        """
        delete_unused_dbt_ls_remote_cache_files(max_age_last_usage=timedelta(days=5))

    clear_db_ls_remote_cache()


Cache key

The Airflow variables that represent the dbt ls cache are prefixed by cosmos_cache. When using DbtDag, the keys use the DAG name. When using DbtTaskGroup, they contain the TaskGroup and parent task groups and DAG.

Examples:

  • The DbtDag “cosmos_dag” will have the cache represented by “cosmos_cache__basic_cosmos_dag”.

  • The DbtTaskGroup “customers” declared inside the DAG “basic_cosmos_task_group” will have the cache key “cosmos_cache__basic_cosmos_task_group__customers”.

Cache value

The cache values contain a few properties:

  • last_modified timestamp, represented using the ISO 8601 format.

  • version is a hash that represents the version of the dbt project and arguments used to run dbt ls by the time Cosmos created the cache

  • dbt_ls_compressed represents the dbt ls output compressed using zlib and encoded to base64 so Cosmos can record the value as a compressed string in the Airflow metadata database.

  • dag_id is the DAG associated to this cache

  • task_group_id is the TaskGroup associated to this cache

  • cosmos_type is either DbtDag or DbtTaskGroup

Caching the partial parse file#

(Introduced in Cosmos 1.4)

After parsing the dbt project, dbt stores an internal project manifest in a file called partial_parse.msgpack (official docs). This file contributes significantly to the performance of running dbt commands when the dbt project did not change.

Cosmos 1.4 introduced support to partial parse files both provided by the user, and also by storing in the disk temporary folder in the Airflow scheduler and worker node the file generated after running dbt commands.

Users can customize where to store the cache using the setting AIRFLOW__COSMOS__CACHE_DIR.

It is possible to switch off this feature by exporting the environment variable AIRFLOW__COSMOS__ENABLE_CACHE_PARTIAL_PARSE=0.

For more information, read the Cosmos partial parsing documentation

Caching the profiles#

(Introduced in Cosmos 1.5)

Cosmos 1.5 introduced support to profile caching, enabling caching for the profile mapping in the path specified by env AIRFLOW__COSMOS__CACHE_DIR and AIRFLOW__COSMOS__PROFILE_CACHE_DIR_NAME. This feature facilitates the reuse of Airflow connections and profiles.yml.

Users have the flexibility to customize the cache storage location using the settings AIRFLOW__COSMOS__CACHE_DIR and AIRFLOW__COSMOS__PROFILE_CACHE_DIR_NAME.

To disable this feature, users can set the environment variable AIRFLOW__COSMOS__ENABLE_CACHE_PROFILE=False