Caching#
This page explains the caching strategies in astronomer-cosmos
Astronomer Cosmos behavior.
All Cosmos caching mechanisms can be enabled or turned off in the airflow.cfg
file or using environment variables.
Note
For more information, see configuring a Cosmos project.
Depending on the Cosmos version, it creates a cache for two types of data:
The
dbt ls
outputThe dbt
partial_parse.msgpack
file
It is possible to turn off any cache in Cosmos by exporting the environment variable AIRFLOW__COSMOS__ENABLE_CACHE=0
.
Disabling individual types of cache in Cosmos is also possible, as explained below.
Caching the dbt ls output#
(Introduced in Cosmos 1.5)
While parsing a dbt project using LoadMode.DBT_LS, Cosmos uses subprocess to run dbt ls
.
This operation can be very costly; it can increase the DAG parsing times and affect not only the scheduler DAG processing but
also the tasks queueing time.
Cosmos 1.5 introduced a feature to mitigate the performance issue associated with LoadMode.DBT_LS
by caching the output
of this command as an Airflow Variable.
Based on an initial analysis, enabling this setting reduced some DAGs task queueing from 30s to 0s. Additionally, some users reported improvements of 84% in the DAG run time.
This feature is on by default. To turn it off, export the following environment variable: AIRFLOW__COSMOS__ENABLE_CACHE_DBT_LS=0
.
(Introduced in Cosmos 1.6 - Experimental feature)
Starting with Cosmos 1.6.0, users can also set a remote directory path to store this cache instead of using Airflow
Variables. To do so, you need to configure a remote cache directory. See remote_cache_dir: and
remote_cache_dir_conn_id: for more information. This is an experimental feature introduced in 1.6.0 to gather
user feedback. The remote_cache_dir
will eventually be merged into the cache_dir: setting in upcoming
releases.
How the cache is refreshed
If using the default Variables cache approach, users can purge or delete the cache via Airflow UI by identifying and
deleting the cache key. In case you’re using the alternative approach by setting the remote_cache_dir
introduced
in Cosmos 1.6.0, you can delete the cache by removing the specific files by identifying them using your configured path
in the remote store.
Cosmos will refresh the cache in a few circumstances:
if any files of the dbt project change
if one of the arguments that affect the dbt ls command execution changes
To evaluate if the dbt project changed, it calculates the changes using a few of the MD5 of all the files in the directory.
Additionally, if any of the following DAG configurations are changed, we’ll automatically purge the cache of the DAGs that use that specific configuration:
ProjectConfig.dbt_vars
ProjectConfig.env_vars
ProjectConfig.partial_parse
RenderConfig.env_vars
RenderConfig.exclude
RenderConfig.select
RenderConfig.selector
Finally, if users would like to define specific Airflow variables that, if changed, will cause the recreation of the cache, they can specify those by using:
RenderConfig.airflow_vars_to_purge_cache
Example:
RenderConfig(airflow_vars_to_purge_cache == ["refresh_cache"])
Cleaning up stale cache
Not rarely, Cosmos DbtDags and DbtTaskGroups may be renamed or deleted. In those cases, to clean up the Airflow metadata database, it is possible to use the method delete_unused_dbt_ls_cache
.
The method deletes the Cosmos cache stored in Airflow Variables based on the last execution of their associated DAGs.
As an example, the following clean-up DAG will delete any cache associated with Cosmos that has not been used for the last five days:
from datetime import datetime, timedelta
from airflow.decorators import dag, task
from cosmos.cache import delete_unused_dbt_ls_cache, delete_unused_dbt_ls_remote_cache_files
@dag(
schedule_interval="0 0 * * 0", # Runs every Sunday
start_date=datetime(2023, 1, 1),
catchup=False,
tags=["example"],
)
def example_cosmos_cleanup_dag():
@task()
def clear_db_ls_cache(session=None):
"""
Delete the dbt ls cache that has not been used for the last five days.
"""
delete_unused_dbt_ls_cache(max_age_last_usage=timedelta(days=5))
clear_db_ls_cache()
@task()
def clear_db_ls_remote_cache(session=None):
"""
Delete the dbt ls remote cache files that have not been used for the last five days.
"""
delete_unused_dbt_ls_remote_cache_files(max_age_last_usage=timedelta(days=5))
clear_db_ls_remote_cache()
Cache key
The Airflow variables that represent the dbt ls cache are prefixed by cosmos_cache
.
When using DbtDag
, the keys use the DAG name. When using DbtTaskGroup
, they contain the TaskGroup
and parent task groups and DAG.
Examples:
The
DbtDag
“cosmos_dag” will have the cache represented by “cosmos_cache__basic_cosmos_dag”.The
DbtTaskGroup
“customers” declared inside the DAG “basic_cosmos_task_group” will have the cache key “cosmos_cache__basic_cosmos_task_group__customers”.
Cache value
The cache values contain a few properties:
last_modified
timestamp, represented using the ISO 8601 format.version
is a hash that represents the version of the dbt project and arguments used to run dbt ls by the time Cosmos created the cachedbt_ls_compressed
represents the dbt ls output compressed using zlib and encoded to base64 so Cosmos can record the value as a compressed string in the Airflow metadata database.dag_id
is the DAG associated to this cachetask_group_id
is the TaskGroup associated to this cachecosmos_type
is eitherDbtDag
orDbtTaskGroup
Caching the partial parse file#
(Introduced in Cosmos 1.4)
After parsing the dbt project, dbt stores an internal project manifest in a file called partial_parse.msgpack
(official docs).
This file contributes significantly to the performance of running dbt commands when the dbt project did not change.
Cosmos 1.4 introduced support to partial parse files both provided by the user, and also by storing in the disk temporary folder in the Airflow scheduler and worker node the file generated after running dbt commands.
Users can customize where to store the cache using the setting AIRFLOW__COSMOS__CACHE_DIR
.
It is possible to switch off this feature by exporting the environment variable AIRFLOW__COSMOS__ENABLE_CACHE_PARTIAL_PARSE=0
.
For more information, read the Cosmos partial parsing documentation
Caching the profiles#
(Introduced in Cosmos 1.5)
Cosmos 1.5 introduced support to profile caching,
enabling caching for the profile mapping in the path specified by env AIRFLOW__COSMOS__CACHE_DIR
and AIRFLOW__COSMOS__PROFILE_CACHE_DIR_NAME
.
This feature facilitates the reuse of Airflow connections and profiles.yml
.
Users have the flexibility to customize the cache storage location using the settings AIRFLOW__COSMOS__CACHE_DIR
and AIRFLOW__COSMOS__PROFILE_CACHE_DIR_NAME
.
To disable this feature, users can set the environment variable AIRFLOW__COSMOS__ENABLE_CACHE_PROFILE=False