Watcher retry behavior history#
While ExecutionMode.WATCHER can significantly improve DAG run times, it is based on
non-idempotent Apache Airflow® tasks and relies on a complex retry
mechanism in which one task’s status can affect another task’s status. This is the reason
ExecutionMode.WATCHER has remained marked as experimental for several months — until we can get
this right. This document aims to present how each aspect of retries has evolved within the Cosmos
watcher implementation across Cosmos releases.
Goals#
The Airflow
DbtDag/DbtTaskGroupstate should match the dbt pipeline status, whether successful or failedUsers should be able to retry individual tasks via Airflow retry
Users should be able to retry the whole DAG via Airflow automatic retry — so humans do not need to intervene when the DAG fails
Users should be able to retry the whole DAG via Airflow clear
Avoid duplicate or concurrent runs of the same dbt transformation in the same DAG run
Does the Airflow state match dbt’s?#
Version |
Outcome |
|---|---|
1.11.0 |
Yes. |
1.11.1 |
Yes. Same as 1.11.0. |
1.11.2 |
Yes. Same as 1.11.0. |
1.11.3 |
Yes. Same as 1.11.0. |
1.12.0 |
Yes. Same as 1.11.0. |
1.12.1 |
Yes. Same as 1.11.0. |
1.13.0 |
Maybe. Yes if successful in the first run. No if retries happen, unless users manually clear the producer task. |
1.13.1 |
Maybe. Same as 1.13.0. |
1.14.0 |
No — on producer retry, dbt model failures from the first attempt are silently dropped. The consumer tasks for those models are marked successful instead of running their fallback retry, so the DAG appears successful even though dbt failed. |
1.14.1 |
Yes. |
Task-level retry — consumer#
Version |
Behavior |
|---|---|
1.11.0 |
Fallback to |
1.11.1 |
Same as 1.11.0. |
1.11.2 |
Same as 1.11.0. |
1.11.3 |
Same as 1.11.0. |
1.12.0 |
Similar to 1.11.0. Fixes rendering of dbt compiled SQL as a templated field (#2209); consumers run asynchronously when they behave as sensors (#2084), letting them detect producer failure faster and freeing worker slots sooner. |
1.12.1 |
Same as 1.12.0. |
1.13.0 |
Same as 1.12.0. |
1.13.1 |
Same as 1.12.0. |
1.14.0 |
Similar to 1.12.0. Affected by an Airflow limitation (#2554): because the producer returns success on retry and Airflow does not preserve XCom across retries, consumers lose the model statuses from the first attempt and may silently mark failed models as successful. |
1.14.1 |
Similar to 1.12.0. Consumers always read correct model statuses thanks to the producer’s XCom backup mechanism — see Task-level retry — producer. |
Task-level retry — producer#
Version |
Behavior |
|---|---|
1.11.0 |
Relaunches the entire |
1.11.1 |
Same as 1.11.0. |
1.11.2 |
Manual clear still relaunches |
1.11.3 |
Same as 1.11.2. |
1.12.0 |
Same as 1.11.2. |
1.12.1 |
Same as 1.11.2. |
1.13.0 |
Producer returns success on |
1.13.1 |
Same as 1.13.0. |
1.14.0 |
Producer returns success on retry without re-running |
1.14.1 |
Producer raises Known issues with the XCom backup mechanism:
|
Automatic retries#
Version |
Behavior |
|---|---|
1.11.0 |
Unsafe. No safeguard; producer auto-retries would relaunch |
1.11.1 |
Unsafe. Same as 1.11.0. |
1.11.2 |
Failure. Producer |
1.11.3 |
Failure. Same as 1.11.2. |
1.12.0 |
Failure. Same as 1.11.2. |
1.12.1 |
Failure. Same as 1.11.2. |
1.13.0 |
Failure. Same as 1.11.2. |
1.13.1 |
Failure. Same as 1.11.2. |
1.14.0 |
Incorrect status. Forced |
1.14.1 |
Works. Producer auto-retries raise |
Full DAG / TaskGroup clear#
Version |
Behavior |
|---|---|
1.11.0 |
Unsafe. Relaunches the entire |
1.11.1 |
Unsafe. Same as 1.11.0. |
1.11.2 |
Unsafe. Same as 1.11.0. |
1.11.3 |
Unsafe. Same as 1.11.0. |
1.12.0 |
Unsafe. Same as 1.11.0. |
1.12.1 |
Unsafe. Same as 1.11.0. |
1.13.0 |
Works. Producer returns success on retry without re-running |
1.13.1 |
Works. Same as 1.13.0. |
1.14.0 |
Incorrect status. Same as 1.13.0, but Airflow does not preserve XCom across retries (#2554), so failed dbt models can be silently marked successful. |
1.14.1 |
Works. Producer raises |
Avoid duplicate or concurrent runs of the same dbt transformation in the same DAG run#
Version |
Behavior |
|---|---|
1.11.0 |
Not met. Producer auto-retry, manual clear, or full DAG/TaskGroup clear relaunches
|
1.11.1 |
Not met. Same as 1.11.0. |
1.11.2 |
Not met. Producer |
1.11.3 |
Not met. Same as 1.11.2. |
1.12.0 |
Not met. Same as 1.11.2. |
1.12.1 |
Not met. Same as 1.11.2. |
1.13.0 |
Not met. Producer returns success on retry without re-running |
1.13.1 |
Not met. Same as 1.13.0. |
1.14.0 |
Not met. Same as 1.13.0 — forced |
1.14.1 |
Met. Producer raises |