Watcher benchmark#
This page captures the detailed cluster setup, per-configuration tables, and analysis behind the ExecutionMode.LOCAL versus ExecutionMode.WATCHER benchmark dated 2026-05-15. For a concise summary, see the “Performance gains” section in Watcher execution mode (experimental).
We ran the comparison with the astronomer/cosmos-benchmark project on the official Apache Airflow Helm chart, with a dedicated worker pool for the heavy producer dbt build step and a separate pool for the per-model sensor tasks.
The full setup script, raw per-rep data, and a reproducible sweep recipe are published in the cosmos-benchmark readme.
Cluster setup#
Apache Airflow Helm chart
1.21.0(Airflow3.2.0)astronomer-cosmos==1.14.1,dbt-bigquery==1.9dbt project:
google/fhir-dbt-analytics(2 seeds, 52 sources, 185 models)Producer pool: 1 replica,
cpu=1/memory=2GiConsumer pool: 9 replicas,
cpu=1/memory=2Giwithworker_concurrency=2(18 task slots total)Airflow
parallelism=16Cosmos
watcher_dbt_execution_queue=producerrouting so the producerdbt buildalways lands on the dedicated pool5 repetitions per configuration; wall time below is
mean ± sample-stdev (n=5)
Timing and CPU#
Mode |
Threads |
Wall time (minutes) |
Producer time (minutes) |
Producer max CPU |
Total consumers max CPU |
|---|---|---|---|---|---|
LOCAL |
N/A |
8.9 ± 0.2 |
N/A |
N/A |
4.30 |
WATCHER |
4 |
7.5 ± 0.2 |
7.3 |
0.28 |
7.96 |
WATCHER |
8 |
5.2 ± 0.2 |
4.2 |
0.54 |
8.05 |
WATCHER |
12 |
5.6 ± 1.2 |
4.3 |
0.70 |
8.15 |
WATCHER |
16 |
5.2 ± 0.3 |
3.1 |
0.83 |
8.30 |
Producer max CPU is peak cores observed for the single producer pod (capacity: 1 core). Total consumers max CPU is peak cores summed across the 9 consumer pods (combined capacity: 9 cores); each individual consumer pod is bounded by its 1-core limit.
Peak memory by pool#
Mode |
Threads |
Producer peak memory (GiB) |
Total consumers peak memory (GiB) |
|---|---|---|---|
LOCAL |
N/A |
N/A |
10.0 |
WATCHER |
4 |
0.8 |
8.1 |
WATCHER |
8 |
0.8 |
8.5 |
WATCHER |
12 |
0.8 |
8.7 |
WATCHER |
16 |
0.8 |
8.6 |
Producer peak memory is for the single producer pod (capacity: 2 GiB). Total consumers peak memory is summed across the 9 consumer pods (combined capacity: 18 GiB); each individual consumer pod is bounded by its 2 GiB limit.
Analysis#
WATCHERbeatLOCALat every thread count we tested. Even at dbt’s default of 4 threads,WATCHERcut wall-clock runtime by about 15% (7.5 vs 8.9 minutes). With 8 threads or more,WATCHERran the DAG roughly 41% faster thanLOCAL.threads=8is a strong default. Past 8 threads the producerdbt builditself kept getting faster (4.2 minutes atthreads=8versus 3.1 minutes atthreads=16), but the consumer sensor tasks took correspondingly longer to wake up and finalise, so total wall time plateaued (we are tracking the investigation of this consumer-side behaviour in astronomer/astronomer-cosmos#2657 and will update the findings as they become available). Start at 8 and only push higher if your producer task is the bottleneck.Producer CPU rises with
threads, but sub-linearly. Going from 4 to 16 threads pushed producer peak CPU from 0.28 to 0.83 cores. dbt threads spend most of their time waiting on the warehouse, so most extra threads do not consume an extra core. Sizing the producer pod for roughly 1 core coversthreads=16comfortably.LOCALis bound by your data warehouse, not Airflow. UnderLOCALthe consumer pool peaked at 4.30 of 9 available cores (about 48%), because each dbt task spent most of its time waiting on BigQuery. Adding more Airflow workers will not move that ceiling.WATCHERinstead saturates the consumer pool to roughly 8 cores via lightweight sensor work, and runs the heavydbt buildon the dedicated producer pod.WATCHERcuts consumer memory pressure. Total consumers peak memory drops from 10.0 GiB underLOCALto roughly 8.5 GiB underWATCHER, because the heavydbt buildruns in the 0.8 GiB producer pod rather than across the consumer pool’s 18 concurrent dbt processes.
These numbers reflect one specific dbt project, warehouse, and worker shape. Treat them as a directional baseline and re-run the benchmark against your own pipeline before settling on a thread count.