Operators

class DeleteRayCluster(*args: Any, **kwargs: Any)

Bases: BaseOperator

Operator to delete a Ray cluster from Kubernetes.

Parameters:
  • conn_id – The connection ID for the Ray cluster.

  • ray_cluster_yaml – Path to the YAML file defining the Ray cluster.

  • gpu_device_plugin_yaml – URL or path to the GPU device plugin YAML. Defaults to NVIDIA’s plugin.

execute(context: airflow.utils.context.Context) None

Execute the deletion of the Ray cluster.

Parameters:

context – The context in which the operator is being executed.

property hook: airflow.providers.cncf.kubernetes.utils.pod_manager.PodOperatorHookProtocol

Lazily initialize and return the RayHook.

class SetupRayCluster(*args: Any, **kwargs: Any)

Bases: BaseOperator

Operator to set up a Ray cluster on Kubernetes.

Parameters:
  • conn_id – The connection ID for the Ray cluster.

  • ray_cluster_yaml – Path to the YAML file defining the Ray cluster.

  • kuberay_version – Version of KubeRay to install. Defaults to “1.0.0”.

  • gpu_device_plugin_yaml – URL or path to the GPU device plugin YAML. Defaults to NVIDIA’s plugin.

  • update_if_exists – Whether to update the cluster if it already exists. Defaults to False.

execute(context: airflow.utils.context.Context) None

Execute the setup of the Ray cluster.

Parameters:

context – The context in which the operator is being executed.

property hook: RayHook

Lazily initialize and return the RayHook.

class SubmitRayJob(*args: Any, **kwargs: Any)

Bases: BaseOperator

Operator to submit and monitor a Ray job.

This operator handles the submission of a Ray job to a Ray cluster and monitors its status until completion. It supports deferring execution and resuming based on job status changes, making it suitable for long-running jobs.

Parameters:
  • conn_id – The connection ID for the Ray cluster.

  • entrypoint – The command or script to execute as the Ray job.

  • runtime_env – The runtime environment configuration for the Ray job.

  • num_cpus – Number of CPUs required for the job. Defaults to 0.

  • num_gpus – Number of GPUs required for the job. Defaults to 0.

  • memory – Amount of memory required for the job in bytes. Defaults to 0.

  • resources – Additional custom resources required for the job. Defaults to None.

  • ray_cluster_yaml – Path to the Ray cluster YAML configuration file. If provided, the operator will set up and tear down the cluster.

  • kuberay_version – Version of KubeRay to use when setting up the Ray cluster. Defaults to “1.0.0”.

  • update_if_exists – Whether to update the Ray cluster if it already exists. Defaults to True.

  • gpu_device_plugin_yaml – URL or path to the GPU device plugin YAML file. Defaults to NVIDIA’s plugin.

  • fetch_logs – Whether to fetch logs from the Ray job. Defaults to True.

  • wait_for_completion – Whether to wait for the job to complete before marking the task as finished. Defaults to True.

  • job_timeout_seconds – Maximum time to wait for job completion in seconds. Defaults to 600 seconds. Set to 0 if you want the job to run indefinitely without timeouts.

  • poll_interval – Interval between job status checks in seconds. Defaults to 60 seconds.

  • xcom_task_key – XCom key to retrieve the dashboard URL. Defaults to None.

execute(context: airflow.utils.context.Context) str

Execute the Ray job submission and monitoring.

This method submits the Ray job to the cluster and, if configured to wait for completion, monitors the job status until it reaches a terminal state or times out.

Parameters:

context – The context in which the task is being executed.

Returns:

The job ID of the submitted Ray job.

Raises:

AirflowException – If the job fails, is cancelled, or reaches an unexpected state.

execute_complete(context: airflow.utils.context.Context, event: dict[str, Any]) None

Handle the completion of a deferred Ray job execution.

This method is called when the deferred job execution completes. It processes the final job status and raises exceptions for failed or cancelled jobs. It finally deletes the cluster when the ray spec is provided

Parameters:
  • context – The context in which the task is being executed.

  • event – The event containing the job execution result.

Raises:

AirflowException – If the job execution fails, is cancelled, or reaches an unexpected state.

property hook: airflow.providers.cncf.kubernetes.utils.pod_manager.PodOperatorHookProtocol

Lazily initialize and return the RayHook.

on_kill() None

Delete the Ray job if the task is killed.

This method is called when the task is externally killed. It ensures that the associated Ray job is also terminated to avoid orphaned jobs.

template_fields = ('conn_id', 'entrypoint', 'runtime_env', 'num_cpus', 'num_gpus', 'memory', 'xcom_task_key', 'ray_cluster_yaml', 'job_timeout_seconds')