Welcome to Ray provider documentation!

This repository provides tools for integrating Apache Airflow® with Ray, enabling the orchestration of Ray jobs within Airflow DAGs. It includes a decorator, two operators, and one trigger designed to efficiently manage and monitor Ray jobs and services.

Benefits of using this provider include:

  • Integration: Incorporate Ray jobs into Airflow DAGs for unified workflow management.

  • Distributed computing: Use Ray’s distributed capabilities within Airflow pipelines for scalable ETL, LLM fine-tuning etc.

  • Monitoring: Track Ray job progress through Airflow’s user interface.

  • Dependency management: Define and manage dependencies between Ray jobs and other tasks in DAGs.

  • Resource allocation: Run Ray jobs alongside other task types within a single pipeline.

Table of Contents

Quickstart

See the Getting Started page for detailed instructions on how to begin using the provider.

What is the Ray provider?

Enterprise data value extraction involves two crucial components:

  • Data Engineering

  • Data Science/ML/AI

While Airflow excels at data engineering tasks through its extensive plugin ecosystem, it generally relies on external systems when dealing with large-scale ETL(100s GB to PB scale) or AI tasks such as fine-tuning & deploying LLMs etc.

Ray is a particularly powerful platform for handling large scale computations and this provider makes it very straightforward to orchestrate Ray jobs from Airflow.

Alternative text for the image

The architecture diagram above shows how we can deploy both Airflow & Ray on a Kubernetes cluster for elastic compute.

Use Cases

  • Scalable ETL: Orchestrate and monitor Ray jobs on on-demand compute clusters using the Ray Data library. These operations could be custom Python code or ML model inference.

  • Model Training: Schedule model training or fine-tuning jobs on flexible cadences (daily/weekly/monthly). Benefits include:

    • Optimize resource utilization by scheduling Ray jobs during cost-effective periods

    • Trigger model refresh based on changing business conditions or data trends

  • Model Inference: Inference of trained or fine-tuned models can be handled in two ways:

    • Batch Inference jobs can be incorporated into Airflow DAGs for unified workflow management and monitoring

    • Real time models (or online models) that use Ray Serve can be deployed using the same operators with wait_for_completion=False

  • Model Ops: Leverage Airflow tasks following Ray job tasks for model management

    • Deploy models to a model registry

    • Perform champion-challenger analysis

    • Execute other model lifecycle management tasks

Components

Decorators

  • @ray.task(): Simplifies integration by decorating task functions to work seamlessly with Ray.

Operators

  • SetupRayCluster: Sets up or Updates a ray cluster on kubernetes using a kubeconfig input provided through the Ray connection

  • DeleteRayCluster: Deletes an existing Ray cluster on kubernetes using the same Ray connection

  • SubmitRayJob: Submits jobs to a Ray cluster using a specified host name and monitors its execution

Hooks

  • RayHook: Sets up methods needed to run operators and decorators, working with the ‘Ray’ connection type to manage Ray clusters and submit jobs.

Triggers

  • RayJobTrigger: Monitors asynchronous job execution submitted via the SubmitRayJob operator or using the @ray.task() decorator and prints real-time logs to the the Airflow UI

Getting Involved

Platform

Purpose

Estimated Response Time

Discussion Forum

General inquiries and discussions

< 3 day

GitHub Issues

Bug reports and feature requests

< 1-2 days

Slack

Quick questions and real-time chat. Join (#airflow-ray)

< 12 hrs

Changelog

We follow Semantic Versioning for releases. Check CHANGELOG.rst for the latest changes.

Contributing Guide

All contributions, bug reports, bug fixes, documentation improvements, enhancements are welcome.

A detailed overview an how to contribute can be found in the Contributing Guide