in

Simplify Airflow DAG Creation and Upkeep with Hamilton in 8 minutes | by Stefan Krawczyk | Jul, 2023


How Hamilton may also help you write extra maintainable Airflow DAGs

An summary illustration of how Airflow & Hamilton relate. Airflow helps carry all of it collectively, whereas Hamilton helps make the innards manageable. Picture from Pixabay.

This put up is written in collaboration with Thierry Jean and initially appeared here.

This put up walks you thru the advantages of getting two open supply initiatives, Hamilton and Airflow, and their directed acyclic graphs (DAGs) work in tandem. At a excessive degree Airflow is chargeable for orchestration (suppose macro) and Hamilton helps creator clear and maintainable knowledge transformations (suppose micro).

For these which can be unfamiliar with Hamilton, we level you to an interactive overview on tryhamilton.dev, or our different posts, e.g. like this one. In any other case we are going to discuss Hamilton at a excessive degree and level to reference documentation for extra particulars. For reference I’m one of many co-creators of Hamilton.

For these nonetheless mentally making an attempt to understand how the 2 can run collectively, the explanation you possibly can run Hamilton with Airflow, is that Hamilton is only a library with a small dependency footprint, and so one can add Hamilton to their Airflow setup very quickly!

Simply to recap, Airflow is the business commonplace to orchestrate knowledge pipelines. It powers all types of knowledge initiatives together with ETL, ML pipelines and BI. Since its inception in 2014, Airflow customers have confronted sure tough edges with reference to authoring and sustaining knowledge pipelines:

  1. Maintainably managing the evolution of workflows; what begins easy can invariably get advanced.
  2. Writing modular, reusable, and testable code that runs inside an Airflow job.
  3. Monitoring lineage of code and knowledge artifacts that an Airflow DAG produces.

That is the place we consider Hamilton may also help! Hamilton is a Python micro-framework for writing knowledge transformations. Briefly, one writes python capabilities in a “declarative” fashion, which Hamilton parses and connects right into a graph based mostly on their names, arguments and sort annotations. Particular outputs may be requested and Hamilton will execute the required perform path to supply them. As a result of it doesn’t present macro orchestrating capabilities, it pairs properly with Airflow by serving to knowledge professionals write cleaner code and extra reusable code for Airflow DAGs.

The Hamilton Paradigm in an image. This instance reveals how one would map procedural pandas code to Hamilton capabilities that outline a DAG. Notice: Hamilton can be utilized for any Python object varieties, not simply Pandas. Picture by creator.

A typical use of Airflow is to assist with machine studying/knowledge science. Operating such workloads in manufacturing usually requires advanced workflows. A needed design determination with Airflow is figuring out find out how to break up the workflow into Airflow duties. Create too many and also you enhance scheduling and execution overhead (e.g. transferring a lot of knowledge), create too few and you’ve got monolithic duties that may take some time to run, however most likely is extra environment friendly to run. The trade-off right here is Airflow DAG complexity versus code complexity inside every of the duties. This makes debugging and reasoning concerning the workflow more durable, particularly for those who didn’t creator the preliminary Airflow DAG. As a rule, the preliminary job construction of the Airflow DAG turns into mounted, as a result of refactoring the duty code turns into prohibitive!

Whereas easier DAGs equivalent to A->B->C are fascinating, there’s an inherent rigidity between the construction’s simplicity and the quantity of code per job. The extra code per job, the tougher it’s to establish factors of failure, on the trade-off of potential computational efficiencies, however within the case of failures, retries develop in price with the “measurement” of the duty.

Airflow DAG construction selections: what number of duties? how a lot code per job? Picture by creator.

As an alternative, what for those who might concurrently wrangle the complexity inside an Airflow job, irrespective of the dimensions of code inside it, and acquire the pliability to simply change the Airflow DAG form with minimal effort? That is the place Hamilton is available in.

With Hamilton you possibly can substitute the code inside every Airflow job with a Hamilton DAG, the place Hamilton handles the “micro” orchestration of the code throughout the job. Notice: Hamilton really lets you logically mannequin all the things that you simply’d need an Airflow DAG to do. Extra on that beneath.

To make use of Hamilton, you load a Python module that incorporates your Hamilton capabilities, instantiate a Hamilton Driver and execute a Hamilton DAG inside an Airflow job in a couple of strains of code. By utilizing Hamilton, you possibly can write your knowledge transformation at an arbitrary granularity, permitting you to examine in larger particulars what every Airflow job is doing.

Particularly the mechanics of the code are:

  1. Import your perform modules
  2. Go them to the Hamilton driver to construct the DAG.
  3. Then, name Driver.execute() with the outputs you need to execute from the DAG you’ve outlined.

Let’s have a look at some code that create a single node Airflow DAG however makes use of Hamilton to coach and consider a ML mannequin:

Now, we didn’t present the Hamilton code right here, however the advantages of this strategy are:

  1. Unit & integration testing. Hamilton, by its naming and sort annotations necessities, pushes builders to put in writing modular Python code. This leads to Python modules well-suited for unit testing. As soon as your Python code is unit examined, you possibly can develop integration checks to make sure it behaves correctly in your Airflow duties. In distinction, testing code contained in an Airflow job is much less trivial, particularly in CI/CD settings, because it requires gaining access to an Airflow atmosphere.
  2. Reuse knowledge transformations. This strategy retains the info transformations code in Python modules, separated from the Airflow DAG file. This implies this code can be runnable exterior of Airflow! Should you come from the analytics world, it ought to really feel much like growing and testing SQL queries in an exterior .sql file, then loading it into your Airflow Postgres operators.
  3. Reorganize your Airflow DAG simply. The raise required to vary your Airflow DAG is now a lot decrease. Should you logically mannequin all the things in Hamilton, e.g. an finish to finish machine studying pipeline, it’s only a matter of figuring out how a lot of this Hamilton DAG must be computed in every Airflow job. For instance, you alter the variety of duties from one monolithic Airflow job, to a couple, or to many — all that would wish to vary is what you request from Hamilton since your Hamilton DAG needn’t change in any respect!

In most knowledge science initiatives, it could be inconceivable to put in writing the DAG of the ultimate system from day 1 as necessities will change. For instance, the info science workforce would possibly need to strive totally different characteristic units for his or her mannequin. Till the listing is ready and finalized, it’s most likely undesirable to have the characteristic set in your supply code and underneath model management; configuration information could be preferable.

Airflow helps default and runtime DAG configurations and can log these settings to make each DAG run reproducible. Nonetheless, including configurable behaviors would require committing including conditional statements and complexity to your Airflow job code. This code would possibly change into out of date through the challenge or solely be helpful specifically eventualities, finally reducing your DAGs readability.

In distinction, Hamilton can use Airflow’s runtime configuration to execute totally different knowledge transformations from the perform graph on the fly. This layered strategy can vastly enhance the expressivity of Airflow DAGs whereas sustaining structural simplicity. Alternatively, Airflow can dynamically generate new DAGs from configurations, however this might lower observability and a few of these options stay experimental.

Airflow UI for DAG run configuration. Picture by creator.

Should you work in a hand-off mannequin, this strategy promotes a separation of issues between the info engineers chargeable for the Airflow manufacturing system and the info scientists answerable for growing enterprise options by writing Hamilton code. Having this separation also can enhance knowledge consistency and scale back code duplication. For instance, a single Airflow DAG may be reused with totally different Hamilton modules to create totally different fashions. Equally, the identical Hamilton knowledge transformations may be reused throughout totally different Airflow DAGs to energy dashboards, API, purposes, and so on.

Beneath are two photos. The primary illustrates the high-level Airflow DAG containing two nodes. The second shows the low-level Hamilton DAG of the Python module evaluate_model imported within the Airflow job train_and_evaluate_model.

1. Airflow UI: Absenteeism Airflow DAG
2. Hamilton driver visualization: perform graph for evaluate_model.py

Knowledge science initiatives produce a lot of knowledge artifacts from datasets, efficiency evaluations, figures, educated fashions, and so on. The artifacts wanted will change over the course of the challenge life cycle (knowledge exploration, mannequin optimization, manufacturing debugging, and so on.). This can be a drawback for Airflow since eradicating a job from a DAG will delete its metadata historical past and break the artifact lineage. In sure eventualities, producing pointless or redundant knowledge artifacts can incur important computation and storage prices.

Hamilton can present the wanted flexibility for knowledge artifact technology by its data saver API. Capabilities embellished with @save_to.* add the likelihood to retailer their output, one want solely to request this performance by way of Driver.execute(). Within the code beneath, calling validation_predictions_table will return the desk whereas calling the output_name_ worth of save_validation_predictions will return the desk and reserve it to .csv

This added flexibility permits customers to simply toggle the artifacts generated and it may be finished straight by the Airflow runtime configuration, with out enhancing the Airflow DAG or Hamilton modules.

Moreover, the fine-grained Hamilton perform graph permits for exact knowledge lineage & provenance. Utility capabilities what_is_downstream_of() and what_is_upstream_of() assist visualize and programmatically discover knowledge dependencies. We level readers for extra element here.

Hopefully by now we’ve impressed on you that combing Hamilton with Airflow will enable you with Airflow’s DAG creation & maintainability challenges. Since this can be a brief put up, to wrap issues up, let’s transfer onto the code now we have within the Hamilton repository for you.

That can assist you stand up and working, now we have an example on find out how to use Hamilton with Airflow. It ought to cowl all of the fundamentals that it’s worthwhile to get began. The README consists of find out how to arrange Airflow with Docker, so that you simply don’t want to fret about putting in dependencies simply to play with the instance.

As for the code within the instance, it incorporates two Airflow DAGs, one showcasing a basic Hamilton “how-to” to create “options” for coaching a mannequin, and the opposite a extra full machine learning project example, that does a full end-to-end pipeline of making options after which becoming and evaluating a mannequin. For each examples, you’ll discover the Hamilton code underneath the plugins folder.

What it’s best to count on to see within the Airflow instance. Picture by creator.

When you’ve got questions or need assistance — please be part of our Slack. In any other case, to be taught extra about Hamilton’s options and performance, we refer you to Hamilton’s documentation.

Thanks for having a look at this put up. If you wish to dive deeper, or need to be taught extra about Hamilton, now we have the next hyperlinks so that you can browse!


9 Guidelines for Working Rust on the Internet and on Embedded | by Carl M. Kadie | Jul, 2023

LangChain: Enable LLMs to work together together with your code | by Marcello Politi | Jul, 2023