You probably don’t need a DAG
This post was first published on Prefect’s blog.
For most data engineers, “workflow” is virtually synonymous with “DAG.” Nearly every orchestrator, most notably Airflow, requires workflows to be DAGs. There’s no shortage of attempts to explain DAGs, but everything you need to know is right there in the name; directed acyclic graph. Let’s break that down:
- Graph: Specifically, it’s a control flow graph — a set of nodes representing tasks in a workflow and edges representing dependency relationships between them.
- Acyclic: The edges cannot create a circular dependency. Once a task in the workflow has run successfully, it cannot run again in the same workflow run.
- Directed: The edges can have only one direction. There cannot be mutually dependent tasks in the workflow.
So a DAG is a model, a graph, with two constraints, acyclicality and directedness. Easy enough, but knowing what the acronym stands for doesn’t tell you much about how they’re used in practice.
Most data engineers think of a DAG as a visualization of their workflow graph. Workflow graphs are great! If a picture is worth a thousand words, a workflow graph is worth a thousand lines of code. You don’t need to define a DAG, though, to create a workflow graph. Just as you can trace code as it executes, you can also visualize that execution as a workflow graph during and after runtime. Orchestrators like Airflow make you define a DAG so that they can validate that the workflow is a stable DAG and visualize it before running the code.
DAGs offer assurances…
Validating a DAG definition is a highly restrictive form of static code analysis. It ensures that a workflow satisfies the acyclicality and directedness constraints without running its code. That assurance may be useful in certain contexts. DAG validation ensures that if each individual task completes, the entire workflow will complete — which is useful when an unsuccessful run is costly, either in terms of time or dollars. It also ensures that the same lines of code run in the same order, the same number of times, every time — useful when guaranteeing zero variance in your workflow is paramount.
…but with substantial costs
When you hand control of code execution to an orchestrator via a DAG, you lose the ability to use standard “control flow” statements in your code. The DAG, and only the DAG, defines when your code runs. That means no for
and while
loops. It also means no conditional or reactive behavior at runtime. Even something as simple as an if
statement typically requires a dedicated “operator” in a DAG.
DAG validation forces you to write unnatural code, resulting in unnecessary duplication & cruft, and pushing critical control logic into config files. DAG definitions can be painful to write and, even worse, difficult to read. Code is read many more times than it is written. DAGs make code harder to maintain and easier to break.
With DAG validation, you lose the benefits of an entire ecosystem of tools that help you avoid and troubleshoot mistakes. Your IDE’s autocomplete, your code’s supporting libraries, your repository’s linter and your test harness all expect pure code. They may break entirely or provide incorrect information when applied to a DAG.
You probably shouldn’t impose a DAG on your code if there isn’t a compelling reason. It is hard enough to write code that reflects your intended business logic. Don’t make it harder with a DAG if you don’t have to. Pure code is usually the best way to model a workflow.
Path dependency, not first principles
So if the benefits of DAGs aren’t usually worth the costs, why did they become ubiquitous?
In the ”big data” era of the late 2000s, people got excited about batch data processing with MapReduce. MapReduce enabled data processing to be distributed across many machines, but required that code be divided into individual tasks that could be executed independently. The early orchestrators managed this execution by expressing workflows, usually called jobs, in YAML. Some still do today. YAML, being a markup language, can’t express loops or mutual dependencies, so every workflow was a DAG, whether the author intended so or not.
Later, when Airflow came along, it dropped YAML for Python-based workflows, but kept the DAG definition requirement. It was only natural. After all, at that time, orchestration was still mainly for MapReduce jobs, which had always been DAGs. Python, a Turing-complete programming language, is much more expressive than YAML. Workflows got smarter, but the guardrails stayed the same.
Workflows don’t have to be DAGs
Airflow made an entire generation of data engineers think that they had to define a DAG to create a workflow graph, but it just ain’t so. When you absolutely must guard against any runtime mistakes or deviations, DAGs can be worth it. However, DAGs come at a cost to code capability, expressiveness, and readability. Ask yourself if it’s worth it for your workflows.