What is Apache Airflow? An Overview

source: wikipedia.com

Apache Airflow is a “data orchestration” platform that data engineers use to schedule and coordinate steps in a data pipeline or data workflows. It’s one of the most popular open-source orchestration platform.

Airflow helps define workflows in python code and provides a rich UI to manage and monitor these workflows. Airflow supports easy integration with a plethora of popular external interfaces like DBs(SQL and MongoDB), SSH, FTP, Cloud providers etc.

This post describes fundamentals of Airflow.

DAG

Airflow represents the flow and dependencies of tasks in a data pipeline as a directed graph or DAG (Directed acyclic graph). The graph is directed, meaning it starts with a single task or multiple tasks and ends with a specific task or tasks. Graphs must also be acyclic, meaning that a task cannot point back to a previously completed task. In other words, it cannot cycle back.

A DAG is a python(.py) file that defines what the steps are in our workflow. Each DAG has some common configuration and details of what task needs to be executed at each step.

Tasks

Each step of Airflow DAG is called a Task. One can provide different relationship/dependencies between tasks. For example: Task4 should run after Task2 and Task5 should run only after both Task3 and Task4 have run successfully.

Operators

Tasks are executed using Airflow operators. Operators are actually execute scripts, commands, and other operations when a Task is executed. There a number of operators that come with Airflow, as well as countless custom ones created by the Airflow community. Here are some of the most common operators.

  • BashOperator
  • PythonOperator
  • SimpleHttpOperator
  • EmailOperator
  • MySqlOperator, PostgresOperator

The Airflow Database

The metadata database is a core component of Airflow. It stores critical information such as the configuration of your Airflow environment’s roles and permissions, as well as all metadata for previous and present DAG and task executions. By default, Airflow uses a SQLite database but you can point it to a MySQL, Postgres or any other number of databases.

Losing data stored in the metadata database can both interfere with running DAGs and prevent you from accessing data for older DAG executions.

Connections & Hooks

Hooks are interfaces to services external to the Airflow. Hooks provide an interface to access external services like S3, MySQL, HDFS, Postgres, etc. Hooks are the building blocks for operators to interact with external services.

The Airflow Scheduler

The Airflow Scheduler is constantly monitoring and executing DAGs every n seconds. The scheduler also has an internal component called Executor. The executor is responsible for spinning up workers and executing the task to completion.

Airflow Executers

Executors are what Airflow uses to run tasks that the Scheduler determines are ready to run. By default, Airflow uses the SequentialExecutor. To run workloads at scale Airflow also contains Executers like the CeleryExecutor, DaskExecutor and KubernetesExecutor.

XComs

XCom(Cross Communication) is basically the data that flows between Tasks.

Where to Learn More

Go through the official Airflow Tutorial.

Learn more about scheduling DAGs.