P.S. I'm not trolling - I'm genuinely trying to get a sense of why and when would I use Airflow. Is it a point of scalability, of productivity , etc ?
For example - the positioning of spark is simple: scalability. Celery is also very clear: simplicity with good enough robustness if using the rabbitmq backend.
In a modern data team, Spark is just one of the type of job you may want to orchestrate. Typically as your company gets more tangled in data processing, you'll have many storage and compute engines that you'll have to orchestrate. Hive, MySQL, Presto, HBASE, map/reduce, Cascading/Scalding, scripts, external integrations, R, Druid, Redshift, miroservices, ...
Airflow allows you to orchestrate all of this and keep most of code and high level operation in one place.
Of course Spark has its own internal DAG and can somewhat act as Airflow and trigger some of these other things, but typically that breaks down as you have a growing array of Spark jobs and want to keep a holistic view.
Airflow uses Celery to horizontally scale its execution. The Airflow scheduler takes care of what tasks to run in what order, but also what to do when they fail, need to retry, don't need to run at all, backfill the past etc.
Spark for Airflow is just one of the engines where a transformation of data can happen.
I definitely get where you're coming from. At Astronomer, we use both Airflow and Spark, though Spark is very new to me.
For us, Airflow manages workflows and task dependencies but all of the actual work is done externally. Each task (operator) runs whatever dockerized command with I/O over XCom. Note that we use a custom Mesos executor instead of the Celery executor. An Airflow DAG might kick off a different Spark job based on upstream tasks.
P.S. I'm not trolling - I'm genuinely trying to get a sense of why and when would I use Airflow. Is it a point of scalability, of productivity , etc ?
For example - the positioning of spark is simple: scalability. Celery is also very clear: simplicity with good enough robustness if using the rabbitmq backend.
what does Airflow do differently ?