Within the dynamic panorama of knowledge engineering and analytics, constructing scalable and automatic pipelines is paramount.
Spark lovers who’ve been working with Airflow for some time could be questioning:
The way to execute a Spark job on a distant cluster utilizing Airflow?
The way to automate Spark pipelines with AWS EMR and Airflow?
On this tutorial we’re going to combine these two applied sciences by exhibiting methods to:
- Configure and fetch important parameters from the Airflow UI.
- Create auxiliary features to mechanically generate the popular
spark-submit
command. - Use Airflow’s
EmrAddStepsOperator()
methodology to construct a job that submits and executes a PySpark job to EMR - Use Airflow’s
EmrStepSensor()
methodology to watch the script execution.
The code used on this tutorial is accessible on GitHub.
- An AWS account with a S3 bucket and EMR cluster configured on the identical area ( on this case
eu-north-1
). The EMR cluster needs to be accessible and inWAITING
state. In our case it has been namedemr-cluster-tutorial
:
- Some mock
balances
information already accessible within theS3
bucket below thesrc/balances
folder. Knowledge might be generated and written to the situation utilizing the data producer script. - The required
JARs
ought to already downloaded from Maven and accessible within theS3
bucket. - Docker put in and working on the native machine with 4-6 GB of allotted reminiscence.
The purpose is to jot down some mock information in parquet
format to a S3
a bucket after which construct a DAG
that:
- Fetches required configuration from Airflow UI;
- Uploads a
pyspark
script to the identicalS3
a bucket;