MWAA vs. AWS glue for Automating ETL Workflows
An ETL process, most commonly known as Extract Transform Load, is the process adopted to extract the data from one or more data sources, followed by cleaning and transforming, also allowing various manipulations and ultimately loading the data in a centralized data warehouse. An organization may use many data sources to store their data; data coming from a data source is not uniform and is distributed unevenly. ETL solves this problem and integrates various data sources under one uniform data warehouse. Now the data will become more accessible and desirable. Additionally, analysis and data mapping will become robust and quicker.
Various ETL tools in the market make this process hassle-free, resource-efficient, and require no coding experience. Amazon cloud service is one such platform that provides ETL tools that do not require any infrastructure setup, and we pay as we use them. The two tools supported by AWS are
- AWS Glue
- Amazon Managed Workflow for Apache Airflow
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
A subsidiary of Amazon, Amazon Web Services, Inc. offers governments, businesses, and people access to on-demand cloud computing platforms and APIs on a metered pay-per-use basis. Through AWS server farms, these cloud computing web services offer software tools and processing capabilities for distributed computing.
To get deeper insights about AWS check out this article.
Amazon Managed Workflow for Apache Airflow
Amazon Managed Workflows for Apache Airflow (MWAA) is a managed orchestration service for Apache. Airflow makes it easier to set up and operate end-to-end data pipelines in the cloud at scale.
- Server vs. Serverless
MWAA is server-based, while AWS Glue is serverless. Therefore, MWAA requires an entire architectural environment called servers to be set up first to run Airflow DAGS. Though AWS provides the atmosphere with MWAA, this infrastructure comes at heavy pricing, which we have discussed below.
Cloud platform providers offer serverless services, and people choose them, among other reasons, for cost efficiency. Apache Airflow was designed to be run on servers. This means that even when there is no job to run, your Airflow resources will still stay active, incurring costs during idle hours. MWAA is still server-based but gives you a way to save costs with auto-scaling. Since it is server-based, you cannot avoid running servers during idle time. AWS Glue is serverless and event-driven. It will start allocating server resources when you trigger only. During idle time, it does not use any help nor incur any cost.
In AWS Glue, you are charged by DPU or Data Processing Unit multiplied by usage hour. DPU is calculated differently by the type of job you run.
In MWAA, the environment is run by a set of workers; if more than two DAGS are scheduled to run simultaneously, the server resources, like workers, get divided. Such a scenario becomes a con when heavy-intensity algorithms are expected to run together at the same time. AWS is far better in such a scenario, each ETL flow has its isolated environment, and when heavy workflows are scheduled to run, they don’t impact other jobs running simultaneously.
- Support for external libraries
AWS glue provides an environment where the pyspark or python code can run your workflow. Although this environment has a limited set of libraries that it supports, any usage of external libraries will require you first to install the package, zip it, and store it s3.
While MWAA allows you to add any library you require to run to be mentioned in the requirements file , you are good to go from there.
- Time Delays
As stated in the above points, AWS Glue requires us to package an external library; when we use this library, the glue job takes up time to start, as it needs to install this new library every time it runs. Additionally, it takes extra time to begin in ordinary cases, while MWAA is from any such measures.
- Language and syntax
MWAA allows the code snippet to be written in python language but to run the workflow, and the code is encapsulated in DAG. Each DAG is made up of a minimal number of tasks. Each task contains functionality that makes up the workflow. These tasks are implemented in a python library called operator hooks, etc.
While AWS Glue is away from all these complicated code details, you can have the option to work with two environments supported by glue, namely: spark and python. The workflow code is written in the respective supported language without any new and extra code syntax.
Each one of the services has its pros and cons, and it depends on the architecture and the problem statement. Though the above are some essential research points, you need to evaluate when doing requirement procurement for your project.