Clients who use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) typically want Python dependencies which might be hosted in personal code repositories. Many purchasers go for public community entry mode for its ease of use and skill to make outbound Web requests, all whereas sustaining safe entry. Nevertheless, personal code repositories might not be accessible by way of the Web. It’s additionally a greatest apply to solely set up Python dependencies the place they’re wanted. You need to use Amazon MWAA startup scripts to selectively set up Python dependencies required for working code on staff, whereas avoiding points on account of internet server restrictions.
This publish demonstrates a way to selectively set up Python dependencies based mostly on the Amazon MWAA part kind (internet server, scheduler, or employee) from a Git repository solely accessible out of your digital personal cloud (VPC).
Answer overview
This resolution focuses on utilizing a personal Git repository to selectively set up Python dependencies, though you should utilize the identical sample demonstrated on this publish with personal Python package deal indexes akin to AWS CodeArtifact. For extra data, seek advice from Amazon MWAA with AWS CodeArtifact for Python dependencies.
The Amazon MWAA structure permits you to select a internet server entry mode to manage whether or not the online server is accessible from the web or solely out of your VPC. You too can management whether or not your staff, scheduler, and internet servers have entry to the web by way of your buyer VPC configuration. On this publish, we exhibit an setting such because the one proven within the following diagram, the place the setting is utilizing public community entry mode for the online servers, and the Apache Airflow staff and schedulers don’t have a path to the web out of your VPC.
There are as much as 4 potential networking configurations for an Amazon MWAA setting:
- Public routing and public internet server entry mode
- Personal routing and public internet server entry mode (pictured within the previous diagram)
- Public routing and personal internet server entry mode
- Personal routing and personal internet server entry mode
We deal with one networking configuration for this publish, however the elementary ideas are relevant for any networking configuration.
The answer we stroll by way of depends on the truth that Amazon MWAA runs a startup script (startup.sh
) throughout startup on each particular person Apache Airflow part (employee, scheduler, and internet server) earlier than putting in necessities (necessities.txt
) and initializing the Apache Airflow course of. This startup script is used to set an setting variable, which is then referenced within the necessities.txt file to selectively set up libraries.
The next steps permit us to perform this:
- Create and set up the startup script (
startup.sh
) within the Amazon MWAA setting. This script units the setting variable for selectively putting in dependencies. - Create and set up international Python dependencies (
necessities.txt
) within the Amazon MWAA setting. This file comprises the worldwide dependencies required by all Amazon MWAA elements. - Create and set up component-specific Python dependencies within the Amazon MWAA setting. This step includes creating separate necessities recordsdata for every part kind (employee, scheduler, internet server) to selectively set up the required dependencies.
Stipulations
For this walkthrough, you need to have the next conditions:
- An AWS account
- An Amazon MWAA setting deployed with public entry mode for the online server
- Versioning enabled to your Amazon MWAA setting’s Amazon Easy Storage Service (Amazon S3) bucket
- Amazon CloudWatch logging enabled on the INFO degree for employee and internet server
- A Git repository accessible from inside your VPC
Moreover, we add a pattern Python package deal to the Git repository:
Create and set up the startup script within the Amazon MWAA setting
Create the startup.sh file utilizing the next instance code:
Add startup.sh to the S3 bucket to your Amazon MWAA setting:
Browse the CloudWatch log streams to your staff and think about the worker_console log. Discover the startup script is now working and setting the setting variable.
Create and set up international Python dependencies within the Amazon MWAA setting
Your necessities file should embody a –constraint assertion to verify the packages listed in your necessities are suitable with the model of Apache Airflow you might be utilizing. The assertion starting with -r
references the setting variable you set in your startup.sh
script based mostly on the part kind.
The next code is an instance of the necessities.txt
file:
Add the necessities.txt file to the Amazon MWAA setting S3 bucket:
Create and set up component-specific Python dependencies within the Amazon MWAA setting
For this instance, we need to set up the Python package deal scrapy on staff and schedulers from our personal Git repository. We additionally need to set up pprintpp on the net server from the general public Python packages indexes. To perform that, we have to create the next recordsdata (we offer instance code):
Add webserver_reqs.txt
, scheduler_reqs.txt
, and worker_reqs.txt
to the DAGs folder for the Amazon MWAA setting:
Replace the setting for the brand new necessities file and observe the outcomes
Get the most recent object model for the necessities file:
Replace the Amazon MWAA setting to make use of the brand new necessities.txt
file:
Browse the CloudWatch log streams to your staff and think about the requirements_install
log. Discover the startup script is now working and setting the setting variable.
Conclusion
On this publish, we demonstrated a way to selectively set up Python dependencies based mostly on the Amazon MWAA part kind (internet server, scheduler, or employee) from a Git repository solely accessible out of your VPC.
We hope this publish offered you with a greater understanding of how startup scripts and Python dependency administration work in an Amazon MWAA setting. You’ll be able to implement different variations and configurations utilizing the ideas outlined on this publish, relying in your particular community setup and necessities.
Concerning the Creator
Tim Wilhoit is a Sr. Options Architect for the Division of Protection at AWS. Tim has over 20 years of enterprise IT expertise. His areas of curiosity are serverless computing and ML/AI. In his spare time, Tim enjoys spending time on the lake and rooting on the Oklahoma State Cowboys. Go Pokes!