Entry personal code repositories for putting in Python dependencies on Amazon MWAA

0
26
Entry personal code repositories for putting in Python dependencies on Amazon MWAA


Clients who use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) typically want Python dependencies which might be hosted in personal code repositories. Many purchasers go for public community entry mode for its ease of use and skill to make outbound Web requests, all whereas sustaining safe entry. Nevertheless, personal code repositories might not be accessible by way of the Web. It’s additionally a greatest apply to solely set up Python dependencies the place they’re wanted. You need to use Amazon MWAA startup scripts to selectively set up Python dependencies required for working code on staff, whereas avoiding points on account of internet server restrictions.

This publish demonstrates a way to selectively set up Python dependencies based mostly on the Amazon MWAA part kind (internet server, scheduler, or employee) from a Git repository solely accessible out of your digital personal cloud (VPC).

Answer overview

This resolution focuses on utilizing a personal Git repository to selectively set up Python dependencies, though you should utilize the identical sample demonstrated on this publish with personal Python package deal indexes akin to AWS CodeArtifact. For extra data, seek advice from Amazon MWAA with AWS CodeArtifact for Python dependencies.

The Amazon MWAA structure permits you to select a internet server entry mode to manage whether or not the online server is accessible from the web or solely out of your VPC. You too can management whether or not your staff, scheduler, and internet servers have entry to the web by way of your buyer VPC configuration. On this publish, we exhibit an setting such because the one proven within the following diagram, the place the setting is utilizing public community entry mode for the online servers, and the Apache Airflow staff and schedulers don’t have a path to the web out of your VPC.

mwaa-architecture

There are as much as 4 potential networking configurations for an Amazon MWAA setting:

  • Public routing and public internet server entry mode
  • Personal routing and public internet server entry mode (pictured within the previous diagram)
  • Public routing and personal internet server entry mode
  • Personal routing and personal internet server entry mode

We deal with one networking configuration for this publish, however the elementary ideas are relevant for any networking configuration.

The answer we stroll by way of depends on the truth that Amazon MWAA runs a startup script (startup.sh) throughout startup on each particular person Apache Airflow part (employee, scheduler, and internet server) earlier than putting in necessities (necessities.txt) and initializing the Apache Airflow course of. This startup script is used to set an setting variable, which is then referenced within the necessities.txt file to selectively set up libraries.

The next steps permit us to perform this:

  1. Create and set up the startup script (startup.sh) within the Amazon MWAA setting. This script units the setting variable for selectively putting in dependencies.
  2. Create and set up international Python dependencies (necessities.txt) within the Amazon MWAA setting. This file comprises the worldwide dependencies required by all Amazon MWAA elements.
  3. Create and set up component-specific Python dependencies within the Amazon MWAA setting. This step includes creating separate necessities recordsdata for every part kind (employee, scheduler, internet server) to selectively set up the required dependencies.

Stipulations

For this walkthrough, you need to have the next conditions:

  • An AWS account
  • An Amazon MWAA setting deployed with public entry mode for the online server
  • Versioning enabled to your Amazon MWAA setting’s Amazon Easy Storage Service (Amazon S3) bucket
  • Amazon CloudWatch logging enabled on the INFO degree for employee and internet server
  • A Git repository accessible from inside your VPC

Moreover, we add a pattern Python package deal to the Git repository:

git clone https://github.com/scrapy/scrapy
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy scrapylocal
rm -rf ./scrapy/.git*
cp -r ./scrapy/* ./scrapylocal
cd scrapylocal
git add --all
git commit -m "first commit"
git push

Create and set up the startup script within the Amazon MWAA setting

Create the startup.sh file utilizing the next instance code:

#!/bin/sh

echo "Printing Apache Airflow part"
echo $MWAA_AIRFLOW_COMPONENT

if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
then
sudo yum -y set up libaio
fi
if [[ "${MWAA_AIRFLOW_COMPONENT}" == "webserver" ]]
then
echo "Setting prolonged python necessities for webservers"
export EXTENDED_REQUIREMENTS="webserver_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "worker" ]]
then
echo "Setting prolonged python necessities for staff"
export EXTENDED_REQUIREMENTS="worker_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "scheduler" ]]
then
echo "Setting prolonged python necessities for schedulers"
export EXTENDED_REQUIREMENTS="scheduler_reqs.txt"
fi

Add startup.sh to the S3 bucket to your Amazon MWAA setting:

aws s3 cp startup.sh s3://[mwaa-environment-bucket]
aws mwaa update-environment --startup-script-s3-path s3://[mwaa-environment-bucket]/startup.sh

Browse the CloudWatch log streams to your staff and think about the worker_console log. Discover the startup script is now working and setting the setting variable.

log-startup-script

Create and set up international Python dependencies within the Amazon MWAA setting

Your necessities file should embody a –constraint assertion to verify the packages listed in your necessities are suitable with the model of Apache Airflow you might be utilizing. The assertion starting with -r references the setting variable you set in your startup.sh script based mostly on the part kind.

The next code is an instance of the necessities.txt file:

--constraint https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt
-r /usr/native/airflow/dags/${EXTENDED_REQUIREMENTS}

Add the necessities.txt file to the Amazon MWAA setting S3 bucket:

aws s3 cp necessities.txt s3://[mwaa-environment-bucket]

Create and set up component-specific Python dependencies within the Amazon MWAA setting

For this instance, we need to set up the Python package deal scrapy on staff and schedulers from our personal Git repository. We additionally need to set up pprintpp on the net server from the general public Python packages indexes. To perform that, we have to create the next recordsdata (we offer instance code):

git+https://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy

git+https://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy

Add webserver_reqs.txt, scheduler_reqs.txt, and worker_reqs.txt to the DAGs folder for the Amazon MWAA setting:

aws s3 cp webserver_reqs.txt s3://mwaa-environment/dags
aws s3 cp scheduler_reqs.txt s3://mwaa-environment/dags
aws s3 cp worker_reqs.txt s3://mwaa-environment/dags

Replace the setting for the brand new necessities file and observe the outcomes

Get the most recent object model for the necessities file:

aws s3api list-object-versions --bucket [mwaa-environment-bucket]

Replace the Amazon MWAA setting to make use of the brand new necessities.txt file:

aws mwaa update-environment --name [mwaa-environment-name] --requirements-s3-object-version [s3-object-version]

Browse the CloudWatch log streams to your staff and think about the requirements_install log. Discover the startup script is now working and setting the setting variable.

log-requirements

log-git

Conclusion

On this publish, we demonstrated a way to selectively set up Python dependencies based mostly on the Amazon MWAA part kind (internet server, scheduler, or employee) from a Git repository solely accessible out of your VPC.

We hope this publish offered you with a greater understanding of how startup scripts and Python dependency administration work in an Amazon MWAA setting. You’ll be able to implement different variations and configurations utilizing the ideas outlined on this publish, relying in your particular community setup and necessities.


Concerning the Creator

Tim Wilhoit is a Sr. Options Architect for the Division of Protection at AWS. Tim has over 20 years of enterprise IT expertise. His areas of curiosity are serverless computing and ML/AI. In his spare time, Tim enjoys spending time on the lake and rooting on the Oklahoma State Cowboys. Go Pokes!

LEAVE A REPLY

Please enter your comment!
Please enter your name here