Big Data

Streamlining AWS Glue Studio visible jobs: Constructing an built-in CI/CD pipeline for seamless surroundings synchronization

12 November 2024

Many Amazon Net Companies (AWS) prospects have built-in their information throughout a number of sources utilizing AWS Glue, a serverless information integration service. By offering seamless integration all through the event lifecycle, AWS Glue permits organizations to make data-driven enterprise choices.

AWS Glue Studio visible jobs present a graphical interface known as the visible editor that you need to use to writer extract, rework, and cargo (ETL) jobs in AWS Glue visually. The visible editor maintains a visible illustration that quite a lot of information sources, transformations, and information sinks. With its intuitive interface, you may simply create large-scale information integration jobs without having coding experience, simplifying workflows and eliminating the necessity for handbook ETL script programming.

As information engineers more and more depend on the AWS Glue Studio visible editor to create information integration jobs, the necessity for a streamlined improvement lifecycle and seamless synchronization between environments has develop into paramount. Moreover, managing variations of visible directed acyclic graphs (DAGs) is essential for monitoring modifications, collaboration, and sustaining consistency throughout environments.

This submit introduces an end-to-end resolution that addresses these wants by combining the facility of the AWS Glue Visible Job API, a customized AWS Glue Useful resource Sync Utility, and an primarily based steady integration and steady deployment (CI/CD) pipeline.

A couple of widespread questions from our prospects embrace:

What are one of the best practices for shifting our workloads from a pre-production surroundings to manufacturing?
What are the really useful finest practices for provisioning information integration elements?
How can I construct AWS Glue visible jobs within the improvement surroundings and mechanically propagate them to the manufacturing account utilizing the CI/CD pipeline?
How can I model management and monitor modifications to my AWS Glue Studio visible jobs?

Finish-to-end improvement lifecycle for information integration pipeline

The software program improvement lifecycle on AWS has six phases: plan, design, implement, check, deploy, and keep, as proven within the following diagram.

For extra data concerning every element, take a look at Finish-to-end improvement lifecycle for information engineers to construct a knowledge integration pipeline utilizing AWS Glue.

AWS Glue Useful resource Sync Utility

As a part of synchronizing AWS Glue visible jobs throughout completely different environments, necessities embrace:

Handle model management of visible DAGs by monitoring modifications to AWS Glue Studio visible jobs utilizing model management methods reminiscent of Git
Promote AWS Glue visible jobs from a pre-production surroundings to a manufacturing surroundings
Switch possession of AWS Glue visible jobs between completely different AWS accounts
Replicate AWS Glue visible jobs from one AWS Area to a different as a part of a catastrophe restoration technique

The AWS Glue Useful resource Sync Utility is a Python utility developed on high of the AWS Glue Visible Job API, designed to synchronize AWS Glue Studio visible jobs throughout completely different accounts with out dropping the visible illustration. It operates by utilizing supply and goal AWS surroundings profiles. Optionally, an inventory of jobs for synchronization will be supplied together with a mapping file to switch environment-specific assets.

For extra data on the AWS Glue Useful resource Sync Utility, consult with Synchronize your AWS Glue Studio Visible Jobs to completely different environments.

Resolution overview

As proven within the following diagram, this resolution makes use of three separate AWS accounts. One account is designated for the event surroundings, one other for the manufacturing surroundings, and a 3rd to host the CI/CD infrastructure and pipeline.

The answer emphasizes model controlling AWS Glue Studio visible jobs by serializing them into JSON recordsdata and storing them in a Git repository. Consequently, you may:

Observe modifications to your visible DAGs over time.
Collaborate with crew members.
Restore and deploy visible DAGs in several environments seamlessly.

The AWS account answerable for internet hosting the CI/CD pipeline consists of three key elements:

Managing AWS Glue Job updates – Supplies easy updates and upkeep of AWS Glue jobs.
Cross-Account Entry Administration – Permits safe promotion of updates from the event surroundings to the manufacturing surroundings.
Model Management Integration – Incorporates serialized visible DAGs into the CI/CD pipeline for deployment to focus on environments.

You may create AWS Glue Studio visible jobs utilizing the intuitive visible editor in your improvement account. After these jobs are configured, they’ll serialize the visible DAGs into JSON recordsdata and commit them to a Git repository. The CI/CD pipeline detects modifications to the repository and mechanically triggers the deployment course of.

The pipeline features a step the place the AWS Glue Useful resource Sync Utility deserializes the visible DAGs from the dedicated JSON recordsdata and deploys them to the manufacturing surroundings. This method promotes constant deployment of jobs whereas sustaining their visible illustration.

The answer makes use of the AWS Glue Visible Job API, AWS Glue Useful resource Sync Utility, and AWS CDK to streamline deployment throughout environments. It permits seamless synchronization and constant versioning of AWS Glue jobs between improvement and manufacturing, preserving visible workflows and lowering handbook duties. The answer consists of two foremost elements:

Preliminary steps (one-time setup) – Organising the event surroundings, bootstrapping AWS environments, deploying the CI/CD pipeline, and integrating the AWS Glue Useful resource Sync Utility
Day-to-day improvement (repeated) – Ongoing actions reminiscent of creating visible jobs, serializing them, committing modifications to the repository, deploying to manufacturing by means of the pipeline, and verifying the roles

The answer follows these high-level steps for the preliminary setup:

Arrange the event surroundings
Bootstrap your AWS environments
Deploy the CI/CD pipeline
Configure AWS developer instruments connection on GitHub
Combine the CI/CD pipeline with the AWS Glue Useful resource Sync Utility

The answer follows these high-level steps for the day-to-day improvement:

Create visible jobs within the improvement account
Serialize visible jobs
Commit modifications to Git repository
Deploy visible jobs to manufacturing
Confirm visible jobs in manufacturing

Stipulations

Earlier than you start, be sure you have the next:

GitHub account
Git (git command)
Python 3.9 or later
Package deal installer for Python (pip command)
AWS CDK Toolkit (cdk command) 2.155.0 or later
AWS CLI configured with acceptable credentials on your accounts
Three AWS accounts:
- Growth account
- Manufacturing account
- Pipeline account (for internet hosting the CI/CD pipeline)

Technical resolution walkthrough

This part gives an in depth information to establishing and utilizing an automatic CI/CD pipeline for AWS Glue Studio visible jobs.

Preliminary steps (one-time setup)

On this part, we stroll by means of the foundational steps required to determine the CI/CD pipeline for AWS Glue Studio visible jobs. These preliminary steps arrange the required infrastructure and configurations, offering a easy and automatic deployment course of throughout your improvement and manufacturing environments.

Arrange the event surroundings

To arrange the event surroundings, observe these steps:

Fork the aws-glue-cdk-baseline repository
Clone the forked repository:

git clone https://github.com//aws-glue-cdk-baseline.git

cd aws-glue-cdk-baseline

Create and activate a Python digital surroundings:

python3 -m venv .venv

# On Home windows, use .venvScriptsactivate.bat
supply .venv/bin/activate

Set up required dependencies:

pip set up -r necessities.txt

pip set up -r requirements-dev.txt

To configure the default settings, edit the default-config.yaml file together with your AWS account particulars and substitute placeholders together with your AWS account particulars:
Pipeline account: awsAccountId and awsRegion.
Growth account: awsAccountId and awsRegion.
Manufacturing account: awsAccountId and awsRegion.

Bootstrap your AWS environments

Bootstrapping prepares your AWS accounts for AWS CDK deployments. To bootstrap your AWS environments, run the next instructions, changing placeholders together with your account numbers, Areas, and AWS CLI profiles:

# Bootstrap the pipeline account
cdk bootstrap aws:/// --profile 

# Bootstrap the event account, trusting the pipeline account
cdk bootstrap aws:/// --profile  --trust 

# Bootstrap the manufacturing account, trusting the pipeline account
cdk bootstrap aws:/// --profile  --trust

Deploy the CI/CD pipeline

Deploy the pipeline stack to your pipeline account:

This command creates:

The pipeline stack within the pipeline account
The AWS Glue app stack within the improvement account

Configure AWS developer instruments connection to GitHub

To determine a connection between AWS CodePipeline and your GitHub repository, observe these steps:

Create a GitHub connection
Within the AWS Administration Console on your pipeline account, navigate to AWS CodePipeline
Within the navigation pane, select Connections
Select Create connection
Choose GitHub because the supply supplier
Authorize the connection
Present a connection identify (reminiscent of MyGitHubConnection)
Select Hook up with GitHub
Comply with the prompts to authorize AWS CodePipeline to entry your GitHub account
Ensure that the connection has entry to your forked aws-glue-cdk-baseline repository
Observe the connection Amazon Useful resource Identify (ARN)
After the connection is established, word the Connection ARN since you’ll want it when configuring the pipeline

Combine the CI/CD pipeline with the AWS Glue Useful resource Sync Utility

To combine the AWS Glue Useful resource Sync Utility into the pipeline to automate the synchronization of AWS Glue visible jobs, observe these steps:

Obtain the sync.py script from the AWS Glue Samples repository:

wget https://uncooked.githubusercontent.com/aws-samples/aws-glue-samples/grasp/utilities/resource_sync/sync.py 
-O aws_glue_cdk_baseline/job_scripts/sync.py

Create a brand new file aws_glue_cdk_baseline/job_scripts/generate_mapping.py with the next content material:

import yaml
import json
 
def generate_mapping():
    with open('default-config.yaml', 'r') as config_file:
        config = yaml.safe_load(config_file)
    mapping = {
        f"s3://aws-glue-assets-{config['devAccount']['awsAccountId']}-{config['devAccount']['awsRegion']}": f"s3://aws-glue-assets-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}",
        f"arn:aws:iam::{config['devAccount']['awsAccountId']}:position/service-role/AWSGlueServiceRole": f"arn:aws:iam::{config['prodAccount']['awsAccountId']}:position/service-role/AWSGlueServiceRole",
        f"s3://dev-glue-data-{config['devAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}": f"s3://prod-glue-data-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}"
    }
    with open('mapping.json', 'w') as mapping_file:
        json.dump(mapping, mapping_file, indent=2)
 
if __name__ == "__main__":
    generate_mapping()

This script generates a mapping.json file that the sync.py script will use to synchronize the roles between the event and manufacturing environments. The mapping.json file comprises the mapping of the event surroundings property to the manufacturing surroundings property:

The s3://aws-glue-assets-* Amazon Easy Storage Service (Amazon S3) bucket comprises the AWS Glue Studio visible job definitions
The arn:aws:iam::*:position/service-role/AWSGlueServiceRole AWS Id and Entry Administration (IAM) position is utilized by the AWS Glue Studio jobs to entry AWS assets
The s3://dev-glue-data-* and s3://prod-glue-data-* S3 buckets comprise scripts and information utilized by the AWS Glue Studio jobs

Replace the aws_glue_cdk_baseline/pipeline_stack.py file to incorporate a step that deserializes the JSON file and deploys the AWS Glue jobs to the manufacturing surroundings:

from typing import Dict
import aws_cdk as cdk
from aws_cdk import (
    Stack,
    aws_iam as iam
)
from constructs import Assemble
from aws_cdk.pipelines import CodePipeline, CodePipelineSource, CodeBuildStep
from aws_glue_cdk_baseline.glue_app_stage import GlueAppStage
 
GITHUB_REPO = "YOUR-GITHUB-USERNAME/aws-glue-cdk-baseline"
GITHUB_BRANCH = "foremost"
GITHUB_CONNECTION_ARN = "YOUR-GITHUB-CONNECTION-ARN"
 
class PipelineStack(Stack):
 
    def __init__(self, scope: Assemble, construct_id: str, config: Dict, **kwargs) -> None:
        tremendous().__init__(scope, construct_id, **kwargs)
 
        supply = CodePipelineSource.connection(
            GITHUB_REPO,
            GITHUB_BRANCH,
            connection_arn=GITHUB_CONNECTION_ARN
        )
 
        pipeline = CodePipeline(self, "GluePipeline",
            pipeline_name="GluePipeline",
            cross_account_keys=True,
            docker_enabled_for_synth=True,
            synth=CodeBuildStep("CdkSynth",
                enter=supply,
                install_commands=[
                    "pip install -r requirements.txt",
                    "pip install -r requirements-dev.txt",
                    "npm install -g aws-cdk",
                ],
                instructions=[
                    "cdk synth",
                ]
            )
        )
 
        # Add improvement stage
        dev_stage = GlueAppStage(self, "DevStage", config=config, stage="dev", 
            env=cdk.Surroundings(
                account=str(config['devAccount']['awsAccountId']),
                area=config['devAccount']['awsRegion']
            ))
        pipeline.add_stage(dev_stage)

        # Add manufacturing stage
        prod_stage = GlueAppStage(self, "ProdStage", config=config, stage="prod", 
            env=cdk.Surroundings(
                account=str(config['prodAccount']['awsAccountId']),
                area=config['prodAccount']['awsRegion']
            ))
        pipeline.add_stage(prod_stage)
 
        # Glue Useful resource Sync as a separate step within the pipeline
        pipeline.add_wave("GlueJobSync").add_post(CodeBuildStep("GlueJobSync",
            enter=supply,
            instructions=[
                "python $(pwd)/aws_glue_cdk_baseline/job_scripts/generate_mapping.py",
                "python aws_glue_cdk_baseline/job_scripts/sync.py "
                   "--dst-role-arn arn:aws:iam::{0}:role/GlueCrossAccountRole-prod "
                   "--dst-region {1} "
                   "--deserialize-from-file aws_glue_cdk_baseline/resources/resources.json "
                   "--config-path mapping.json "
                   "--targets job,catalog "
                   "--skip-prompt".format(
                       config['prodAccount']['awsAccountId'],
                       config['prodAccount']['awsRegion']
                   ),
            ],
            role_policy_statements=[
                iam.PolicyStatement(
                    actions=[
                        "sts:AssumeRole",
                    ],
                    assets=["*"]
                )
            ]
        ))

Exchange the placeholders within the pipeline_stack.py file together with your values:

GITHUB_REPO with the identify of your GitHub repository
GITHUB_BRANCH with the identify of the department you wish to use for the pipeline
GITHUB_CONNECTION_ARN with the ARN of the GitHub connection you created in Step 4

Replace the aws_glue_cdk_baseline/glue_app_stack.py file to create a cross-account position with the required permissions to entry the event surroundings assets:

    self.cross_account_role = self.create_cross_account_role(
        f"GlueCrossAccountRole-{stage}",
        str(config['pipelineAccount']['awsAccountId'])
    )
 
    def create_cross_account_role(self, role_name: str, trusted_account_id: str):
        return iam.Position(self, f"{role_name}CrossAccountRole",
            role_name=role_name,
            assumed_by=iam.AccountPrincipal(trusted_account_id),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AdministratorAccess")]
        )
 
    @property
    def cross_account_role_arn(self):
        return self.cross_account_role.role_arn

    @property
    def cross_account_role_arn(self):
        return self.glue_app_stack.cross_account_role_arn

Examine the andreimaksimov/aws-glue-cdk-baseline for an entire diff.

Commit your modifications to the repository:

git add aws_glue_cdk_baseline/job_scripts/sync.py
git add aws_glue_cdk_baseline/job_scripts/generate_mapping.py
git add pipeline_stack.py

git commit -m "Combine Glue Useful resource Sync Utility into the pipeline"

git push

Day-to-day improvement (repeated)

With the preliminary setup full, now you can proceed together with your common improvement actions. This part outlines the steps you’ll repeat throughout your day-to-day work to develop, model management, and deploy AWS Glue visible jobs.

Create visible jobs within the improvement account

On this step, you’ll use AWS Glue Studio to create and configure your visible jobs throughout the improvement surroundings.

In your improvement account, in AWS Glue Studio, choose AWS Glue Studio
To create a brand new visible job, select Create job
Select Visible with a clean canvas and use the visible editor to design your ETL job
Configure the job settings:
Job identify: Present a significant identify
IAM position: Choose an IAM position with essential permissions
Different configurations: Modify as wanted
To avoid wasting the job, select Save

Repeat these steps to create extra jobs as required.

Serialize visible jobs

To serialize your visible jobs to allow model management and preparation for deployment, observe these steps:

Run the AWS Glue Useful resource Sync Utility:

python sync.py 
  --src-role-arn arn:aws:iam:::position/GlueCrossAccountRole-dev 
  --src-region us-east-1 
  --serialize-to-file assets.json 
  --targets job,catalog 
  --skip-prompt

Exchange together with your improvement account quantity
Exchange together with your improvement Area (for instance, us-east-1)
Confirm the serialized file:
Find JSON in aws_glue_cdk_baseline/assets/
Ensure that it comprises the definitions of your visible jobs

Commit modifications to Git repository

To commit modifications to the Git repository, observe these steps:

Add the serialized assets to Git:

git add aws_glue_cdk_baseline/assets/assets.json

Commit your modifications:

git commit -m "Add serialized Glue Visible Jobs"

Push to GitHub:

This motion triggers the CI/CD pipeline.

Deploy visible jobs to manufacturing

The CI/CD pipeline mechanically deploys the next modifications:

Synthesize the AWS CDK utility
Deploy to the event surroundings
Deploy to the manufacturing surroundings
Execute the AWS Glue Useful resource Sync Utility

The next screenshot exhibits the CI/CD pipeline.

Confirm visible jobs in manufacturing

After the pipeline has accomplished the deployment, it’s vital to confirm that the visible jobs are accurately mirrored within the manufacturing surroundings. To take action, observe these steps:

Within the manufacturing account, on the AWS Glue Studio console, choose AWS Glue Studio
Confirm the deployed jobs:
Ensure that the visible jobs are current
Open every job to substantiate that the visible DAGs are preserved

By following these steps in your day-to-day workflow, you guarantee that your AWS Glue visible jobs are version-controlled, constant throughout environments, and that your manufacturing surroundings displays the newest examined modifications.

Model management for AWS Glue visible jobs

By serializing AWS Glue Studio visible jobs to JSON recordsdata and committing them to a Git repository, you allow model management on your information integration workflows. By following this method you may:

Observe Modifications – Monitor modifications to your AWS Glue jobs over time
Collaborate – Work with crew members on creating and refining jobs
Restore and deploy – Simply restore jobs in different accounts or environments

The serialization and deserialization steps are integral to your improvement and deployment course of, ensuring that every one modifications are captured and seamlessly propagated.

Conclusion

By combining the AWS Glue Visible Job API, AWS Glue Useful resource Sync Utility, and an AWS CDK primarily based CI/CD pipeline, we’ve crafted a complete resolution for managing AWS Glue Studio visible jobs throughout completely different environments. This built-in method affords a number of advantages:

Model management integration – Handle and monitor modifications to your AWS Glue visible jobs utilizing Git, enabling collaboration and alter monitoring
Streamlined improvement – Simply develop and check AWS Glue jobs utilizing the Visible Editor within the improvement surroundings
Automated deployment – Use a CI/CD pipeline to mechanically deploy serialized visible DAGs to the manufacturing surroundings
Surroundings consistency – Promote consistency throughout improvement and manufacturing environments by utilizing the identical job definitions
Visible illustration preservation – Keep the visible DAG illustration when synchronizing jobs between environments

This resolution empowers information engineers to deal with constructing strong information integration pipelines whereas automating the complexities of managing and deploying AWS Glue Studio visible jobs throughout a number of environments.

We encourage you to do this resolution and adapt it to your wants. As at all times, we welcome your suggestions and strategies for additional enhancements.

In regards to the Authors

Andrei Maksimov is an AWS Senior Cloud Infrastructure Architect specializing in cloud infrastructure, software program improvement, and DevOps. He designs and implements scalable, safe, and environment friendly cloud options and helps prospects optimize their cloud environments. Outdoors of labor, Andrei enjoys taking part in hackathons, contributing to open supply initiatives, and exploring the newest developments in AI. You may join with him on LinkedIn.

David Zhang is an AWS Information Architect specializing in designing and implementing analytics infrastructure, information administration, ETL, and intensive information methods. He helps prospects modernize their AWS information platforms. David can also be an lively speaker at AWS conferences and contributor to AWS conferences, technical content material, and open supply initiatives. He enjoys taking part in volleyball, tennis, and weightlifting in his free time. Be happy to attach with him on LinkedIn.

Noritaka Sekiyama is a Principal Huge Information Architect on the AWS Glue crew. He’s answerable for designing AWS options, implementing software program artifacts, and serving to with buyer architectures. In his spare time, he enjoys watching anime on Prime Video. You may join with him on LinkedIn.