Organizations are more and more utilizing a multi-cloud technique to run their manufacturing workloads. We regularly see requests from prospects who’ve began their knowledge journey by constructing knowledge lakes on Microsoft Azure, to increase entry to the information to AWS companies. Prospects wish to use a wide range of AWS analytics, knowledge, AI, and machine studying (ML) companies like AWS Glue, Amazon Redshift, and Amazon SageMaker to construct extra cost-efficient, performant knowledge options harnessing the energy of particular person cloud service suppliers for his or her enterprise use circumstances.
In such eventualities, knowledge engineers face challenges in connecting and extracting knowledge from storage containers on Microsoft Azure. Prospects usually use Azure Knowledge Lake Storage Gen2 (ADLS Gen2) as their knowledge lake storage medium and retailer the information in open desk codecs like Delta tables, and wish to use AWS analytics companies like AWS Glue to learn the delta tables. AWS Glue, with its capacity to course of knowledge utilizing Apache Spark and join to numerous knowledge sources, is an appropriate answer for addressing the challenges of accessing knowledge throughout a number of cloud environments.
AWS Glue is a serverless knowledge integration service that makes it easy to find, put together, and mix knowledge for analytics, ML, and software growth. AWS Glue customized connectors permit you to uncover and combine further knowledge sources, akin to software program as a service (SaaS) functions and your customized knowledge sources. With only a few clicks, you may seek for and subscribe to connectors from AWS Market and start your knowledge preparation workflow in minutes.
On this publish, we clarify how one can extract knowledge from ADLS Gen2 utilizing the Azure Knowledge Lake Storage Connector for AWS Glue. We particularly show easy methods to import knowledge saved in Delta tables in ADLS Gen2. We offer step-by-step steerage on easy methods to configure the connector, writer an AWS Glue ETL (extract, rework, and cargo) script, and cargo the extracted knowledge into Amazon Easy Storage Service (Amazon S3).
Azure Knowledge Lake Storage Connector for AWS Glue
The Azure Knowledge Lake Storage Connector for AWS Glue simplifies the method of connecting AWS Glue jobs to extract knowledge from ADLS Gen2. It makes use of the Hadoop’s FileSystem interface and the ADLS Gen2 connector for Hadoop. The Azure Knowledge Lake Storage Connector for AWS Glue additionally consists of the hadoop-azure module, which helps you to run Apache Hadoop or Apache Spark jobs instantly with knowledge in ADLS. When the connector is added to the AWS Glue setting, AWS Glue hundreds the library from the Amazon Elastic Container Registry (Amazon ECR) repository throughout initialization (as a connector). When AWS Glue has web entry, the Spark job in AWS Glue can learn from and write to ADLS.
With the provision of the Azure Knowledge Lake Storage Connector for AWS Glue in AWS Market, an AWS Glue connection makes positive you will have the required packages to make use of in your AWS Glue job.
For this publish, we use the Shared Key authentication methodology.
Answer overview
On this publish, our goal is emigrate a product desk named sample_delta_table
, which presently resides in ADLS Gen2, to Amazon S3. To perform this, we use AWS Glue, the Azure Knowledge Lake Storage Connector for AWS Glue, and AWS Secrets and techniques Supervisor to securely retailer the Azure shared key. We employed an AWS Glue serverless ETL job, configured with the connector, to determine a connection to ADLS utilizing shared key authentication over the general public web. After the desk is migrated to Amazon S3, we use Amazon Athena to question Delta Lake tables.
The next structure diagram illustrates how AWS Glue facilitates knowledge ingestion from ADLS.
Stipulations
You want the next conditions:
Configure your ADLS Gen2 account in Secrets and techniques Supervisor
Full the next steps to create a secret in Secrets and techniques Supervisor to retailer the ADLS credentials:
- On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
- For Secret sort, choose Different sort of secret.
- Enter the important thing
accountName
for the ADLS Gen2 storage account title. - Enter the important thing
accountKey
for the ADLS Gen2 storage account key. - Enter the important thing container for the ADLS Gen2 container.
- Go away the remainder of the choices as default and select Subsequent.
- Enter a reputation for the key (for instance,
adlstorage_credentials
). - Select Subsequent.
- Full the remainder of the steps to retailer the key.
Subscribe to the Azure Knowledge Lake Storage Connector for AWS Glue
The Azure Knowledge Lake Storage Connector for AWS Glue simplifies the method of connecting AWS Glue jobs to extract knowledge from ADLS Gen2. The connector is obtainable as an AWS Market providing.
Full the next steps to subscribe to the connector:
- Log in to your AWS account with the mandatory permissions.
- Navigate to the AWS Market web page for the Azure Knowledge Lake Storage Connector for AWS Glue.
- Select Proceed to Subscribe.
- Select Proceed to Configuration after studying the EULA.
- For Fulfilment choice, select Glue 4.0.
- For Software program model, select the most recent software program model.
- Select Proceed to Launch.
Create a customized connection in AWS Glue
After you’re subscribed to the connector, full the next steps to create an AWS Glue connection primarily based on it. This connection might be added to the AWS Glue job to ensure the connector is obtainable and the information retailer connection data is accessible to determine a community pathway.
To create the AWS Glue connection, you should activate the Azure Knowledge Lake Storage Connector for AWS Glue on the AWS Glue Studio console. After you select Proceed to Launch within the earlier steps, you’re redirected to the connector touchdown web page.
- Within the Configuration particulars part, select Utilization directions.
- Select Activate the Glue connector from AWS Glue Studio.
The AWS Glue Studio console permits the choice to both activate the connector or activate it and create the connection in a single step. For this publish, we select the second choice.
- For Connector, verify Azure ADLS Connector for AWS Glue 4.0 is chosen.
- For Identify, enter a reputation for the connection (for instance,
AzureADLSStorageGen2Connection
). - Enter an non-obligatory description.
- Select Create connection and activate connector.
The connection is now prepared to be used. The connector and connection data is seen on the Knowledge connections web page of the AWS Glue console.

Learn Delta tables from ADLS Gen2 utilizing the connector in an AWS Glue ETL job
Full the next steps to create an AWS Glue job and configure the AWS Glue connection and job parameter choices:
- On the AWS Glue console, select ETL jobs within the navigation pane.
- Select Writer code with a script editor and select Script editor.
- Select Create script and go to the Job particulars part.
- Replace the settings for Identify and IAM position.
- Below Superior properties, add the AWS Glue connection AzureADLSStorageGen2Connection created in earlier steps.
- For Job parameters, add the important thing
--datalake-formats
with the worth as delta.
- Use the next script to learn the Delta desk from ADLS. Present the trail to the place you will have Delta desk information in your Azure storage account container and the S3 bucket for writing delta information to the output S3 location.
- Select Run to start out the job.
- On the Runs tab, verify the job ran efficiently.
- On the Amazon S3 console, confirm the delta information within the S3 bucket (Delta desk path).
- Create a database and desk in Athena to question the migrated Delta desk in Amazon S3.
You may accomplish this step utilizing an AWS Glue crawler. The crawler can routinely crawl your Delta desk saved in Amazon S3 and create the mandatory metadata within the AWS Glue Knowledge Catalog. Athena can then use this metadata to question and analyze the Delta desk seamlessly. For extra data, see Crawl Delta Lake tables utilizing AWS Glue crawlers.
12. Question the Delta desk:
By following the steps outlined within the publish, you will have efficiently migrated a Delta desk from ADLS Gen2 to Amazon S3 utilizing an AWS Glue ETL job.
Learn the Delta desk in an AWS Glue pocket book
The next are non-obligatory steps if you wish to learn the Delta desk from ADLS Gen2 in an AWS Glue pocket book:
- Create a pocket book and run the next code within the first pocket book cell to configure the AWS Glue connection and
--datalake-formats
in an interactive session:
- Run the next code in a brand new cell to learn the Delta desk saved in ADLS Gen 2. Present the trail to the place you will have delta information in an Azure storage account container and the S3 bucket for writing delta information to Amazon S3.
Clear up
To scrub up your sources, full the next steps:
- Take away the AWS Glue job, database, desk, and connection:
- On the AWS Glue console, select Tables within the navigation pane, choose
sample_delta_table
, and select Delete. - Select Databases within the navigation pane, choose
deltadb
, and select Delete. - Select Connections within the navigation pane, choose
AzureADLSStorageGen2Connection
, and on the Actions menu, select Delete.
- On the AWS Glue console, select Tables within the navigation pane, choose
- On the Secrets and techniques Supervisor console, select Secrets and techniques within the navigation pane, choose
adlstorage_credentials
, and on the Actions menu, select Delete secret. - In case you are now not going to make use of this connector, you may cancel the subscription to the connector:
- On the AWS Market console, select Handle subscriptions.
- Choose the subscription for the product that you simply wish to cancel, and on the Actions menu, select Cancel subscription.
- Learn the knowledge supplied and choose the acknowledgement verify field.
- Select Sure, cancel subscription.
- On the Amazon S3 console, delete the information within the S3 bucket that you simply used within the earlier steps.
You too can use the AWS Command Line Interface (AWS CLI) to take away the AWS Glue and Secrets and techniques Supervisor sources. Take away the AWS Glue job, database, desk, connection, and Secrets and techniques Supervisor secret with the next command:
Conclusion
On this publish, we demonstrated a real-world instance of migrating a Delta desk from Azure Delta Lake Storage Gen2 to Amazon S3 utilizing AWS Glue. We used an AWS Glue serverless ETL job, configured with an AWS Market connector, to determine a connection to ADLS utilizing shared key authentication over the general public web. Moreover, we used Secrets and techniques Supervisor to securely retailer the shared key and seamlessly combine it throughout the AWS Glue ETL job, offering a safe and environment friendly migration course of. Lastly, we supplied steerage on querying the Delta Lake desk from Athena.
Check out the answer on your personal use case, and tell us your suggestions and questions within the feedback.
Concerning the Authors
Nitin Kumar is a Cloud Engineer (ETL) at Amazon Net Companies, specialised in AWS Glue. With a decade of expertise, he excels in aiding prospects with their massive knowledge workloads, specializing in knowledge processing and analytics. He’s dedicated to serving to prospects overcome ETL challenges and develop scalable knowledge processing and analytics pipelines on AWS. In his free time, he likes to look at films and spend time together with his household.
Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru, specialised in AWS Glue and Amazon Athena. He’s enthusiastic about serving to prospects remedy points associated to their ETL workload and implement scalable knowledge processing and analytics pipelines on AWS. In his free time, Shubham likes to spend time together with his household and journey world wide.
Pramod Kumar P is a Options Architect at Amazon Net Companies. With 19 years of expertise expertise and near a decade of designing and architecting connectivity options (IoT) on AWS, he guides prospects to construct options with the best architectural tenets to fulfill their enterprise outcomes.
Madhavi Watve is a Senior Options Architect at Amazon Net Companies, offering assist and steerage to a broad vary of consumers to construct extremely safe, scalable, dependable, and cost-efficient functions on the cloud. She brings over 20 years of expertise expertise in software program growth and structure and is knowledge analytics specialist.
Swathi S is a Technical Account Supervisor with the Enterprise Help staff in Amazon Net Companies. She has over 6 years of expertise with AWS on massive knowledge applied sciences and makes a speciality of analytics frameworks. She is enthusiastic about serving to AWS prospects navigate the cloud area and enjoys aiding with design and optimization of analytics workloads on AWS.