Big Data

Unlock scalable analytics with a safe connectivity sample in AWS Glue to learn from or write to Snowflake

20 August 2024

In right this moment’s data-driven world, the flexibility to seamlessly combine and make the most of various knowledge sources is important for gaining actionable insights and driving innovation. As organizations more and more depend on knowledge saved throughout numerous platforms, comparable to Snowflake, Amazon Easy Storage Service (Amazon S3), and numerous software program as a service (SaaS) functions, the problem of bringing these disparate knowledge sources collectively has by no means been extra urgent.

AWS Glue is a strong knowledge integration service that facilitates the consolidation of knowledge from totally different origins, empowering companies to make use of the complete potential of their knowledge belongings. Through the use of AWS Glue to combine knowledge from Snowflake, Amazon S3, and SaaS functions, organizations can unlock new alternatives in generative synthetic intelligence (AI), machine studying (ML), enterprise intelligence (BI), and self-service analytics or feed knowledge to underlying functions.

On this publish, we discover how AWS Glue can function the info integration service to carry the info from Snowflake to your knowledge integration technique, enabling you to harness the ability of your knowledge ecosystem and drive significant outcomes throughout numerous use circumstances.

Use case

Think about a big ecommerce firm that depends closely on data-driven insights to optimize its operations, advertising methods, and buyer experiences. The corporate shops huge quantities of transactional knowledge, buyer info, and product catalogs in Snowflake. Nevertheless, in addition they generate and accumulate knowledge from numerous different sources, comparable to internet logs saved in Amazon S3, social media platforms, and third-party knowledge suppliers. To realize a complete understanding of their enterprise and make knowledgeable selections, the corporate must combine and analyze knowledge from all these sources seamlessly.

One essential enterprise requirement for the ecommerce firm is to generate a Pricing Abstract Report that gives an in depth evaluation of pricing and discounting methods. This report is crucial for understanding income streams, figuring out alternatives for optimization, and making data-driven selections relating to pricing and promotions. After the Pricing Abstract Report is generated and saved in Amazon S3, the corporate can use AWS analytics providers to generate interactive BI dashboards and run one-time queries on the report. This enables enterprise analysts and decision-makers to realize precious insights, visualize key metrics, and discover the info in depth, enabling knowledgeable decision-making and strategic planning for pricing and promotional methods.

Resolution overview

The next structure diagram illustrates a safe and environment friendly answer of integrating Snowflake knowledge with Amazon S3, utilizing the native Snowflake connector in AWS Glue. This setup makes use of AWS PrivateLink to supply safe connectivity between AWS providers throughout totally different digital personal clouds (VPCs), eliminating the necessity to expose knowledge to the general public web, which is a important want for organizations.

The next are the important thing parts and steps within the integration course of:

Set up a safe, personal connection between your AWS account and your Snowflake account utilizing PrivateLink. This entails creating VPC endpoints in each the AWS and Snowflake VPCs, ensuring knowledge switch stays inside the AWS community.
Use Amazon Route 53 to create a non-public hosted zone that resolves the Snowflake endpoint inside your VPC. This enables AWS Glue jobs to connect with Snowflake utilizing a non-public DNS identify, sustaining the safety and integrity of the info switch.
Create an AWS Glue job to deal with the extract, remodel, and cargo (ETL) course of on knowledge from Snowflake to Amazon S3. The AWS Glue job makes use of the safe connection established by the VPC endpoints to entry Snowflake knowledge. Snowflake credentials are securely saved in AWS Secrets and techniques Supervisor. The AWS Glue job retrieves these credentials at runtime to authenticate and connect with Snowflake, offering safe entry administration. A VPC endpoint lets you securely talk with this service with out traversing the general public web, enhancing safety and efficiency.
Retailer the extracted and reworked knowledge in Amazon S3. Manage the info into acceptable constructions, comparable to partitioned folders, to optimize question efficiency and knowledge administration. We use a VPC endpoint enabled to securely talk with this service with out traversing the general public web, enhancing safety and efficiency. We additionally use Amazon S3 to retailer AWS Glue scripts, logs, and momentary knowledge generated in the course of the ETL course of.

This method affords the next advantages:

Enhanced safety – Through the use of PrivateLink and VPC endpoints, knowledge switch between Snowflake and Amazon S3 is secured inside the AWS community, decreasing publicity to potential safety threats.
Environment friendly knowledge integration – AWS Glue simplifies the ETL course of, offering a scalable and versatile answer for knowledge integration between Snowflake and Amazon S3.
Price-effectiveness – Utilizing Amazon S3 for knowledge storage, mixed with the AWS Glue pay-as-you-go pricing mannequin, helps optimize prices related to knowledge administration and integration.
Scalability and suppleness – The structure helps scalable knowledge transfers and may be prolonged to combine extra knowledge sources and locations as wanted.

By following this structure and profiting from the capabilities of AWS Glue, PrivateLink, and related AWS providers, organizations can obtain a sturdy, safe, and environment friendly knowledge integration answer, enabling them to harness the complete potential of their Snowflake and Amazon S3 knowledge for superior analytics and BI.

Conditions

Full the next conditions earlier than establishing the answer:

Confirm that you’ve got entry to AWS account with the required permissions to provision sources in providers comparable to Route 53, Amazon S3, AWS Glue, Secrets and techniques Supervisor, and Amazon Digital Non-public Cloud (Amazon VPC) utilizing AWS CloudFormation, which helps you to mannequin, provision, and handle AWS and third-party sources by treating infrastructure as code.
Affirm that you’ve got entry to Snowflake hosted in AWS with required permissions to run the steps to configure PrivateLink. Check with Enabling AWS PrivateLink within the Snowflake documentation to confirm the steps, required entry stage, and repair stage to set the configurations. After you allow PrivateLink, save the worth of the next parameters supplied by Snowflake to make use of within the subsequent step on this publish:
1. privatelink-vpce-id
2. privatelink-account-url
3. privatelink_ocsp-url
4. regionless-snowsight-privatelink-url
Be sure to have a Snowflake person snowflakeUser and password snowflakePassword with required permissions to learn from and write to Snowflake. The person and password are used within the AWS Glue connection to authenticate inside Snowflake.
In case your Snowflake person doesn’t have a default warehouse set, you’ll need a warehouse identify. We use snowflakeWarehouse as a placeholder for the warehouse identify; exchange it along with your precise warehouse identify.
Should you’re new to Snowflake, take into account finishing the Snowflake in 20 Minutes By the tip of the tutorial, it is best to know the way to create required Snowflake objects, together with warehouses, databases, and tables for storing and querying knowledge.

Create sources with AWS CloudFormation

This publish features a CloudFormation template for a fast setup of the bottom sources. You possibly can evaluate and customise it to fit your wants if wanted. The CloudFormation template generates the next sources:

To create your sources, full the next steps:

Sign up to the AWS CloudFormation console.
Select Launch Stack to launch the CloudFormation stack.
Present the CloudFormation stack parameters:
1. For PrivateLinkAccountURL, enter the worth of the parameter privatelink-account-url obtained within the conditions.
2. For PrivateLinkOcspURL, enter the worth of the parameter privatelink_ocsp-url obtained within the conditions.
3. For PrivateLinkVpceId, enter the worth of the parameter privatelink-vpce-id obtained within the conditions.
4. For PrivateSubnet1CIDR, enter the IP addresses to your personal subnet 1.
5. For PrivateSubnet2CIDR, enter the IP addresses to your personal subnet 2.
6. For PrivateSubnet3CIDR, enter the IP addresses to your personal subnet 3.
7. For PublicSubnet1CIDR, enter the IP addresses to your public subnet 1.
8. For RegionlessSnowsightPrivateLinkURL, enter the worth of the parameter regionless-snowsight-privatelink-url obtained within the conditions.
9. For VpcCIDR, enter the IP addresses to your VPC.
Select Subsequent.
Choose I acknowledge that AWS CloudFormation may create IAM sources.
Select Submit and anticipate the stack creation step to finish.

After the CloudFormation stack is efficiently created, you possibly can see all of the sources created on the Assets tab.

Navigate to the Outputs tab to see the outputs supplied by CloudFormation stack. Save the worth of the outputs GlueSecurityGroupId, VpcId, and PrivateSubnet1Id to make use of within the subsequent step on this publish.

Replace the Secrets and techniques Supervisor secret with Snowflake credentials for the AWS Glue connection

To replace the Secrets and techniques Supervisor secret with person snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse that you’ll use within the AWS Glue connection to determine a connection to Snowflake, full the next steps:

On the Secrets and techniques Supervisor console, select Secrets and techniques within the navigation pane.
Open the key blog-glue-snowflake-credentials.
Underneath Secret worth, select Retrieve secret worth.

Select Edit.
Enter the person snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse for the keys sfUser, sfPassword, and sfWarehouse, respectively.
Select Save.

Create the AWS Glue connection for Snowflake

An AWS Glue connection is an AWS Glue Information Catalog object that shops login credentials, URI strings, VPC info, and extra for a specific knowledge retailer. AWS Glue crawlers, jobs, and growth endpoints use connections in an effort to entry sure sorts of knowledge shops. To create an AWS Glue connection to Snowflake, full the next steps:

On the AWS Glue console, within the navigation pane, underneath Information catalog, select Connections.
Select Create connection.
For Information sources, seek for and choose Snowflake.
Select Subsequent.

For Snowflake URL, enter https://.

To acquire the Snowflake PrivateLink account URL, confer with parameters obtained within the conditions.

For AWS Secret, select the key blog-glue-snowflake-credentials.
For VPC, select the VpcId worth obtained from the CloudFormation stack output.
For Subnet, select the PrivateSubnet1Id worth obtained from the CloudFormation stack output.
For Safety teams, select the GlueSecurityGroupId worth obtained from the CloudFormation stack output.
Select Subsequent.

Within the Connection Properties part, for Identify, enter glue-snowflake-connection.
Select Subsequent.

Select Create connection.

Create an AWS Glue job

You’re now able to outline the AWS Glue job utilizing the Snowflake connection. To create an AWS Glue job to learn from Snowflake, full the next steps:

On the AWS Glue console, underneath ETL jobs within the navigation pane, select Visible ETL.

Select the Job particulars tab.
For Identify, enter a reputation, for instance, Pricing Abstract Report Job.
For Description, enter a significant description for the job.
For IAM Function, select the function that has entry to the goal S3 location the place the job is writing to and the supply location from the place it’s loading the Snowflake knowledge and likewise to run the AWS Glue job. You’ll find this function in your CloudFormation stack output, named blog-glue-snowflake-GlueServiceRole-*.
Use the default choices for Sort, Glue model, Language, Employee kind, Variety of employees, Variety of retries, and Job timeout.
For Job bookmark, select Disable.
Select Save to avoid wasting the job.

On the Visible tab, select Add nodes.

For Sources, select Snowflake.

Select Information supply – Snowflake within the AWS Glue Studio canvas.
For Identify, enter Snowflake_Pricing_Summary.
For Snowflake connection, select glue-snowflake-connection.
For Snowflake supply, choose Enter a customized question.
For Database, enter snowflake_sample_data.
For Snowflake question, add the next Snowflake question:

SELECT l_returnflag
    , l_linestatus
    , Sum(l_quantity) AS sum_qty
    , Sum(l_extendedprice) AS sum_base_price
    , Sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price
    , Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge
    , Avg(l_quantity) AS avg_qty
    , Avg(l_extendedprice) AS avg_price
    , Avg(l_discount) AS avg_disc
    , Rely(*) AS count_order
FROM tpch_sf1.lineitem
WHERE l_shipdate <= Dateadd(day, - 90, To_date('1998-12-01'))
GROUP BY l_returnflag
    , l_linestatus
ORDER BY l_returnflag
    , l_linestatus;

The Pricing Abstract Report supplies a abstract pricing report for all line objects shipped as of a given date. The date is inside 60–120 days of the best ship date contained within the database. The question lists totals for prolonged worth, discounted prolonged worth, discounted prolonged worth plus tax, common amount, common prolonged worth, and common low cost. These aggregates are grouped by RETURNFLAG and LINESTATUS, and listed in ascending order of RETURNFLAG and LINESTATUS. A depend of the variety of line objects in every group is included.

For Customized Snowflake properties, specify Key as sfSchema and Worth as tpch_sf1.
Select Save.

Subsequent, you add the vacation spot as an S3 bucket.

On the Visible tab, select Add nodes.
For Targets, select Amazon S3.

Select Information goal – S3 bucket within the AWS Glue Studio canvas.
For Identify, enter S3_Pricing_Summary.
For Node mother and father, choose Snowflake_Pricing_Summary.
For Format, choose Parquet.
For S3 Goal Location, enter s3:///pricing_summary_report/ (use the identify of your bucket).
For Information Catalog replace choices, choose Create a desk within the Information Catalog and on subsequent runs, replace the schema and add new partitions.
For Database, select db_blog_glue_snowflake.
For Desk identify, enter tb_pricing_summary.
Select Save.
Select Run to run the job, and monitor its standing on the Runs tab.

You efficiently accomplished the steps to create an AWS Glue job that reads knowledge from Snowflake and hundreds the outcomes into an S3 bucket utilizing a safe connectivity sample. Ultimately, if you wish to remodel the info earlier than loading it into Amazon S3, you should utilize AWS Glue transformations obtainable in AWS Glue Studio. Utilizing AWS Glue transformations is essential when creating an AWS Glue job as a result of they allow environment friendly knowledge cleaning, enrichment, and restructuring, ensuring the info is within the desired format and high quality for downstream processes. Check with Modifying AWS Glue managed knowledge remodel nodes for extra info.

Validate the outcomes

After the job is full, you possibly can validate the output of the ETL job run in Athena, a serverless interactive analytics service. To validate the output, full the next steps:

On the Athena console, select Launch Question Editor.
For Workgroup, select blog-workgroup.
If the message “All queries run within the Workgroup, blog-workgroup, will use the next settings:” is displayed, select Acknowledge.
For Database, select db_blog_glue_snowflake.
For Question, enter the next assertion:

SELECT l_returnflag
    , l_linestatus
    , sum_qty
    , sum_base_price
FROM db_blog_glue_snowflake.tb_pricing_summary

Select Run.

You’ve got efficiently validated your knowledge for the AWS Glue job Pricing Abstract Report Job.

Clear up

To scrub up your sources, full the next duties:

Delete the AWS Glue job Pricing Abstract Report Job.
Delete the AWS Glue connection glue-snowflake-connection.
Cease any AWS Glue interactive periods.
Delete content material from the S3 bucket blog-glue-snowflake-*.
Delete the CloudFormation stack blog-glue-snowflake.

Conclusion

Utilizing the native Snowflake connector in AWS Glue supplies an environment friendly and safe approach to combine knowledge from Snowflake into your knowledge pipelines on AWS. By following the steps outlined on this publish, you possibly can set up a non-public connectivity channel between AWS Glue and your Snowflake utilizing PrivateLink, Amazon VPC, safety teams, and Secrets and techniques Supervisor.

This structure lets you learn knowledge from and write knowledge to Snowflake tables straight from AWS Glue jobs operating on Spark. The safe connectivity sample prevents knowledge transfers over the general public web, enhancing knowledge privateness and safety.

Combining AWS knowledge integration providers like AWS Glue with knowledge platforms like Snowflake lets you construct scalable, safe knowledge lakes and pipelines to energy analytics, BI, knowledge science, and ML use circumstances.

In abstract, the native Snowflake connector and personal connectivity mannequin outlined right here present a performant, safe approach to embody Snowflake knowledge in AWS massive knowledge workflows. This unlocks scalable analytics whereas sustaining knowledge governance, compliance, and entry management. For extra info on AWS Glue, go to AWS Glue.

In regards to the Authors

Caio Sgaraboto Montovani is a Sr. Specialist Options Architect, Information Lake and AI/ML inside AWS Skilled Providers, creating scalable options in accordance buyer wants. His huge expertise has helped prospects in numerous industries comparable to life sciences and healthcare, retail, banking, and aviation construct options in knowledge analytics, machine studying, and generative AI. He’s captivated with rock and roll and cooking, and likes to spend time together with his household.

Kartikay Khator is a Options Architect inside World Life Sciences at AWS, the place he dedicates his efforts to creating progressive and scalable options that cater to the evolving wants of consumers. His experience lies in harnessing the capabilities of AWS analytics providers. Extending past his skilled pursuits, he finds pleasure and success on the earth of operating and climbing. Having already accomplished two marathons, he’s presently getting ready for his subsequent marathon problem.

Navnit Shukla, an AWS Specialist Resolution Architect specializing in Analytics, is captivated with serving to shoppers uncover precious insights from their knowledge. Leveraging his experience, he develops creative options that empower companies to make knowledgeable, data-driven selections. Notably, Navnit is the completed creator of the e-book “Information Wrangling on AWS,” showcasing his experience within the area.

Kamen Sharlandjiev is a Sr. Massive Information and ETL Options Architect, Amazon MWAA and AWS Glue ETL knowledgeable. He’s on a mission to make life simpler for patrons who’re going through advanced knowledge integration and orchestration challenges. His secret weapon? Absolutely managed AWS providers that may get the job accomplished with minimal effort. Observe Kamen on LinkedIn to maintain updated with the most recent Amazon MWAA and AWS Glue options and information!

Bosco Albuquerque is a Sr. Associate Options Architect at AWS and has over 20 years of expertise working with database and analytics merchandise from enterprise database distributors and cloud suppliers. He has helped expertise firms design and implement knowledge analytics options and merchandise.