Big Data

Writer visible ETL flows on Amazon SageMaker Unified Studio (preview)

6 December 2024

Amazon SageMaker Unified Studio (preview) supplies an built-in knowledge and AI growth setting inside Amazon SageMaker. From the Unified Studio, you may collaborate and construct sooner utilizing acquainted AWS instruments for mannequin growth, generative AI, knowledge processing, and SQL analytics. This expertise consists of visible ETL, a brand new visible interface that makes it easy for knowledge engineers to writer, run, and monitor extract, rework, load (ETL) knowledge integration circulate. You should utilize a easy visible interface to compose flows that transfer and rework knowledge and run them on serverless compute. Moreover, you may select to writer your visible flows with English utilizing generative AI prompts powered by Amazon Q. Visible ETL additionally mechanically converts your visible circulate directed acyclic graph (DAG) into Spark native scripts so you may proceed authoring by pocket book, enabling a quick-start expertise for builders preferring to writer utilizing code.

This submit exhibits how one can construct a low-code and no-code (LCNC) visible ETL circulate that allows seamless knowledge ingestion and transformation throughout a number of knowledge sources. We display find out how to:

Moreover, we discover how generative AI can improve your LCNC visible ETL growth course of, creating an intuitive and highly effective workflow that streamlines your entire growth expertise.

Use case walkthrough

On this instance, we use Amazon SageMaker Unified Studio to develop a visible ETL circulate. This pipeline reads knowledge from an Amazon S3 based mostly file location, performs transformations on the info, and subsequently writes the reworked knowledge again into an Amazon S3 based mostly AWS Glue Information Catalog desk. We use allevents_pipe and venue_pipe recordsdata from the TICKIT dataset to display this functionality.

The TICKIT dataset information gross sales actions on the fictional TICKIT web site, the place customers should purchase and promote tickets on-line for several types of occasions akin to sports activities video games, exhibits, and concert events. Analysts can use this dataset to trace how ticket gross sales change over time, consider the efficiency of sellers, and decide probably the most profitable occasions, venues, and seasons by way of ticket gross sales.

The method entails merging the allevents_pipe and venue_pipe recordsdata from the TICKIT dataset. Subsequent, the merged knowledge is filtered to incorporate solely a particular geographic area. The info is then aggregated to calculate the variety of occasions by venue title. In the long run, the reworked output knowledge is saved to Amazon S3, and a brand new AWS Glue Information Catalog desk is created.

The next diagram illustrates the structure:

Stipulations

To run the instruction, you should full the next stipulations:

An AWS account
A SageMaker Unified Studio area
A SageMaker Unified Studio undertaking with Information analytics and machine studying undertaking profile

Construct a visible ETL circulate

Full following steps to construct a brand new visible ETL circulate with pattern dataset:

On the SageMaker Unified Studio console, on the highest menu, select Construct.
Underneath DATA ANALYSIS & INTEGRATION, select Visible ETL flows, as proven within the following screenshot.

Choose your undertaking and select Proceed.

Select Create visible ETL circulate.

This time, manually outline the ETL circulate.

On the highest left, select the + icon within the circle. Underneath Information sources, select Amazon S3, as proven within the following screenshot. Find the icon on the canvas.

Select the Amazon S3 supply node and enter the next values:

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/knowledge/venue.csv
- Format: CSV
- Delimiter: ,
- Multiline: Enabled
- Header: Disabled

Depart the remaining as default.

Await the info preview to be out there on the backside of the display.

Select the + icon within the circle to the best of the Amazon S3 node. Underneath Transforms, select Rename Columns.

Select the Rename Columns node and select Add new rename pair. For Present title and New title, enter the next pairs:
- _c0: venueid
- _c1: venuename
- _c2: venuecity
- _c3: venuestate
- _c4: venueseats

Select the + icon to the best of Rename Columns node. Underneath Transforms, select Filter.
Select Add new filter situation.
For Key, select venuestate. For Operation, select ==. For Worth, enter DC, as proven within the following screenshot.

Repeat steps 5 and 6 so as to add the Amazon S3 supply node for desk occasions.

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/knowledge/occasions.csv
- Format: CSV
- Sep: ,
- Multiline: Enabled
- Header: Disabled

Depart the remaining as default

Repeat steps 7 and eight for the Amazon S3 supply node. On the Rename Columns node, select Add new rename pair. For Present title and New title, enter the next pairs:
- _c0: eventid
- _c1: e_venueid
- _c2: catid
- _c3: dateid
- _c4: eventname
- _c5: starttime

Select the + icon to the best of Rename Column node. Underneath Transforms, select Be a part of.
Drag the + icon on the proper of the Filter node and drop it on the left of the Be a part of node.
For Be a part of sort, select Inside. For Left knowledge supply, select e_venueid. For Proper knowledge supply, select venue_id.

Select the + icon to the best of the Be a part of node. Underneath Transforms, select SQL Question.
Enter the next question assertion:

choose 
  venuename,
  depend(distinct eventid) as eventid_count 
from {myDataSource} 
group by venuename

Select the + icon to the best of the SQL Question node. Underneath Information goal, select Amazon S3.
Select the Amazon S3 goal node and enter the next values:
- S3 URI: /output/venue_event/”> (for instance, s3:///dzd_bd693kieeb65yf/52d3z1nutb42w7/dev/output/venue_event/)
- Format: Parquet
- Compression: Snappy
- Mode: Overwrite
- Replace catalog: True
- Database: Select your database
- Desk: venue_event_agg

At this level, you need to encounter this end-to-end visible circulate. Now you may publish it.

On the highest proper, select Save to undertaking to save lots of the draft circulate. You possibly can optionally change the title and add an outline. Select Save to undertaking, as proven within the following screenshot.

The visible ETL circulate has been efficiently saved.

Run circulate

This part exhibits you find out how to run the visible ETL circulate you authored.

On the highest proper, select Run.

On the backside of the display, the run standing is proven. The run standing transitions from Beginning to Operating and Operating to Completed.

Await the run to be Completed.

Question utilizing Amazon Athena

The output knowledge has been written to the goal S3 bucket. This part exhibits you find out how to question the output desk.

On the highest left menu, below DATA ANALYSIS & INTEGRATION, select Question Editor.

On the info explorer, below Lakehouse, select AwsDataCatalog. Navigate to the desk venue_event_agg.
From the three dots icon, select Question with Athena.

4 information might be returned, as proven within the following screenshot. This means you succeeded in querying the output desk written by the visible ETL circulate.

Generative AI part to generate a visible ETL circulate

The previous instruction is completed in step-by-step operations on the visible console. However, SageMaker Unified Studio can automate job authoring steps through the use of generative AI powered by Amazon Q.

On the highest left menu, select Visible ETL flows.
Select Create visible ETL circulate.
Enter the next textual content and select Submit.

Create a circulate to attach 2 Glue catalog tables venue and occasion in database glue_db, be part of on occasion id , filter on venue state with situation as venuestate=='DC' and write output to a S3 location

This creates the next boilerplate circulate you can edit to rapidly writer the visible ETL circulate.

The generated circulate retains the context of the immediate on the node degree.

Clear Up

To keep away from incurring future costs, clear up the assets you created throughout this walkthrough:

From the SQL querybook, enter the next SQL to drop desk:

drop desk venue_event_agg

To delete the circulate, below Actions, select Delete circulate

Conclusion

This submit demonstrated how you should utilize Amazon SageMaker Unified Studio to construct a low-code no-code (LCNC) visible ETL circulate. This enables for a seamless knowledge ingestion and transformation throughout a number of knowledge sources.

To be taught extra, check with our documentation and the AWS Information Weblog.

In regards to the Authors

Praveen Kumar is an Analytics Options Architect at AWS with experience in designing, constructing, and implementing fashionable knowledge and analytics platforms utilizing cloud-based companies. His areas of curiosity are serverless know-how, knowledge governance, and data-driven AI purposes.

Noritaka Sekiyama is a Principal Large Information Architect with AWS Analytics companies. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking on his street bike.

Alexandra Tello is a Senior Entrance Finish Engineer with the AWS Analytics companies in New York Metropolis. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso fanatic and enjoys constructing mechanical keyboards.

Ranu Shah is a Software program Growth Supervisor with AWS Analytics companies. She loves constructing knowledge analytics options for purchasers. Exterior work, she enjoys studying books or listening to music.

Gal Heyne is a Technical Product Supervisor for AWS Analytics companies with a robust give attention to AI/ML and knowledge engineering. She is enthusiastic about growing a deep understanding of shoppers’ enterprise wants and collaborating with engineers to design simple-to-use knowledge merchandise.