Introducing AWS Glue 5.0 for Apache Spark

0
19
Introducing AWS Glue 5.0 for Apache Spark


AWS Glue is a serverless, scalable information integration service that makes it easy to find, put together, transfer, and combine information from a number of sources. Immediately, we’re launching AWS Glue 5.0, a brand new model of AWS Glue that accelerates information integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, supplying you with newer Spark and Python releases so you’ll be able to develop, run, and scale your information integration workloads and get insights quicker.

This submit describes what’s new in AWS Glue 5.0, efficiency enhancements, key highlights on Spark and associated libraries, and the right way to get began on AWS Glue 5.0.

What’s new in AWS Glue 5.0

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, and Java 17 with new efficiency and safety enhancements from the open supply. AWS Glue 5.0 additionally updates assist for open desk format libraries to Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1 so you’ll be able to resolve superior use circumstances round efficiency, value, governance, and privateness in your information lakes. AWS Glue 5.0 provides assist for Spark-native fine-grained entry management with AWS Lake Formation so you’ll be able to apply table- and column-level permissions on an Amazon Easy Storage Service (Amazon S3) information lake for write operations (akin to INSERT INTO and INSERT OVERWRITE) with Spark jobs.

Key options embody:

  • Amazon SageMaker Unified Studio assist
  • Amazon SageMaker Lakehouse assist
  • Frameworks up to date to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17
  • Open Desk Codecs (OTF) up to date to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
  • Spark-native fine-grained entry management utilizing Lake Formation
  • Amazon S3 Entry Grants assist
  • necessities.txt assist to put in further Python libraries
  • Knowledge lineage assist in Amazon DataZone

Amazon SageMaker Unified Studio assist

Amazon SageMaker Unified Studio helps AWS Glue 5.0 for compute runtime of unified notebooks and visible ETL stream editor.

Amazon SageMaker Lakehouse assist

Glue 5.0 helps native integration with Amazon SageMaker Lakehouse to allow unified entry throughout Amazon Redshift information warehouses and S3 information lakes.

Frameworks up to date to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17. Glue 5.0 makes use of AWS efficiency optimized Spark runtime, 3.9 occasions quicker than open supply Spark. Glue 5.0 is 32% quicker than AWS Glue 4.0 and reduces prices by 22%.

For extra particulars about up to date library dependencies, see Dependent library upgrades part.

Open Desk Codecs (OTF) up to date to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1

AWS Glue 5.0 upgrades the open desk format libraries to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1.

Spark-native fine-grained entry management utilizing Lake Formation

AWS Glue helps AWS Lake Formation Superb Grained Entry Management (FGAC) by native Spark DataFrames and Spark SQL.

S3 Entry Grants assist

S3 Entry Grants gives a simplified mannequin for outlining entry permissions to information in Amazon S3 by prefix, bucket, or object. AWS Glue 5.0 helps S3 Entry Grants by EMR File System (EMRFS) utilizing further Spark configurations:

  • Key: --conf
  • Worth: hadoop.fs.s3.s3AccessGrants.enabled=true --conf spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM=false

To be taught extra, confer with documentation.

necessities.txt assist to put in further Python libraries

In AWS Glue 5.0, you’ll be able to present the usual necessities.txt file to handle Python library dependencies. To try this, present the next job parameters:

  • Parameter 1:
    • Key: --python-modules-installer-option
    • Worth: -r
  • Parameter 2:
    • Key: --additional-python-modules
    • Worth: s3://path_to_requirements.txt

AWS Glue 5.0 nodes initially load Python libraries laid out in necessities.txt. The next code illustrates the pattern necessities.txt:

awswrangler==3.9.1 
elasticsearch==8.15.1
PyAthena==3.9.0
PyMySQL==1.1.1
PyYAML==6.0.2
pyodbc==5.2.0
pyorc==0.9.0 
redshift-connector==2.1.3
scipy==1.14.1
scikit-learn==1.5.2
SQLAlchemy==2.0.36

Knowledge lineage assist in Amazon DataZone (preview)

AWS Glue 5.0 helps information lineage in Amazon DataZone in preview. You may configure AWS Glue to robotically gather lineage data throughout Spark job runs and ship the lineage occasions to be visualized in Amazon DataZone.

To configure this on the AWS Glue console, allow Generate lineage occasions, and enter your Amazon DataZone area ID on the Job particulars tab.

Alternatively, you’ll be able to present the next job parameter (present your DataZone area ID):

  • Key: --conf
  • Worth: extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener —conf spark.openlineage.transport.kind=amazon_datazone_api —conf spark.openlineage.transport.domainId=

Be taught extra in Amazon DataZone introduces OpenLineage-compatible information lineage visualization in preview.

Improved efficiency

AWS Glue 5.0 improves the price-performance of your AWS Glue jobs. AWS Glue 5.0 is 32% quicker than AWS Glue 4.0 and reduces prices by 22%. The next chart exhibits the entire job runtime for all queries (in seconds) within the 3 TB question dataset between AWS Glue 4.0 and AWS Glue 5.0. The TPC-DS dataset is situated in an S3 bucket in Parquet format, and we used 30 G.2X staff in AWS Glue. We noticed that our AWS Glue 5.0 TPC-DS assessments on Amazon S3 was 58% quicker than that on AWS Glue 4.0 whereas decreasing value by 36%.

. AWS Glue 4.0 AWS Glue 5.0
Whole Question Time (seconds) 1896.1904 1197.78755
Geometric Imply (seconds) 10.09472 6.82208
Estimated Price ($) 45.85533 29.20133

The next graphs illustrates the comparisons of efficiency and value.

Dependent library upgrades

The next desk lists dependency upgrades.

Dependency Model in AWS Glue 4.0 Model in AWS Glue 5.0
Spark 3.3.0 3.5.2
Hadoop 3.3.3 3.4.0
Scala 2.12 2.12.18
Hive 2.3.9 2.3.9
EMRFS 2.54.0 2.66.0
Arrow 7.0.0 12.0.1
Iceberg 1.0.0 1.6.1
Hudi 0.12.1 0.15.0
Delta Lake 2.1.0 3.2.1
Java 8 17
Python 3.10 3.11
boto3 1.26 1.34.131
AWS SDK for Java 1.12 2.28.8
AWS Glue Knowledge Catalog Shopper 3.7.0 4.2.0
EMR DynamoDB Connector 4.16.0 5.6.0

The next desk lists database connector (JDBC driver) upgrades.

Driver Connector Model in AWS Glue 4.0 Connector Model in AWS Glue 5.0
MySQL 8.0.23 8.0.33
Microsoft SQL Server 9.4.0 10.2.0
Oracle Databases 21.7 23.3.0.23.09
PostgreSQL 42.3.6 42.7.3
Amazon Redshift redshift-jdbc42-2.1.0.16 redshift-jdbc42-2.1.0.29

The next are Spark connector upgrades:

Driver Connector Model in AWS Glue 4.0 Connector Model in AWS Glue 5.0
Amazon Redshift 6.1.3 6.3.0
OpenSearch 1.0.1 1.2.0
MongoDB 10.0.4 10.3.0
Snowflake 2.12.0 3.0.0
BigQuery 0.32.2 0.32.2

Apache Spark highlights

Spark 3.5.2 in AWS Glue 5.0 brings numerous precious options, which we spotlight on this part. To be taught extra concerning the highlights and enhancements of Spark 3.4 and three.5, confer with Spark Launch 3.4.0 and Spark Launch 3.5.0.

Apache Arrow-optimized Python UDF

Python user-defined capabilities (UDFs) allow customers to construct customized code for information processing wants, offering flexibility and accessibility. Nonetheless, efficiency suffers as a result of UDFs require serialization between Python and JVM processes. Spark 3.5’s Apache Arrow-optimized UDFs resolve this by holding information in shared reminiscence utilizing Arrow’s high-performance columnar format, eliminating serialization overhead and making UDFs environment friendly for large-scale processing.

To make use of Arrow-optimized Python UDFs, set spark.sql.execution.pythonUDF.arrow.enabled to True.

Python user-defined desk capabilities

A user-defined desk operate (UDTF) is a operate that returns a complete output desk as a substitute of a single worth. PySpark customers can now write customized UDTFs with Python logic and use them in PySpark and SQL queries. Referred to as within the FROM clause, UDTFs can settle for zero or extra arguments, both as scalar expressions or desk arguments. The UDTF’s return kind, outlined as both a StructType (for instance, StructType().add("c1", StringType())) or DDL string (for instance, c1: string), determines the output desk’s schema.

RocksDB state retailer enhancement

At Spark 3.2, RocksDB state retailer supplier has been added as a built-in state retailer implementation.

Changelog checkpointing

A brand new checkpoint mechanism for the RocksDB state retailer supplier referred to as changelog checkpointing persists the changelog (updates) of the state. This reduces the commit latency, thereby decreasing end-to-end latency considerably.

You may allow this by setting spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled to True.

You may also allow this characteristic with current checkpoints.

Reminiscence administration enhancements

Though the RocksDB state retailer supplier is well-known to be helpful to deal with reminiscence points on the state, there was no fine-grained reminiscence administration. Spark 3.5 introduces extra fine-grained reminiscence administration, which allows customers to cap the entire reminiscence utilization throughout RocksDB situations in the identical executor course of, enabling customers to configure the reminiscence utilization per executor course of.

Enhanced Structured Streaming

Spark 3.4 and three.5 have many enhancements associated to Spark Structured Streaming.

This new API deduplicates rows primarily based on sure occasions. Watermark-based processing permits for extra exact management over late information dealing with:

  • Deduplicate the identical rows: dropDuplicatesWithinWatermark()
  • Deduplicate values on ‘worth’ columns: dropDuplicatesWithinWatermark(['value'])
  • Deduplicate utilizing the guid column with a watermark primarily based on the eventTime column: withWatermark("eventTime", "10 hours") .dropDuplicatesWithinWatermark(["guid"])

Get began with AWS Glue 5.0

You can begin utilizing AWS Glue 5.0 by AWS Glue Studio, the AWS Glue console, the newest AWS SDK, and the AWS Command Line Interface (AWS CLI).

To start out utilizing AWS Glue 5.0 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Particulars tab, select the model Glue 5.0 – Helps Spark 3.5, Scala 2, Python 3.

To start out utilizing AWS Glue 5.0 on an AWS Glue Studio pocket book or an interactive session by a Jupyter pocket book, set 5.0 within the %glue_version magic:

The next output exhibits that the session is ready to make use of AWS Glue 5.0:

Setting Glue model to: 5.0

Conclusion

On this submit, we mentioned the important thing options and advantages of AWS Glue 5.0. You may create new AWS Glue jobs on AWS Glue 5.0 to get the profit from the enhancements, or migrate your current AWS Glue jobs.

We wish to thank the assist of quite a few engineers and leaders who helped construct Glue 5.0 that permits clients with a efficiency optimized Spark runtime and a number of other new capabilities.


Concerning the Authors

Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue group. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking along with his street bike.

Stuti Deshpande is a Large Knowledge Specialist Options Architect at AWS. She works with clients across the globe, offering them strategic and architectural steerage on implementing analytics options utilizing AWS. She has in depth expertise in massive information, ETL, and analytics. In her free time, Stuti likes to journey, be taught new dance varieties, and luxuriate in high quality time with household and pals.

Martin Ma is a Software program Improvement Engineer on the AWS Glue group. He’s keen about bettering the shopper expertise by making use of problem-solving abilities to invent new software program options, in addition to continually trying to find methods to simplify current ones. In his spare time, he enjoys singing and taking part in the guitar.

Anshul Sharma is a Software program Improvement Engineer in AWS Glue Staff.

Rajendra Gujja is a Software program Improvement Engineer on the AWS Glue group. He’s keen about distributed computing and every thing and something about information.

Maheedhar Reddy Chappidi is a Sr. Software program Improvement Engineer on the AWS Glue group. He’s keen about constructing fault tolerant and dependable distributed techniques at scale. Outdoors of his work, Maheedhar is keen about listening to podcasts and taking part in along with his two-year-old child.

Matt Su is a Senior Product Supervisor on the AWS Glue group. He enjoys serving to clients uncover insights and make higher choices utilizing their information with AWS Analytics companies. In his spare time, he enjoys snowboarding and gardening.

Savio Dsouza is a Software program Improvement Supervisor on the AWS Glue group. His group works on generative AI functions for the Knowledge Integration area and distributed techniques for effectively managing information lakes on AWS and optimizing Apache Spark for efficiency and reliability.

Kartik Panjabi is a Software program Improvement Supervisor on the AWS Glue group. His group builds generative AI options for the Knowledge Integration and distributed system for information integration.

Mohit Saxena is a Senior Software program Improvement Supervisor on the AWS Glue and Amazon EMR group. His group focuses on constructing distributed techniques to allow clients with simple-to-use interfaces and AI-driven capabilities to effectively rework petabytes of information throughout information lakes on Amazon S3, and databases and information warehouses on the cloud.

LEAVE A REPLY

Please enter your comment!
Please enter your name here