Big Data

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

23 November 2024

Organizations run tens of millions of Apache Spark functions every month on AWS, shifting, processing, and getting ready information for analytics and machine studying. As these functions age, preserving them safe and environment friendly turns into more and more difficult. Knowledge practitioners have to improve to the newest Spark releases to profit from efficiency enhancements, new options, bug fixes, and safety enhancements. Nonetheless, these upgrades are sometimes complicated, pricey, and time-consuming.

At this time, we’re excited to announce the preview of generative AI upgrades for Spark, a brand new functionality that permits information practitioners to shortly improve and modernize their Spark functions working on AWS. Beginning with Spark jobs in AWS Glue, this function permits you to improve from an older AWS Glue model to AWS Glue model 4.0. This new functionality reduces the time information engineers spend on modernizing their Spark functions, permitting them to give attention to constructing new information pipelines and getting beneficial analytics sooner.

Understanding the Spark improve problem

The standard technique of upgrading Spark functions requires vital guide effort and experience. Knowledge practitioners should fastidiously assessment incremental Spark launch notes to know the intricacies and nuances of breaking modifications, a few of which can be undocumented. They then want to change their Spark scripts and configurations, updating options, connectors, and library dependencies as wanted.

Testing these upgrades entails working the applying and addressing points as they come up. Every check run might reveal new issues, leading to a number of iterations of modifications. After the upgraded utility runs efficiently, practitioners should validate the brand new output towards the anticipated leads to manufacturing. This course of typically turns into year-long initiatives that value tens of millions of {dollars} and devour tens of hundreds of engineering hours.

How generative AI upgrades for Spark works

The Spark upgrades function makes use of AI to automate each the identification and validation of required modifications to your AWS Glue Spark functions. Let’s discover how these capabilities work collectively to simplify your improve course of.

AI-driven improve plan technology

If you provoke an improve, the service analyzes your utility utilizing AI to determine vital modifications throughout each PySpark code and Spark configurations. Throughout preview, Spark Upgrades helps upgrading from Glue 2.0 (Spark 2.4.3, Python 3.7) to Glue 4.0 (Spark 3.3.0, Python 3.10), routinely dealing with modifications that will sometimes require intensive guide assessment of public Spark, Python and Glue model migration guides, adopted by growth, testing, and verification. Spark Upgrades addresses 4 key areas of modifications:

Spark SQL API strategies and features
Spark DataFrame API strategies and operations
Python language updates (together with module deprecations and syntax modifications)
Spark SQL and Core configuration settings

The complexity of those upgrades turns into evident when you think about migrating from Spark 2.4.3 to Spark 3.3.0 entails over 100 version-specific modifications. A number of components contribute to the challenges of performing guide upgrades:

Extremely expressive language with a mixture of crucial and declarative programming types, permits customers to simply develop Spark functions. Nonetheless, this will increase the complexity of figuring out impacted code throughout upgrades.
Lazy execution of transformations in a distributed Spark utility improves efficiency however makes runtime verification of utility upgrades difficult for customers.
Spark configurations modifications in default values or the introduction of latest configurations throughout variations can impression utility conduct in several methods, making it troublesome for customers to determine points throughout upgrades.

For instance, in Spark 3.2, Spark SQL TRANSFORM operator can’t help alias in inputs. In Spark 3.1 and earlier, you possibly can write a script remodel like SELECT TRANSFORM(a AS c1, b AS c2) USING 'cat' FROM TBL.

# Authentic code (Glue 2.0)
question = """
SELECT TRANSFORM(merchandise as product_name, worth as product_price, quantity as product_number)
   USING 'cat'
FROM items
WHERE items.worth > 5
"""
spark.sql(question)

# Up to date code (Glue 4.0)
question = """
SELECT TRANSFORM(merchandise, worth, quantity)
   USING 'cat' AS (product_name, product_price, product_number)
FROM items
WHERE items.worth > 5
"""
spark.sql(question)

In Spark 3.1, loading and saving timestamps earlier than 1900-01-01 00:00:00Z as INT96 in Parquet information causes errors. In Spark 3.0, this wouldn’t fail however may lead to timestamp shifts as a consequence of calendar rebasing. To revive the outdated conduct in Spark 3.1, you would want to configure the Spark SQL configurations for spark.sql.legacy.parquet.int96RebaseModeInRead and spark.sql.legacy.parquet.int96RebaseModeInWrite to LEGACY.

# Authentic code (Glue 2.0)
information = [(1, "1899-12-31 23:59:59"), (2, "1900-01-01 00:00:00")]
schema = StructType([ StructField("id", IntegerType(), True), StructField("timestamp", TimestampType(), True) ])
df = spark.createDataFrame(information, schema=schema)
df.write.mode("overwrite").parquet("path/to/parquet_file") 

# Up to date code (Glue 4.0)
qspark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") 
spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY")

information = [(1, "1899-12-31 23:59:59"), (2, "1900-01-01 00:00:00")]
schema = StructType([ StructField("id", IntegerType(), True), StructField("timestamp", TimestampType(), True) ])
df = spark.createDataFrame(information, schema=schema)
df.write.mode("overwrite").parquet("path/to/parquet_file")

Automated validation in your surroundings

After figuring out the required modifications, Spark Upgrades validates the upgraded utility by working it as an AWS Glue job in your AWS account. The service iterates via a number of validation runs, as much as 10, reviewing any errors encountered in every iteration and refining the improve plan till it achieves a profitable run. You’ll be able to run a Spark Improve Evaluation in your growth account utilizing mock datasets equipped via Glue job parameters used for validation runs.

After Spark Upgrades has efficiently validated the modifications, it presents an improve plan so that you can assessment. You’ll be able to then settle for and apply the modifications to your job within the growth account, earlier than replicating them to your job within the manufacturing account. The Spark Improve plan contains the next:

An improve abstract with a proof of code updates made through the course of
The ultimate script that you need to use rather than your present script
Logs from validation runs exhibiting how points have been recognized and resolved

You’ll be able to assessment all elements of the improve, together with intermediate validation makes an attempt and any error resolutions, earlier than deciding to use the modifications to your manufacturing job. This method ensures you may have full visibility into and management over the improve course of whereas benefiting from AI-driven automation.

Get began with generative AI Spark upgrades

Let’s stroll via the method of upgrading an AWS Glue 2.0 job to AWS Glue 4.0. Full the next steps:

On the AWS Glue console, select ETL jobs within the navigation pane.
Choose your AWS Glue 2.0 job, and select Run improve evaluation with AI.
For Outcome path, enter s3://aws-glue-assets--/scripts/upgraded/ (present your individual account ID and AWS Area).
Select Run.
On the Improve evaluation tab, look forward to the evaluation to be accomplished.

Whereas an evaluation is working, you possibly can view the intermediate job evaluation makes an attempt (as much as 10) for validation below the Runs tab. Moreover, the Upgraded abstract in S3 paperwork the upgrades made by the Spark Improve service to date, refining the improve plan with every try. Every try will show a distinct failure purpose, which the service tries to deal with within the subsequent try via code or configuration updates.
After a profitable evaluation, the upgraded script and a abstract of modifications shall be uploaded to Amazon Easy Storage Service (Amazon S3).
Overview the modifications to verify they meet your necessities, then select Apply upgraded script.

Your job has now been efficiently upgraded to AWS Glue model 4.0. You’ll be able to verify the Script tab to confirm the up to date script and the Job particulars tab to assessment the modified configuration.

Understanding the improve course of via an instance

We now present a manufacturing Glue 2.0 job that we want to improve to Glue 4.0 utilizing the Spark Improve function. This Glue 2.0 job reads a dataset, up to date each day in an S3 bucket below totally different partitions, containing new guide opinions from a web based market and runs SparkSQL to assemble insights into the consumer votes for the guide opinions.

Authentic code (Glue 2.0) – earlier than improve

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
from collections import Sequence
from pyspark.sql.varieties import DecimalType
from pyspark.sql.features import lit, to_timestamp, col

def is_data_type_sequence(coming_dict):
    return True if isinstance(coming_dict, Sequence) else False

def dataframe_to_dict_list(df):
    return [row.asDict() for row in df.collect()]

books_input_path = (
    "s3://aws-bigdata-blog/generated_synthetic_reviews/information/product_category=Books/"
)
view_name = "books_temp_view"
static_date = "2010-01-01"
books_source_df = (
    spark.learn.choice("header", "true")
    .choice("recursiveFileLookup", "true")
    .choice("path", books_input_path)
    .parquet(books_input_path)
)
books_source_df.createOrReplaceTempView(view_name)
books_with_new_review_dates_df = spark.sql(
    f"""
        SELECT 
        {view_name}.*,
            DATE_ADD(to_date(review_date), "180.8") AS next_review_date,
            CASE 
                WHEN DATE_ADD(to_date(review_date), "365") < to_date('{static_date}') THEN 'Sure' 
                ELSE 'No' 
            END AS Actionable
        FROM {view_name}
    """
)
books_with_new_review_dates_df.createOrReplaceTempView(view_name)
aggregate_books_by_marketplace_df = spark.sql(
    f"SELECT market, rely({view_name}.*) as total_count, avg(star_rating) as average_star_ratings, avg(helpful_votes) as average_helpful_votes, avg(total_votes) as average_total_votes  FROM {view_name} group by market"
)
aggregate_books_by_marketplace_df.present()
information = dataframe_to_dict_list(aggregate_books_by_marketplace_df)
if is_data_type_sequence(information):
    print("information is legitimate")
else:
    increase ValueError("Knowledge is invalid")

aggregated_target_books_df = aggregate_books_by_marketplace_df.withColumn(
    "average_total_votes_decimal", col("average_total_votes").forged(DecimalType(3, -2))
)
aggregated_target_books_df.present()

New code (Glue 4.0) – after improve

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from collections.abc import Sequence
from pyspark.sql.varieties import DecimalType
from pyspark.sql.features import lit, to_timestamp, col

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.legacy.allowStarWithSingleTableIdentifierInCount", "true")
spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", "true")
job = Job(glueContext)

def is_data_type_sequence(coming_dict):
    return True if isinstance(coming_dict, Sequence) else False

def dataframe_to_dict_list(df):
    return [row.asDict() for row in df.collect()]

books_input_path = (
    "s3://aws-bigdata-blog/generated_synthetic_reviews/information/product_category=Books/"
)
view_name = "books_temp_view"
static_date = "2010-01-01"
books_source_df = (
    spark.learn.choice("header", "true")
    .choice("recursiveFileLookup", "true")
    .load(books_input_path)
)
books_source_df.createOrReplaceTempView(view_name)
books_with_new_review_dates_df = spark.sql(
    f"""
        SELECT 
        {view_name}.*,
            DATE_ADD(to_date(review_date), 180) AS next_review_date,
            CASE 
                WHEN DATE_ADD(to_date(review_date), 365) < to_date('{static_date}') THEN 'Sure' 
                ELSE 'No' 
            END AS Actionable
        FROM {view_name}
    """
)
books_with_new_review_dates_df.createOrReplaceTempView(view_name)
aggregate_books_by_marketplace_df = spark.sql(
    f"SELECT market, rely({view_name}.*) as total_count, avg(star_rating) as average_star_ratings, avg(helpful_votes) as average_helpful_votes, avg(total_votes) as average_total_votes  FROM {view_name} group by market"
)
aggregate_books_by_marketplace_df.present()
information = dataframe_to_dict_list(aggregate_books_by_marketplace_df)
if is_data_type_sequence(information):
    print("information is legitimate")
else:
    increase ValueError("Knowledge is invalid")

aggregated_target_books_df = aggregate_books_by_marketplace_df.withColumn(
    "average_total_votes_decimal", col("average_total_votes").forged(DecimalType(3, -2))
)
aggregated_target_books_df.present()

Improve abstract

In Spark 3.2, spark.sql.adaptive.enabled is enabled by default. To revive the conduct earlier than Spark 3.2, 
you possibly can set spark.sql.adaptive.enabled to false.

No appropriate migration rule was discovered within the supplied context for this particular error. The change was made primarily based on the error message, which indicated that Sequence couldn't be imported from collections module. In Python 3.10, Sequence has been moved to the collections.abc module.

In Spark 3.1, path choice can't coexist when the next strategies are referred to as with path parameter(s): DataFrameReader.load(), DataFrameWriter.save(), DataStreamReader.load(), or DataStreamWriter.begin(). As well as, paths choice can't coexist for DataFrameReader.load(). For instance, spark.learn.format(csv).choice(path, /tmp).load(/tmp2) or spark.learn.choice(path, /tmp).csv(/tmp2) will throw org.apache.spark.sql.AnalysisException. In Spark model 3.0 and under, path choice is overwritten if one path parameter is handed to above strategies; path choice is added to the general paths if a number of path parameters are handed to DataFrameReader.load(). To revive the conduct earlier than Spark 3.1, you possibly can set spark.sql.legacy.pathOptionBehavior.enabled to true.

In Spark 3.0, the `date_add` and `date_sub` features accepts solely int, smallint, tinyint because the 2nd argument; fractional and non-literal strings will not be legitimate anymore, for instance: `date_add(forged('1964-05-23' as date), '12.34')` causes `AnalysisException`. Notice that, string literals are nonetheless allowed, however Spark will throw `AnalysisException` if the string content material will not be a sound integer. In Spark model 2.4 and under, if the 2nd argument is fractional or string worth, it's coerced to int worth, and the result's a date worth of `1964-06-04`.

In Spark 3.2, the utilization of rely(tblName.*) is blocked to keep away from producing ambiguous outcomes. As a result of rely(*) and rely(tblName.*) will output in a different way if there's any null values. To revive the conduct earlier than Spark 3.2, you possibly can set spark.sql.legacy.allowStarWithSingleTableIdentifierInCount to true.

In Spark 3.0, adverse scale of decimal will not be allowed by default, for instance, information kind of literal like 1E10BD is DecimalType(11, 0). In Spark model 2.4 and under, it was DecimalType(2, -9). To revive the conduct earlier than Spark 3.0, you possibly can set spark.sql.legacy.allowNegativeScaleOfDecimal to true.

As seen within the up to date Glue 4.0 (Spark 3.3.0) script diff in comparison with the Glue 2.0 (Spark 2.4.3) script and the ensuing improve abstract, a complete of six totally different code and configuration updates have been utilized throughout the six makes an attempt of the Spark Improve Evaluation.

Try #1 included a Spark SQL configuration (spark.sql.adaptive.enabled) to revive the applying conduct as a brand new function for Spark SQL adaptive question execution is launched beginning Spark 3.2. Customers can examine this configuration change and might additional allow or disable it as per their choice.
Try #2 resolved a Python language change between Python 3.7 and three.10 with the introduction of a brand new summary base class (abc) below the Python collections module for importing Sequence.
Try #3 resolved an error encountered as a consequence of a change in conduct of DataFrame API beginning Spark 3.1 the place path choice can’t exist with different DataFrameReader operations.
Try #4 resolved an error attributable to a change within the Spark SQL operate API signature for DATE_ADD which now solely accepts integers because the second argument ranging from Spark 3.0.
Try #5 resolved an error encountered because of the change in conduct Spark SQL operate API for rely(tblName.*) beginning Spark 3.2. The conduct was restored with the introduction of a brand new Spark SQL configuration spark.sql.legacy.allowStarWithSingleTableIdentifierInCount
Try #6 efficiently accomplished the evaluation and ran the brand new script on Glue 4.0 with none new errors. The ultimate try resolved an error encountered because of the prohibited use of adverse scale for forged(DecimalType(3, -6) in Spark DataFrame API beginning Spark 3.0. The difficulty was addressed by enabling the brand new Spark SQL configuration spark.sql.legacy.allowNegativeScaleOfDecimal.

Necessary issues for preview

As you start utilizing automated Spark upgrades through the preview interval, there are a number of vital elements to contemplate for optimum utilization of the service:

Service scope and limitations – The preview launch focuses on PySpark code upgrades from AWS Glue variations 2.0 to model 4.0. On the time of writing, the service handles PySpark code that doesn’t depend on further library dependencies. You’ll be able to run automated upgrades for as much as 10 jobs concurrently in an AWS account, permitting you to effectively modernize a number of jobs whereas sustaining system stability.
Optimizing prices through the improve course of – As a result of the service makes use of generative AI to validate the improve plan via a number of iterations, with every iteration working as an AWS Glue job in your account, it’s important to optimize the validation job run configurations for cost-efficiency. To attain this, we suggest specifying a run configuration when beginning an improve evaluation as follows:
- Utilizing non-production developer accounts and deciding on pattern mock datasets that characterize your manufacturing information however are smaller in measurement for validation with Spark Upgrades.
- Utilizing right-sized compute assets, resembling G.1X employees, and deciding on an acceptable variety of employees for processing your pattern information.
- Enabling Glue auto scaling when relevant to routinely alter assets primarily based on workload.
For instance, in case your manufacturing job processes terabytes of information with 20 G.2X employees, you would possibly configure the improve job to course of a number of gigabytes of consultant information with 2 G.2X employees and auto scaling enabled for validation.
Preview greatest practices – Through the preview interval, we strongly suggest beginning your improve journey with non-production jobs. This method permits you to familiarize your self with the improve workflow, and perceive how the service handles various kinds of Spark code patterns.

Your expertise and suggestions are essential in serving to us improve and enhance this function. We encourage you to share your insights, strategies, and any challenges you encounter via AWS Assist or your account staff. This suggestions will assist us enhance the service and add capabilities that matter most to you throughout preview.

Conclusion

This publish demonstrates how automated Spark upgrades can help with migrating your Spark functions in AWS Glue. It simplifies the migration course of through the use of generative AI to routinely determine the required script modifications throughout totally different Spark variations.

To be taught extra about this function in AWS Glue, see Generative AI upgrades for Apache Spark in AWS Glue.

A particular due to everybody who contributed to the launch of generative AI upgrades for Apache Spark in AWS Glue: Shuai Zhang, Mukul Prasad, Liyuan Lin, Rishabh Nair, Raghavendhar Thiruvoipadi Vidyasagar, Tina Shao, Chris Kha, Neha Poonia, Xiaoxi Liu, Japson Jeyasekaran, Suthan Phillips, Raja Jaya Chandra Mannem, Yu-Ting Su, Neil Jonkers, Boyko Radulov, Sujatha Rudra, Mohammad Sabeel, Mingmei Yang, Matt Su, Daniel Greenberg, Charlie Sim, McCall Petier, Adam Rohrscheib, Andrew King, Ranu Shah, Aleksei Ivanov, Bernie Wang, Karthik Seshadri, Sriram Ramarathnam, Asterios Katsifodimos, Brody Bowman, Sunny Konoplev, Bijay Bisht, Saroj Yadav, Carlos Orozco, Nitin Bahadur, Kinshuk Pahare, Santosh Chandrachood, and William Vambenepe.

Concerning the Authors

Noritaka Sekiyama is a Principal Huge Knowledge Architect on the AWS Glue staff. He’s liable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking along with his new highway bike.

Keerthi Chadalavada is a Senior Software program Improvement Engineer at AWS Glue, specializing in combining generative AI and information integration applied sciences to design and construct complete options for purchasers’ information and analytics wants.

Shubham Mehta is a Senior Product Supervisor at AWS Analytics. He leads generative AI function growth throughout providers resembling AWS Glue, Amazon EMR, and Amazon MWAA, utilizing AI/ML to simplify and improve the expertise of information practitioners constructing information functions on AWS.

Pradeep Patel is a Software program Improvement Supervisor on the AWS Glue staff. He’s captivated with serving to prospects resolve their issues through the use of the facility of the AWS Cloud to ship extremely scalable and sturdy options. In his spare time, he likes to hike and play with net functions.

Chuhan Liu is a Software program Engineer at AWS Glue. He’s captivated with constructing scalable distributed programs for large information processing, analytics, and administration. He’s additionally eager on utilizing generative AI applied sciences to offer brand-new expertise to prospects. In his spare time, he likes sports activities and enjoys enjoying tennis.

Vaibhav Naik is a software program engineer at AWS Glue, captivated with constructing sturdy, scalable options to deal with complicated buyer issues. With a eager curiosity in generative AI, he likes to discover revolutionary methods to develop enterprise-level options that harness the facility of cutting-edge AI applied sciences.

Mohit Saxena is a Senior Software program Improvement Supervisor on the AWS Glue and Amazon EMR staff. His staff focuses on constructing distributed programs to allow prospects with simple-to-use interfaces and AI-driven capabilities to effectively remodel petabytes of information throughout information lakes on Amazon S3, and databases and information warehouses on the cloud.