Big Data

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

23 November 2024

Organizations run hundreds of thousands of Apache Spark purposes every month to organize, transfer, and course of their knowledge for analytics and machine studying (ML). Constructing and sustaining these Spark purposes is an iterative course of, the place builders spend vital time testing and troubleshooting their code. Throughout improvement, knowledge engineers typically spend hours sifting via log recordsdata, analyzing execution plans, and making configuration adjustments to resolve points. This course of turns into much more difficult in manufacturing environments because of the distributed nature of Spark, its in-memory processing mannequin, and the multitude of configuration choices obtainable. Troubleshooting these manufacturing points requires in depth evaluation of logs and metrics, typically resulting in prolonged downtimes and delayed insights from vital knowledge pipelines.

As we speak, we’re excited to announce the preview of generative AI troubleshooting for Spark in AWS Glue. It is a new functionality that permits knowledge engineers and scientists to shortly determine and resolve points of their Spark purposes. This characteristic makes use of ML and generative AI applied sciences to offer automated root trigger evaluation for failed Spark purposes, together with actionable suggestions and remediation steps. This submit demonstrates how one can debug your Spark purposes with generative AI troubleshooting.

How generative AI troubleshooting for Spark works

For Spark jobs, the troubleshooting characteristic analyzes job metadata, metrics and logs related to the error signature of your job to generates a complete root trigger evaluation. You’ll be able to provoke the troubleshooting and optimization course of with a single click on on the AWS Glue console. With this characteristic, you possibly can scale back your imply time to decision from days to minutes, optimize your Spark purposes for value and efficiency, and focus extra on deriving worth out of your knowledge.

Manually debugging Spark purposes can get difficult for knowledge engineers and ETL builders due to some completely different causes:

In depth connectivity and configuration choices to a wide range of sources with Spark whereas makes it a well-liked knowledge processing platform, typically makes it difficult to root trigger points when configurations will not be right, particularly associated to useful resource setup (S3 bucket, databases, partitions, resolved columns) and entry permissions (roles and keys).
Spark’s in-memory processing mannequin and distributed partitioning of datasets throughout its employees whereas good for parallelism, typically make it tough for customers to determine root reason behind failures ensuing from useful resource exhaustion points like out of reminiscence and disk exceptions.
Lazy analysis of Spark transformations whereas good for efficiency, makes it difficult to precisely and shortly determine the applying code and logic which precipitated the failure from the distributed logs and metrics emitted from completely different executors.

Let’s take a look at a number of frequent and complicated Spark troubleshooting eventualities the place Generative AI Troubleshooting for Spark can save hours of guide debugging time required to deep dive and give you the precise root trigger.

Useful resource setup or entry errors

Spark purposes permits to combine knowledge from a wide range of sources like datasets with a number of partitions and columns on S3 buckets and Knowledge Catalog tables, use the related job IAM roles and KMS keys for proper permissions to entry these sources, and require these sources to exist and be obtainable in the appropriate areas and places referenced by their identifiers. Customers can mis-configure their purposes that lead to errors requiring deep dive into the logs to grasp the basis trigger being a useful resource setup or permission challenge.

Guide RCA: Failure cause and Spark software Logs

Following instance exhibits the failure cause for such a typical setup challenge for S3 buckets in a manufacturing job run. The failure cause coming from Spark doesn’t assist perceive the basis trigger or the road of code that must be inspected for fixing it.

Exception in Person Class: org.apache.spark.SparkException : Job aborted attributable to stage failure: Job 0 in stage 0.0 failed 4 occasions, most up-to-date failure: Misplaced process 0.3 in stage 0.0 (TID 3) (172.36.245.14 executor 1): com.amazonaws.providers.glue.util.NonFatalException: Error opening file:

After deep diving into the logs of one of many many distributed Spark executors, it turns into clear that the error was precipitated attributable to a S3 bucket not current, nonetheless the error stack hint is normally fairly lengthy and truncated to grasp the exact root trigger and site inside Spark software the place the repair is required.

Attributable to: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.providers.s3.mannequin.AmazonS3Exception: The required bucket doesn't exist (Service: Amazon S3; Standing Code: 404; Error Code: NoSuchBucket; Request ID: 80MTEVF2RM7ZYAN9; S3 Prolonged Request ID: AzRz5f/Amtcs/QatfTvDqU0vgSu5+v7zNIZwcjUn4um5iX3JzExd3a3BkAXGwn/5oYl7hOXRBeo=; Proxy: null), S3 Prolonged Request ID: AzRz5f/Amtcs/QatfTvDqU0vgSu5+v7zNIZwcjUn4um5iX3JzExd3a3BkAXGwn/5oYl7hOXRBeo=
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.record(Jets3tNativeFileSystemStore.java:423)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolderUsingFolderObject(Jets3tNativeFileSystemStore.java:249)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolder(Jets3tNativeFileSystemStore.java:212)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:518)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:935)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:927)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:983)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:197)
at com.amazonaws.providers.glue.hadoop.TapeHadoopRecordReaderSplittable.initialize(TapeHadoopRecordReaderSplittable.scala:168)
... 29 extra

With Generative AI Spark Troubleshooting: RCA and Suggestions

With Spark Troubleshooting, you merely click on the Troubleshooting evaluation button in your failed job run, and the service analyzes the debug artifacts of your failed job to determine the basis trigger evaluation together with the road quantity in your Spark software that you would be able to examine to additional resolve the problem.

Spark Out of Reminiscence Errors

Let’s take a typical however comparatively complicated error that requires vital guide evaluation to conclude its due to a Spark job operating out of reminiscence on Spark driver (grasp node) or one of many distributed Spark executors. Often, troubleshooting requires an skilled knowledge engineer to manually go over the next steps to determine the basis trigger.

Search via Spark driver logs to search out the precise error message
Navigate to the Spark UI to research reminiscence utilization patterns
Evaluation executor metrics to grasp reminiscence stress
Analyze the code to determine memory-intensive operations

This course of typically takes hours as a result of the failure cause from Spark is normally not difficult to grasp that it was a out of reminiscence challenge on the Spark driver and what’s the treatment to repair it.

Guide RCA: Failure cause and Spark software Logs

Following instance exhibits the failure cause for the error.

Py4JJavaError: An error occurred whereas calling o4138.collectToPython. java.lang.StackOverflowError

Spark driver logs require in depth search to search out the precise error message. On this case, the error stack hint consisted of greater than hundred operate calls and is difficult to grasp the exact root trigger because the Spark software terminated abruptly.

py4j.protocol.Py4JJavaError: An error occurred whereas calling o4138.collectToPython.
: java.lang.StackOverflowError
 at org.apache.spark.sql.catalyst.timber.TreeNode$$Lambda$1942/131413145.get$Lambda(Unknown Supply)
 at org.apache.spark.sql.catalyst.timber.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:798)
 at org.apache.spark.sql.catalyst.timber.TreeNode.mapProductIterator(TreeNode.scala:459)
 at org.apache.spark.sql.catalyst.timber.TreeNode.mapChildren(TreeNode.scala:781)
 at org.apache.spark.sql.catalyst.timber.TreeNode.clone(TreeNode.scala:881)
 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$tremendous$clone(LogicalPlan.scala:30)
 at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.clone(AnalysisHelper.scala:295)
 at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.clone$(AnalysisHelper.scala:294)
 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.clone(LogicalPlan.scala:30)
 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.clone(LogicalPlan.scala:30)
 at org.apache.spark.sql.catalyst.timber.TreeNode.$anonfun$clone$1(TreeNode.scala:881)
 at org.apache.spark.sql.catalyst.timber.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:747)
 at org.apache.spark.sql.catalyst.timber.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:783)
 at org.apache.spark.sql.catalyst.timber.TreeNode.mapProductIterator(TreeNode.scala:459)
 ... repeated a number of occasions with tons of of operate calls

With Generative AI Spark Troubleshooting: RCA and Suggestions

With Spark Troubleshooting, you possibly can click on the Troubleshooting evaluation button in your failed job run and get an in depth root trigger evaluation with the road of code which you’ll examine, and in addition suggestions on greatest practices to optimize your Spark software for fixing the issue.

Spark Out of Disk Errors

One other complicated error sample with Spark is when it runs out of disk storage on one of many many Spark executors within the Spark software. Just like Spark OOM exceptions, guide troubleshooting requires in depth deep dive into distributed executor logs and metrics to grasp the basis trigger and determine the applying logic or code inflicting the error attributable to Spark’s lazy execution of its transformations.

Guide RCA: Failure Cause and Spark software Logs

The related failure cause and error stack hint within the software logs is once more quiet lengthy requiring the person to collect extra insights from Spark UI and Spark metrics to determine the basis trigger and determine the decision.

An error occurred whereas calling o115.parquet. No area left on machine

py4j.protocol.Py4JJavaError: An error occurred whereas calling o115.parquet.
: org.apache.spark.SparkException: Job aborted.
 at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:193)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(instructions.scala:113)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(instructions.scala:111)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(instructions.scala:125)
 ....

With Generative AI Spark Troubleshooting: RCA and Suggestions

With Spark Troubleshooting, it supplies the RCA and the road variety of code within the script the place the information shuffle operation was lazily evaluated by Spark. It additionally factors to greatest practices information for optimizing the shuffle or large transforms or utilizing S3 shuffle plugin on AWS Glue.

Debug AWS Glue for Spark jobs

To make use of this troubleshooting characteristic in your failed job runs, full following:

On the AWS Glue console, select ETL jobs within the navigation pane.
Select your job.
On the Runs tab, select your failed job run.
Select Troubleshoot with AI to start out the evaluation.
You’ll be redirected to the Troubleshooting evaluation tab with generated evaluation.

You will notice Root Trigger Evaluation and Suggestions sections.

The service analyzes your job’s debug artifacts and supply the outcomes. Let’s take a look at an actual instance of how this works in observe.

We present beneath an end-to-end instance the place Spark Troubleshooting helps a person with identification of the basis trigger for a useful resource setup challenge and assist repair the job to resolve the error.

Issues

Throughout preview, the service focuses on frequent Spark errors like useful resource setup and entry points, out of reminiscence exceptions on Spark driver and executors, out of disk exceptions on Spark executors, and can clearly point out when an error sort will not be but supported. Your jobs should run on AWS Glue model 4.0.

The preview is on the market at no further cost in all AWS business Areas the place AWS Glue is on the market. While you use this functionality, any validation runs triggered by you to check proposed options might be charged in line with the usual AWS Glue pricing.

Conclusion

This submit demonstrated how generative AI troubleshooting for Spark in AWS Glue helps your day-to-day Spark software debugging. It simplifies the debugging course of in your Spark purposes through the use of generative AI to mechanically determine the basis reason behind failures and supplies actionable suggestions to resolve the problems.

To be taught extra about this new troubleshooting characteristic for Spark, please go to Troubleshooting Spark jobs with AI.

A particular due to everybody who contributed to the launch of generative AI troubleshooting for Apache Spark in AWS Glue: Japson Jeyasekaran, Rahul Sharma, Mukul Prasad, Weijing Cai, Jeremy Samuel, Hirva Patel, Martin Ma, Layth Yassin, Kartik Panjabi, Maya Patwardhan, Anshi Shrivastava, Henry Caballero Corzo, Rohit Das, Peter Tsai, Daniel Greenberg, McCall Peltier, Takashi Onikura, Tomohiro Tanaka, Sotaro Hikita, Chiho Sugimoto, Yukiko Iwazumi, Gyan Radhakrishnan, Victor Pleikis, Sriram Ramarathnam, Matt Sampson, Brian Ross, Alexandra Tello, Andrew King, Joseph Barlan, Daiyan Alamgir, Ranu Shah, Adam Rohrscheib, Nitin Bahadur, Santosh Chandrachood, Matt Su, Kinshuk Pahare, and William Vambenepe.

Concerning the Authors

Noritaka Sekiyama is a Principal Huge Knowledge Architect on the AWS Glue staff. He’s liable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking along with his highway bike.

Vishal Kajjam is a Software program Improvement Engineer on the AWS Glue staff. He’s captivated with distributed computing and utilizing ML/AI for designing and constructing end-to-end options to deal with prospects’ knowledge integration wants. In his spare time, he enjoys spending time with household and associates.

Shubham Mehta is a Senior Product Supervisor at AWS Analytics. He leads generative AI characteristic improvement throughout providers resembling AWS Glue, Amazon EMR, and Amazon MWAA, utilizing AI/ML to simplify and improve the expertise of knowledge practitioners constructing knowledge purposes on AWS.

Wei Tang is a Software program Improvement Engineer on the AWS Glue staff. She is powerful developer with deep pursuits in fixing recurring buyer issues with distributed techniques and AI/ML.

XiaoRun Yu is a Software program Improvement Engineer on the AWS Glue staff. He’s engaged on constructing new options for AWS Glue to assist prospects. Exterior of labor, Xiaorun enjoys exploring new locations within the Bay Space.

Jake Zych is a Software program Improvement Engineer on the AWS Glue staff. He has deep curiosity in distributed techniques and machine studying. In his spare time, Jake likes to create video content material and play board video games.

Savio Dsouza is a Software program Improvement Supervisor on the AWS Glue staff. His staff works on distributed techniques & new interfaces for knowledge integration and effectively managing knowledge lakes on AWS.

Mohit Saxena is a Senior Software program Improvement Supervisor on the AWS Glue and Amazon EMR staff. His staff focuses on constructing distributed techniques to allow prospects with simple-to-use interfaces and AI-driven capabilities to effectively remodel petabytes of knowledge throughout knowledge lakes on Amazon S3, and databases and knowledge warehouses on the cloud.

How generative AI troubleshooting for Spark works

Useful resource setup or entry errors

Guide RCA: Failure cause and Spark software Logs

With Generative AI Spark Troubleshooting: RCA and Suggestions

Spark Out of Reminiscence Errors

Guide RCA: Failure cause and Spark software Logs

With Generative AI Spark Troubleshooting: RCA and Suggestions

Spark Out of Disk Errors

Guide RCA: Failure Cause and Spark software Logs

With Generative AI Spark Troubleshooting: RCA and Suggestions

Debug AWS Glue for Spark jobs

Issues

Conclusion

Concerning the Authors

LEAVE A REPLY Cancel reply