Poor knowledge high quality can result in quite a lot of issues, together with pipeline failures, incorrect reporting, and poor enterprise choices. For instance, if knowledge ingested from one of many techniques comprises a excessive variety of duplicates, it may end up in skewed knowledge within the reporting system. To forestall such points, knowledge high quality checks are built-in into knowledge pipelines, which assess the accuracy and reliability of the information. These checks within the knowledge pipelines ship alerts if the information high quality requirements are usually not met, enabling knowledge engineers and knowledge stewards to take applicable actions. Instance of those checks embody counting information, detecting duplicate knowledge, and checking for null values.
To handle these points, Amazon constructed an open supply framework referred to as Deequ, which performs knowledge high quality at scale. In 2023, AWS launched AWS Glue Knowledge High quality, which gives an entire answer to measure and monitor knowledge high quality. AWS Glue makes use of the facility of Deequ to run knowledge high quality checks, determine information which might be dangerous, present an information high quality rating, and detect anomalies utilizing machine studying (ML). Nonetheless, you could have very small datasets and require sooner startup occasions. In such cases, an efficient answer is operating Deequ on AWS Lambda.
On this submit, we present tips on how to run Deequ on Lambda. Utilizing a pattern utility as reference, we display tips on how to construct an information pipeline to test and enhance the standard of information utilizing AWS Step Features. The pipeline makes use of PyDeequ, a Python API for Deequ and a library constructed on high of Apache Spark to carry out knowledge high quality checks. We present tips on how to implement knowledge high quality checks utilizing the PyDeequ library, deploy an instance that showcases tips on how to run PyDeequ in Lambda, and talk about the issues utilizing Lambda for operating PyDeequ.
That will help you get began, we’ve arrange a GitHub repository with a pattern utility that you should use to follow operating and deploying the appliance.
Since you might be studying this submit you might also have an interest within the following:
Resolution overview
On this use case, the information pipeline checks the standard of Airbnb lodging knowledge, which incorporates scores, opinions, and costs, by neighborhood. Your goal is to carry out the information high quality test of the enter file. If the information high quality test passes, then you definately combination the value and opinions by neighborhood. If the information high quality test fails, then you definately fail the pipeline and ship a notification to the person. The pipeline is constructed utilizing Step Features and contains three main steps:
- Knowledge high quality test – This step makes use of a Lambda operate to confirm the accuracy and reliability of the information. The Lambda operate makes use of PyDeequ, a library for knowledge high quality checks. As PyDeequ runs on Spark, the instance employs the Spark Runtime for AWS Lambda (SoAL) framework, which makes it simple to run a standalone set up of Spark in Lambda. The Lambda operate performs knowledge high quality checks and shops the ends in an Amazon Easy Storage Service (Amazon S3) bucket.
- Knowledge aggregation – If the information high quality test passes, the pipeline strikes to the information aggregation step. This step performs some calculations on the information utilizing a Lambda operate that makes use of Polars, a DataFrames library. The aggregated outcomes are saved in Amazon S3 for additional processing.
- Notification – After the information high quality test or knowledge aggregation, the pipeline sends a notification to the person utilizing Amazon Easy Notification Service (Amazon SNS). The notification features a hyperlink to the information high quality validation outcomes or the aggregated knowledge.
The next diagram illustrates the answer structure.

Implement high quality checks
The next is an instance of information from the pattern lodging CSV file.
id |
identify |
host_name |
neighbourhood_group |
neighbourhood |
room_type |
value |
minimum_nights |
number_of_reviews |
7071 |
BrightRoom with sunny greenview! |
Shiny |
Pankow |
Helmholtzplatz |
Non-public room |
42 |
2 |
197 |
28268 |
Cozy Berlin Friedrichshain for1/6 p |
Elena |
Friedrichshain-Kreuzberg |
Frankfurter Allee Sued FK |
Complete residence/apt |
90 |
5 |
30 |
42742 |
Spacious 35m2 in Central House |
Desiree |
Friedrichshain-Kreuzberg |
suedliche Luisenstadt |
Non-public room |
36 |
1 |
25 |
57792 |
Bungalow mit Garten in Berlin Zehlendorf |
Jo |
Steglitz – Zehlendorf |
Ostpreußendamm |
Complete residence/apt |
49 |
2 |
3 |
81081 |
Stunning Prenzlauer Berg Apt |
Bernd+Katja 🙂 |
Pankow |
Prenzlauer Berg Nord |
Complete residence/apt |
66 |
3 |
238 |
114763 |
Within the coronary heart of Berlin! |
Julia |
Tempelhof – Schoeneberg |
Schoeneberg-Sued |
Complete residence/apt |
130 |
3 |
53 |
153015 |
Central Artist Appartement Prenzlauer Berg |
Marc |
Pankow |
Helmholtzplatz |
Non-public room |
52 |
3 |
127 |
In a semi-structured knowledge format resembling CSV, there isn’t any inherent knowledge validation and integrity checks. It’s essential to confirm the information in opposition to accuracy, completeness, consistency, uniqueness, timeliness, and validity, that are generally referred because the six knowledge high quality dimensions. As an illustration, if you wish to show the identify of the host for a specific property on a dashboard, however the host’s identify is lacking within the CSV file, this might be a difficulty of incomplete knowledge. Completeness checks can embody searching for lacking information, lacking attributes, or truncated knowledge, amongst different issues.
As a part of the GitHub repository pattern utility, we offer a PyDeequ script that can carry out the standard validation checks on the enter file.
The next code is an instance of performing the completeness test from the validation script:
checkCompleteness = VerificationSuite(spark)
.onData(dataset)
.isComplete("host_name")
The next is an instance of checking for uniqueness of information:
checkCompleteness = VerificationSuite(spark)
.onData(dataset)
.isUnique ("id")
You may as well chain a number of validation checks as follows:
checkResult = VerificationSuite(spark)
.onData(dataset)
.isComplete("identify")
.isUnique("id")
.isComplete("host_name")
.isComplete("neighbourhood")
.isComplete("value")
.isNonNegative("value"))
.run()
The next is an instance of creating certain 99% or extra of the information within the file embody host_name:
checkCompleteness = VerificationSuite(spark)
.onData(dataset)
.hasCompleteness("host_name", lambda x: x >= 0.99)
Stipulations
Earlier than you get began, be sure you full the next conditions:
- It’s best to have an AWS account.
- Set up and configure the AWS Command Line Interface (AWS CLI).
- Set up the AWS SAM CLI.
- Set up Docker neighborhood version.
- It’s best to have Python 3
Run Deequ on Lambda
To deploy the pattern utility, full the next steps:
- Clone the GitHub repository.
- Use the supplied AWS CloudFormation template to create the Amazon Elastic Container Registry (Amazon ECR) picture that shall be used to run Deequ on Lambda.
- Use the AWS SAM CLI to construct and deploy the remainder of the information pipeline to your AWS account.
For detailed deployment steps, consult with the GitHub repository Readme.md.
If you deploy the pattern utility, you’ll discover that the DataQuality operate is in a container packaging format. It’s because the SoAL library required for this operate is bigger than the 250 MB restrict for zip archive packaging. Throughout the AWS Serverless Software Mannequin (AWS SAM) deployment course of, a Step Features workflow can also be created, together with the mandatory knowledge required to run the pipeline.
Run the workflow
After the appliance has been efficiently deployed to your AWS account, full the next steps to run the workflow:
- Go to the S3 bucket that was created earlier.
You’ll discover a brand new bucket with the prefix as your stack identify.
- Observe the directions within the GitHub repository to add the Spark script to this S3 bucket. This script is used to carry out knowledge high quality checks.
- Subscribe to the SNS subject created to obtain success or failure e mail notifications as defined within the GitHub repository.
- Open the Step Features console and run the workflow prefixed
DataQualityUsingLambdaStateMachine
with default inputs.
- You possibly can check each success and failure eventualities as defined within the directions within the GitHub repository.
The next determine illustrates the workflow of the Step Features state machine.

Overview the standard test outcomes and metrics
To assessment the standard test outcomes, you’ll be able to navigate to the identical S3 bucket. Navigate to the OUTPUT/verification-results
folder to see the standard test verification outcomes. Open the file identify beginning with the prefix half. The next desk is a snapshot of the file.
test |
check_level |
check_status |
constraint |
constraint_status |
Accomodations |
Error |
Success |
SizeConstraint(Dimension(None)) |
Success |
Accomodations |
Error |
Success |
CompletenessConstraint(Completeness(identify,None)) |
Success |
Accomodations |
Error |
Success |
UniquenessConstraint(Uniqueness(Listing(id),None)) |
Success |
Accomodations |
Error |
Success |
CompletenessConstraint(Completeness(host_name,None)) |
Success |
Accomodations |
Error |
Success |
CompletenessConstraint(Completeness(neighbourhood,None)) |
Success |
Accomodations |
Error |
Success |
CompletenessConstraint(Completeness(value,None)) |
Success |
Check_status
suggests if the standard test was profitable or a failure. The Constraint column suggests the completely different high quality checks that had been performed by the Deequ engine. Constraint_status
suggests the success or failure for every of the constraint.
You may as well assessment the standard test metrics generated by Deequ by navigating to the folder OUTPUT/verification-results-metrics
. Open the file identify beginning with the prefix half. The next desk is a snapshot of the file.
entity |
occasion |
identify |
worth |
Column |
value is non-negative |
Compliance |
1 |
Column |
neighbourhood |
Completeness |
1 |
Column |
value |
Completeness |
1 |
Column |
id |
Uniqueness |
1 |
Column |
host_name |
Completeness |
0.998831356 |
Column |
identify |
Completeness |
0.997348076 |
For the columns with a worth of 1, all of the information of the enter file fulfill the precise constraint. For the columns with a worth of 0.99, 99% of the information fulfill the precise constraint.
Concerns for operating PyDeequ in Lambda
Contemplate the next when deploying this answer:
- Operating SoAL on Lambda is a single-node deployment, however will not be restricted to a single core; a node can have a number of cores in Lambda, which permits for distributed knowledge processing. Including extra reminiscence in Lambda proportionally will increase the quantity of CPU, growing the general computational energy accessible. A number of CPU with single-node deployment and the fast startup time of Lambda ends in sooner job processing in relation to Spark jobs. Moreover, the consolidation of cores inside a single node permits sooner shuffle operations, enhanced communication between cores, and improved I/O efficiency.
- For Spark jobs that run longer than quarter-hour or bigger information (greater than 1 GB) or advanced joins that require extra reminiscence and compute useful resource, we advocate AWS Glue Knowledge High quality. SoAL may also be deployed in Amazon ECS.
- Choosing the proper reminiscence setting for Lambda features might help stability the pace and price. You possibly can automate the method of choosing completely different reminiscence allocations and measuring the time taken utilizing Lambda energy tuning.
- Workloads utilizing multi-threading and multi-processing can profit from Lambda features powered by an AWS Graviton processor, which gives higher price-performance. You need to use Lambda energy tuning to run with each x86 and ARM structure and examine outcomes to decide on the optimum structure to your workload.
Clear up
Full the next steps to wash up the answer assets:
- On the Amazon S3 console, empty the contents of your S3 bucket.
As a result of this S3 bucket was created as a part of the AWS SAM deployment, the following step will delete the S3 bucket.
- To delete the pattern utility that you simply created, use the AWS CLI. Assuming you used your venture identify for the stack identify, you’ll be able to run the next code:
sam delete --stack-name ""
- To delete the ECR picture you created utilizing CloudFormation, delete the stack from the AWS CloudFormation console.
For detailed directions, consult with the GitHub repository Readme.md file.
Conclusion
Knowledge is essential for contemporary enterprises, influencing decision-making, demand forecasting, supply scheduling, and general enterprise processes. Poor high quality knowledge can negatively affect enterprise choices and effectivity of the group.
On this submit, we demonstrated tips on how to implement knowledge high quality checks and incorporate them within the knowledge pipeline. Within the course of, we mentioned tips on how to use the PyDeequ library, tips on how to deploy it in Lambda, and issues when operating it in Lambda.
You possibly can consult with Knowledge high quality prescriptive steerage for studying about finest practices for implementing knowledge high quality checks. Please consult with Spark on AWS Lambda weblog to study operating analytics workloads utilizing AWS Lambda.
In regards to the Authors
Vivek Mittal is a Resolution Architect at Amazon Internet Companies. He’s obsessed with serverless and machine studying applied sciences. Vivek takes nice pleasure in helping clients with constructing revolutionary options on the AWS cloud platform.
John Cherian is Senior Options Architect at Amazon Internet Companies helps clients with technique and structure for constructing options on AWS.
Uma Ramadoss is a Principal Options Architect at Amazon Internet Companies, targeted on the Serverless and Integration Companies. She is answerable for serving to clients design and function event-driven cloud-native purposes utilizing providers like Lambda, API Gateway, EventBridge, Step Features, and SQS. Uma has a fingers on expertise main enterprise-scale serverless supply initiatives and possesses robust working data of event-driven, micro service and cloud structure.