Big Data

Carry out knowledge parity at scale for knowledge modernization packages utilizing AWS Glue Knowledge High quality

10 October 2024

Right now, prospects are embarking on knowledge modernization packages by migrating on-premises knowledge warehouses and knowledge lakes to the AWS Cloud to reap the benefits of the dimensions and superior analytical capabilities of the cloud. Prospects are migrating their on-premises knowledge warehouse options constructed on databases like Netezza, PostgreSQL, Greenplum, and Teradata to AWS based mostly trendy knowledge platforms utilizing companies like Amazon Easy Storage Service (Amazon S3) and Amazon Redshift. AWS based mostly trendy knowledge platforms make it easier to break down knowledge silos and allow analytics and machine studying (ML) use circumstances at scale.

Throughout migration, you would possibly need to set up knowledge parity checks between on-premises databases and AWS knowledge platform companies. Knowledge parity is a course of to validate that knowledge was migrated efficiently from supply to focus on with none errors or failures. A profitable knowledge parity examine implies that knowledge within the goal platform has the equal content material, values, and completeness as that of the supply platform.

Knowledge parity may help construct confidence and belief with enterprise customers on the standard of migrated knowledge. Moreover, it will possibly make it easier to determine errors within the new cloud-based extract, rework, and cargo (ETL) course of.

Some prospects construct customized in-house knowledge parity frameworks to validate knowledge throughout migration. Others use open supply knowledge high quality merchandise for knowledge parity use circumstances. These choices contain loads of customized code, configurations, and set up, and have scalability challenges. This takes away necessary particular person hours from the precise migration effort into constructing and sustaining an information parity framework.

On this publish, we present you methods to use AWS Glue Knowledge High quality, a function of AWS Glue, to ascertain knowledge parity throughout knowledge modernization and migration packages with minimal configuration and infrastructure setup. AWS Glue Knowledge High quality lets you robotically measure and monitor the standard of your knowledge in knowledge repositories and AWS Glue ETL pipelines.

Overview of answer

In massive knowledge modernization tasks of migrating from an on-premises database to an Amazon S3 based mostly knowledge lake, it’s widespread to have the next necessities for knowledge parity:

Evaluate one-time historic knowledge from the supply on-premises database to the goal S3 knowledge lake.
Evaluate ongoing knowledge that’s replicated from the supply on-premises database to the goal S3 knowledge lake.
Evaluate the output of the cloud-based new ETL course of with the present on-premises ETL course of. You may plan a interval of parallel runs, the place the legacy and new programs run in parallel, and the info is in contrast every day.
Use useful queries to check high-level aggregated enterprise metrics between the supply on-premises database and the goal knowledge lake.

On this publish, we use an instance of PostgreSQL migration from an on-premises database to an S3 knowledge lake utilizing AWS Glue Knowledge High quality.

The next diagram illustrates this use case’s historic knowledge migration structure.

The structure reveals a typical sample for on-premises databases (like PostgreSQL) to Amazon S3 based mostly knowledge lake migration. The workflow consists of the next steps:

Schemas and tables are saved in an on-premises database (PostgreSQL), and also you need to migrate to Amazon S3 for storage and AWS Glue for compute.
Use AWS Database Migration Service (AWS DMS) emigrate historic knowledge from the supply database to an S3 staging bucket.
Use AWS Glue ETL to curate knowledge from the S3 staging bucket to an S3 curated bucket. Within the curated bucket, AWS Glue tables are created utilizing AWS Glue crawlers or an AWS Glue ETL job.
Use an AWS Glue connection to attach AWS Glue with the on-premises PostgreSQL database.
Use AWS Glue Knowledge High quality to check historic knowledge from the supply database to the goal S3 bucket and write outcomes to a separate S3 bucket.

The next diagram illustrates the incremental knowledge migration structure.

After historic knowledge is migrated and validated, the workflow proceeds to the next steps:

Ingest incremental knowledge from the supply programs to the S3 staging bucket. That is finished utilizing an ETL ingestion device like AWS Glue.
Curate incremental knowledge from the S3 staging bucket to the S3 curated bucket utilizing AWS Glue ETL.
Evaluate the incremental knowledge utilizing AWS Glue Knowledge High quality.

Within the subsequent sections, we display methods to use AWS Glue Knowledge High quality to ascertain knowledge parity between supply (PostgreSQL) and goal (Amazon S3). We cowl the next eventualities:

Set up knowledge parity for historic knowledge migration – Historic knowledge migration is outlined as a one-time bulk knowledge migration of historic knowledge from legacy on-premises databases to the AWS Cloud. The info parity course of maintains the validity of migrated historic knowledge.
Set up knowledge parity for incremental knowledge – After the historic knowledge migration, incremental knowledge is loaded to Amazon S3 utilizing the brand new cloud-based ETL course of. The incremental knowledge is in contrast between the legacy on-premises database and the AWS Cloud.
Set up knowledge parity utilizing useful queries – We carry out business- and functional-level checks utilizing SQL queries on migrated knowledge.

Conditions

It’s essential arrange the next prerequisite sources:

Set up knowledge parity for historic knowledge migration

For historic knowledge migration and parity, we’re assuming that organising a PostgreSQL database, migrating knowledge to Amazon S3 utilizing AWS DMS, and knowledge curation have been accomplished as a prerequisite to carry out knowledge parity utilizing AWS Glue Knowledge High quality. For this use case, we use an on-premises PostgreSQL database with historic knowledge loaded on Amazon S3 and AWS Glue. Our goal is to check historic knowledge between the on-premises database and the AWS Cloud.

We use the next tables within the on-premises PostgreSQL database. These have been migrated to the AWS Cloud utilizing AWS DMS. As a part of knowledge curation, the next three further columns have been added to the test_schema.sample_data desk within the curated layer: id, etl_create_ts, and etl_user_id.

Create sample_data with the next code:

create desk test_schema.sample_data
(
    job              textual content,
    firm          textual content,
    ssn              textual content,
    residence        textual content,
    current_location textual content,
    web site          textual content,
    username         textual content,
    title             textual content,
    gender_id        integer,
    blood_group_id   integer,
    deal with          textual content,
    mail             textual content,
    birthdate        date,
    insert_ts        timestamp with time zone,
    update_ts        timestamp with time zone
);

Create gender with the next code (accommodates gender particulars for lookup):

create desk test_schema.gender
(
    gender_id integer,
    worth     varchar(1)
);

Create blood_group with the next code (accommodates blood group data for lookup):

create desk test_schema.blood_group
(
    blood_group_id integer,
    worth          varchar(10)
);

We’re assuming that the previous tables have been migrated to the S3 staging bucket utilizing AWS DMS and curated utilizing AWS Glue. For detailed directions on methods to arrange AWS DMS to copy knowledge, discuss with the appendix on the finish of this publish.

Within the following sections, we showcase methods to configure an AWS Glue Knowledge High quality job for comparability.

Create an AWS Glue connection

AWS Glue Knowledge High quality makes use of an AWS Glue connection to hook up with the supply PostgreSQL database. Full the next steps to create the connection:

On AWS Glue console, below Knowledge Catalog within the navigation pane, select Connections.
Select Create connection.
Set Connector kind as JDBC.
Add connection particulars just like the connection URL, credentials, and networking particulars.

Discuss with AWS Glue connection properties for added particulars.

AWS Glue 4.0 makes use of PostgreSQL JDBC driver 42.3.6. In case your PostgreSQL database requires a special model JDBC driver, obtain the JDBC driver similar to your PostgreSQL model.

Create an AWS Glue knowledge parity job for historic knowledge comparability

As a part of the previous steps, you used AWS DMS to drag historic knowledge from PostgreSQL to the S3 staging bucket. You then used an AWS Glue pocket book to curate knowledge from the staging bucket to the curated bucket and created AWS Glue tables. As a part of this step, you utilize AWS Glue Knowledge High quality to check knowledge between PostgreSQL and Amazon S3 to substantiate the info is legitimate. Full the next steps to create an AWS Glue job utilizing the AWS Glue visible editor to check knowledge between PostgreSQL and Amazon S3:

Set the supply because the PostgreSQL desk sample_data.
Set the goal because the AWS Glue desk sample_data.

Within the curated layer, we added a couple of further columns: id, etl_create_ts, and etl_user_id. As a result of these columns are newly created, we use a metamorphosis to drop these columns for comparability.
Moreover, the birth_date column is a timestamp in AWS Glue, so we alter it up to now format previous to comparability.

Select Consider Knowledge High quality in Transformations.
Specify the AWS Glue Knowledge High quality rule as DatasetMatch, which checks if the info within the main dataset matches the info in a reference dataset.
Present the distinctive key (main key) data for supply and goal. On this instance, the first secret is a mixture of columns job and username.

For Knowledge high quality rework output, specify your knowledge to output:
1. Unique knowledge – This output consists of all rows and columns in authentic knowledge. As well as, you may choose Add new columns to point knowledge high quality errors. This selection provides metadata columns for every row that can be utilized to determine legitimate and invalid rows and the principles that failed validation. You may additional customise row-level output to pick out solely legitimate rows or convert the desk format based mostly on the use case.
2. Knowledge high quality outcomes – It is a abstract output grouped by a rule. For our knowledge parity instance, this output could have one row with a abstract of the match share.

Configure the Amazon S3 targets for ruleOutcomes and rowLevelOutcomes to put in writing AWS Glue Knowledge High quality output within the Amazon S3 location in Parquet format.

Save and run the AWS Glue job.
When the AWS Glue job is full, you may run AWS Glue crawler to robotically create rulesummary and row_level_output tables and consider the output in Amazon Athena.

The next is an instance of rule-level output. The screenshot reveals the DatasetMatch worth as 1.0, which suggests all rows between the supply PostgreSQL database and goal knowledge lake matched.

The next is an instance of row-level output. The screenshot reveals all supply rows together with further columns that verify if a row has handed or failed validation.

Let’s replace a couple of data in PostgreSQL to simulate an information subject in the course of the knowledge migration course of:

replace test_schema.sample_data set residence = null the place blood_group_id = 8
Information up to date 1,272

You may rerun the AWS Glue job and observe the output in Athena. Within the following screenshot, the brand new match share is 87.27%. With this instance, you had been capable of seize the simulated knowledge subject with AWS Glue Knowledge High quality efficiently.

When you run the next question, the output will match the document rely with the previous screenshot:

SELECT rely(*) FROM "gluedqblog"."rowleveloutput" the place dataqualityevaluationresult="Failed"

Set up knowledge parity for incremental knowledge

After the preliminary historic migration, the following step is to implement a course of to validate incremental knowledge between the legacy on-premises database and the AWS Cloud. For incremental knowledge validation, knowledge output from the present ETL course of and the brand new cloud-based ETL course of is in contrast every day. You may add a filter to the previous AWS Glue knowledge parity job to pick out knowledge that has been modified for a given day utilizing a timestamp column.

Set up knowledge parity utilizing useful queries

Practical queries are SQL statements that enterprise analysts can run within the legacy system (for this publish, an on-premises database) and the brand new AWS Cloud-based knowledge lake to check knowledge metrics and output. To ensure the patron functions work appropriately with migrated knowledge, it’s crucial to validate knowledge functionally. The earlier examples are primarily technical validation to ensure there isn’t any knowledge loss within the goal knowledge lake after knowledge ingestion from each historic migration and alter knowledge seize (CDC) context. In a typical knowledge warehouse migration use case, the historic migration pipeline typically pulls knowledge from an information warehouse, and the incremental or CDC pipeline integrates the precise supply programs, which feed the info warehouse.

Practical knowledge parity is the third step within the general knowledge validation framework, the place you have got the pliability to proceed comparable enterprise metrics validation pushed by an aggregated SQL question. You may assemble your personal enterprise metrics validation question, ideally working with material consultants (SMEs) from the enterprise facet. We’ve seen that agility and perfection matter for a profitable knowledge warehouse migration, subsequently reusing the time-tested and enterprise SME-approved aggregated SQL question from the legacy knowledge warehouse system with minimal modifications can fast-track the implementation in addition to keep enterprise confidence. On this part, we display methods to implement a pattern useful parity for a given dataset.

On this instance, we use a set of supply PostgreSQL tables and goal S3 knowledge lake tables for comparability. We use an AWS Glue crawler to create Knowledge Catalog tables for the supply tables, as described within the first instance.

The pattern useful validation compares the distribution rely of gender and blood group for every firm. This may very well be any useful question that joins info and dimension tables and performs aggregations.

You should use a SQL transformation to generate an aggregated dataset for each the supply and goal question. On this instance, the supply question makes use of a number of tables. Apply SQL capabilities on the columns and required filter standards.

The next screenshot illustrates the Supply Practical Question rework.

The next screenshot illustrates the Goal Practical Question rework.

The next screenshot illustrates the Consider Knowledge High quality rework. You may apply the DatasetMatch rule to realize a 100% match.

After the job runs, you’ll find the job run standing on AWS Glue console.

The Knowledge high quality tab shows the info high quality outcomes.

AWS Glue Knowledge High quality gives row- and rule-level outputs, as described within the earlier examples.

Test the rule-level output within the Athena desk. The end result of the DatasetMatch rule reveals a 100% match between the PostgreSQL supply dataset and goal knowledge lake.

Test the row-level output within the Athena desk. The next screenshot shows the row-level output with knowledge high quality analysis outcomes and rule standing.

Let’s change the corporate worth for Spencer LLC to Spencer LLC – New to simulate the affect on the info high quality rule and general outcomes. This creates a spot within the rely of data for the given firm title whereas evaluating supply and goal.

By rerunning the job and checking the AWS Glue Knowledge High quality outcomes, you’ll uncover that the info high quality rule has failed. That is as a result of distinction in firm title between the supply and goal dataset as a result of the info high quality rule analysis is monitoring a 100% match. You may scale back the match share within the knowledge high quality expression based mostly on the required threshold.

Subsequent, revert the modifications made for the info high quality rule failure simulation.

When you rerun the job and validate the AWS Glue Knowledge High quality outcomes, you’ll find the info high quality rating is again to 100%.

Clear up

When you not need to maintain the sources you created as a part of this publish in your AWS account, full the next steps:

Delete the AWS Glue pocket book and visible ETL jobs.
Take away all knowledge and delete the staging and curated S3 buckets.
Delete the AWS Glue connection to the PostgreSQL database.
Delete the AWS DMS replication process and occasion.
Delete the Knowledge Catalog.

Conclusion

On this publish, we mentioned how you should use AWS Glue Knowledge High quality to construct a scalable knowledge parity pipeline for knowledge modernization packages. AWS Glue Knowledge High quality lets you keep the standard of your knowledge by automating lots of the guide duties concerned in knowledge high quality monitoring and administration. This helps stop unhealthy knowledge from coming into your knowledge lakes and knowledge warehouses. The examples on this publish supplied an outline on methods to arrange historic, incremental, and useful knowledge parity jobs utilizing AWS Glue Knowledge High quality.

To be taught extra about AWS Glue Knowledge High quality, discuss with Evaluating knowledge high quality with AWS Glue Studio and AWS Glue Knowledge High quality. To dive into the AWS Glue Knowledge High quality APIs, see Knowledge High quality API.

Appendix

On this part, we display methods to arrange AWS DMS and replicate knowledge. You should use AWS DMS to repeat one-time historic knowledge from the PostgreSQL database to the S3 staging bucket. Full the next steps:

On the AWS DMS console, below Migrate knowledge within the navigation pane, select Replication cases.
Select Create a replication occasion.
Select a VPC that has connectivity to the PostgreSQL occasion.

After the occasion is created, it ought to seem with the standing as Obtainable on the AWS DMS console.

4. Primarily based on our answer structure, you now create an S3 staging bucket for AWS DMS to put in writing replicated output. For this publish, the staging bucket title is gluedq-blog-dms-staging-bucket.
Beneath Migrate knowledge within the navigation pane, select Endpoints.
Create a supply endpoint for the PostgreSQL connection.

After you create the supply endpoint, select Check endpoint to ensure it’s connecting efficiently to the PostgreSQL Occasion.
Equally, create a goal endpoint with the S3 staging bucket as a goal and take a look at the goal endpoint.

We’ll be writing replicated output from PostgreSQL in CSV format. the addColumnName=true; property within the AWS DMS configuration to ensure the schema data is written as headers in CSV output.

Now you’re able to create the migration process.

Beneath Migrate knowledge within the navigation pane, select Database migration duties.
Create a brand new replication process.
Specify the supply and goal endpoints you created and select the desk that must be replicated.

After the replication process is created, it is going to begin replicating knowledge robotically.

When the standing reveals as Load full, knowledge ought to seem within the following S3 areas (the bucket title on this instance is a placeholder):

s3:///staging_layer/test_schema/sample_data/
s3:///staging_layer/test_schema/gender/
s3:///staging_layer/test_schema/blood_group/

In regards to the Authors

Himanshu Sahni is a Senior Knowledge Architect in AWS Skilled Companies. Himanshu focuses on constructing Knowledge and Analytics options for enterprise prospects utilizing AWS instruments and companies. He’s an skilled in AI/ ML and Huge Knowledge instruments like Spark, AWS Glue and Amazon EMR. Exterior of labor, Himanshu likes taking part in chess and tennis.

Arunabha Datta is a Senior Knowledge Architect at AWS Skilled Companies. He collaborates with prospects and companions to create and execute trendy knowledge structure utilizing AWS Analytics companies. Arunabha’s ardour lies in helping prospects with digital transformation, significantly within the areas of information lakes, databases, and AI/ML applied sciences. Apart from work, his hobbies embrace pictures and he likes to spend high quality time together with his household.

Charishma Ravoori is an Affiliate Knowledge & ML Engineer at AWS Skilled Companies. She focuses on growing options for purchasers that embrace constructing out knowledge pipelines, growing predictive fashions and producing ai chatbots utilizing AWS/Amazon instruments. Exterior of labor, Charishma likes to experiment with new recipes and play the guitar.