Big Data

Get began with Amazon DynamoDB zero-ETL integration with Amazon Redshift

18 October 2024

We’re excited to announce the overall availability (GA) of Amazon DynamoDB zero-ETL integration with Amazon Redshift, which allows you to run high-performance analytics in your DynamoDB knowledge in Amazon Redshift with little to no influence on manufacturing workloads working on DynamoDB. As knowledge is written right into a DynamoDB desk, it’s seamlessly made accessible in Amazon Redshift, eliminating the necessity to construct and preserve complicated knowledge pipelines.

Zero-ETL integrations facilitate point-to-point knowledge motion with out the necessity to create and handle knowledge pipelines. You’ll be able to create zero-ETL integration on an Amazon Redshift Serverless workgroup or Amazon Redshift provisioned cluster utilizing RA3 occasion varieties. You’ll be able to then run enhanced evaluation on this DynamoDB knowledge with the wealthy capabilities of Amazon Redshift, resembling high-performance SQL, built-in machine studying (ML) and Spark integrations, materialized views (MV) with computerized and incremental refresh, knowledge sharing, and the flexibility to affix knowledge throughout a number of knowledge shops and knowledge lakes.

The DynamoDB zero-ETL integration with Amazon Redshift has helped our prospects simplify their extract, remodel, and cargo (ETL) pipelines. The next is an affidavit from Keith McDuffee, Director of DevOps at Verisk Analytics, a buyer who used zero-ETL integration with DynamoDB instead of their homegrown resolution and benefitted from the seamless replication that it supplied:

“We’ve got dashboards constructed on high of our transactional knowledge in Amazon Redshift. Earlier, we used our homegrown resolution to maneuver knowledge from DynamoDB to Amazon Redshift, however these jobs would typically day out and result in a whole lot of operational burden and missed insights on Amazon Redshift. Utilizing the DynamoDB zero-ETL integration with Amazon Redshift, we now not run into such points and the mixing seamlessly and constantly replicates knowledge to Amazon Redshift.”

On this publish, we showcase how an ecommerce utility can use this zero-ETL integration to investigate the distribution of consumers by attributes resembling location and buyer signup date. You can even use the mixing for retention and churn evaluation by calculating retention charges by evaluating the variety of lively profiles over totally different time intervals.

Answer overview

The zero-ETL integration offers end-to-end totally managed course of that enables knowledge to be seamlessly moved from DynamoDB tables to Amazon Redshift with out the necessity for handbook ETL processes, making certain environment friendly and incremental updates in Amazon Redshift setting. It leverages DynamoDB exports to incrementally replicate knowledge adjustments from DynamoDB to Amazon Redshift each 15-Half-hour. The preliminary knowledge load is a full load, which can take longer relying on the info quantity. This integration additionally permits replicating knowledge from a number of DynamoDB tables right into a single Amazon Redshift provisioned cluster or serverless workgroup, offering a holistic view of information throughout numerous functions.

This replication is finished with little to no efficiency or availability influence to your DynamoDB tables and with out consuming DynamoDB learn capability items (RCUs). Your functions will proceed to make use of DynamoDB whereas knowledge from these tables will likely be seamlessly replicated to Amazon Redshift for analytics workloads resembling reporting and dashboards.

The next diagram illustrates this structure.

Within the following sections, we present the best way to get began with DynamoDB zero-ETL integration with Amazon Redshift. This normal availability launch helps creating and managing the zero-ETL integrations utilizing the AWS Command Line Interface (AWS CLI), AWS SDKs, API, and AWS Administration Console. On this publish, we display utilizing the console.

Stipulations

Full the next prerequisite steps:

Allow point-in-time restoration (PITR) on the DynamoDB desk.
Allow case sensitivity for the goal Redshift knowledge warehouse.
Connect the resource-based insurance policies to each DynamoDB and Amazon Redshift as talked about in right here.
Be sure the AWS Id and Entry Administration (IAM) consumer or function creating the mixing has an identity-based coverage that authorizes actions listed in right here.

Create the DynamoDB zero-ETL integration

You’ll be able to create the mixing both on the DynamoDB console or Amazon Redshift console. The next steps use the Amazon Redshift console.

On the Amazon Redshift console, select Zero-ETL integrations within the navigation pane.
Select Create DynamoDB integration.

Should you select to create the mixing on the DynamoDB console, select Integrations within the navigation pane after which select Create integration and Amazon Redshift.

For Integration identify, enter a reputation (for instance, ddb-rs-customerprofiles-zetl-integration).
Select Subsequent.

Select Browse DynamoDB tables and select the desk that would be the supply for this integration.
Select Subsequent.

You’ll be able to solely select one desk. Should you want knowledge from a number of tables in a single Redshift cluster, you could create a separate integration for every desk.

Should you don’t have PITR enabled on the supply DynamoDB desk, an error will pop up whereas selecting the supply. On this case, you’ll be able to choose Repair it for me for DynamoDB to allow the PITR in your supply desk. Evaluate the adjustments and select Proceed.

Select Subsequent.

Select your goal Redshift knowledge warehouse. If it’s in the identical account, you’ll be able to browse and select the goal. If the goal resides in a unique account, you’ll be able to present the Amazon Useful resource Title (ARN) of the goal Redshift cluster.

Should you get an error in regards to the useful resource coverage, choose Repair it for me for Amazon Redshift to repair insurance policies as a part of this creation course of. Alternatively, you’ll be able to add useful resource insurance policies for Amazon Redshift manually previous to creating zero-ETL integration. Evaluate the adjustments and select Reboot and proceed.

Select Subsequent and full your integration.

The zero-ETL integration creation ought to present the standing Creating. Watch for the standing to alter to Energetic.

Create a Redshift database from the mixing

Full the next steps to create a Redshift database:

On the Amazon Redshift console, navigate to the lately created zero-ETL integration.
Select Create database from integration.

For Vacation spot database identify, enter a reputation (for instance, ddb_rs_customerprofiles_zetl_db).
Select Create database.

After you create the database, the database state ought to change from Creating to Energetic. This can begin the replication of information within the supply DynamoDB tables to the goal Redshift tables, which will likely be created beneath the general public schema of the vacation spot database (ddb_rs_customerprofiles_zetl_db).

Now you’ll be able to question your knowledge in Amazon Redshift utilizing the mixing with DynamoDB.

Understanding your knowledge

Knowledge exported from DynamoDB to Amazon Redshift is saved within the Redshift database that you simply created out of your zero-ETL integration (ddb_rs_customerprofiles_zetl_db). A single desk of the identical identify because the DynamoDB supply desk is created and is beneath the default (public) Redshift schema. DynamoDB solely enforces schemas for the first key attributes (partition key and optionally kind key). Due to this, your DynamoDB desk construction is replicated to Amazon Redshift in three columns: partition key, kind key, and a SUPER knowledge sort column named worth that accommodates all of the attributes. The info on this worth column is in DynamoDB JSON format. For details about the info format, see DynamoDB desk export output format.

The DynamoDB partition secret’s used because the Redshift desk distribution key, and the mix of the DynamoDB partition and kind keys are used because the Redshift desk kind keys. Amazon Redshift additionally permits altering the type keys on the zero-ETL integration replicated tables utilizing the ALTER SORT KEY command.

The DynamoDB knowledge in Amazon Redshift is read-only knowledge. After the info is accessible within the Amazon Redshift desk, you’ll be able to question the worth column as a SUPER knowledge sort utilizing PartiQL SQL or create and question materialized views on the desk, that are incrementally refreshed robotically.

For extra details about the SUPER knowledge sort, see Semistructured knowledge in Amazon Redshift.

Question the info

To validate the ingested data, you should use the Amazon Redshift Question Editor to question the goal desk in Amazon Redshift utilizing PartiQL SQL. For instance, you should use the next question to pick electronic mail and unnest the info within the worth column to the retrieve the client identify and tackle:

choose electronic mail, 
       worth.custname."S"::textual content custname, 
       worth.tackle."S"::textual content custaddress, 
       worth 
from "ddb_rs_customerprofiles_zetl_db".public."customerprofiles"

To display the replication of incremental adjustments in motion, we make the next updates to the supply DynamoDB desk:

Add two new gadgets within the DynamoDB desk:

##Incremental adjustments
##add 2 gadgets

aws dynamodb put-item --table-name customerprofiles --item  '{ "electronic mail": { "S": "sarah.wilson@instance.com" }, "custname": { "S": "Sarah Wilson" }, "username": { "S": "swilson789" }, "telephone": { "S": "555-012-3456" }, "tackle": { "S": "789 Oak St, Chicago, IL 60601" }, "custcreatedt": { "S": "2023-04-01T09:00:00Z" }, "custupddt": { "S": "2023-04-01T09:00:00Z" }, "standing": { "S": "lively" } }'

aws dynamodb put-item --table-name customerprofiles --item  '{ "electronic mail": { "S": "michael.taylor@instance.com" }, "custname": { "S": "Michael Taylor" }, "username": { "S": "mtaylor123" }, "telephone": { "S": "555-246-8024" }, "tackle": { "S": "246 Maple Ave, Los Angeles, CA 90001" }, "custcreatedt": { "S": "2022-11-01T08:00:00Z" }, "custupddt": { "S": "2022-11-01T08:00:00Z" }, "standing": { "S": "lively" } }'

Replace the tackle for one of many gadgets within the DynamoDB desk:

##replace an merchandise
aws dynamodb update-item --table-name customerprofiles --key '{"electronic mail": {"S": "sarahjones@instance.com"}}' --update-expression "SET tackle = :a" --expression-attribute-values '{":a":{"S":"124 Predominant St, Somewhereville USA "}}'

Delete the merchandise the place electronic mail is michaelwilson@instance.com:

# # delete an merchandise

aws dynamodb delete-item --table-name customerprofiles --key '{"electronic mail": {"S": "michaelwilson@instance.com"}}'

With these adjustments, the DynamoDB desk customerprofiles has 4 gadgets (three current, two new, and one delete), as proven within the following screenshot.

Subsequent, you’ll be able to go to the question editor to validate these adjustments. At this level, you’ll be able to count on incremental adjustments to mirror within the Redshift desk (4 data in desk).

Create materialized views on zero-ETL replicated tables

Widespread analytics use instances typically contain aggregating knowledge throughout a number of supply tables utilizing complicated queries to generate studies and dashboards for downstream functions. Prospects normally create late binding views to fulfill such use instances, which aren’t at all times optimized to fulfill the stringent question SLAs because of the lengthy underlying question runtimes. Another choice is to create a desk that shops the info throughout a number of supply tables, which brings the problem of incrementally updating and refreshing knowledge based mostly on the adjustments within the supply desk.

To serve such use instances and get across the challenges related to conventional choices, you’ll be able to create materialized views on high of zero-ETL replicated tables in Amazon Redshift, which may get robotically refreshed incrementally because the underlying knowledge adjustments. Materialized views are additionally handy for storing ceaselessly accessed knowledge by unnesting and shredding knowledge saved within the SUPER column worth by the zero-ETL integration.

For instance, we are able to use the next question to create a materialized view on the customerprofiles desk to investigate buyer knowledge:

CREATE MATERIALIZED VIEW dev.public.customer_mv
AUTO REFRESH YES
AS
SELECT worth."custname"."S"::varchar(30) as cust_name, worth."username"."S"::varchar(100) as user_name, worth."electronic mail"."S"::varchar(60) as cust_email, worth."tackle"."S"::varchar(100) as cust_addres, worth."telephone"."S"::varchar(100) as cust_phone_nbr, worth."standing"."S"::varchar(10) as cust_status,
worth."custcreatedt"."S"::varchar(10) as cust_create_dt, worth."custupddt"."S"::varchar(10) as cust_update_dt FROM "ddb_rs_customerprofiles_zetl_db"."public"."customerprofiles"
group by 1,2,3,4,5,6,7,8;

This view is ready to AUTO REFRESH, which suggests it is going to be robotically and incrementally refreshed when the brand new knowledge arrives within the underlying supply desk customerprofiles.

Now let’s say you need to perceive the distribution of consumers throughout totally different standing classes. You’ll be able to question the materialized view customer_mv created from the zero-ETL DynamoDB desk as follows:

-- Buyer depend by standing
choose cust_status,depend(distinct user_name) cust_status_count
from dev.public.customer_mv
group by 1;

Subsequent, let’s say you need to examine the variety of lively buyer profiles over totally different time intervals. You’ll be able to run the next question on customer_mv to get that knowledge:

-- Buyer lively depend by date
choose cust_create_dt,depend(distinct user_name) cust_count
from dev.public.customer_mv
the place cust_status="lively"
group by 1;

Let’s attempt to make a number of incremental adjustments, which includes two new gadgets and one delete on the supply DynamoDB desk utilizing following AWS CLI instructions.

aws dynamodb put-item --table-name customerprofiles --item  '{ "electronic mail": { "S": "robert.davis@instance.com" }, "custname": { "S": "Robert Davis" }, "username": { "S": "rdavis789" }, "telephone": { "S": "555-012-3456" }, "tackle": { "S": "789 Pine St, Seattle, WA 98101" }, "custcreatedt": { "S": "2022-07-01T14:00:00Z" }, "custupddt": { "S": "2023-04-01T11:30:00Z" }, "standing": { "S": "inactive" } }'

aws dynamodb put-item --table-name customerprofiles --item '{ "electronic mail": { "S": "william.jones@instance.com" }, "custname": { "S": "William Jones" }, "username": { "S": "wjones456" }, "telephone": { "S": "555-789-0123" }, "tackle": { "S": "456 Elm St, Atlanta, GA 30301" }, "custcreatedt": { "S": "2022-09-15T12:30:00Z" }, "custupddt": { "S": "2022-09-15T12:30:00Z" }, "standing": { "S": "lively" } }'

aws dynamodb delete-item --table-name customerprofiles --key '{"electronic mail": {"S": "emily.brown@instance.com"}}'

Validate the incremental refresh of the materialized view

To observe the historical past of materialized view refreshes, you should use the SYS_MV_REFRESH_HISTORY system view. As you’ll be able to see within the following output, the materialized view customer_mv was incrementally refreshed.

Now let’s question the materialized view created from the zero-ETL desk. You’ll be able to see two new data. The adjustments have been propagated into the materialized view with an incremental refresh.

Monitor the zero-ETL integration

There are a number of choices to acquire metrics on the efficiency and standing of the DynamoDB zero-ETL integration with Amazon Redshift.

On the Amazon Redshift console, select Zero-ETL integrations within the navigation pane. You’ll be able to select the zero-ETL integration you need and show Amazon CloudWatch metrics associated to the mixing. These metrics are additionally immediately accessible in CloudWatch.

For every integration, there are two tabs with data accessible:

Integration metrics – Reveals metrics such because the lag (in minutes) and knowledge transferred (in KBps)

Desk statistics – Reveals particulars about tables replicated from DynamoDB to Amazon Redshift resembling standing, final up to date time, desk row depend, and desk measurement

After inserting, deleting, and updating rows within the supply DynamoDB desk, the Desk statistics part shows the small print, as proven within the following screenshot.

Along with the CloudWatch metrics, you’ll be able to question the next system views, which give details about the integrations:

Pricing

AWS doesn’t cost an extra payment for the zero-ETL integration. You pay for current DynamoDB and Amazon Redshift assets used to create and course of the change knowledge created as a part of a zero-ETL integration. These embrace DynamoDB PITR, DynamoDB exports for the preliminary and ongoing knowledge adjustments to your DynamoDB knowledge, extra Amazon Redshift storage for storing replicated knowledge, and Amazon Redshift compute on the goal. For pricing on DynamoDB PITR and DynamoDB exports, see Amazon DynamoDB pricing. For pricing on Redshift clusters, see Amazon Redshift pricing.

Clear up

Once you delete a zero-ETL integration, your knowledge isn’t deleted from the DynamoDB desk or Redshift, however knowledge adjustments occurring after that time of time aren’t despatched to Amazon Redshift.

To delete a zero-ETL integration, full the next steps:

On the Amazon Redshift console, select Zero-ETL integrations within the navigation pane.
Choose the zero-ETL integration that you simply need to delete and on the Actions menu, select Delete.

To verify the deletion, enter verify and select Delete.

Conclusion

On this publish, we defined how one can arrange the zero-ETL integration from DynamoDB to Amazon Redshift to derive holistic insights throughout many functions, break knowledge silos in your group, and achieve important value financial savings and operational efficiencies.

To study extra about zero-ETL integration, discuss with documentation.

Concerning the authors

Ekta Ahuja is an Amazon Redshift Specialist Options Architect at AWS. She is keen about serving to prospects construct scalable and sturdy knowledge and analytics options. Earlier than AWS, she labored in a number of totally different knowledge engineering and analytics roles. Outdoors of labor, she enjoys panorama pictures, touring, and board video games.

Raghu Kuppala is an Analytics Specialist Options Architect skilled working within the databases, knowledge warehousing, and analytics area. Outdoors of labor, he enjoys attempting totally different cuisines and spending time together with his household and associates.

Veerendra Nayak is a Principal Database Options Architect based mostly within the Bay Space, California. He works with prospects to share finest practices on database migrations, resiliency, and integrating operational knowledge with analytics and AI providers.