Big Data

20x Quicker Ingestion with Rockset’s New DynamoDB Connector

26 September 2024

Since its introduction in 2012, Amazon DynamoDB has been one of the standard NoSQL databases within the cloud. DynamoDB, in contrast to a conventional RDBMS, scales horizontally, obviating the necessity for cautious capability planning, resharding, and database upkeep. Consequently, DynamoDB is the database of alternative for firms constructing event-driven architectures and user-friendly, performant functions at scale. As such, DynamoDB is central to many trendy functions in advert tech, gaming, IoT, and monetary providers.

Nevertheless, whereas DynamoDB is nice for real-time transactions it doesn’t do as effectively for analytics workloads. Analytics workloads are the place Rockset shines. To allow these workloads, Rockset gives a totally managed sync to DynamoDB tables with its built-in connector. The information from DynamoDB is routinely listed in an inverted index, a column index and a row index which may then be queried rapidly and effectively.

As such, the DynamoDB connector is one in all our most generally used information connectors. We see customers transfer large quantities of information–TBs price of information–utilizing the DynamoDB connector. Given the size of the use, we quickly uncovered shortcomings with our connector.

How the DynamoDB Connector At the moment Works with Scan API

At a excessive degree, we ingest information into Rockset utilizing the present connector in two phases:

dynamodb-rockset-connector-v1

Preliminary Dump: This section makes use of DynamoDB’s Scan API for a one-time scan of your complete desk
Streaming: This section makes use of DynamoDB’s Streams API and consumes steady updates made to a DynamoDB desk in a streaming vogue.

Roughly, the preliminary dump offers us a snapshot of the information, on which the updates from the streaming section apply. Whereas the preliminary dump utilizing the Scan API works effectively for small sizes, it doesn’t all the time do effectively for giant information dumps.

There are two predominant points with DynamoDB’s preliminary dump because it stands at the moment:

Unconfigurable phase sizes: Dynamo doesn’t all the time steadiness segments uniformly, generally resulting in a straggler phase that’s inordinately bigger than the others. As a result of parallelism is at phase granularity, we have now seen straggler segments improve the full ingestion time for a number of customers in manufacturing.
Fastened Dynamo stream retention: DynamoDB Streams seize change data in a log for as much as 24 hours. Which means that if the preliminary dump takes longer than 24 hours the shards that have been checkpointed at first of the preliminary dump could have expired by then, resulting in information loss.

Enhancing the DynamoDB Connector with Export to S3

When AWS introduced the launch of latest performance that means that you can export DynamoDB desk information to Amazon S3, we began evaluating this strategy to see if this might assist overcome the shortcomings with the older strategy.

At a excessive degree, as a substitute of utilizing the Scan API to get a snapshot of the information, we use the brand new export desk to S3 performance. Whereas not a drop-in substitute for the Scan API, we tweaked the streaming section which, along with the export to S3, is the premise of our new connector.

Whereas the outdated connector took nearly 20 hours to ingest 1TB finish to finish with manufacturing workload working on the DynamoDB desk, the brand new connector takes solely about 1 hour, finish to finish. What’s extra, ingesting 20TB from DynamoDB takes solely 3.5 hours, finish to finish! All you might want to present is an S3 bucket!

Advantages of the brand new strategy:

Doesn’t have an effect on the provisioned learn capability, and thus any manufacturing workload, working on the DynamoDB desk
The export course of is lots quicker than customized table-scan options
S3 duties might be configured to unfold the load evenly in order that we don’t must take care of a closely imbalanced phase like with DynamoDB
Checkpointing with S3 comes without spending a dime (we only in the near past constructed help for this)

We’re opening up entry for public beta, and can’t wait so that you can take this for a spin! Signal-up right here.

Pleased ingesting and comfortable querying!

How the DynamoDB Connector At the moment Works with Scan API

Enhancing the DynamoDB Connector with Export to S3

LEAVE A REPLY Cancel reply