Big Data

Use the AWS CDK with the Knowledge Options Framework to provision and handle Amazon Redshift Serverless

4 September 2024

In February 2024, we introduced the discharge of the Knowledge Options Framework (DSF), an opinionated open supply framework for constructing information options on AWS. DSF is constructed utilizing the AWS Cloud Improvement Package (AWS CDK) to package deal infrastructure parts into L3 AWS CDK constructs on high of AWS providers. L3 constructs are implementations of widespread technical patterns and create a number of sources which can be configured to work with one another.

On this publish, we display the best way to use the AWS CDK and DSF to create a multi-data warehouse platform primarily based on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of knowledge, and information sharing between totally different information warehouse deployments. Utilizing a programmatic strategy with the AWS CDK and DSF lets you apply GitOps rules to your analytics workloads and notice the next advantages:

You possibly can deploy utilizing steady integration and supply (CI/CD) pipelines, together with the definitions of Redshift objects (databases, tables, shares, and so forth)
You possibly can roll out adjustments constantly throughout a number of environments
You possibly can bootstrap information warehouses (desk creation, ingestion of knowledge, and so forth) utilizing code and use model management to simplify the setup of testing environments
You possibly can check adjustments earlier than deployment utilizing AWS CDK built-in testing capabilities

As well as, DSF’s Redshift Serverless L3 constructs present various built-in capabilities that may speed up improvement whereas serving to you observe finest practices. For instance:

Operating extract, remodel, and cargo (ETL) jobs to and from Amazon Redshift is extra simple as a result of an AWS Glue connection useful resource is robotically created and configured. This implies information engineers don’t need to configure this useful resource and may use it straight away with their AWS Glue ETL jobs.
Equally, with discovery of knowledge inside Amazon Redshift, DSF gives a handy technique to configure an AWS Glue crawler to populate the AWS Glue Knowledge Catalog for ease of discovery in addition to ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler makes use of an AWS Identification and Entry Administration (IAM) position that follows least privilege.
Sharing information between Redshift information warehouses is a standard strategy to enhance collaboration between strains of enterprise with out duplicating information. DSF gives handy strategies for the end-to-end stream for each information producer and client.

Resolution overview

The answer demonstrates a standard sample the place a knowledge warehouse is used as a serving layer for enterprise intelligence (BI) workloads on high of knowledge lake information. The supply information is saved in Amazon Easy Storage Service (Amazon S3) buckets, then ingested right into a Redshift producer information warehouse to create materialized views and combination information, and eventually shared with a Redshift client operating BI queries from the end-users. The next diagram illustrates the high-level structure.

Within the publish, we use Python for the instance code. DSF additionally helps TypeScript.

Conditions

As a result of we’re utilizing the AWS CDK, full the steps in Getting Began with the AWS CDK earlier than you implement the answer.

Initialize the undertaking and provision a Redshift Serverless namespace and workgroup

Let’s begin with initializing the undertaking and together with DSF as a dependency. You possibly can run this code in your native terminal, or you should use AWS Cloud9:

mkdir dsf-redshift-blog && cd dsf-redshift-blog
cdk init --language python

Open the undertaking folder in your IDE and full the next steps:

Open the app.py file.
On this file, ensure to uncomment the primary env This configures the AWS CDK setting relying on the AWS profile used through the deployment.
Add a configuration flag within the cdk.context.json file on the root of the undertaking (if it doesn’t exist, create the file):
```
{  
    "@data-solutions-framework-on-aws/removeDataOnDestroy": true 
}
```

Setting the @data-solutions-framework-on-aws/removeDataOnDestroy configuration flag to true makes certain sources which have the removal_policy parameter set to RemovalPolicy.DESTROY are destroyed when the AWS CDK stack is deleted. It is a guardrail DSF makes use of to stop by chance deleting information.

Now that the undertaking is configured, you can begin including sources to the stack.

Navigate to the dsf_redshift_blog folder and open the dsf_redshift_blog_stack.py file.

That is the place we configure the sources to be deployed.

To get began constructing the end-to-end demo, add the next import statements on the high of the file, which lets you begin defining the sources from each the AWS CDK core library in addition to DSF:

from aws_cdk import (
    RemovalPolicy,
    Stack
)

from aws_cdk.aws_s3 import Bucket
from aws_cdk.aws_iam import Function, ServicePrincipal
from constructs import Assemble
from cdklabs import aws_data_solutions_framework as dsf

We use a number of DSF-specific constructs to construct the demo:

DataLakeStorage – This creates three S3 buckets, named Bronze, Silver, and Gold, to characterize the totally different information layers.
S3DataCopy – This manages the copying of knowledge from one bucket to a different bucket.
RedshiftServerlessNamespace – This creates a Redshift Serverless namespace the place database objects and customers are saved.
RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that accommodates compute- and network-related configurations for the information warehouse. That is additionally the entry level for a number of handy functionalities that DSF gives, reminiscent of cataloging of Redshift tables, operating SQL statements as a part of the AWS CDK (reminiscent of creating tables, information ingestion, merging of tables, and extra), and sharing datasets throughout totally different Redshift clusters with out shifting information.

Now that you’ve got imported the libraries, create a set of S3 buckets following the medallion structure finest practices with bronze, silver, and gold information layers.

The high-level definitions of every layer are as follows:

Bronze represents uncooked information; that is the place information from numerous supply methods lands. No schema is required.
Silver is cleaned and probably augmented information. The schema is enforced on this layer.
Gold is information that’s additional refined and aggregated to serve a particular enterprise want.

Utilizing the DataLakeStorage assemble, you’ll be able to create these three S3 buckets with the next finest practices:

Encryption at relaxation by way of AWS Key Administration Service (AWS KMS) is turned on
SSL is enforced
The usage of S3 bucket keys is turned on

There’s a default S3 lifecycle rule outlined to delete incomplete multipart uploads after 1 day

data_lake = dsf.storage.DataLakeStorage(self,
    'DataLake',
    removal_policy=RemovalPolicy.DESTROY)

After you create the S3 buckets, copy over the information utilizing the S3DataCopy For this demo, we land the information within the Silver bucket as a result of it’s already cleaned:

source_bucket = Bucket.from_bucket_name(self, 
    'SourceBucket', 
    bucket_name="redshift-immersionday-labs")

data_copy = dsf.utils.S3DataCopy(self,
    'SourceData', 
    source_bucket=source_bucket, 
    source_bucket_prefix='information/amazon-reviews/', 
    source_bucket_region='us-west-2', 
    target_bucket=data_lake.silver_bucket, 
    target_bucket_prefix='silver/amazon-reviews/')

To ensure that Amazon Redshift to ingest the information in Amazon S3, it wants an IAM position with the appropriate permissions. This position shall be related to the Redshift Serverless namespace that you just create subsequent.
```
lake_role = Function(self, 
    'LakeRole', 
    assumed_by=ServicePrincipal('redshift.amazonaws.com'))

data_lake.silver_bucket.grant_read(lake_role)
```
To provision Redshift Serverless, configure two sources: a namespace and a workgroup. DSF gives L3 constructs for each:
1. RedshiftServerlessNamespace
2. RedshiftServerlessWorkgroup
Each constructs observe safety finest practices, together with:
- The default digital personal cloud (VPC) makes use of personal subnets (with public entry disabled).
- Knowledge is encrypted at relaxation by way of AWS KMS with computerized key rotation.
- Admin credentials are saved in AWS Secrets and techniques Supervisor with computerized rotation managed by Amazon Redshift.
- A default AWS Glue connection is robotically created utilizing personal connectivity. This can be utilized by AWS Glue crawlers in addition to AWS Glue ETL jobs to hook up with Amazon Redshift.
The RedshiftServerlessWorkgroup assemble is the primary entry level for different capabilities, reminiscent of integration with the AWS Glue Knowledge Catalog, Redshift Knowledge API, and Knowledge Sharing API.
1. Within the following instance, use the defaults supplied by the assemble and affiliate the IAM position that you just created earlier to present Amazon Redshift entry to the information lake for information ingestion:
```
namespace = dsf.consumption.RedshiftServerlessNamespace(self, 
    'Namespace', 
    db_name="defaultdb", 
    identify="producer", 
    removal_policy=RemovalPolicy.DESTROY, 
    default_iam_role=lake_role)

workgroup = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    'Workgroup', 
    identify="producer", 
    namespace=namespace, 
    removal_policy=RemovalPolicy.DESTROY)
```

Create tables and ingest information

To create a desk, you should use the runCustomSQL technique within the RedshiftServerlessWorkgroup assemble. This technique lets you run arbitrary SQL statements when the useful resource is being created (reminiscent of create desk or create materialized view) and when it’s being deleted (reminiscent of drop desk or drop materialized view).

Add the next code after the RedshiftServerlessWorkgroup instantiation:

create_amazon_reviews_table = workgroup.run_custom_sql('CreateAmazonReviewsTable', 
    database_name="defaultdb", 
    sql="CREATE TABLE amazon_reviews (market character various(16383) ENCODE lzo, customer_id character various(16383) ENCODE lzo, review_id character various(16383) ENCODE lzo, product_id character various(16383) ENCODE lzo, product_parent character various(16383) ENCODE lzo, product_title character various(16383) ENCODE lzo, star_rating integer ENCODE az64, helpful_votes integer ENCODE az64, total_votes integer ENCODE az64, vine character various(16383) ENCODE lzo, verified_purchase character various(16383) ENCODE lzo, review_headline character various(max) ENCODE lzo, review_body character various(max) ENCODE lzo, review_date date ENCODE az64, yr integer ENCODE az64) DISTSTYLE AUTO;", 
    delete_sql="drop desk amazon_reviews")

load_amazon_reviews_data = workgroup.ingest_data('amazon_reviews_ingest_data', 
    'defaultdb', 
    'amazon_reviews', 
    data_lake.silver_bucket, 
    'silver/amazon-reviews/', 
    'FORMAT parquet')

load_amazon_reviews_data.node.add_dependency(create_amazon_reviews_table)
load_amazon_reviews_data.node.add_dependency(data_copy)

Given the asynchronous nature of among the useful resource creation, we additionally implement dependencies between some sources; in any other case, the AWS CDK would attempt to create them in parallel to speed up the deployment. The previous dependency statements set up the next:

Earlier than you load the information, the S3 information copy is full, so the information exists within the supply bucket of the ingestion
Earlier than you load the information, the goal desk has been created within the Redshift namespace

Bootstrapping instance (materialized views)

The workgroup.run_custom_sql() technique gives flexibility in how one can bootstrap your Redshift information warehouse utilizing the AWS CDK. For instance, you’ll be able to create a materialized view to enhance the queries’ efficiency by pre-aggregating information from the Amazon critiques:

materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
    database_name="defaultdb",
    sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS ranking FROM amazon_reviews WHERE market="US" GROUP BY 1,2;''',
    delete_sql="drop materialized view mv_product_analysis")

materialized_view.node.add_dependency(load_amazon_reviews_data)

Catalog tables in Amazon Redshift

The deployment of RedshiftServerlessWorkgroup robotically creates an AWS Glue connection useful resource that can be utilized by AWS Glue crawlers and AWS Glue ETL jobs. That is instantly uncovered from the workgroup assemble by way of the glue_connection property. Utilizing this connection, the workgroup assemble exposes a handy technique to catalog the tables contained in the related Redshift Serverless namespace. The next an instance code:

workgroup.catalog_tables('DefaultDBCatalog', 'mv_product_analysis')

This single line of code creates a database within the Knowledge Catalog named mv_product_analysis and the related crawler with the IAM position and community configuration already configured. By default, it crawls all of the tables inside the general public schema within the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter within the catalogTables technique lets you outline a sample on what to crawl (see the JDBC information retailer within the embody path).

You possibly can run the crawler utilizing the AWS Glue console or invoke it utilizing the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK utilizing AwsCustomResource.

Knowledge sharing

DSF helps Redshift information sharing for each side (producers and customers) in addition to identical account and cross-account eventualities. Let’s create one other Redshift Serverless namespace and workgroup to display the interplay:

namespace2 = dsf.consumption.RedshiftServerlessNamespace(self, 
    "Namespace2", 
    db_name="defaultdb", 
    identify="client", 
    default_iam_role=lake_role, 
    removal_policy=RemovalPolicy.DESTROY)

workgroup2 = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    "Workgroup2", 
    identify="client", 
    namespace=namespace2, 
    removal_policy=RemovalPolicy.DESTROY)

For producers

For producers, full the next steps:

Create the brand new share and populate the share with the schema or tables:

data_share = workgroup.create_share('DataSharing', 
    'defaultdb', 
    'defaultdbshare', 
    'public', ['mv_product_analysis'])

data_share.new_share_custom_resource.node.add_dependency(materialized_view)

Create entry grants:

To grant to a cluster in the identical account:

share_grant = workgroup.grant_access_to_share("GrantToSameAccount", 
    data_share, 
    namespace2.namespace_id)

share_grant.useful resource.node.add_dependency(data_share.new_share_custom_resource)
share_grant.useful resource.node.add_dependency(namespace2)

To grant to a distinct account:

workgroup.grant_access_to_share('GrantToDifferentAccount', 
    tpcdsShare, 
    undefined, 
    '', 
    true)

The final parameter within the grant_access_to_share technique permits to robotically authorize the cross-account entry on the information share. Omitting this parameter would default to no authorization; which suggests a Redshift administrator must authorize the cross-account share both utilizing the AWS CLI, SDK, or Amazon Redshift console.

For customers

For a similar account share, to create the database from the share, use the next code:

create_db_from_share = workgroup2.create_database_from_share("CreateDatabaseFromShare", 
    "advertising", 
    data_share.data_share_name, 
    data_share.producer_namespace)

create_db_from_share.useful resource.node.add_dependency(share_grant.useful resource)
create_db_from_share.useful resource.node.add_dependency(workgroup2)

For cross-account grants, the syntax is analogous, however it is advisable to point out the producer account ID:

consumerWorkgroup.create_database_from_share('CreateCrossAccountDatabaseFromShare', 
    'tpcds', 
    , 
    , 
    )

To see the complete working instance, observe the directions within the accompanying GitHub repository.

Deploy the sources utilizing the AWS CDK

To deploy the sources, run the next code:

You possibly can evaluate the sources created, as proven within the following screenshot.

Verify the adjustments for the deployment to begin. Wait a couple of minutes for the undertaking to be deployed; you’ll be able to preserve monitor of the deployment utilizing the AWS CLI or the AWS CloudFormation console.

When the deployment is full, you must see two Redshift workgroups (one producer and one client).

Utilizing Amazon Redshift Question Editor v2, you’ll be able to log in to the producer Redshift workgroup utilizing Secrets and techniques Supervisor, as proven within the following screenshot.

After you log in, you’ll be able to see the tables and views that you just created utilizing DSF within the defaultdb database.

Log in to the patron Redshift workgroup to see the shared dataset from the producer Redshift workgroup beneath the advertising database.

Clear up

You possibly can run cdk destroy in your native terminal to delete the stack. Since you marked the constructs with a RemovalPolicy.DESTROY and configured DSF to take away information on destroy, operating cdk destroy or deleting the stack from the AWS CloudFormation console will clear up the provisioned sources.

Conclusion

On this publish, we demonstrated the best way to use the AWS CDK together with the DSF to handle Redshift Serverless as code. Codifying the deployment of sources helps present consistency throughout a number of environments. Other than infrastructure, DSF additionally gives capabilities to bootstrap (desk creation, ingestion of knowledge, and extra) Amazon Redshift and handle objects, all from the AWS CDK. Which means that adjustments could be model managed, reviewed, and even unit examined.

Along with Redshift Serverless, DSF helps different AWS providers, reminiscent of Amazon Athena, Amazon EMR, and lots of extra. Our roadmap is publicly accessible, and we sit up for your function requests, contributions, and suggestions.

You will get began utilizing DSF by following our fast begin information.

Concerning the authors

Jan Michael Go Tan is a Principal Options Architect for Amazon Internet Providers. He helps clients design scalable and revolutionary options with the AWS Cloud.

Vincent Gromakowski is an Analytics Specialist Options Architect at AWS the place he enjoys fixing clients’ analytics, NoSQL, and streaming challenges. He has a robust experience on distributed information processing engines and useful resource orchestration platform.