Big Data

Evaluating pattern Amazon Redshift information sharing structure utilizing Redshift Check Drive and superior SQL evaluation

11 September 2024

With the launch of Amazon Redshift Serverless and the varied provisioned occasion deployment choices, prospects are searching for instruments that assist them decide probably the most optimum information warehouse configuration to assist their Amazon Redshift workloads.

Amazon Redshift is a extensively used, absolutely managed, petabyte-scale information warehouse service. Tens of 1000’s of consumers use Amazon Redshift to course of exabytes of information day-after-day to energy their analytics workloads.

Redshift Check Drive is a instrument hosted on the GitHub repository that permit prospects consider which information warehouse configurations choices are greatest fitted to their workload. The Check Drive Workload Replicator utility consists of scripts that can be utilized to extract the workload queries out of your supply warehouse audit logs and replay them on a goal warehouse you launched. The Check Drive Configuration Comparability utility automates this course of by deploying goal Amazon Redshift warehouses and orchestrating the replay of the supply workload via a mixture of AWS CloudFormation and AWS StepFunctions.

Each utilities unload the efficiency metrics from the replay of the supply workload on the goal configuration(s) to Amazon Easy Storage Service (Amazon S3), which is used as a storage to retailer the efficiency metrics. Though the Replay Evaluation UI and Configuration Comparability utility can present a preliminary efficiency comparability, many purchasers wish to dig deeper by analyzing the uncooked information themselves.

The walkthrough illustrates an instance workload replayed on a single Amazon Redshift information warehouse and information sharing structure utilizing the Workload Replicator utility, the output of which will likely be used to judge the efficiency of the workload.

Use case overview

For the pattern use case, we assumed we now have an current 2 x ra3.4xlarge provisioned information warehouse that at the moment runs extract, remodel, and cargo (ETL), advert hoc, and enterprise intelligence (BI) queries. We’re all for breaking these workloads aside utilizing information sharing right into a 32 base Redshift Processing Unit (RPU) Serverless producer operating ETL and a 64 base RPU Serverless client operating the BI workload. We used Workload Replicator to replay the workload on a duplicate baseline of the supply and goal information sharing configuration as specified within the tutorial. The next picture reveals the method move.

Producing and accessing Check Drive metrics

The outcomes of Amazon Redshift Check Drive could be accessed utilizing an exterior schema for evaluation of a replay. Discuss with the Workload Replicator README and the Configuration Comparability README for extra detailed directions to execute a replay utilizing the respective instrument.

The exterior schema for evaluation is mechanically created with the Configuration Comparability utility, during which case you possibly can proceed on to the SQL evaluation within the Deploy the QEv2 SQL Pocket book and analyze workload part. When you use Workload Replicator, nevertheless, the exterior schema is just not created mechanically, and due to this fact must be configured as a prerequisite to the SQL evaluation. We reveal within the following walkthrough how the exterior schema could be arrange, utilizing pattern evaluation of the Knowledge Sharing use case.

Executing Check Drive Workload Replicator for information sharing

To execute Workload Replicator, use Amazon Elastic Compute Cloud (Amazon EC2) to run the automation scripts used to extract the workload from the supply.

Configure Amazon Redshift Knowledge Warehouse

Create a snapshot following the steerage within the Amazon Redshift Administration Information.
Allow audit logging following the steerage in Amazon Redshift Administration Information.
Allow the person exercise logging of the supply cluster following the steerage Amazon Redshift Administration Information.

Enabling logging requires a change of the parameter group. Audit logging must be enabled previous to the workload that will likely be replayed as a result of that is the place the connections and SQL queries of the workload will likely be extracted from.

Launch the baseline duplicate from the snapshot by restoring a 2 node ra3.4xlarge provisioned cluster from the snapshot.
Launch the producer warehouse by restoring the snapshot to a 32 RPU serverless namespace.
The patron mustn’t include the schema and tables that will likely be shared from the producer. You may launch the 64 RPU Serverless client both from the snapshot after which drop the related objects, or you possibly can create a brand new 64 RPU Serverless client warehouse and recreate client customers.
Create a datashare from the producer to the patron and add the related objects.

Knowledge share objects could be learn utilizing two mechanisms: utilizing three-part notation (database.schema.desk), or by creating an exterior schema pointing to a shared schema and querying that utilizing two-part notation (external_schema.desk). As a result of we wish to seamlessly run the supply workload, which makes use of two-part notation on the native objects, this submit demonstrates the latter method. For every schema shared from the producer, run the next command on the patron:

CREATE EXTERNAL SCHEMA schema_name 
FROM REDSHIFT DATABASE ‘datashare_database_name’ SCHEMA ‘schema_name’;

Be certain that to make use of the identical schema title because the supply for the exterior schema. Additionally, if any queries are run on the general public schema, drop the native public schema first earlier than creating the exterior equal.

Grant utilization on the schema for any related customers.

Configure Redshift Check Drive Workload Replicator

Create an S3 bucket to retailer the artifacts required by the utility (such because the metrics, extracted workload, and output information from operating UNLOAD instructions).
Launch the next three varieties of EC2 cases utilizing the beneficial configuration of m5.8xlarge, 32GB of SSD storage, and Amazon Linux AMI:
1. Baseline occasion
2. Goal-producer occasion
3. Goal-consumer occasion

Ensure you can connect with the EC2 occasion to run the utility.

For every occasion, set up the required libraries by finishing the next steps from the GitHub repository:
a. 2.i
b. 2.ii (if an ODBC driver needs to be used—the default is the Amazon Redshift Python driver)
c. 2.iii
d. 2.iv
e. 2.v
Create an AWS Identification and Entry Administration (IAM) function for the EC2 cases to entry the Amazon Redshift warehouses, to learn from the S3 audit logging bucket, and with each learn and write entry to the brand new S3 bucket created for storing replay artifacts.
If you’ll run COPY and UNLOAD instructions, create an IAM function with entry to the S3 buckets required, and connect it to the Amazon Redshift warehouses that can execute the load and unload.

On this instance, the IAM function is hooked up to the baseline duplicate and producer warehouses as a result of these will likely be executing the ETL processes. The utility will replace UNLOAD instructions to unload information to a bucket you outline, which as a greatest apply needs to be the bucket created for S3 artifacts. Write permissions have to be granted to the Amazon Redshift warehouse for this location.

Run Redshift Check Drive Workload Replicator

Run aws configure on the EC2 cases and populate the default Area with the Area the utility is being executed in.
Extract solely must be run as soon as, so connect with the baseline EC2 occasion and run vi config/extract.yaml to open the extract.yaml file and configure the extraction particulars (choose i to start configuring parts, then use escape to depart edit mode and :wq! to depart vi). For extra particulars on the parameters, see Configure parameters.

The next code is an instance of a configured extract that unloads the logs for a half hour window to the Check Drive artifacts bucket and updates COPY instructions to run with the POC Amazon Redshift function.

Run make extract to extract the workload. When accomplished, make word of the folder created on the path specified for the workload_location parameter within the extract (s3://testdriveartifacts/myworkload/Extraction_xxxx-xx-xxTxx:xx:xx.xxxxxx+00:00).
On the identical baseline EC2 occasion that can run the complete workload on the supply duplicate, run vi config/replay.yaml and configure the main points with the workload location copied within the earlier step 3 and the baseline warehouse endpoint. (See extra particulars on the parameters Configure parameters to run an extract job. The values after the analysis_iam_role parameter could be left because the default).

The next code is an instance for the start of a replay configuration for the supply duplicate.

On the EC2 occasion that can run the target-producer workload, run vi config/replay.yaml. Configure the main points with the workload location copied within the earlier step 3, the producer warehouse endpoint and different configuration as in step 4. To be able to replay solely the producer workload, add the suitable customers to incorporate or exclude for the filters parameter.

The next code is an instance of the filters used to exclude the BI workload from the producer.

On the EC2 occasion that can run the target-consumer workload, run vi config/replay.yaml and configure the main points with the workload location copied within the earlier step 3, the patron warehouse endpoint, and acceptable filters as for step 5. The identical customers that have been excluded on the producer workload replay needs to be included within the client workload replay.

The next is an instance of the filters used to solely run the BI workload from the patron.

Run make replay on the baseline occasion, target-producer occasion, and target-consumer occasion concurrently to run the workload on the goal warehouses.

Analyze the Workload Replicator output

Create the folder construction within the S3 bucket that was created within the earlier step.

'{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'

For comparison_stats_s3_path, enter the S3 bucket and path title. For what_if_timestamp, enter the replay begin time. For cluster_identifier, enter the goal cluster title for simple identification.

The next screenshot reveals

Use the next script to unload system desk information for every goal cluster to a corresponding Amazon S3 goal path that was created beforehand within the baseline Redshift cluster utilizing QEv2.

UNLOAD ($$
SELECT a.*,Trim(u.usename) as username FROM sys_query_history a , pg_user u
WHERE a.user_id = u.usesysid
and a.start_time > to_timestamp('{what_if_timestamp}','YYYY-MM- DD-HH24-MI-SS')
$$) TO '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'
FORMAT AS PARQUET PARALLEL OFF ALLOWOVERWRITE 
IAM_ROLE '{redshift_iam_role}';

For what_if_timestamp, enter the replay begin time. For comparison_stats_s3_path, enter the S3 bucket and path title. For cluster_identifier, enter the goal cluster title for simple identification. For redshift_iam_role, enter the Amazon Useful resource Title (ARN) of the Redshift IAM function for the goal cluster.

Create an exterior schema in Amazon Redshift with the title comparison_stats.

CREATE EXTERNAL SCHEMA comparison_stats from DATA CATALOG
DATABASE 'redshift_config_comparison'
IAM_ROLE '{redshift-iam-role}'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Create an exterior desk in Amazon Redshift with the title redshift_config_comparision_aggregate primarily based on the Amazon S3 file location.

CREATE EXTERNAL TABLE comparison_stats.redshift_config_comparision_aggregate 
(user_id int,
query_id bigint,
query_label VARCHAR,
transaction_id bigint,
session_id int,
database_name VARCHAR,
query_type VARCHAR,
standing VARCHAR,
result_cache_hit boolean,
start_time timestamp,
end_time timestamp,
elapsed_time bigint,
queue_time bigint,
execution_time bigint,
error_message VARCHAR,
returned_rows bigint,
returned_bytes bigint,
query_text VARCHAR,
redshift_version VARCHAR,
usage_limit VARCHAR,
compute_type VARCHAR,
compile_time bigint,
planning_time bigint,
lock_wait_time bigint,
username VARCHAR)
PARTITIONED BY (cluster_identifier VARCHAR)
STORED AS PARQUET
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}';

After making a partitioned desk, alter the desk utilizing the next assertion to register partitions to the exterior catalog.

Once you add a partition, you outline the situation of the subfolder on Amazon S3 that accommodates the partition information. Run that assertion for every cluster identifier.

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="{cluster_identifier}")
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/';

Instance:

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="baseline-ra3-4xlarge-2")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/baseline-ra3-4xlarge-2/';

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="producer-serverless32RPU")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/producer-serverless32RPU/';

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="consumer-serverless64RPU")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/consumer-serverless64RPU/';

Deploy the QEv2 SQL Pocket book and analyze workload

On this part, we analyze the queries that have been replayed in each the baseline and goal clusters. We analyze the workload primarily based on the widespread queries which are executed within the baseline and goal clusters.

Obtain the evaluation pocket book from Amazon S3.
Import the pocket book into the baseline Redshift clusters utilizing QEv2. For steerage, seek advice from the Authoring and operating notebooks.
Create the saved process common_queries_sp in the identical database that was used to create the exterior schema.
The saved process will create a view referred to as common_queries by querying the exterior desk redshift_config_comparison_aggregate that was created in earlier steps.

The view will establish the queries widespread to each the baseline and goal clusters as talked about within the pocket book.

Execute the saved process by passing the cluster identifiers for the baseline and goal clusters as parameters to the saved process.

For this submit, we handed the baseline and producer cluster identifier because the parameters. Passing the cluster identifiers as parameters will retrieve the info just for these particular clusters.

As soon as the common_queries view is created, you possibly can carry out additional evaluation utilizing subsequent queries which are obtainable within the pocket book. If in case you have a couple of goal cluster, you possibly can comply with the identical evaluation course of for each. For this submit, we now have two goal clusters: producer and client. We first carried out the evaluation between the baseline and producer clusters, then repeated the identical course of to investigate the info for the baseline versus client clusters.

To research our workload, we are going to use the sys_query_history view. We often use a number of columns from this view, together with the next:

elapsed_time: The top-to-end time of the question run
execution_time: The time the question spent operating. Within the case of a SELECT question, this additionally contains the return time.
compile_time: The time the question spent compiling

For extra info on sys_query_history, seek advice from SYS_QUERY_HISTORY within the Amazon Redshift Database Developer Information. The next desk reveals the descriptions of the evaluation queries.

	Title of the question	Description
1	General workload by person	Depend of widespread queries between baseline and goal clusters primarily based on person
2	General workload by question kind	Depend of widespread queries between baseline and goal clusters primarily based on question kind
3	General workload comparability (in seconds)	Evaluate the general workload between the baseline and goal clusters by analyzing the execution time, compile time, and elapsed time
4	Percentile workload comparability	The proportion of queries that carry out at or under that runtime (for instance, p50_s having the worth of 5 seconds means 50% of queries in that workload have been 5 seconds or quicker)
5	Variety of enhance/degrade/keep identical queries	The variety of queries degraded/stayed the identical/improved when evaluating the elapsed time between the baseline and goal clusters
6	Diploma of query-level efficiency change (proportion)	The diploma of change of the question from the baseline to focus on relative to the baseline efficiency
7	Comparability by question kind (in seconds)	Evaluate the elapsed time of various question varieties equivalent to `SELECT`, `INSERT`, and `COPY` instructions between the baseline cluster and goal cluster
8	High 10 slowest operating queries (in seconds)	High 10 slowest queries between the baseline and goal cluster by evaluating the elapsed time of each clusters
9	High 10 improved queries (in seconds)	The highest 10 queries with probably the most improved elapsed time when evaluating the baseline cluster to the goal cluster

Pattern Outcomes evaluation

In our instance, the general workload enchancment for workload isolation structure utilizing information sharing for ETL workload between baseline and producer is 858 seconds (baseline_elapsed_time – target_elapsed_time) for the pattern TPC information, as proven within the following screenshots.

The general workload enchancment for workload isolation structure utilizing information sharing for BI workload between baseline and client is 1148 seconds (baseline_elapsed_time – target_elapsed_time) for pattern TPC information, as proven within the following screenshots.

Cleanup

Full the next steps to wash up your sources:

Delete the Redshift provisioned duplicate cluster and the 2 Redshift serverless endpoints (32 RPU and 64 RPU)
Delete the S3 bucket used to retailer the artifacts
Delete the baseline, target-producer, and target-consumer EC2 cases
Delete the IAM function created for the EC2 cases to entry Redshift clusters and S3 buckets
Delete the IAM roles created for Amazon Redshift warehouses to entry S3 buckets for COPY and UNLOAD instructions

Conclusion

On this submit, we walked you thru the method of testing workload isolation structure utilizing Amazon Redshift Knowledge Sharing and Check Drive utility. We demonstrated how you should use SQL for superior value efficiency evaluation and evaluate totally different workloads on totally different goal Redshift cluster configurations. We encourage you to judge your Amazon Redshift information sharing structure utilizing the Redshift Check Drive instrument. Use the offered SQL script to investigate the price-performance of your Amazon Redshift cluster.

In regards to the Authors

Ayan Majumder is an Analytics Specialist Options Architect at AWS. His experience lies in designing sturdy, scalable, and environment friendly cloud options for patrons. Past his skilled life, he derives pleasure from touring, pictures, and outside actions.

Ekta Ahuja is an Amazon Redshift Specialist Options Architect at AWS. She is obsessed with serving to prospects construct scalable and sturdy information and analytics options. Earlier than AWS, she labored in a number of totally different information engineering and analytics roles. Exterior of labor, she enjoys panorama pictures, touring, and board video games.

Julia Beck is an Analytics Specialist Options Architect at AWS. She is obsessed with supporting prospects in validating and optimizing analytics options by architecting proof of idea workloads designed to satisfy their particular wants.