With the launch of Amazon Redshift Serverless and the varied provisioned occasion deployment choices, prospects are searching for instruments that assist them decide probably the most optimum information warehouse configuration to assist their Amazon Redshift workloads.
Amazon Redshift is a extensively used, absolutely managed, petabyte-scale information warehouse service. Tens of 1000’s of consumers use Amazon Redshift to course of exabytes of information day-after-day to energy their analytics workloads.
Redshift Check Drive is a instrument hosted on the GitHub repository that permit prospects consider which information warehouse configurations choices are greatest fitted to their workload. The Check Drive Workload Replicator utility consists of scripts that can be utilized to extract the workload queries out of your supply warehouse audit logs and replay them on a goal warehouse you launched. The Check Drive Configuration Comparability utility automates this course of by deploying goal Amazon Redshift warehouses and orchestrating the replay of the supply workload via a mixture of AWS CloudFormation and AWS StepFunctions.
Each utilities unload the efficiency metrics from the replay of the supply workload on the goal configuration(s) to Amazon Easy Storage Service (Amazon S3), which is used as a storage to retailer the efficiency metrics. Though the Replay Evaluation UI and Configuration Comparability utility can present a preliminary efficiency comparability, many purchasers wish to dig deeper by analyzing the uncooked information themselves.
The walkthrough illustrates an instance workload replayed on a single Amazon Redshift information warehouse and information sharing structure utilizing the Workload Replicator utility, the output of which will likely be used to judge the efficiency of the workload.
Use case overview
For the pattern use case, we assumed we now have an current 2 x ra3.4xlarge
provisioned information warehouse that at the moment runs extract, remodel, and cargo (ETL), advert hoc, and enterprise intelligence (BI) queries. We’re all for breaking these workloads aside utilizing information sharing right into a 32 base Redshift Processing Unit (RPU) Serverless producer operating ETL and a 64 base RPU Serverless client operating the BI workload. We used Workload Replicator to replay the workload on a duplicate baseline of the supply and goal information sharing configuration as specified within the tutorial. The next picture reveals the method move.
Producing and accessing Check Drive metrics
The outcomes of Amazon Redshift Check Drive could be accessed utilizing an exterior schema for evaluation of a replay. Discuss with the Workload Replicator README and the Configuration Comparability README for extra detailed directions to execute a replay utilizing the respective instrument.
The exterior schema for evaluation is mechanically created with the Configuration Comparability utility, during which case you possibly can proceed on to the SQL evaluation within the Deploy the QEv2 SQL Pocket book and analyze workload part. When you use Workload Replicator, nevertheless, the exterior schema is just not created mechanically, and due to this fact must be configured as a prerequisite to the SQL evaluation. We reveal within the following walkthrough how the exterior schema could be arrange, utilizing pattern evaluation of the Knowledge Sharing use case.
Executing Check Drive Workload Replicator for information sharing
To execute Workload Replicator, use Amazon Elastic Compute Cloud (Amazon EC2) to run the automation scripts used to extract the workload from the supply.
Configure Amazon Redshift Knowledge Warehouse
- Create a snapshot following the steerage within the Amazon Redshift Administration Information.
- Allow audit logging following the steerage in Amazon Redshift Administration Information.
- Allow the person exercise logging of the supply cluster following the steerage Amazon Redshift Administration Information.
Enabling logging requires a change of the parameter group. Audit logging must be enabled previous to the workload that will likely be replayed as a result of that is the place the connections and SQL queries of the workload will likely be extracted from.
- Launch the baseline duplicate from the snapshot by restoring a 2 node
ra3.4xlarge
provisioned cluster from the snapshot. - Launch the producer warehouse by restoring the snapshot to a 32 RPU serverless namespace.
- The patron mustn’t include the schema and tables that will likely be shared from the producer. You may launch the 64 RPU Serverless client both from the snapshot after which drop the related objects, or you possibly can create a brand new 64 RPU Serverless client warehouse and recreate client customers.
- Create a datashare from the producer to the patron and add the related objects.
Knowledge share objects could be learn utilizing two mechanisms: utilizing three-part notation (database.schema.desk
), or by creating an exterior schema pointing to a shared schema and querying that utilizing two-part notation (external_schema.desk
). As a result of we wish to seamlessly run the supply workload, which makes use of two-part notation on the native objects, this submit demonstrates the latter method. For every schema shared from the producer, run the next command on the patron:
Be certain that to make use of the identical schema title because the supply for the exterior schema. Additionally, if any queries are run on the general public schema, drop the native public schema first earlier than creating the exterior equal.
- Grant utilization on the schema for any related customers.
Configure Redshift Check Drive Workload Replicator
- Create an S3 bucket to retailer the artifacts required by the utility (such because the metrics, extracted workload, and output information from operating
UNLOAD
instructions). - Launch the next three varieties of EC2 cases utilizing the beneficial configuration of
m5.8xlarge
, 32GB of SSD storage, and Amazon Linux AMI:- Baseline occasion
- Goal-producer occasion
- Goal-consumer occasion
Ensure you can connect with the EC2 occasion to run the utility.
- For every occasion, set up the required libraries by finishing the next steps from the GitHub repository:
a. 2.i
b. 2.ii (if an ODBC driver needs to be used—the default is the Amazon Redshift Python driver)
c. 2.iii
d. 2.iv
e. 2.v - Create an AWS Identification and Entry Administration (IAM) function for the EC2 cases to entry the Amazon Redshift warehouses, to learn from the S3 audit logging bucket, and with each learn and write entry to the brand new S3 bucket created for storing replay artifacts.
- If you’ll run COPY and UNLOAD instructions, create an IAM function with entry to the S3 buckets required, and connect it to the Amazon Redshift warehouses that can execute the load and unload.
On this instance, the IAM function is hooked up to the baseline duplicate and producer warehouses as a result of these will likely be executing the ETL processes. The utility will replace UNLOAD
instructions to unload information to a bucket you outline, which as a greatest apply needs to be the bucket created for S3 artifacts. Write permissions have to be granted to the Amazon Redshift warehouse for this location.
Run Redshift Check Drive Workload Replicator
- Run
aws configure
on the EC2 cases and populate the default Area with the Area the utility is being executed in. - Extract solely must be run as soon as, so connect with the baseline EC2 occasion and run
vi config/extract.yaml
to open theextract.yaml
file and configure the extraction particulars (choosei
to start configuring parts, then useescape
to depart edit mode and:wq!
to depart vi). For extra particulars on the parameters, see Configure parameters.
The next code is an instance of a configured extract that unloads the logs for a half hour window to the Check Drive artifacts bucket and updates COPY
instructions to run with the POC Amazon Redshift function.
- Run
make extract
to extract the workload. When accomplished, make word of the folder created on the path specified for theworkload_location
parameter within the extract (s3://testdriveartifacts/myworkload/Extraction_xxxx-xx-xxTxx:xx:xx.xxxxxx+00:00
). - On the identical baseline EC2 occasion that can run the complete workload on the supply duplicate, run
vi config/replay.yaml
and configure the main points with the workload location copied within the earlier step 3 and the baseline warehouse endpoint. (See extra particulars on the parameters Configure parameters to run an extract job. The values after theanalysis_iam_role
parameter could be left because the default).
The next code is an instance for the start of a replay configuration for the supply duplicate.
- On the EC2 occasion that can run the target-producer workload, run
vi config/replay.yaml
. Configure the main points with the workload location copied within the earlier step 3, the producer warehouse endpoint and different configuration as in step 4. To be able to replay solely the producer workload, add the suitable customers to incorporate or exclude for the filters parameter.
The next code is an instance of the filters used to exclude the BI workload from the producer.
- On the EC2 occasion that can run the target-consumer workload, run
vi config/replay.yaml
and configure the main points with the workload location copied within the earlier step 3, the patron warehouse endpoint, and acceptable filters as for step 5. The identical customers that have been excluded on the producer workload replay needs to be included within the client workload replay.
The next is an instance of the filters used to solely run the BI workload from the patron.
- Run
make replay
on the baseline occasion, target-producer occasion, and target-consumer occasion concurrently to run the workload on the goal warehouses.
Analyze the Workload Replicator output
- Create the folder construction within the S3 bucket that was created within the earlier step.
For comparison_stats_s3_path
, enter the S3 bucket and path title. For what_if_timestamp
, enter the replay begin time. For cluster_identifier
, enter the goal cluster title for simple identification.
The next screenshot reveals
- Use the next script to unload system desk information for every goal cluster to a corresponding Amazon S3 goal path that was created beforehand within the baseline Redshift cluster utilizing QEv2.
For what_if_timestamp
, enter the replay begin time. For comparison_stats_s3_path
, enter the S3 bucket and path title. For cluster_identifier
, enter the goal cluster title for simple identification. For redshift_iam_role
, enter the Amazon Useful resource Title (ARN) of the Redshift IAM function for the goal cluster.
- Create an exterior schema in Amazon Redshift with the title
comparison_stats
.
- Create an exterior desk in Amazon Redshift with the title
redshift_config_comparision_aggregate
primarily based on the Amazon S3 file location.
- After making a partitioned desk, alter the desk utilizing the next assertion to register partitions to the exterior catalog.
Once you add a partition, you outline the situation of the subfolder on Amazon S3 that accommodates the partition information. Run that assertion for every cluster identifier.
Instance:
Deploy the QEv2 SQL Pocket book and analyze workload
On this part, we analyze the queries that have been replayed in each the baseline and goal clusters. We analyze the workload primarily based on the widespread queries which are executed within the baseline and goal clusters.
- Obtain the evaluation pocket book from Amazon S3.
- Import the pocket book into the baseline Redshift clusters utilizing QEv2. For steerage, seek advice from the Authoring and operating notebooks.
- Create the saved process
common_queries_sp
in the identical database that was used to create the exterior schema. - The saved process will create a view referred to as
common_queries
by querying the exterior deskredshift_config_comparison_aggregate
that was created in earlier steps.
The view will establish the queries widespread to each the baseline and goal clusters as talked about within the pocket book.
- Execute the saved process by passing the cluster identifiers for the baseline and goal clusters as parameters to the saved process.
For this submit, we handed the baseline and producer cluster identifier because the parameters. Passing the cluster identifiers as parameters will retrieve the info just for these particular clusters.
As soon as the common_queries
view is created, you possibly can carry out additional evaluation utilizing subsequent queries which are obtainable within the pocket book. If in case you have a couple of goal cluster, you possibly can comply with the identical evaluation course of for each. For this submit, we now have two goal clusters: producer and client. We first carried out the evaluation between the baseline and producer clusters, then repeated the identical course of to investigate the info for the baseline versus client clusters.
To research our workload, we are going to use the sys_query_history
view. We often use a number of columns from this view, together with the next:
elapsed_time
: The top-to-end time of the question runexecution_time
: The time the question spent operating. Within the case of a SELECT question, this additionally contains the return time.compile_time
: The time the question spent compiling
For extra info on sys_query_history
, seek advice from SYS_QUERY_HISTORY within the Amazon Redshift Database Developer Information. The next desk reveals the descriptions of the evaluation queries.
Title of the question | Description | |
1 | General workload by person | Depend of widespread queries between baseline and goal clusters primarily based on person |
2 | General workload by question kind | Depend of widespread queries between baseline and goal clusters primarily based on question kind |
3 | General workload comparability (in seconds) | Evaluate the general workload between the baseline and goal clusters by analyzing the execution time, compile time, and elapsed time |
4 | Percentile workload comparability | The proportion of queries that carry out at or under that runtime (for instance, p50_s having the worth of 5 seconds means 50% of queries in that workload have been 5 seconds or quicker) |
5 | Variety of enhance/degrade/keep identical queries | The variety of queries degraded/stayed the identical/improved when evaluating the elapsed time between the baseline and goal clusters |
6 | Diploma of query-level efficiency change (proportion) | The diploma of change of the question from the baseline to focus on relative to the baseline efficiency |
7 | Comparability by question kind (in seconds) | Evaluate the elapsed time of various question varieties equivalent to SELECT , INSERT , and COPY instructions between the baseline cluster and goal cluster |
8 | High 10 slowest operating queries (in seconds) | High 10 slowest queries between the baseline and goal cluster by evaluating the elapsed time of each clusters |
9 | High 10 improved queries (in seconds) | The highest 10 queries with probably the most improved elapsed time when evaluating the baseline cluster to the goal cluster |
Pattern Outcomes evaluation
In our instance, the general workload enchancment for workload isolation structure utilizing information sharing for ETL workload between baseline and producer is 858 seconds (baseline_elapsed_time
– target_elapsed_time
) for the pattern TPC information, as proven within the following screenshots.
The general workload enchancment for workload isolation structure utilizing information sharing for BI workload between baseline and client is 1148 seconds (baseline_elapsed_time
– target_elapsed_time
) for pattern TPC information, as proven within the following screenshots.
Cleanup
Full the next steps to wash up your sources:
- Delete the Redshift provisioned duplicate cluster and the 2 Redshift serverless endpoints (32 RPU and 64 RPU)
- Delete the S3 bucket used to retailer the artifacts
- Delete the baseline, target-producer, and target-consumer EC2 cases
- Delete the IAM function created for the EC2 cases to entry Redshift clusters and S3 buckets
- Delete the IAM roles created for Amazon Redshift warehouses to entry S3 buckets for
COPY
andUNLOAD
instructions
Conclusion
On this submit, we walked you thru the method of testing workload isolation structure utilizing Amazon Redshift Knowledge Sharing and Check Drive utility. We demonstrated how you should use SQL for superior value efficiency evaluation and evaluate totally different workloads on totally different goal Redshift cluster configurations. We encourage you to judge your Amazon Redshift information sharing structure utilizing the Redshift Check Drive instrument. Use the offered SQL script to investigate the price-performance of your Amazon Redshift cluster.
In regards to the Authors
Ayan Majumder is an Analytics Specialist Options Architect at AWS. His experience lies in designing sturdy, scalable, and environment friendly cloud options for patrons. Past his skilled life, he derives pleasure from touring, pictures, and outside actions.
Ekta Ahuja is an Amazon Redshift Specialist Options Architect at AWS. She is obsessed with serving to prospects construct scalable and sturdy information and analytics options. Earlier than AWS, she labored in a number of totally different information engineering and analytics roles. Exterior of labor, she enjoys panorama pictures, touring, and board video games.
Julia Beck is an Analytics Specialist Options Architect at AWS. She is obsessed with supporting prospects in validating and optimizing analytics options by architecting proof of idea workloads designed to satisfy their particular wants.