Big Data

Speed up Amazon Redshift Knowledge Lake queries with AWS Glue Knowledge Catalog Column Statistics

2 October 2024

Amazon Redshift lets you effectively question and retrieve structured and semi-structured information from open format information in Amazon S3 information lake with out having to load the info into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your information lake, enabling you to run analytical queries. Amazon Redshift helps all kinds of tabular information codecs like CSV, JSON, Parquet, ORC and open tabular codecs like Apache Hudi, Linux basis Delta Lake and Apache Iceberg.

You create Redshift exterior tables by defining the construction to your information, S3 location of the information and registering them as tables in an exterior information catalog. The exterior information catalog will be AWS Glue Knowledge Catalog, the info catalog that comes with Amazon Athena, or your personal Apache Hive metastore.

During the last 12 months, Amazon Redshift added a number of efficiency optimizations for information lake queries throughout a number of areas of question engine equivalent to rewrite, planning, scan execution and consuming AWS Glue Knowledge Catalog column statistics. To get the perfect efficiency on information lake queries with Redshift, you should utilize AWS Glue Knowledge Catalog’s column statistics characteristic to gather statistics on Knowledge Lake tables. For Amazon Redshift Serverless situations, you will notice improved scan efficiency by way of elevated parallel processing of S3 information and this occurs mechanically based mostly on RPUs used.

On this put up, we spotlight the efficiency enhancements we noticed utilizing {industry} normal TPC-DS benchmarks. Total execution time of TPC-DS 3 TB benchmark improved by 3x. A number of the queries in our benchmark skilled as much as 12x velocity up.

Efficiency Enhancements

A number of efficiency optimizations had been achieved during the last 12 months to enhance efficiency of knowledge lake queries together with the next.

Devour AWS Glue Knowledge Catalog column statistics and tuning of Redshift optimizer to enhance high quality of question plans
Make the most of bloom filters for partition columns
Improved scan effectivity for Amazon Redshift Serverless situations by way of elevated parallel processing of information
Novel question rewrite guidelines to merge related scans
Sooner retrieval of metadata from AWS Glue Knowledge Catalog

To know the efficiency positive aspects, we examined the efficiency on the industry-standard TPC-DS benchmark utilizing 3 TB information units and queries which represents completely different buyer use circumstances. Efficiency was examined on a Redshift serverless information warehouse with 128 RPU. In our testing, the dataset was saved in Amazon S3 in Parquet format and AWS Glue Knowledge Catalog was used to handle exterior databases and tables. Truth tables had been partitioned on the date column, and every truth desk consisted of roughly 2,000 partitions. All the tables had their row depend desk property, numRows, set as per the spectrum question efficiency pointers.

We did a baseline run on Redshift patch model (patch 172) from final 12 months. Later, we ran all TPC-DS queries on newest patch model (patch 180) that features all efficiency optimizations added over final 12 months. Then we used AWS Glue Knowledge Catalog’s column statistics characteristic to compute statistics for all of the tables and measured enhancements with the presence of AWS Glue Knowledge Catalog column statistics.

Our evaluation revealed that the TPC-DS 3TB Parquet benchmark noticed substantial efficiency positive aspects with these optimizations. Particularly, partitioned Parquet with our newest optimizations achieved 2x sooner runtimes in comparison with the earlier implementation. Enabling AWS Glue Knowledge Catalog column statistics additional improved efficiency by 3x versus final 12 months. The next graph illustrates these runtime enhancements for the total benchmark (all TPC-DS queries) over the previous 12 months, together with the extra increase from utilizing AWS Glue Knowledge Catalog column statistics.

Improvement in total runtime of TPC-DS 3T workload

Determine 1: Enchancment in whole runtime of TPC-DS 3T workload

The next graph presents the highest queries from the TPC-DS benchmark with the best efficiency enchancment during the last 12 months with and with out AWS Glue Knowledge Catalog column statistics. You may see that efficiency improves so much when statistics exist on AWS Glue Knowledge Catalog (for particulars on how you can get statistics to your Knowledge Lake tables, please seek advice from optimizing question efficiency utilizing AWS Glue Knowledge Catalog column statistics). Particularly, multi-join queries will profit essentially the most from AWS Glue Knowledge Catalog column statistics as a result of the optimizer makes use of statistics to decide on the proper be a part of order and distribution technique.

Determine 2: Velocity-up in TPC-DS queries

Let’s focus on among the optimizations that contributed to improved question efficiency.

Optimizing with table-level statistics

Amazon Redshift’s design permits it to deal with large-scale information challenges with superior velocity and cost-efficiency. Its massively parallel processing (MPP) question engine, AI-powered question optimizer, auto-scaling capabilities, and different superior options permit Redshift to excel at looking, aggregating, and reworking petabytes of knowledge.

Nonetheless, even essentially the most highly effective methods can expertise efficiency degradation in the event that they encounter anti-patterns like grossly inaccurate desk statistics, such because the row depend metadata.

With out this significant metadata, Redshift’s question optimizer could also be restricted within the variety of attainable optimizations, particularly these associated to information distribution throughout question execution. This may have a big impression on general question efficiency.

For example this, think about the next easy question involving an interior be a part of between a big desk with billions of rows and a small desk with just a few hundred thousand rows.

choose small_table.sellerid, sum(large_table.qtysold)
from large_table, small_table
the place large_table.salesid = small_table.listid
 and small_table.listtime > '2023-12-01'
 and large_table.saletime > '2023-12-01'
group by 1 order by 1

If executed as-is, with the big desk on the right-hand facet of the be a part of, the question will result in sub-optimal efficiency. It’s because the big desk will have to be distributed (broadcast) to all Redshift compute nodes to carry out the interior be a part of with the small desk, as proven within the following diagram.

Inaccurate table statistics lead to limited optimizations and large amounts of data broadcast among compute nodes for a simple inner join

Determine 3: Inaccurate desk statistics result in restricted optimizations and huge quantities of knowledge broadcast amongst compute nodes for a easy interior be a part of

Now, think about a state of affairs the place the desk statistics, such because the row depend, are correct. This permits the Amazon Redshift question optimizer to make extra knowledgeable selections, equivalent to figuring out the optimum be a part of order. On this case, the optimizer would instantly rewrite the question to have the big desk on the left-hand facet of the interior be a part of, so that it’s the small desk that’s broadcast throughout the Redshift compute nodes, as illustrated within the following diagram.

Accurate table statistics lead to high degree of optimizations and very little data broadcast among compute nodes for a simple inner join

Determine 4: Correct desk statistics result in excessive diploma of optimizations and little or no information broadcast amongst compute nodes for a easy interior be a part of

Happily, Amazon Redshift mechanically maintains correct desk statistics for native tables by working the ANALYZE command within the background. For exterior tables (information lake tables), nonetheless, AWS Glue Knowledge Catalog column statistics are advisable to be used with Amazon Redshift as we are going to focus on within the subsequent part. For extra basic data on optimizing queries in Amazon Redshift, please seek advice from the documentation on elements affecting question efficiency, information redistribution, and Amazon Redshift finest practices for designing queries.

Enhancements with AWS Glue Knowledge Catalog column statistics

AWS Glue Knowledge Catalog has a characteristic to compute column degree statistics for Amazon S3 backed exterior tables. AWS Glue Knowledge Catalog can compute column degree statistics equivalent to NDV, Variety of Nulls, Min/Max and Avg. column width for the columns with out the necessity for added information pipelines. Amazon Redshift cost-based optimizer makes use of these statistics to give you higher high quality question plans. Along with consuming statistics, we additionally made a number of enhancements in cardinality estimations and value tuning to get top quality question plans thereby enhancing question efficiency.

TPC-DS 3TB dataset confirmed 40% enchancment in whole question execution time when these AWS Glue Knowledge Catalog column statistics had been supplied. Particular person TPC-DS queries confirmed as much as 5x enhancements in question execution time. A number of the queries that had higher impression in execution time are Q85, Q64, Q75, Q78, Q94, Q16, Q04, Q24 and Q11.

We are going to undergo an instance the place cost-based optimizer generated a greater question plan with statistics and the way it improved the execution time.

Let’s think about following less complicated model of TPC-DS Q64 to showcase the question plan variations with statistics.

choose i_product_name product_name
,i_item_sk item_sk
,ad1.ca_street_number b_street_number
,ad1.ca_street_name b_street_name
,ad1.ca_city b_city
,ad1.ca_zip b_zip
,d1.d_year as syear
,depend(*) cnt
,sum(ss_wholesale_cost) s1
,sum(ss_list_price) s2
,sum(ss_coupon_amt) s3
FROM   tpcds_3t_alls3_pp_ext.store_sales
,tpcds_3t_alls3_pp_ext.store_returns
,tpcds_3t_alls3_pp_ext.date_dim d1
,tpcds_3t_alls3_pp_ext.buyer
,tpcds_3t_alls3_pp_ext.customer_address ad1
,tpcds_3t_alls3_pp_ext.merchandise
WHERE
ss_sold_date_sk = d1.d_date_sk AND
ss_customer_sk = c_customer_sk AND

ss_addr_sk = ad1.ca_address_sk and
ss_item_sk = i_item_sk and
ss_item_sk = sr_item_sk and
ss_ticket_number = sr_ticket_number and
i_color in ('firebrick','papaya','orange','cream','turquoise','deep') and
i_current_price between 42 and 42 + 10 and
i_current_price between 42 + 1 and 42 + 15
group by i_product_name
,i_item_sk
,ad1.ca_street_number
,ad1.ca_street_name
,ad1.ca_city
,ad1.ca_zip
,d1.d_year

With out Statistics

Following determine represents the logical question plan of Q64. You may observe that cardinality estimation of joins is just not correct. With inaccurate cardinalities, optimizer produces a sub-optimal question plan resulting in greater execution time.

With Statistics

Following determine represents the logical question plan after consuming AWS Glue Knowledge Catalog column statistics. Primarily based on the highlighted modifications, you possibly can observe that the cardinality estimations of JOIN improved by many magnitudes serving to the optimizer to decide on a greater be a part of order and be a part of technique (broadcast DS_BCAST_INNER vs. distribute DS_DIST_BOTH). Switching the customer_address and buyer desk from interior to outer desk and making be a part of methods as distribute has main impression as a result of this reduces the info motion between the nodes and avoids spilling from hash desk.

Logical query plan of Q64 without statistics

Determine 5: Logical question plan of Q64 with out statistics

Logical query plan of Q64 after consuming column-level statistics

Determine 6: Logical question plan of Q64 after consuming AWS Glue Knowledge Catalog column statistics

This transformation in question plan improved the question execution time of Q64 from 383s to 81s.

Given the higher advantages with AWS Glue Knowledge Catalog column statistics for the optimizer, you need to think about accumulating stats to your information lake utilizing AWS Glue. In case your workload is a JOIN heavy workload, then accumulating stats will present higher enchancment in your workload. Consult with producing AWS Glue Knowledge Catalog column statistics for directions on how you can accumulate statistics in AWS Glue Knowledge Catalog.

Question rewrite optimization

We launched a brand new question rewrite rule which mixes scalar aggregates over the identical frequent expression utilizing barely completely different predicates. This rewrite resulted in efficiency enhancements on TPC-DS queries Q09, Q28, and Q88. Let’s concentrate on Q09 as a consultant of those queries, given by the next fragment:

SELECT CASE
WHEN (SELECT COUNT(*)
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20) > 48409437
THEN (SELECT AVG(ss_ext_discount_amt)
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20)
ELSE (SELECT AVG(ss_net_profit)
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20) END
AS bucket1,
<<4 extra variations of the CASE expression above>>
FROM purpose
WHERE r_reason_sk = 1

In whole, there are 15 scans of the very fact desk store_sales, each returning varied aggregates over completely different subsets of knowledge. The engine first performs subquery removing and transforms the assorted expressions within the CASE statements into relational subtrees linked through cross merchandise, after which they’re fused into one subquery dealing with all scalar aggregates. The ensuing plan for Q09, described under utilizing SQL for readability, is given by:

SELECT CASE WHEN v1 > 48409437 THEN t1 ELSE e1 END,
<4 extra variations>
FROM (SELECT COUNT(CASE WHEN b1 THEN 1 END) AS v1,
AVG(CASE WHEN b1 THEN ss_ext_discount_amt END) AS t1,
AVG(CASE WHEN b1 THEN ss_net_profit END) AS e1,
<4 extra variations>
FROM purpose,
(SELECT *,
ss_quantity BETWEEN 1 AND 20 AS b1,
<4 extra variations>
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20 OR
<4 extra variations>))
WHERE r_reason_sk = 1)

Generally, this rewrite rule leads to the most important enhancements each in latency (from 3x to 8x enhancements) and bytes learn from Amazon S3 (from 6x to 8x discount in scanned bytes and, consequently, value).

Bloom filter for partition columns

Amazon Redshift already makes use of Bloom filters on information columns of exterior tables in Amazon S3 to allow early and efficient information filtering. Final 12 months, we prolonged this help for partition columns as nicely. A Bloom filter is a probabilistic, memory-efficient information construction that accelerates be a part of queries at scale by filtering rows that don’t match the be a part of relation, considerably lowering the quantity of knowledge transferred over the community. Amazon Redshift mechanically determines what queries are appropriate for leveraging Bloom filters at question runtime.

This optimization resulted in efficiency enhancements on TPC-DS queries Q05, Q17 and Q54. This optimization resulted in giant enhancements in each latency (from 2x to 3x enchancment) and bytes learn from S3 (from 9x to 15x discount in scanned bytes and, consequently value).

Following is the subquery of Q05 which showcased enhancements with runtime filter.

choose s_store_id,
sum(sales_price) as gross sales,
sum(revenue) as revenue,
sum(return_amt) as returns,
sum(net_loss) as profit_loss
from
( choose  ss_store_sk as store_sk,
ss_sold_date_sk  as date_sk,
ss_ext_sales_price as sales_price,
ss_net_profit as revenue,
forged(0 as decimal(7,2)) as return_amt,
forged(0 as decimal(7,2)) as net_loss
from tpcds_3t_alls3_pp_ext.store_sales
union all
choose sr_store_sk as store_sk,
sr_returned_date_sk as date_sk,
forged(0 as decimal(7,2)) as sales_price,
forged(0 as decimal(7,2)) as revenue,
sr_return_amt as return_amt,
sr_net_loss as net_loss
from tpcds_3t_alls3_pp_ext.store_returns
) salesreturnss,
tpcds_3t_alls3_pp_ext.date_dim,
tpcds_3t_alls3_pp_ext.retailer
the place date_sk = d_date_sk
and d_date between forged('1998-08-13' as date)
and (forged('1998-08-13' as date) +  14)
and store_sk = s_store_sk
group by s_store_id

With out bloom filter help on partition columns

Following determine is the logical question plan for sub-query of Q05. This appends two giant truth tables store_sales (8B rows) and store_returns (863M rows) after which joins with very selective dimension tables date_dim after which with dimension desk retailer. You may observe that be a part of with date_dim desk reduces the variety of rows from 9B to 93M rows.

With bloom filter help on partition columns

With help of bloom filter on partition columns, we now create bloom filter for d_date_sk column of date_dim desk and push down the bloom filters to store_sales and store_returns desk. These bloom filters assist to filter out the partitions in each store_sales and store_returns desk as a result of be a part of occurs on partition column (variety of partitions processed reduces by 10x).

Logical query plan for sub-query of Q05 without bloom filter support on partition columns

Determine 7: Logical question plan for sub-query of Q05 with out bloom filter help on partition columns

Logical query plan for sub-query of Q05 with bloom filter support on partition columns

Determine 8: Logical question plan for sub-query of Q05 with bloom filter help on partition columns

Total, bloom filter on partition column will scale back the variety of partitions processed leading to lowered S3 itemizing calls and lesser variety of information information to be learn (discount in scanned bytes). You may see that we solely scan 89M rows from store_sales and 4M rows from store_returns due to the bloom filter. This lowered variety of rows to course of at JOIN degree and helped in enhancing the general question efficiency by 2x and scanned bytes by 9x.

Conclusion

On this put up, we coated new efficiency optimizations in Amazon Redshift information lake question processing and the way AWS Glue Knowledge Catalog statistics helps to boost high quality of question plans for information lake queries in Amazon Redshift. These optimizations collectively improved TPC-DS 3 TB benchmark by 3x. A number of the queries in our benchmark benefited as much as 12x velocity up.

In abstract, Amazon Redshift now presents enhanced question efficiency with optimizations equivalent to AWS Glue Knowledge Catalog column statistics, bloom filters on partition columns, new question rewrite guidelines and sooner retrieval of metadata. These optimizations are enabled by default and Amazon Redshift customers will profit with higher question response instances for his or her workloads. For extra data, please attain out to your AWS technical account supervisor or AWS account options architect. They are going to be glad to offer further steering and help.

Concerning the authors

Kalaiselvi Kamaraj is a Sr. Software program Improvement Engineer with Amazon. She has labored on a number of initiatives inside Redshift Question processing crew and at present specializing in efficiency associated initiatives for Redshift Knowledge Lake.

Mark Lyons is a Principal Product Supervisor on the Amazon Redshift crew. He works on the intersection of knowledge lakes and information warehouses. Previous to becoming a member of AWS, Mark held product management roles with Dremio and Vertica. He’s enthusiastic about information analytics and empowering prospects to vary the world with their information.

Asser Moustafa is a Principal Worldwide Specialist Options Architect at AWS, based mostly in Dallas, Texas, USA. He companions with prospects worldwide, advising them on all facets of their information architectures, migrations, and strategic information visions to assist organizations undertake cloud-based options, maximize the worth of their information belongings, modernize legacy infrastructures, and implement cutting-edge capabilities like machine studying and superior analytics. Previous to becoming a member of AWS, Asser held varied information and analytics management roles, finishing an MBA from New York College and an MS in Laptop Science from Columbia College in New York. He’s enthusiastic about empowering organizations to grow to be really data-driven and unlock the transformative potential of their information.