Big Data

Unleash deeper insights with Amazon Redshift knowledge sharing for knowledge lake tables

10 October 2024

Amazon Redshift has established itself as a extremely scalable, absolutely managed cloud knowledge warehouse trusted by tens of 1000’s of shoppers for its superior price-performance and superior knowledge analytics capabilities. Pushed primarily by buyer suggestions, the product roadmap for Amazon Redshift is designed to verify the service repeatedly evolves to satisfy the ever-changing wants of its customers.

Over time, this customer-centric method has led to the introduction of groundbreaking options akin to zero-ETL, knowledge sharing, streaming ingestion, knowledge lake integration, Amazon Redshift ML, Amazon Q generative SQL, and transactional knowledge lake capabilities. The most recent innovation in Amazon Redshift knowledge sharing capabilities additional enhances the service’s flexibility and collaboration potential.

Amazon Redshift now permits the safe sharing of knowledge lake tables—also called exterior tables or Amazon Redshift Spectrum tables—which can be managed within the AWS Glue Information Catalog, in addition to Redshift views referencing these knowledge lake tables. This breakthrough empowers knowledge analytics to span the complete breadth of shareable knowledge, permitting you to seamlessly share native tables and knowledge lake tables throughout warehouses, accounts, and AWS Areas—with out the overhead of bodily knowledge motion or recreating safety insurance policies for knowledge lake tables and Redshift views on every warehouse.

Through the use of granular entry controls, knowledge sharing in Amazon Redshift helps knowledge house owners keep tight governance over who can entry the shared data. On this publish, we discover highly effective use instances that exhibit how one can improve cross-team and cross-organizational collaboration, scale back overhead, and unlock new insights through the use of this revolutionary knowledge sharing performance.

Overview of Amazon Redshift knowledge sharing

Amazon Redshift knowledge sharing lets you securely share your knowledge with different Redshift warehouses, with out having to repeat or transfer the info.

Information shared between warehouses doesn’t require the info to be bodily copied or moved—as a substitute, knowledge stays within the unique Redshift warehouse, and entry is granted to different approved customers as a part of a one-time setup. Information sharing gives granular entry management, permitting you to regulate which particular tables or views are shared, and which customers or providers can entry the shared knowledge.

Since shoppers entry the shared knowledge in-place, they all the time entry the most recent state of the shared knowledge. Information sharing even permits for the automated sharing of recent tables created after that datashare was established.

You may share knowledge throughout totally different Redshift warehouses inside or throughout AWS accounts, and you can even do cross-region knowledge sharing. This lets you share knowledge with companions, subsidiaries, or different elements of your group, and permits the highly effective workload isolation use case, as proven within the following diagram. With the seamless integration of Amazon Redshift with AWS Information Alternate, knowledge will also be monetized and shared publicly, and public datasets akin to census knowledge may be added to a Redshift warehouse with just some steps.

Figure 1: Amazon Redshift data sharing between producer and consumer warehouses

Determine 1: Amazon Redshift knowledge sharing between producer and client warehouses

The information sharing capabilities in Amazon Redshift additionally allow the implementation of an information mesh structure, as proven within the following diagram. This helps democratize knowledge throughout the group by decreasing obstacles to accessing and utilizing knowledge throughout totally different enterprise models and groups. For datasets with a number of authors, Amazon Redshift knowledge sharing helps each learn and write use instances (write in preview on the time of writing). This allows the creation of 360-degree datasets, akin to a buyer dataset that receives contributions from a number of Redshift warehouses throughout totally different enterprise models within the group.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

Determine 2: Information mesh structure utilizing Amazon Redshift knowledge sharing

Overview of Redshift Spectrum and knowledge lake tables

Within the trendy knowledge group, the info lake has emerged as a centralized repository—a single supply of reality the place all knowledge throughout the group finally resides sooner or later in its lifecycle. Redshift Spectrum permits seamless integration between the Redshift knowledge warehouse and clients’ knowledge lakes, as proven within the following diagram. With Redshift Spectrum, you possibly can run SQL queries immediately in opposition to knowledge saved in Amazon Easy Storage Service (Amazon S3), with out the necessity to first load that knowledge right into a Redshift warehouse. This lets you keep a complete view of your knowledge whereas optimizing for cost-efficiency.

Figure 3: Amazon Redshift bridges the data warehouse and data lake by enabling querying of data lake tables in-place

Determine 3: Amazon Redshift bridges the info warehouse and knowledge lake by enabling querying of knowledge lake tables in-place

Redshift Spectrum helps quite a lot of open file codecs, together with Parquet, ORC, JSON, and CSV, in addition to open desk codecs akin to Apache Iceberg, all saved in Amazon S3. It runs these queries utilizing a devoted fleet of high-performance servers with low-latency connections to the S3 knowledge lake. Information lake tables may be added to a Redshift warehouse both mechanically via the Information Catalog, within the Amazon Redshift Question Editor, or manually utilizing SQL instructions.

From a consumer expertise standpoint, there may be little distinction between querying a neighborhood Redshift desk vs. an information lake desk. SQL queries may be reused verbatim to carry out the identical aggregations and transformations on knowledge residing within the knowledge lake, as proven within the following examples. Moreover, through the use of columnar file codecs like Parquet and pushing down question predicates, you possibly can obtain additional efficiency enhancements.

The next SQL is for a pattern question in opposition to native Redshift tables:

SELECT high 10 mylocal_schema.gross sales.eventid, sum(mylocal_schema.gross sales.pricepaid) FROM mylocal_schema.gross sales, occasion
WHERE mylocal_schema.gross sales.eventid = occasion.eventid
AND mylocal_schema.gross sales.pricepaid > 30
GROUP BY mylocal_schema.gross sales.eventid
ORDER BY 2 DESC;

The next SQL is for a similar question, however in opposition to knowledge lake tables:

SELECT high 10 myspectrum_schema.gross sales.eventid, sum(myspectrum_schema.gross sales.pricepaid) FROM myspectrum_schema.gross sales, occasion
WHERE myspectrum_schema.gross sales.eventid = occasion.eventid
AND myspectrum_schema.gross sales.pricepaid > 30
GROUP BY myspectrum_schema.gross sales.eventid
ORDER BY 2 desc;

To take care of strong knowledge governance, Redshift Spectrum integrates with AWS Lake Formation, enabling the constant software of safety insurance policies and entry controls throughout each the Redshift knowledge warehouse and S3 knowledge lake. When Lake Formation is used, Redshift producer warehouses first share their knowledge with Lake Formation relatively than immediately with different Redshift client warehouses, and the info lake administrator grants fine-grained permissions for Redshift client warehouses to entry the shared knowledge. For extra data, see Centrally handle entry and permissions for Amazon Redshift knowledge sharing with AWS Lake Formation.

Up to now, nonetheless, sharing knowledge lake tables throughout Redshift warehouses offered challenges. It wasn’t attainable to take action with out having to mount the info lake tables on every particular person Redshift warehouse after which recreate the associated safety insurance policies.

This barrier has now been addressed with the introduction of knowledge sharing assist for knowledge lake tables. Now you can share knowledge lake tables similar to some other desk, utilizing the built-in knowledge sharing capabilities of Amazon Redshift. By combining the facility of Redshift Spectrum knowledge lake integration with the pliability of Amazon Redshift knowledge sharing, organizations can unlock new ranges of cross-team collaboration and insights, whereas sustaining strong knowledge governance and safety controls.

For extra details about Redshift Spectrum, see Getting began with Amazon Redshift Spectrum.

Answer overview

On this publish, we describe learn how to add knowledge lake tables or views to a Redshift datashare, overlaying two key use instances:

Including a late-binding view or materialized view to a producer datashare that references an information lake desk
Including an information lake desk on to a producer datashare

The primary use case gives better flexibility and comfort. Shoppers can question the shared view with out having to configure fine-grained permissions. The configuration, akin to defining permissions on knowledge saved in Amazon S3 with Lake Formation, is already dealt with on the producer facet. You solely want so as to add the view to the producer datashare one time, making it a handy possibility for each the producer and the patron.

A further good thing about this method is that you could add views to a datashare that be a part of knowledge lake tables with native Redshift tables. When these views are shared, you possibly can relegate the trusted enterprise logic to simply the producer facet.

Alternatively, you possibly can add knowledge lake tables on to a datashare. On this case, shoppers can question the info lake tables immediately or be a part of them with their very own native tables, permitting them so as to add their very own conditional logic as wanted.

Add a view that references an information lake desk to a Redshift datashare

Once you create knowledge lake tables that you simply intend so as to add to a datashare, the really helpful and commonest means to do that is so as to add a view to the datashare that references an information lake desk or tables. There are three high-level steps concerned:

Add the Redshift view’s schema (the native schema) to the Redshift datashare.
Add the Redshift view (the native view) to the Redshift datashare.
Add the Redshift exterior schemas (for the tables referenced by the Redshift view) to the Redshift datashare.

The next diagram illustrates the complete workflow.

Figure 4: Sharing data lake tables via Amazon Redshift views

Determine 4: Sharing knowledge lake tables through Amazon Redshift views

The workflow consists of the next steps:

Create an information lake desk on the datashare producer. For extra data on creating Redshift Spectrum objects, see Exterior schemas for Amazon Redshift Spectrum. Information lake tables to be shared can embody Lake Formation registered tables and Information Catalog tables, and if utilizing the Redshift Question Editor, these tables are mechanically mounted.
Create a view on the producer that references the info lake desk that you simply created.
Create a datashare, if one doesn’t exist already, and add objects to your datashare, together with the view you created that references the info lake desk. For extra data, see Creating datashares and including objects (preview).
Add the exterior schema of the bottom Redshift desk to the datashare (that is true of each native base tables and knowledge lake tables). You don’t have so as to add an information lake desk itself to the datashare.
On the patron, the administrator makes the view accessible to client database customers.
Database client customers can write queries to retrieve knowledge from the shared view and be a part of it with different tables and views on the patron.

After these steps are full, database client customers with entry to the datashare views can reference them of their SQL queries. The next SQL queries are examples for attaining the previous steps.

Create an information lake desk on the producer warehouse:

CREATE EXTERNAL TABLE myspectrum_db.myspectrum_schema.check (c1 INT)
saved AS parquet
location 's3://amzn-s3-demo-bucket/myfolder/';

Create a view on the producer warehouse:

CREATE VIEW mylocal_db.mylocal_schema.myspectrumview AS SELECT c1 FROM myspectrum_db.myspectrum_schema.v_test
WITH no schema binding;

Add a view to the datashare on the producer warehouse:

ALTER datashare mydatashare ADD SCHEMA mylocal_db.mylocal_schema;
ALTER datashare mydatashare ADD VIEW myspectrumview;
ALTER datashare mydatashare ADD SCHEMA myspectrum_db.myspectrum_schema;

Create a client datashare and grant permissions for the view within the client warehouse:

CREATE database myspectrum_db FROM datashare myspectrumproducer OF account '123456789012' namespace 'p1234567-8765-4321-p10987654321';
GRANT utilization ON database myspectrum_db TO usernames;

Add an information lake desk on to a Redshift datashare

Including an information lake desk to a datashare is much like including a view. This course of works nicely for a case the place the shoppers need the uncooked knowledge from the info lake desk and so they wish to write queries and be a part of it to tables in their very own knowledge warehouse. There are two high-level steps concerned:

Add the Redshift exterior schemas (of the info lake tables to be shared) to the Redshift datashare.
Add the info lake desk (the Redshift exterior desk) to the Redshift datashare.

The next diagram illustrates the complete workflow.

Figure 5: Sharing data lake tables directly in an Amazon Redshift datashare

Determine 5: Sharing knowledge lake tables immediately in an Amazon Redshift datashare

The workflow consists of the next steps:

Create an information lake desk on the datashare producer.
Add objects to your datashare, together with the info lake desk you created. On this case, you don’t have any abstraction over the desk.
On the patron, the administrator makes the desk accessible.
Database client customers can write queries to retrieve knowledge from the shared desk and be a part of it with different tables and views on the patron.

The next SQL queries are examples for attaining the previous producer steps.

Create an information lake desk on the producer warehouse:

CREATE EXTERNAL TABLE myspectrum_db.myspectrum_schema.check (c1 INT)
saved AS parquet
location 's3://amzn-s3-demo-bucket/myfolder/';

Add an information lake schema and desk on to the datashare on the producer warehouse:

ALTER datashare mydatashare ADD SCHEMA myspectrum_db.myspectrum_schema;
ALTER datashare mydatashare ADD TABLE myspectrum_db.myspectrum_schema.check;

Create a client datashare and grant permissions for the view within the client warehouse:

CREATE database myspectrum_db FROM datashare myspectrumproducer OF account '123456789012' namespace 'p1234567-8765-4321-p10987654321';
GRANT utilization ON database myspectrum_db TO usernames;

Safety issues for sharing knowledge lake tables and views

Information lake tables are saved outdoors of Amazon Redshift, within the knowledge lake, and is probably not owned by the Redshift warehouse, however are nonetheless referenced inside Amazon Redshift. This setup requires particular safety issues. Information lake tables function beneath the safety and governance of each Amazon Redshift and the info lake. For Lake Formation registered tables particularly, the Amazon S3 sources are secured by Lake Formation and made accessible to shoppers utilizing the offered credentials.

The information proprietor of the info within the knowledge lake tables could wish to impose restrictions on which exterior objects may be added to a datashare. To present knowledge house owners extra management over whether or not warehouse customers can share knowledge lake tables, you need to use session tags in AWS Identification and Entry Administration (IAM). These tags present extra context concerning the consumer working the queries. For extra particulars on tagging sources, seek advice from Tags for AWS Identification and Entry Administration sources.

Audit issues for sharing knowledge lake tables and views

When sharing knowledge lake objects via a datashare, there are particular logging issues to remember:

Entry controls – You may as well use CloudTrail log knowledge along side IAM insurance policies to regulate entry to shared tables, together with each Redshift datashare producers and shoppers. The CloudTrail logs report particulars about who accesses shared tables. The identifiers within the log knowledge can be found within the ExternalId discipline beneath the AssumeRole CloudTrail logs. The information proprietor can configure extra limitations on knowledge entry in an IAM coverage by way of actions. For extra details about defining knowledge entry via insurance policies, see Entry to AWS accounts owned by third events.
Centralized entry – Amazon S3 sources akin to knowledge lake tables may be registered and centrally managed with Lake Formation. After they’re registered with Lake Formation, Amazon S3 sources are secured and ruled by the related Lake Formation insurance policies and made accessible utilizing the credentials offered by Lake Formation.

Billing issues for sharing knowledge lake tables and views

The billing mannequin for Redshift Spectrum differs for Amazon Redshift provisioned and serverless warehouses. For provisioned warehouses, Redshift Spectrum queries (queries involving knowledge lake tables) are billed based mostly on the quantity of knowledge scanned throughout question execution. For serverless warehouses, knowledge lake queries are billed the identical as non-data-lake queries. Storage for knowledge lake tables is all the time billed to the AWS account related to the Amazon S3 knowledge.

Within the case of datashares involving knowledge lake tables, prices are attributed for storing and scanning knowledge lake objects in a datashare as follows:

When a client queries shared objects from an information lake, the price of scanning is billed to the patron:
- When the patron is a provisioned warehouse, Amazon Redshift makes use of Redshift Spectrum to scan the Amazon S3 knowledge. Subsequently, the Redshift Spectrum value is billed to the patron account.
- When the patron is an Amazon Redshift Serverless workgroup, there isn’t any separate cost for knowledge lake queries.
Amazon S3 prices for storage and operations, akin to itemizing buckets, is billed to the account that owns every S3 bucket.

For detailed data on Redshift Spectrum billing, seek advice from Amazon Redshift pricing and Billing for storage.

Conclusion

On this publish, we explored how Amazon Redshift enhanced knowledge sharing capabilities, together with assist for sharing knowledge lake tables and Redshift views that reference these knowledge lake tables, empower organizations to unlock the complete potential of their knowledge by bringing the complete breadth of knowledge belongings in scope for superior analytics. Organizations are actually capable of seamlessly share native tables and knowledge lake tables throughout warehouses, accounts, and Areas.

We outlined the steps to securely share knowledge lake tables and views that reference these knowledge lake tables throughout Redshift warehouses, even these in separate AWS accounts or Areas. Moreover, we coated some issues and greatest practices to remember when utilizing this revolutionary function.

Sharing knowledge lake tables and views via Amazon Redshift knowledge sharing champions the trendy, data-driven group’s objective to democratize knowledge entry in a safe, scalable, and environment friendly method. By eliminating the necessity for bodily knowledge motion or duplication, this functionality reduces overhead and permits seamless cross-team and cross-organizational collaboration. Unleashing the complete potential of your knowledge analytics to span the complete breadth of your native tables and knowledge lake tables is just some steps away.

For extra data on Amazon Redshift knowledge sharing and the way it can profit your group, seek advice from the next sources:

Please additionally attain out to your AWS technical account supervisor or AWS account Options Architect. They are going to be completely satisfied to supply extra steering and assist.

In regards to the Authors

Mohammed Alkateb is an Engineering Supervisor at Amazon Redshift. Previous to becoming a member of Amazon, Mohammed had 12 years of business expertise in question optimization and database internals as a person contributor and engineering supervisor. Mohammed has 18 US patents, and he has publications in analysis and industrial tracks of premier database conferences together with EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Pc Science from The College of Vermont, and MSc and BSc levels in Data Techniques from Cairo College.

Ramchandra Anil Kulkarni is a software program improvement engineer who has been with Amazon Redshift for over 4 years. He’s pushed to develop database improvements that serve AWS clients globally. Kulkarni’s long-standing tenure and dedication to the Amazon Redshift service exhibit his deep experience and dedication to delivering cutting-edge database options that empower AWS clients worldwide.

Mark Lyons is a Principal Product Supervisor on the Amazon Redshift workforce. He works on the intersection of knowledge lakes and knowledge warehouses. Previous to becoming a member of AWS, Mark held product management roles with Dremio and Vertica. He’s keen about knowledge analytics and empowering clients to alter the world with their knowledge.

Asser Moustafa is a Principal Worldwide Specialist Options Architect at AWS, based mostly in Dallas, Texas. He companions with clients worldwide, advising them on all elements of their knowledge architectures, migrations, and strategic knowledge visions to assist organizations undertake cloud-based options, maximize the worth of their knowledge belongings, modernize legacy infrastructures, and implement cutting-edge capabilities like machine studying and superior analytics. Previous to becoming a member of AWS, Asser held varied knowledge and analytics management roles, finishing an MBA from New York College and an MS in Pc Science from Columbia College in New York. He’s keen about empowering organizations to develop into really data-driven and unlock the transformative potential of their knowledge.