Big Data

Amazon Redshift information ingestion choices

6 September 2024

Amazon Redshift, a warehousing service, affords a wide range of choices for ingesting information from various sources into its high-performance, scalable atmosphere. Whether or not your information resides in operational databases, information lakes, on-premises techniques, Amazon Elastic Compute Cloud (Amazon EC2), or different AWS companies, Amazon Redshift offers a number of ingestion strategies to fulfill your particular wants. The at the moment out there decisions embrace:

The Amazon Redshift COPY command can load information from Amazon Easy Storage Service (Amazon S3), Amazon EMR, Amazon DynamoDB, or distant hosts over SSH. This native function of Amazon Redshift makes use of huge parallel processing (MPP) to load objects instantly from information sources into Redshift tables. Additional, the auto-copy function simplifies and automates information loading from Amazon S3 into Amazon Redshift.
Amazon Redshift federated queries run queries utilizing supply database compute with the outcomes returned to Amazon Redshift.
Amazon Redshift zero-ETL integrations can load information from Amazon Aurora MySQL-Suitable Version, Amazon Relational Database Service (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL, and DynamoDB, with the added capability to carry out transformations after loading.
The Amazon Redshift integration for Apache Spark mixed with AWS Glue or Amazon EMR performs transformations earlier than loading information into Amazon Redshift.
Amazon Redshift streaming helps ingestion of streaming sources, together with Amazon Kinesis Knowledge Streams, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon Knowledge Firehose.
Lastly, information might be loaded into Amazon Redshift with standard ETL instruments like Informatica, Matillion and DBT Labs.

This put up explores every possibility (as illustrated within the following determine), determines that are appropriate for various use circumstances, and discusses how and why to pick out a selected Amazon Redshift software or function for information ingestion.

Amazon Redshift COPY command

The Redshift COPY command, a easy low-code information ingestion software, hundreds information into Amazon Redshift from Amazon S3, DynamoDB, Amazon EMR, and distant hosts over SSH. It’s a quick and environment friendly approach to load massive datasets into Amazon Redshift. It makes use of massively parallel processing (MPP) structure in Amazon Redshift to learn and cargo massive quantities of information in parallel from information or information from supported information sources. This lets you make the most of parallel processing by splitting information into a number of information, particularly when the information are compressed.

Advisable use circumstances for the COPY command embrace loading massive datasets and information from supported information sources. COPY routinely splits massive uncompressed delimited textual content information into smaller scan ranges to make the most of the parallelism of Amazon Redshift provisioned clusters and serverless workgroups. With auto-copy, automation enhances the COPY command by including jobs for computerized ingestion of information.

COPY command benefits:

Efficiency – Effectively hundreds massive datasets from Amazon S3 or different sources in parallel with optimized throughput
Simplicity – Simple and user-friendly, requiring minimal setup
Value-optimized – Makes use of Amazon Redshift MPP at a decrease price by decreasing information switch time
Flexibility – Helps file codecs equivalent to CSV, JSON, Parquet, ORC, and AVRO

Amazon Redshift federated queries

Amazon Redshift federated queries assist you to incorporate reside information from Amazon RDS or Aurora operational databases as a part of enterprise intelligence (BI) and reporting purposes.

Federated queries are helpful to be used circumstances the place organizations wish to mix information from their operational techniques with information saved in Amazon Redshift. Federated queries permit querying information throughout Amazon RDS for MySQL and PostgreSQL information sources with out the necessity for extract, rework, and cargo (ETL) pipelines. If storing operational information in an information warehouse is a requirement, synchronization of tables between operational information shops and Amazon Redshift tables is supported. In eventualities the place information transformation is required, you should utilize Redshift saved procedures to switch information in Redshift tables.

Federated queries key options:

Actual-time entry – Allows querying of reside information throughout discrete sources, equivalent to Amazon RDS and Aurora, with out the necessity to transfer the information
Unified information view – Supplies a single view of information throughout a number of databases, simplifying information evaluation and reporting
Value financial savings – Eliminates the necessity for ETL processes to maneuver information into Amazon Redshift, saving on storage and compute prices
Flexibility – Helps Amazon RDS and Aurora information sources, providing flexibility in accessing and analyzing distributed information

Amazon Redshift Zero-ETL integration

Aurora zero-ETL integration with Amazon Redshift permits entry to operational information from Amazon Aurora MySQL-Suitable (and Amazon Aurora PostgreSQL-Suitable Version, Amazon RDS for MySQL in preview), and DynamoDB from Amazon Redshift with out the necessity for ETL in close to actual time. You need to use zero-ETL to simplify ingestion pipelines for performing change information seize (CDC) from an Aurora database to Amazon Redshift. Constructed on the combination of Amazon Redshift and Aurora storage layers, zero-ETL boasts easy setup, information filtering, automated observability, auto-recovery, and integration with both Amazon Redshift provisioned clusters or Amazon Redshift Serverless workgroups.

Zero-ETL integration advantages:

Seamless integration – Robotically integrates and synchronizes information between operational databases and Amazon Redshift with out the necessity for customized ETL processes
Close to real-time insights – Supplies close to real-time information updates, so essentially the most present information is out there for evaluation
Ease of use – Simplifies information structure by eliminating the necessity for separate ETL instruments and processes
Effectivity – Minimizes information latency and offers information consistency throughout techniques, enhancing general information accuracy and reliability

Amazon Redshift integration for Apache Spark

The Amazon Redshift integration for Apache Spark, routinely included via Amazon EMR or AWS Glue, offers efficiency and safety optimizations when in comparison with the community-provided connector. The combination enhances and simplifies safety with AWS Identification and Entry Administration (IAM) authentication assist. AWS Glue 4.0 offers a visible ETL software for authoring jobs to learn from and write to Amazon Redshift, utilizing the Redshift Spark connector for connectivity. This simplifies the method of constructing ETL pipelines to Amazon Redshift. The Spark connector permits use of Spark purposes to course of and rework information earlier than loading into Amazon Redshift. The combination minimizes the guide strategy of organising a Spark connector and shortens the time wanted to arrange for analytics and machine studying (ML) duties. It lets you specify the connection to a knowledge warehouse and begin working with Amazon Redshift information out of your Apache Spark-based purposes inside minutes.

The combination offers pushdown capabilities for type, combination, restrict, be a part of, and scalar operate operations to optimize efficiency by shifting solely the related information from Amazon Redshift to the consuming Apache Spark software. Spark jobs are appropriate for information processing pipelines and when you’ll want to use Spark’s superior information transformation capabilities.

With the Amazon Redshift integration for Apache Spark, you may simplify the constructing of ETL pipelines with information transformation necessities. It affords the next advantages:

Excessive efficiency – Makes use of the distributed computing energy of Apache Spark for large-scale information processing and evaluation
Scalability – Effortlessly scales to deal with huge datasets by distributing computation throughout a number of nodes
Flexibility – Helps a variety of information sources and codecs, offering versatility in information processing duties
Interoperability – Seamlessly integrates with Amazon Redshift for environment friendly information switch and queries

Amazon Redshift streaming ingestion

The important thing advantage of Amazon Redshift streaming ingestion is the power to ingest lots of of megabytes of information per second instantly from streaming sources into Amazon Redshift with very low latency, supporting real-time analytics and insights. Supporting streams from Kinesis Knowledge Streams, Amazon MSK, and Knowledge Firehose, streaming ingestion requires no information staging, helps versatile schemas, and is configured with SQL. Streaming ingestion powers real-time dashboards and operational analytics by instantly ingesting information into Amazon Redshift materialized views.

Amazon Redshift streaming ingestion unlocks close to real-time streaming analytics with:

Low latency – Ingests streaming information in close to actual time, making streaming ingestion ideally suited for time-sensitive purposes equivalent to Web of Issues (IoT), monetary transactions, and clickstream evaluation
Scalability – Manages excessive throughput and huge volumes of streaming information from sources equivalent to Kinesis Knowledge Streams, Amazon MSK, and Knowledge Firehose
Integration – Integrates with different AWS companies to construct end-to-end streaming information pipelines
Steady updates – Retains information in Amazon Redshift constantly up to date with the most recent data from the information streams

Amazon Redshift ingestion use circumstances and examples

On this part, we focus on the small print of various Amazon Redshift ingestion use circumstances and supply examples.

Redshift COPY use case: Software log information ingestion and evaluation

Ingesting software log information saved in Amazon S3 is a typical use case for the Redshift COPY command. Knowledge engineers in a company want to investigate software log information to realize insights into consumer habits, determine potential points, and optimize a platform’s efficiency. To realize this, information engineers ingest log information in parallel from a number of information saved in S3 buckets into Redshift tables. This parallelization makes use of the Amazon Redshift MPP structure, permitting for quicker information ingestion in comparison with different ingestion strategies.

The next code is an instance of the COPY command loading information from a set of CSV information in an S3 bucket right into a Redshift desk:

COPY myschema.mytable
FROM 's3://my-bucket/information/information/'
IAM_ROLE ‘arn:aws:iam::1234567891011:position/MyRedshiftRole’
FORMAT AS CSV;

This code makes use of the next parameters:

mytable is the goal Redshift desk for information load
‘s3://my-bucket/information/information/‘ is the S3 path the place the CSV information are positioned
IAM_ROLE specifies the IAM position required to entry the S3 bucket
FORMAT AS CSV specifies that the information information are in CSV format

Along with Amazon S3, the COPY command hundreds information from different sources, equivalent to DynamoDB, Amazon EMR, distant hosts via SSH, or different Redshift databases. The COPY command offers choices to specify information codecs, delimiters, compression, and different parameters to deal with totally different information sources and codecs.

To get began with the COPY command, see Utilizing the COPY command to load from Amazon S3.

Federated queries use case: Built-in reporting and analytics for a retail firm

For this use case, a retail firm has an operational database operating on Amazon RDS for PostgreSQL, which shops real-time gross sales transactions, stock ranges, and buyer data information. Moreover, an information warehouse runs on Amazon Redshift storing historic information for reporting and analytics functions. To create an built-in reporting answer that mixes real-time operational information with historic information within the information warehouse, with out the necessity for multi-step ETL processes, full the next steps:

Arrange community connectivity. Be certain your Redshift cluster and RDS for PostgreSQL occasion are in the identical digital non-public cloud (VPC) or have community connectivity established via VPC peering, AWS PrivateLink, or AWS Transit Gateway.
Create a secret and IAM position for federated queries:
1. In AWS Secrets and techniques Supervisor, create a brand new secret to retailer the credentials (consumer title and password) in your Amazon RDS for PostgreSQL occasion.
2. Create an IAM position with permissions to entry the Secrets and techniques Supervisor secret and the Amazon RDS for PostgreSQL occasion.
3. Affiliate the IAM position along with your Amazon Redshift cluster.
Create an exterior schema in Amazon Redshift:
1. Hook up with your Redshift cluster utilizing a SQL shopper or the question editor v2 on the Amazon Redshift console.
2. Create an exterior schema that references your Amazon RDS for PostgreSQL occasion:

CREATE EXTERNAL SCHEMA postgres_schema
FROM POSTGRES
DATABASE 'mydatabase'
SCHEMA 'public'
URI 'endpoint-for-your-rds-instance.aws-region.rds.amazonaws.com:5432'
IAM_ROLE 'arn:aws:iam::123456789012:position/RedshiftRoleForRDS'
SECRET_ARN 'arn:aws:secretsmanager:aws-region:123456789012:secret:my-rds-secret-abc123';

Question tables in your Amazon RDS for PostgreSQL occasion instantly from Amazon Redshift utilizing federated queries:

SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.class
FROM
    postgres_schema.orders r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id
WHERE
    r.order_date >= '2024-01-01';

Create views or materialized views in Amazon Redshift that mix the operational information from federated queries with the historic information in Amazon Redshift for reporting functions:

CREATE MATERIALIZED VIEW sales_report AS
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.class,
    h.historical_sales
FROM
    (
        SELECT
            order_id,
            order_date,
            customer_name,
            total_amount,
            product_id
        FROM
            postgres_schema.orders
    ) r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id;

With this implementation, federated queries in Amazon Redshift combine real-time operational information from Amazon RDS for PostgreSQL situations with historic information in a Redshift information warehouse. This method eliminates the necessity for multi-step ETL processes and lets you create complete stories and analytics that mix information from a number of sources.

To get began with Amazon Redshift federated question ingestion, see Querying information with federated queries in Amazon Redshift.

Zero-ETL integration use case: Close to real-time analytics for an ecommerce software

Suppose an ecommerce software constructed on Aurora MySQL-Suitable manages on-line orders, buyer information, and product catalogs. To carry out close to real-time analytics with information filtering on transactional information to realize insights into buyer habits, gross sales traits, and stock administration with out the overhead of constructing and sustaining multi-step ETL pipelines, you should utilize zero-ETL integrations for Amazon Redshift. Full the next steps:

Arrange an Aurora MySQL cluster (should be operating Aurora MySQL model 3.05-compatible with MySQL 8.0.32 or increased):
1. Create an Aurora MySQL cluster in your required AWS Area.
2. Configure the cluster settings, such because the occasion kind, storage, and backup choices.
Create a zero-ETL integration with Amazon Redshift:
1. On the Amazon RDS console, navigate to the Zero-ETL integrations
2. Select Create integration and choose your Aurora MySQL cluster because the supply.
3. Select an current Redshift cluster or create a brand new cluster because the goal.
4. Present a reputation for the combination and evaluate the settings.
5. Select Create integration to provoke the zero-ETL integration course of.
Confirm the combination standing:
1. After the combination is created, monitor the standing on the Amazon RDS console or by querying the SVV_INTEGRATION and SYS_INTEGRATION_ACTIVITY system views in Amazon Redshift.
2. Await the combination to succeed in the Lively state, indicating that information is being replicated from Aurora to Amazon Redshift.
Create analytics views:
1. Hook up with your Redshift cluster utilizing a SQL shopper or the question editor v2 on the Amazon Redshift console.
2. Create views or materialized views that mix and rework the replicated information from Aurora in your analytics use circumstances:

CREATE MATERIALIZED VIEW orders_summary AS
SELECT
    o.order_id,
    o.customer_id,
    SUM(oi.amount * oi.value) AS total_revenue,
    MAX(o.order_date) AS latest_order_date
FROM
    aurora_schema.orders o
    JOIN aurora_schema.order_items oi ON o.order_id = oi.order_id
GROUP BY
    o.order_id,
    o.customer_id;

Question the views or materialized views in Amazon Redshift to carry out close to real-time analytics on the transactional information out of your Aurora MySQL cluster:

SELECT
	customer_id,
	SUM(total_revenue) AS total_customer_revenue,
	MAX(latest_order_date) AS most_recent_order
FROM
	orders_summary
GROUP BY
	customer_id
ORDER BY
	total_customer_revenue DESC;

This implementation achieves close to real-time analytics for an ecommerce software’s transactional information utilizing the zero-ETL integration between Aurora MySQL-Suitable and Amazon Redshift. The information routinely replicates from Aurora to Amazon Redshift, eliminating the necessity for multi-step ETL pipelines and supporting insights from the most recent information shortly.

To get began with Amazon Redshift zero-ETL integrations, see Working with zero-ETL integrations. To be taught extra about Aurora zero-ETL integrations with Amazon Redshift, see Amazon Aurora zero-ETL integrations with Amazon Redshift.

Integration for Apache Spark use case: Gaming participant occasions written to Amazon S3

Contemplate a big quantity of gaming participant occasions saved in Amazon S3. The occasions require information transformation, cleaning, and preprocessing to extract insights, generate stories, or construct ML fashions. On this case, you should utilize the scalability and processing energy of Amazon EMR to carry out the required information adjustments utilizing Apache Spark. After it’s processed, the reworked information should be loaded into Amazon Redshift for additional evaluation, reporting, and integration with BI instruments.

On this state of affairs, you should utilize the Amazon Redshift integration for Apache Spark to carry out the mandatory information transformations and cargo the processed information into Amazon Redshift. The next implementation instance assumes gaming participant occasions in Parquet format are saved in Amazon S3 (s3:///player_events/).

Launch an Amazon EMR (emr-6.9.0) cluster with Apache Spark (Spark 3.3.0) with Amazon Redshift integration with Apache Spark assist.
Configure the mandatory IAM position for accessing Amazon S3 and Amazon Redshift.
Add safety group guidelines to Amazon Redshift to permit entry to the provisioned cluster or serverless workgroup.
Create a Spark job that units up a connection to Amazon Redshift, reads information from Amazon S3, performs transformations, and writes ensuing information to Amazon Redshift. See the next code:

from pyspark.sql import SparkSession
from pyspark.sql.capabilities import col, lit
import os

def predominant():

	# Create a SparkSession
	spark = SparkSession.builder 
    		.appName("RedshiftSparkJob") 
    		.getOrCreate()

	# Set Amazon Redshift connection properties
	Redshift_jdbc_url = "jdbc:redshift://:/"
	redshift_table = "."
	temp_s3_bucket = "s3:///temp/"
	iam_role_arn = ""

	# Learn information from Amazon S3
	s3_data = spark.learn.format("parquet") 
    		.load("s3:///player_events/")

	# Carry out transformations
	transformed_data = s3_data.withColumn("transformed_column", lit("transformed_value"))

	# Write the reworked information to Amazon Redshift
	transformed_data.write 
    		.format("io.github.spark_redshift_community.spark.redshift") 
    		.possibility("url", redshift_jdbc_url) 
    		.possibility("dbtable", redshift_table) 
    		.possibility("tempdir", temp_s3_bucket) 
    		.possibility("aws_iam_role", iam_role_arn) 
    		.mode("overwrite") 
    		.save()

if __name__ == "__main__":
    predominant()

On this instance, you first import the mandatory modules and create a SparkSession. Set the connection properties for Amazon Redshift, together with the endpoint, port, database, schema, desk title, short-term S3 bucket path, and the IAM position ARN for authentication. Learn information from Amazon S3 in Parquet format utilizing the spark.learn.format("parquet").load() technique. Carry out a metamorphosis on the Amazon S3 information by including a brand new column transformed_column with a relentless worth utilizing the withColumn technique and the lit operate. Write the reworked information to Amazon Redshift utilizing the write technique and the io.github.spark_redshift_community.spark.redshift format. Set the mandatory choices for the Redshift connection URL, desk title, short-term S3 bucket path, and IAM position ARN. Use the mode("overwrite") choice to overwrite the prevailing information within the Amazon Redshift desk with the reworked information.

To get began with Amazon Redshift integration for Apache Spark, see Amazon Redshift integration for Apache Spark. For extra examples of utilizing the Amazon Redshift for Apache Spark connector, see New – Amazon Redshift Integration with Apache Spark.

Streaming ingestion use case: IoT telemetry close to real-time evaluation

Think about a fleet of IoT units (sensors and industrial gear) that generate a steady stream of telemetry information equivalent to temperature readings, stress measurements, or operational metrics. Ingesting this information in actual time to carry out analytics to observe the units, detect anomalies, and make data-driven selections requires a streaming answer built-in with a Redshift information warehouse.

On this instance, we use Amazon MSK because the streaming supply for IoT telemetry information.

Create an exterior schema in Amazon Redshift:
1. Hook up with an Amazon Redshift cluster utilizing a SQL shopper or the question editor v2 on the Amazon Redshift console.
2. Create an exterior schema that references the MSK cluster:

CREATE EXTERNAL SCHEMA kafka_schema
FROM KAFKA
BROKER 'broker-1.instance.com:9092,broker-2.instance.com:9092'
TOPIC 'iot-telemetry-topic'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789012:position/RedshiftRoleForMSK';

Create a materialized view in Amazon Redshift:
1. Outline a materialized view that maps the Kafka subject information to Amazon Redshift desk columns.
2. CAST the streaming message payload information kind to the Amazon Redshift SUPER kind.
3. Set the materialized view to auto refresh.

CREATE MATERIALIZED VIEW iot_telemetry_view
AUTO REFRESH YES
AS SELECT
    kafka_partition,
    kafka_offset,
    kafka_timestamp_type,
    kafka_timestamp,
    CAST(kafka_value AS SUPER) payload
FROM kafka_schema.iot-telemetry-topic;

Question the iot_telemetry_view materialized view to entry the real-time IoT telemetry information ingested from the Kafka subject. The materialized view will routinely refresh as new information arrives within the Kafka subject.

SELECT
    kafka_timestamp,
    payload:device_id,
    payload:temperature,
    payload:stress
FROM iot_telemetry_view;

With this implementation, you may obtain close to real-time analytics on IoT machine telemetry information utilizing Amazon Redshift streaming ingestion. As telemetry information is acquired by an MSK subject, Amazon Redshift routinely ingests and displays the information in a materialized view, supporting question and evaluation of the information in close to actual time.

To get began with Amazon Redshift streaming ingestion, see Streaming ingestion to a materialized view. To be taught extra about streaming and buyer use circumstances, see Amazon Redshift Streaming Ingestion.

Conclusion

This put up detailed the choices out there for Amazon Redshift information ingestion. The selection of information ingestion technique is dependent upon components equivalent to the dimensions and construction of information, the necessity for real-time entry or transformations, information sources, current infrastructure, ease of use, and consumer skill-sets. Zero-ETL integrations and federated queries are appropriate for easy information ingestion duties or becoming a member of information between operational databases and Amazon Redshift analytics information. Massive-scale information ingestion with transformation and orchestration profit from Amazon Redshift integration with Apache Spark with Amazon EMR and AWS Glue. Bulk loading of information into Amazon Redshift no matter dataset measurement suits completely with the capabilities of the Redshift COPY command. Using streaming sources equivalent to Kinesis Knowledge Streams, Amazon MSK, or Knowledge Firehose are ideally suited eventualities for using AWS streaming companies integration for information ingestion.

Consider the options and steering offered in your information ingestion workloads and tell us your suggestions within the feedback.

Concerning the Authors

Steve Phillips is a senior technical account supervisor at AWS within the North America area. Steve has labored with video games clients for eight years and at the moment focuses on information warehouse architectural design, information lakes, information ingestion pipelines, and cloud distributed architectures.

Sudipta Bagchi is a Sr. Specialist Options Architect at Amazon Net Companies. He has over 14 years of expertise in information and analytics, and helps clients design and construct scalable and high-performant analytics options. Exterior of labor, he loves operating, touring, and enjoying cricket.