Big Data

Implement historic report lookup and Slowly Altering Dimensions Kind-2 utilizing Apache Iceberg

10 December 2024

In right this moment’s data-driven world, monitoring and analyzing modifications over time has turn out to be important. As organizations course of huge quantities of information, sustaining an correct historic report is essential. Historical past administration in information programs is prime for compliance, enterprise intelligence, information high quality, and time-based evaluation. It permits organizations to keep up audit trails, carry out development evaluation, determine information high quality points, and conduct point-in-time reporting. When mixed with Change Knowledge Seize (CDC), which identifies and captures database modifications, historical past administration turns into much more potent.

Frequent use circumstances for historic report administration in CDC eventualities span varied domains. In buyer relationship administration, it tracks modifications in buyer info over time. Monetary programs use it for sustaining correct transaction and stability histories. Stock administration advantages from historic information for analyzing gross sales patterns and optimizing inventory ranges. HR programs use it to trace worker info modifications. In fraud detection, historic information helps determine anomalous patterns in transactions or consumer behaviors.

This put up will discover the best way to implement these functionalities utilizing Apache Iceberg, specializing in Slowly Altering Dimensions (SCD) Kind-2. This methodology creates new information for every information change whereas preserving outdated ones, thus sustaining a full historical past. By the tip, you’ll perceive the best way to use Apache Iceberg to handle historic information successfully on a typical CDC structure.

Historic report lookup

How can we retrieve the historical past of given information? This can be a elementary query in information administration, particularly when coping with programs that want to trace modifications over time. Let’s discover this idea with a sensible instance.

Take into account a product (Heater) in an ecommerce database:

product_id	product_name	worth
00001	Heater	250

Now, let’s say we replace the worth of this product from 250 to 500. After a while, we wish to retrieve the worth historical past of this heater. In a standard database setup, this activity might be difficult, particularly if we haven’t explicitly designed our system to trace historic modifications.

That is the place the idea of historic report lookup turns into essential. We want a system that not solely shops the present state of our information but in addition maintains a log of all modifications made to every report over time. This permits us to reply questions like:

What was the worth of the heater at a selected cut-off date?
What number of instances has the worth modified, and when did these modifications happen?
What was the worth development of the heater over the previous 12 months?

Implementing such a system might be complicated, requiring cautious consideration of information storage, retrieval mechanisms, and question optimization. That is the place Apache Iceberg comes into play, providing a characteristic referred to as the change log view.

The change log view in Apache Iceberg offers a view of all modifications made to a desk over time, making it simple to question and analyze the historical past of any report. With change log view, we are able to simply observe insertions, updates, and deletions, giving us a whole image of how our information has developed.

For our heater instance, Iceberg’s change log view would enable us to effortlessly retrieve a timeline of all worth modifications, full with timestamps and different related metadata, as proven within the following desk.

product_id	product_name	worth	_change_type
00001	Heater	250	INSERT
00001	Heater	250	UPDATE_BEFORE
00001	Heater	500	UPDATE_AFTER

This functionality not solely simplifies historic evaluation but in addition opens prospects for superior time-based analytics, auditing, and information governance.

Historic desk lookup with SCD Kind-2

SCD Kind-2 is a key idea in information warehousing and historic information administration and is especially related to Change Knowledge Seize (CDC) eventualities. SCD Kind-2 creates new rows for modified information as a substitute of overwriting present information, permitting for complete monitoring of modifications over time.

SCD Kind-2 requires further fields reminiscent of effective_start_date, effective_end_date, and current_flag to handle historic information. This method has been extensively utilized in information warehouses to trace modifications in varied dimensions reminiscent of buyer info, product particulars, and worker information. Within the instance of the earlier part, right here’s what the SCD Kind-2 seems to be like assuming the replace operation is carried out on December 11, 2024.

product_id	product_name	worth	effective_start_date	effective_end_date	current_flag
00001	Heater	250	2024-12-10	2024-12-11	FALSE
00001	Heater	500	2024-12-11	NULL	TRUE

SCD Kind-2 is especially useful in CDC use circumstances, the place capturing all information modifications over time is essential. It permits point-in-time evaluation, offers detailed audit trails, aids in information high quality administration, and helps meet compliance necessities by preserving historic information.

In conventional implementations on information warehouses, SCD Kind-2 requires its particular dealing with in all INSERT, UPDATE, and DELETE operations that have an effect on these further columns. For instance, to replace the worth of the product, it’s essential to run the next question.

UPDATE product SET effective_end_date="2024-12-11", current_flag = false
WHERE product_id = '00001' AND current_flag = true;

INSERT INTO product (product_id, product_name, worth, effective_start_date, effective_end_date, current_flag)
VALUES ('00001', 'Heater', 500, '2024-12-11', NULL, true);

For contemporary information lakes, we suggest a brand new method to implement SCD Kind-2. With Iceberg, you may create a devoted view of SCD Kind-2 on high of the change log view, eliminating the necessity to implement particular dealing with to make modifications on SCD Kind-2 tables. With this method, you may preserve managing Iceberg tables with out complexity contemplating SCD Kind-2 specification. Anytime if you want SCD Kind-2 snapshot of your Iceberg desk, you may create the corresponding illustration. This method combines the facility of Iceberg’s environment friendly information administration with the historic monitoring capabilities of SCD Kind-2. Through the use of the change log view, Iceberg can dynamically generate the SCD Kind-2 construction with out the overhead of sustaining further tables or manually managing efficient dates and flags.

This streamlined methodology not solely makes the implementation of SCD Kind-2 extra simple, but in addition presents improved efficiency and scalability for dealing with massive volumes of historic information in CDC eventualities. It represents a major development in historic information administration, merging conventional information warehousing ideas with trendy huge information capabilities.

As we delve deeper into Iceberg’s options, we’ll discover how this method might be applied, showcasing the effectivity and adaptability it brings to historic information evaluation and CDC processes.

Conditions

The next stipulations are required for the use circumstances:

Arrange assets with AWS CloudFormation

Use a supplied AWS CloudFormation template to arrange assets to construct Iceberg environments. The template creates the next assets:

Full the next steps to deploy the assets.

Select Launch stack

For the parameters, IcebergDatabaseName is ready by default. You may change the default worth. Then, select Subsequent.
Select Subsequent
Select I acknowledge that AWS CloudFormation may create IAM assets with customized names.
Select Submit.
After the stack creation is full, examine the Outputs tab and make an observation of the useful resource values, that are used within the following sections.

Subsequent, configure the Iceberg JAR information to the session to make use of the Iceberg change log view characteristic. Full the next steps.

Choose the next JAR information from the Iceberg releases web page and obtain these JAR information in your native machine:
1. 1.6.1 Spark 3.3_with Scala 2.12 runtime Jar.
2. 1.6.1 aws-bundle Jar.
Open the Amazon S3 console and choose the S3 bucket you created utilizing the CloudFormation stack. The S3 bucket title might be discovered on the CloudFormation Outputs tab.
Select Create folder and create the jars path within the S3 bucket.
Add the 2 downloaded JAR information on s3:///jars/ from the S3 console.

Add a Jupyter Pocket book on AWS Glue Studio

After launching the CloudFormation stack, create an AWS Glue Studio pocket book to make use of Iceberg with AWS Glue.

Obtain historical past.ipynb.
Open AWS Glue Studio console.
Beneath Create job, choose Pocket book.
Choose Add Pocket book, select Select file and add the Pocket book you downloaded.
Choose the IAM function title reminiscent of IcebergHistoryGlueJobRole that you just created utilizing the CloudFormation template. Then, select Create pocket book.

For Job title on the left high of the web page, enter iceberg_history.
Select Save.

Create an Iceberg desk

To create an Iceberg desk utilizing a product dataset, full the next steps.

On the Jupyter Pocket book that you just created in Add a Jupyter Pocket book on AWS Glue Studio, run the next cell to make use of Iceberg with AWS Glue. Earlier than operating the cell, change with the S3 bucket title the place you uploaded the Iceberg JAR information.

Initialize the SparkSession with Iceberg settings.

Configure database and desk names for an Iceberg desk (DB_TBL) and information warehouse path (ICEBERG_LOC). Change with the S3 bucket from the CloudFormation Outputs tab.
Run the next code to create the Iceberg desk utilizing the Spark DataFrame based mostly on the product dataset.

from pyspark.sql import Row
import time
ut = time.time()
product = [
    {'product_id': '00001', 'product_name': 'Heater', 'price': 250, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00002', 'product_name': 'Thermostat', 'price': 400, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00003', 'product_name': 'Television', 'price': 600, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00004', 'product_name': 'Blender', 'price': 100, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00005', 'product_name': 'USB charger', 'price': 50, 'category': 'Electronics', 'updated_at': ut}
]
df_products = spark.createDataFrame(Row(**x) for x in product)
df_products.createOrReplaceTempView('tmp')

spark.sql(f"""
CREATE TABLE {DB_TBL} USING iceberg LOCATION '{ICEBERG_LOC}'
AS SELECT * FROM tmp
""")

After creating the Iceberg desk, run SELECT * FROM iceberg_history_db.merchandise ORDER BY product_id to indicate the product information within the Iceberg desk. At the moment the next 5 merchandise are saved within the Iceberg desk.

+----------+------------+-----+-----------+--------------------+
|product_id|product_name|worth|   class|          updated_at|
+----------+------------+-----+-----------+--------------------+
|     00001|      Heater|  250|Electronics|1.7297845122056053E9|
|     00002|  Thermostat|  400|Electronics|1.7297845122056053E9|
|     00003|  Tv|  600|Electronics|1.7297845122056053E9|
|     00004|     Blender|  100|Electronics|1.7297845122056053E9|
|     00005| USB charger|   50|Electronics|1.7297845122056053E9|
+----------+------------+-----+-----------+--------------------+

Subsequent, search for the historic modifications for a product utilizing Iceberg’s change log view characteristic.

Implement historic report lookup with Iceberg’s change log view

Suppose that there’s a supply desk whose desk information are replicated to the Iceberg desk by a Change Knowledge Seize (CDC) course of. When the information within the supply desk are up to date, these modifications are then mirrored within the Iceberg desk. On this part, you search for the historical past of a given report for such a system to seize the historical past of product updates. For instance, the next updates happen within the supply desk. Via the CDC course of, these modifications are utilized to the Iceberg desk.

Upsert (replace and insert) the 2 information:
- The value of Heater (product_id: 00001) is up to date from 250 to 500.
- A brand new product Chair (product_id: 00006) is added.
Tv (product_id: 00003) is deleted.

To simulate the CDC workflow, you manually apply these modifications to the Iceberg desk within the pocket book.

Use the MERGE INTO question to upsert information. If an enter report within the Spark DataFrame has the identical product_id as an present report, the present report is up to date. If no matching product_id is discovered, the enter report is inserted into the Iceberg desk.

Delete Tv from the Iceberg desk by operating the DELETE question.

DELETE FROM iceberg_history_db.merchandise WHERE product_id = '00003'

Then, run SELECT * FROM iceberg_history_db.merchandise ORDER BY product_id to indicate the product information within the Iceberg desk. You may affirm that the worth of Heater is up to date to 500, Chair is added and Tv is deleted.

+----------+------------+-----+-----------+--------------------+
|product_id|product_name|worth|   class|          updated_at|
+----------+------------+-----+-----------+--------------------+
|     00001|      Heater|  500|Electronics|    1.729790106579E9|
|     00002|  Thermostat|  400|Electronics|1.7297845122056053E9|
|     00004|     Blender|  100|Electronics|1.7297845122056053E9|
|     00005| USB charger|   50|Electronics|1.7297845122056053E9|
|     00006|       Chair|   50|  Furnishings|    1.729790106579E9|
+----------+------------+-----+-----------+--------------------+

For the Iceberg desk, the place modifications from the supply desk are replicated, you may observe the report modifications utilizing Iceberg’s change log view. To begin, you first create a change log view from the Iceberg desk.

Run the create_changelog_view Iceberg process to create a change log view.

Run the next question to retrieve the historic modifications for Heater.

SELECT product_id, product_name, worth, class, updated_at, _change_type
FROM products_clv WHERE product_id = '00001'
ORDER BY _change_ordinal, _change_type DESC

The question end result reveals the historic modifications to Heater. You may affirm that the worth of Heater was up to date from 250 to 500 from the output.

+----------+------------+-----+-----------+--------------------+-------------+
|product_id|product_name|worth|   class|          updated_at| _change_type|
+----------+------------+-----+-----------+--------------------+-------------+
|     00001|      Heater|  250|Electronics|1.7297902833360643E9|       INSERT|
|     00001|      Heater|  250|Electronics|1.7297902833360643E9|UPDATE_BEFORE|
|     00001|      Heater|  500|Electronics|1.7297903836233025E9| UPDATE_AFTER|
+----------+------------+-----+-----------+--------------------+-------------+

Utilizing Iceberg’s change log view, you may receive the historical past of a given report instantly from the Iceberg desk’s historical past, while not having to create a separate desk for managing report historical past. Subsequent, you implement Slowly Altering Dimension (SCD) Kind-2 utilizing the change log view.

Implement SCD Kind-2 with Iceberg’s change log view

The SCD Kind-2 based mostly desk retains the complete historical past of report modifications and it may be utilized in a number of circumstances reminiscent of historic monitoring, point-in-time evaluation, regulatory compliance, and so forth. On this part, you implement SCD Kind-2 utilizing the change log view (products_clv) that was created within the earlier part. The change log view has a schema that’s just like the schema outlined within the SCD Kind-2 specs. For this modification log view, you add effective_start, effective_end, and is_current columns. So as to add these columns after which implement SCD Kind-2, full the next steps.

Run the next question to implement SCD Kind-2. Within the WITH AS (...) part of the question, the change log view is merged with the Iceberg desk snapshots utilizing the snapshot_id key to incorporate the commit time for every report change. You may receive the desk snapshots by querying for db.desk.snapshots. The opposite half within the question identifies each present and non-current entries by evaluating the commit instances for every product. It then units the efficient time for every product, and marks whether or not a product is present or not based mostly on the efficient time and the change sort from the change log view.

WITH clv_snapshots AS (
    SELECT
        clv.*,
        s.snapshot_id,
        s.committed_at,
        s.committed_at as effective_start
    FROM products_clv clv
    JOIN iceberg_history_db.merchandise.snapshots s
    ON clv._commit_snapshot_id = s.snapshot_id
) 
SELECT
    product_id, 
    product_name, 
    worth, 
    class, 
    updated_at,
    effective_start,
    CASE
        WHEN effective_start != l_part_committed_at 
            OR _change_type="UPDATE_BEFORE" THEN l_part_committed_at
        ELSE CAST(null as timestamp)
    END as effective_end,
    CASE
        WHEN effective_start != l_part_committed_at
            OR _change_type="UPDATE_BEFORE" 
            OR _change_type="DELETE" THEN CAST(false as boolean)
        ELSE CAST(true as boolean)
    END as is_current
FROM (SELECT *, MAX(committed_at) OVER (PARTITION BY product_id, updated_at) as l_part_committed_at FROM clv_snapshots)
WHERE _change_type != 'UPDATE_BEFORE'
ORDER BY product_id,  _change_ordinal

The question end result reveals the SCD Kind-2 based mostly schema and information.

After the question result’s displayed, this SCD Kind-2 based mostly desk is saved as scdt2 to permit entry for additional evaluation.

SCD Kind-2 is beneficial for a lot of use circumstances. To discover how this SCD Kind-2 implementation can be utilized to trace the historical past of desk information, run the next instance queries.

Run the next question to retrieve deleted or up to date information in a selected interval. This question captures which information have been modified throughout that timeframe, permitting you to audit modifications for additional use-cases reminiscent of development evaluation, regulatory compliance checks, and so forth. Earlier than operating the question, change and with particular time ranges reminiscent of 2024-10-24 17:18:00 and 2024-10-24 17:20:00.

SELECT product_id, product_name, worth, class, updated_at, effective_start, effective_end, is_current 
FROM scdt2 WHERE product_id IN ( SELECT product_id FROM scdt2 
WHERE (_change_type="DELETE" or _change_type="UPDATE_AFTER") 
AND effective_start BETWEEN '' AND '') 
ORDER BY product_id, effective_start

The question end result reveals the deleted and up to date information within the specified interval. You may affirm that the worth of Heater was up to date and Tv was deleted from the desk.

+----------+------------+-----+-----------+--------------------+--------------------+--------------------+----------+
|product_id|product_name|worth|   class|          updated_at|     effective_start|       effective_end|is_current|
+----------+------------+-----+-----------+--------------------+--------------------+--------------------+----------+
|     00001|      Heater|  250|Electronics|1.7297902833360643E9|2024-10-24 17:18:...|2024-10-24 17:19:...|     false|
|     00001|      Heater|  500|Electronics|1.7297903836233025E9|2024-10-24 17:19:...|                null|      true|
|     00003|  Tv|  600|Electronics|1.7297902833360643E9|2024-10-24 17:18:...|2024-10-24 17:19:...|     false|
|     00003|  Tv|  600|Electronics|1.7297902833360643E9|2024-10-24 17:19:...|                null|     false|
+----------+------------+-----+-----------+--------------------+--------------------+--------------------+----------+

As one other instance, run the next question to retrieve the most recent information at a selected cut-off date from the SCD Kind-2 desk by filtering with is_current = true for present information reporting.

SELECT product_id, product_name, worth, class, updated_at
FROM scdt2 WHERE is_current = true ORDER BY product_id

The question end result reveals the present desk information, reflecting the up to date worth of Heater, the deletion of Tv, and the addition of Chair after the preliminary information.

+----------+------------+-----+-----------+--------------------+
|product_id|product_name|worth|   class|          updated_at|
+----------+------------+-----+-----------+--------------------+
|     00001|      Heater|  500|Electronics|1.7297903836233025E9|
|     00002|  Thermostat|  400|Electronics|1.7297902833360643E9|
|     00004|     Blender|  100|Electronics|1.7297902833360643E9|
|     00005| USB charger|   50|Electronics|1.7297902833360643E9|
|     00006|       Chair|   50|  Furnishings|1.7297903836233025E9|
+----------+------------+-----+-----------+--------------------+

You’ve got now efficiently applied SCD Kind-2 utilizing the change log view. This SCD Kind-2 implementation lets you observe the historical past of desk information. For instance, you should utilize it to seek for deleted or up to date merchandise reminiscent of Heater and Chair in a selected interval. Moreover, you may retrieve the present desk information by querying the SCD Kind-2 desk with is_current = true. Utilizing Iceberg’s change log view allows you to implement SCD Kind-2 with out making any modifications to the Iceberg desk itself. It additionally eliminates the necessity for creating or managing an extra desk for SCD Kind-2.

Clear up

To scrub up the assets used on this put up, full the next steps:

Open the Amazon S3 console
Choose the S3 bucket aws-glue-assets-- the place the Pocket book file (iceberg_history.ipynb) is saved. Delete the Pocket book file that’s within the pocket book path.
Choose the S3 bucket you created utilizing the CloudFormation template. You may receive the bucket title from IcebergS3Bucket key on the CloudFormation Outputs tab. After choosing the bucket, select Empty to delete all objects
After you affirm the bucket is empty, delete the CloudFormation stack iceberg-history-baseline-resources.

Issues

Listed below are necessary issues:

The change log view doesn’t lose any historic report modifications even when following operations are carried out:
The change log view loses historic report modifications corresponded to snapshots deleted with expire_snapshots and Glue Knowledge Catalog computerized snapshot deletion.
The change log view isn’t supported in MoR tables.

Conclusion

On this put up, we’ve explored the best way to search for the historical past of information and tables utilizing Apache Iceberg. The instruction demonstrated the best way to use change log view to search for the historical past of the information, and likewise the historical past of the tables with SCD Kind-2. With this methodology, you may handle the historical past of information and tables with out additional effort.

Concerning the Authors

Tomohiro Tanaka is a Senior Cloud Help Engineer at Amazon Net Providers. He’s keen about serving to clients use Apache Iceberg for his or her information lakes on AWS. In his free time, he enjoys a espresso break together with his colleagues and making espresso at dwelling.

Noritaka Sekiyama is a Principal Huge Knowledge Architect on the AWS Glue group. He works based mostly in Tokyo, Japan. He’s chargeable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking together with his highway bike.