In today’s data-driven world, tracking and analyzing changes over time has become essential. As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis. It enables organizations to maintain audit trails, perform trend analysis, identify data quality issues, and conduct point-in-time reporting. When combined with Change Data Capture (CDC), which identifies and captures database changes, history management becomes even more potent.
Common use cases for historical record management in CDC scenarios span various domains. In customer relationship management, it tracks changes in customer information over time. Financial systems use it for maintaining accurate transaction and balance histories. Inventory management benefits from historical data for analyzing sales patterns and optimizing stock levels. HR systems use it to track employee information changes. In fraud detection, historical data helps identify anomalous patterns in transactions or user behaviors.
This post will explore how to implement these functionalities using Apache Iceberg, focusing on Slowly Changing Dimensions (SCD) Type-2. This method creates new records for each data change while preserving old ones, thus maintaining a full history. By the end, you’ll understand how to use Apache Iceberg to manage historical records effectively on a typical CDC architecture.
Historical record lookup
How can we retrieve the history of given records? This is a fundamental question in data management, especially when dealing with systems that need to track changes over time. Let’s explore this concept with a practical example.
Consider a product (Heater
) in an ecommerce database:
product_id | product_name | price |
00001 | Heater | 250 |
Now, let’s say we update the price of this product from 250
to 500
. After some time, we want to retrieve the price history of this heater. In a traditional database setup, this task could be challenging, especially if we haven’t explicitly designed our system to track historical changes.
This is where the concept of historical record lookup becomes crucial. We need a system that not only stores the current state of our data but also maintains a log of all changes made to each record over time. This allows us to answer questions like:
- What was the price of the heater at a specific point in time?
- How many times has the price changed, and when did these changes occur?
- What was the price trend of the heater over the past year?
Implementing such a system can be complex, requiring careful consideration of data storage, retrieval mechanisms, and query optimization. This is where Apache Iceberg comes into play, offering a feature known as the change log view.
The change log view in Apache Iceberg provides a view of all changes made to a table over time, making it straightforward to query and analyze the history of any record. With change log view, we can easily track insertions, updates, and deletions, giving us a complete picture of how our data has evolved.
For our heater example, Iceberg’s change log view would allow us to effortlessly retrieve a timeline of all price changes, complete with timestamps and other relevant metadata, as shown in the following table.
product_id | product_name | price | _change_type |
00001 | Heater | 250 | INSERT |
00001 | Heater | 250 | UPDATE_BEFORE |
00001 | Heater | 500 | UPDATE_AFTER |
This capability not only simplifies historical analysis but also opens possibilities for advanced time-based analytics, auditing, and data governance.
Historical table lookup with SCD Type-2
SCD Type-2 is a key concept in data warehousing and historical data management and is particularly relevant to Change Data Capture (CDC) scenarios. SCD Type-2 creates new rows for changed data instead of overwriting existing records, allowing for comprehensive tracking of changes over time.
SCD Type-2 requires additional fields such as effective_start_date
, effective_end_date
, and current_flag
to manage historical records. This approach has been widely used in data warehouses to track changes in various dimensions such as customer information, product details, and employee data. In the example of the previous section, here’s what the SCD Type-2 looks like assuming the update operation is performed on December 11, 2024.
product_id | product_name | price | effective_start_date | effective_end_date | current_flag |
00001 | Heater | 250 | 2024-12-10 | 2024-12-11 | FALSE |
00001 | Heater | 500 | 2024-12-11 | NULL | TRUE |
SCD Type-2 is particularly valuable in CDC use cases, where capturing all data changes over time is crucial. It enables point-in-time analysis, provides detailed audit trails, aids in data quality management, and helps meet compliance requirements by preserving historical data.
In traditional implementations on data warehouses, SCD Type-2 requires its specific handling in all INSERT
, UPDATE
, and DELETE
operations that affect those additional columns. For example, to update the price of the product, you need to run the following query.
For modern data lakes, we propose a new approach to implement SCD Type-2. With Iceberg, you can create a dedicated view of SCD Type-2 on top of the change log view, eliminating the need to implement specific handling to make changes on SCD Type-2 tables. With this approach, you can keep managing Iceberg tables without complexity considering SCD Type-2 specification. Anytime when you need SCD Type-2 snapshot of your Iceberg table, you can create the corresponding representation. This approach combines the power of Iceberg’s efficient data management with the historical tracking capabilities of SCD Type-2. By using the change log view, Iceberg can dynamically generate the SCD Type-2 structure without the overhead of maintaining additional tables or manually managing effective dates and flags.
This streamlined method not only makes the implementation of SCD Type-2 more straightforward, but also offers improved performance and scalability for handling large volumes of historical data in CDC scenarios. It represents a significant advancement in historical data management, merging traditional data warehousing concepts with modern big data capabilities.
As we delve deeper into Iceberg’s features, we’ll explore how this approach can be implemented, showcasing the efficiency and flexibility it brings to historical data analysis and CDC processes.
Prerequisites
The following prerequisites are required for the use cases:
Set up resources with AWS CloudFormation
Use a provided AWS CloudFormation template to set up resources to build Iceberg environments. The template creates the following resources:
Complete the following steps to deploy the resources.
- Choose Launch stack
- For the parameters, IcebergDatabaseName is set by default. You can change the default value. Then, choose Next.
- Choose Next
- Choose I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Submit.
- After the stack creation is complete, check the Outputs tab and make a note of the resource values, which are used in the following sections.
Next, configure the Iceberg JAR files to the session to use the Iceberg change log view feature. Complete the following steps.
- Select the following JAR files from the Iceberg releases page and download these JAR files on your local machine:
- 1.6.1 Spark 3.3_with Scala 2.12 runtime Jar.
- 1.6.1 aws-bundle Jar.
- Open the Amazon S3 console and select the S3 bucket you created using the CloudFormation stack. The S3 bucket name can be found on the CloudFormation Outputs tab.
- Choose Create folder and create the
jars
path in the S3 bucket. - Upload the two downloaded JAR files on
s3://<IcebergS3Bucket>/jars/
from the S3 console.
Upload a Jupyter Notebook on AWS Glue Studio
After launching the CloudFormation stack, create an AWS Glue Studio notebook to use Iceberg with AWS Glue.
- Download history.ipynb.
- Open AWS Glue Studio console.
- Under Create job, select Notebook.
- Select Upload Notebook, choose Choose file and upload the Notebook you downloaded.
- Select the IAM role name such as IcebergHistoryGlueJobRole that you created using the CloudFormation template. Then, choose Create notebook.
- For Job name at the left top of the page, enter
iceberg_history
. - Choose Save.
Create an Iceberg table
To create an Iceberg table using a product dataset, complete the following steps.
- On the Jupyter Notebook that you created in Upload a Jupyter Notebook on AWS Glue Studio, run the following cell to use Iceberg with AWS Glue. Before running the cell, replace
<IcebergS3Bucket>
with the S3 bucket name where you uploaded the Iceberg JAR files.
- Initialize the SparkSession with Iceberg settings.
- Configure database and table names for an Iceberg table (
DB_TBL
) and data warehouse path (ICEBERG_LOC
). Replace<IcebergS3Bucket>
with the S3 bucket from the CloudFormation Outputs tab. - Run the following code to create the Iceberg table using the Spark DataFrame based on the product dataset.
- After creating the Iceberg table, run
SELECT * FROM iceberg_history_db.products ORDER BY product_id
to show the product data in the Iceberg table. Currently the following five products are stored in the Iceberg table.
Next, look up the historical changes for a product using Iceberg’s change log view feature.
Implement historical record lookup with Iceberg’s change log view
Suppose that there’s a source table whose table records are replicated to the Iceberg table through a Change Data Capture (CDC) process. When the records in the source table are updated, these changes are then mirrored in the Iceberg table. In this section, you look up the history of a given record for such a system to capture the history of product updates. For example, the following updates occur in the source table. Through the CDC process, these changes are applied to the Iceberg table.
- Upsert (update and insert) the two records:
- The price of
Heater
(product_id: 00001
) is updated from250
to500
. - A new product
Chair
(product_id: 00006
) is added.
- The price of
Television
(product_id: 00003
) is deleted.
To simulate the CDC workflow, you manually apply these changes to the Iceberg table in the notebook.
- Use the
MERGE INTO
query to upsert records. If an input record in the Spark DataFrame has the sameproduct_id
as an existing record, the existing record is updated. If no matchingproduct_id
is found, the input record is inserted into the Iceberg table.
- Delete
Television
from the Iceberg table by running theDELETE
query.
- Then, run
SELECT * FROM iceberg_history_db.products ORDER BY product_id
to show the product data in the Iceberg table. You can confirm that the price ofHeater
is updated to500
, Chair is added andTelevision
is deleted.
For the Iceberg table, where changes from the source table are replicated, you can track the record changes using Iceberg’s change log view. To start, you first create a change log view from the Iceberg table.
- Run the
create_changelog_view
Iceberg procedure to create a change log view.
- Run the following query to retrieve the historical changes for
Heater
.
- The query result shows the historical changes to
Heater
. You can confirm that the price ofHeater
was updated from250
to500
from the output.
Using Iceberg’s change log view, you can obtain the history of a given record directly from the Iceberg table’s history, without needing to create a separate table for managing record history. Next, you implement Slowly Changing Dimension (SCD) Type-2 using the change log view.
Implement SCD Type-2 with Iceberg’s change log view
The SCD Type-2 based table retains the full history of record changes and it can be used in multiple cases such as historical tracking, point-in-time analysis, regulatory compliance, and so on. In this section, you implement SCD Type-2 using the change log view (products_clv
) that was created in the previous section. The change log view has a schema that’s similar to the schema defined in the SCD Type-2 specifications. For this change log view, you add effective_start
, effective_end
, and is_current
columns. To add these columns and then implement SCD Type-2, complete the following steps.
- Run the following query to implement SCD Type-2. In the
WITH AS (...)
section of the query, the change log view is merged with the Iceberg table snapshots using thesnapshot_id
key to include the commit time for each record change. You can obtain the table snapshots by querying fordb.table.snapshots
. The other part in the query identifies both current and non-current entries by comparing the commit times for each product. It then sets the effective time for each product, and marks whether a product is current or not based on the effective time and the change type from the change log view.
- The query result shows the SCD Type-2 based schema and records.
After the query result is displayed, this SCD Type-2 based table is stored as scdt2
to allow access for further analysis.
SCD Type-2 is useful for many use cases. To explore how this SCD Type-2 implementation can be used to track the history of table records, run the following example queries.
- Run the following query to retrieve deleted or updated records in a specific period. This query captures which records were changed during that timeframe, allowing you to audit changes for further use-cases such as trend analysis, regulatory compliance checks, and so on. Before running the query, replace
<START_DATETIME>
and<END_DATETIME>
with specific time ranges such as2024-10-24 17:18:00
and2024-10-24 17:20:00
.
- The query result shows the deleted and updated records in the specified period. You can confirm that the price of
Heater
was updated andTelevision
was deleted from the table.
- As another example, run the following query to retrieve the latest records at a specific point in time from the SCD Type-2 table by filtering with
is_current = true
for current data reporting.
- The query result shows the current table records, reflecting the updated price of
Heater
, the deletion ofTelevision
, and the addition ofChair
after the initial records.
You have now successfully implemented SCD Type-2 using the change log view. This SCD Type-2 implementation allows you to track the history of table records. For example, you can use it to search for deleted or updated products such as Heater
and Chair
in a specific period. Additionally, you can retrieve the current table records by querying the SCD Type-2 table with is_current = true
. Using Iceberg’s change log view enables you to implement SCD Type-2 without making any changes to the Iceberg table itself. It also eliminates the need for creating or managing an additional table for SCD Type-2.
Clean up
To clean up the resources used in this post, complete the following steps:
- Open the Amazon S3 console
- Select the S3 bucket
aws-glue-assets-<ACCOUNT_ID>-<REGION>
where the Notebook file (iceberg_history.ipynb
) is stored. Delete the Notebook file that’s in the notebook path. - Select the S3 bucket you created using the CloudFormation template. You can obtain the bucket name from IcebergS3Bucket key on the CloudFormation Outputs tab. After selecting the bucket, choose Empty to delete all objects
- After you confirm the bucket is empty, delete the CloudFormation stack
iceberg-history-baseline-resources
.
Considerations
Here are important considerations:
- The change log view does not lose any historical record changes even when following operations are performed:
- The change log view loses historical record changes corresponded to snapshots deleted with
expire_snapshots
and Glue Data Catalog automatic snapshot deletion. - The change log view is not supported in MoR tables.
Conclusion
In this post, we have explored how to look up the history of records and tables using Apache Iceberg. The instruction demonstrated how to use change log view to look up the history of the records, and also the history of the tables with SCD Type-2. With this method, you can manage the history of records and tables without extra effort.
About the Authors
Tomohiro Tanaka is a Senior Cloud Support Engineer at Amazon Web Services. He’s passionate about helping customers use Apache Iceberg for their data lakes on AWS. In his free time, he enjoys a coffee break with his colleagues and making coffee at home.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.