Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To harness the full potential of their data, businesses must enable seamless access and consolidation from these varied sources. However, this task is complicated by the unique characteristics of modern systems, such as differing API protocols, implementations, and rate limits. To address these challenges and accelerate innovation, AWS Glue has recently expanded its third-party application support by introducing native connectors for 19 applications.

To utilize these new application connectors for well-defined use cases such as replication and ingestion, AWS Glue is also launching zero-ETL integration support from external applications. With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift.

Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. By directly integrating with Lakehouse, all the data is automatically cataloged and can be secured through fine-grained permissions in Lake Formation.

What is zero-ETL?

Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. It makes data available in Amazon SageMaker Lakehouse and Amazon Redshift from multiple operational, transactional, and enterprise sources. Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to prepare it for analytics, artificial intelligence (AI), and machine learning (ML) workloads. You don’t need to maintain complex ETL pipelines. We take care of the ETL for you by automating the creation and management of data replication.

What’s the difference between zero-ETL and Glue ETL?

AWS Glue now offers multiple ways for you to build data integration pipelines, depending on your integration needs.

Zero-ETL provides service-managed replication. It’s designed for scenarios where customers need a fully managed, efficient way to replicate data from one source to AWS with minimal configuration. Zero-ETL handles the entire replication process, including schema discovery and evolution, without requiring customers to write or manage any custom logic. This approach is ideal for creating up-to-date replicas of source data in near-real-time, with AWS managing the underlying infrastructure and replication process.
Glue ETL offers customer-managed data ingestion. It’s the preferred choice when customers need more control and customization over the data integration process or require complex transformations. With Glue ETL, customers can write custom transformation logic, combine data from multiple sources, apply data quality rules, add calculated fields, and perform advanced data cleansing or aggregation. This flexibility makes Glue ETL suitable for scenarios where data must be transformed or enriched before analysis.

It’s worth mentioning that the source connections are reusable between Glue ETL and Glue zero-ETL so that can easily support both patterns. After you create a connection once, you can choose to use the same connection across various AWS Glue components including Glue ETL, Glue Visual ETL and zero-ETL. For example, you might start by creating a connection and a zero-ETL integration, but decide later to use the same connection to create a custom GlueETL pipeline.

This blog post will explore how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP and others.

Use case

Consider a large company that relies heavily on data-driven insights to optimize its customer support processes. The company stores vast amounts of transactional data in ServiceNow. To gain a comprehensive understanding of their business and make informed decisions, the company needs to integrate and analyze data from ServiceNow seamlessly, identifying and addressing problems and root causes, managing service level agreements and compliance, and proactively planning for incident prevention.

The company is looking for an efficient, scalable, and cost-effective solution to collecting and ingesting data from ServiceNow, ensuring continuous near real-time replication, automated availability of new data attributes, robust monitoring capabilities to track data load statistics, and reliable data lake foundation supporting data versioning. This allows data analysts, data engineers, and data scientists to quickly explore ingested data and develop data products that meet the needs of business teams.

Solution overview

The following architecture diagram illustrates an efficient and scalable solution for collecting and ingesting replicated data from ServiceNow with zero-ETL integration. In this example we use ServiceNow as a source, but this can be done with any supported source such as Salesforce, Zendesk, SAP, or others. The AWS Glue managed connectors act as a bridge between ServiceNow and the target Amazon SageMaker Lakehouse, enabling seamless, near real-time data flow without the need for custom ETL and scheduling.

The following are the key components and steps in the integration process:

Zero-ETL extracts and loads the data into Amazon S3, a highly scalable object storage service. The data is also registered in the Glue Data Catalog, a metadata repository. Additionally, it keeps the information synchronized by capturing changes that occur in ServiceNow and maintains data consistency by automatically performing schema evolution.
Amazon CloudWatch, a monitoring and observability service, collects logs and metrics from the data integration process.
Amazon EventBridge, a serverless event bus service, triggers a downstream process that allows you to build event-driven architecture as soon as your new data arrives in your target. Through EventBridge, customers can build on top of zero-ETL for a diverse set of use cases such as:

Prerequisites

Complete the following prerequisites before setting up the solution:

Create a bucket in Amazon S3 called zero-etl-demo-<your AWS Account Number>-<AWS Region> (for example, zero-etl-demo-012345678901-us-east-1). The bucket will be used to store the data ingested by zero-ETL in Apache Iceberg which is an open table format (OTF) supporting ACID transactions (atomicity, consistency, isolation, and durability), seamless schema evolution, and data versioning using time travel.
Create an AWS Glue database <your database name>, such as zero_etl_demo_db and associate the S3 bucket zero-etl-demo-<your AWS Account Number>-<AWS Region> as a location of the database. The database will be used to store the metadata related to the data integrations performed by zero-ETL.
Update AWS Glue Data Catalog settings using the following IAM policy for fine-grained access control of the data catalog for zero-ETL.
Create an AWS Identity and Access Management (IAM) role named zero_etl_demo_role. The IAM role will be used by zero-ETL to access the Glue Connector to read from the Service Now and write the data into the target. Optionally, you can create two separate IAM roles (one associated with your source data and another associated with your target).
Make sure you have a ServiceNow instance named ServiceNowInstance, a user named ServiceNowUser, and a password passwordServiceNowPassword with the required permissions to read from ServiceNow. The instance name, user, and password are used in the AWS Glue connection to authenticate within ServiceNow using the BASIC authentication type. Optionally, you can choose OAUTH2 if your ServiceNow supports it.
Create the secret zero_etl_demo_secret in AWS Secrets Manager to store ServiceNow credentials.

Build and verify the zero-ETL integration

Complete the following steps to create and validate zero-ETL integration:

Step 1: Set up a connector

Zero-ETL integration, when used with AWS Glue natively supported applications connectors, provides a straightforward way to bring third-party data into an Amazon S3 transactional data lake or Amazon Redshift. Use the following steps to create a ServiceNow data connection:

Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Connections.
Choose Create Connection.
In the Create Connection pane, enter ServiceNow in Data Sources.
Choose ServiceNow.
Choose Next.
For Instance Name, enter ServiceNowInstance (created as part of the prerequisites).
For IAM service role, choose the zero_etl_demo_role (created as part of the prerequisites).
For Authentication Type, choose the authentication type that you’re using for ServiceNow. In this example. we have chosen OAUTH2, which requires the set up of Application Registries in ServiceNow.
For AWS Secret, choose the secret zero_etl_demo_secret (created as part of the prerequisites).
Choose Next.
In the Connection Properties section, for Name, enter zero_etl_demo_conn.
Choose Next.
Choose Create connection.
There will be a popup from ServiceNow after you choose Create connection. Choose Allow.

Step 2: Set up Zero-ETL integration

After creating the data connection to ServiceNow, use the following steps to create the zero-ETL integration:

Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Zero-ETL integrations.
Choose Create zero-ETL integration.
In the Create integration pane, enter ServiceNow in Data Sources.
Choose ServiceNow.
Choose Next.
For ServiceNow connection, choose the data connection created on Step 1—zero_etl_demo_conn.
For Source IAM role, choose the zero_etl_demo_role (from the prerequisites).
For ServiceNow objects, choose the objects you want to perform the ingestion managed by zero-ETL integration. For this post, choose problem and incident objects.
For Namespace or Database, choose <your database name>. In this example, we use the zero_etl_demo_db (from the prerequisites).
For Target IAM role, choose the zero_etl_demo_role (from the prerequisites).
Choose Next.
For Security and data encryption, you can choose either AWS Managed KMS Key or choose a customer KMS key managed by AWS Key Management Service. For this post, choose Use AWS managed KMS key.
In the Integration details section, for Name, enter zero-etl-demo-integration.
Choose Next.
Review the details and choose Create and launch integration.
The newly created integration will show as Active in about a minute.

Step 3: Verify the initial SEED load

The SEED load refers to the initial loading of the tables that you want to ingest into an Amazon SageMaker Lakehouse using zero-ETL integration. The status and statistics of the SEED load are published into CloudWatch and the data ingested by zero-ETL integration can be accessed in AWS using a set of services such Amazon Sagemaker Unified Studio, Amazon QuickSight, and others. Use the following steps to access zero-ETL integration logs and query the data:

Open the AWS Glue console.
In the navigation pane, choose Zero-ETL integrations.
In the Zero-ETL integrations section, choose zero-etl-demo-integration.
In the Activity summary (all time) section, choose CloudWatch logs.
Check CloudWatch log events for the SEED Load. For each table ingested by the zero-ETL integration, two groups of logs are created: status and statistics. Highlighted in the following screenshot in IngestionTableStatistics are the statistics. The insertCount represents how many rows were extracted and loaded by zero-ETL integration. For the SEED load, you will always see only insertCount because it’s the initial load. In addition, in IngestionCompleted you will find information about the Zero-ETL integration such as status, load type, and message.

To validate the SEED load, query the data using Amazon Sagemaker Unified Studio.

Access Amazon Sagemaker Unified Studio for your specific domain through your AWS Console.
Open the Amazon SageMaker Unified Studio URL.
Sign in with SSO or AWS IAM user.
Select your project.
Go to Data from the left menu, expand the Lakehouse AWSDataCatalog, expand your database, and select the incident table. Click the ⋮ icon and select Query with Athena.

For Query, enter the following statement:

SELECT count(*) AS incidents_count
FROM "zero_etl_demo_db"."incident"

Choose Run.
Let’s check an existing incident in ServiceNow. This is the incident that you will update the description of in ServiceNow to validate change data capture (CDC). In the query editor, pane, for Query, enter the following statement:
```
SELECT number
, short_description
, description
FROM "zero_etl_demo_db"."incident"
WHERE number="INC0000003" -- update to your Incident number
```
Choose Run.

Step 4: Validate CDC

The CDC load is a technique used to identify and process only the data that has changed in a source system since the last extraction. Instead of reloading an entire dataset, CDC captures and transfers only the new, updated, or deleted records into the target system, making data processing more efficient and reducing load times. The status and statistics of the CDC load are published into CloudWatch. For this post, you will use Amazon SageMaker unified studio to query the data ingested. Use the following steps to access zero-ETL integration logs and query the data ingested. For the next step in this example, you will select an incident and perform an update in ServiceNow, changing the short_description and description of the incident.

To demonstrate CDC event, in this blog we are going to edit 1 incident and delete 1 incident in ServiceNow.
Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Zero-ETL integrations.
In the Zero-ETL integrations section, choose zero-etl-demo-integration.
In the Activity summary (all time) section, choose CloudWatch logs.
Zero-ETL integration replicates the changes to the Amazon S3 transactional data lake every 60 minutes by default. Check CloudWatch log events for the CDC load. Shown in the following figure in IngestionTableStatistics, review updateCount and deleteCount for each specific object managed by zero-ETL integration. It’s applying the updates and deletes that happened in ServiceNow to the transactional data lake.

To validate the CDC load, query the data using Amazon SageMaker Unified Studio.

You can go back to Amazon SageMaker Unified Studio.

For Query, enter the following statement:

SELECT count(*) AS incidents_count
FROM "zero_etl_demo_db"."incident"

For Query, enter the following statement to record initial snapshot results before CDC:

SELECT number
    , short_description
    , description
FROM "zero_etl_demo_db"."incident"
WHERE number="INC0000003" -- update to your Incident number

Choose Run and confirm that one record was updated in short_description and description attributes.

By following these steps, you can effectively set up, build, and verify a zero-ETL job using the new AWS Glue application connector for ServiceNow. This process demonstrates the simplicity and efficiency of the zero-ETL approach in integrating applications data into your AWS environment.

Apache Iceberg Time Travel: Enhancing data versioning in zero-ETL

One of the benefits of using Apache Iceberg in zero-ETL integration is the ability to perform Time Travel. This feature allows you to access and query historical versions of your data effortlessly. With Iceberg Time Travel, you can easily roll back to previous data states, compare data across different points in time, or recover from accidental data changes. In the context of zero-ETL integrations, this capability becomes particularly valuable when dealing with rapidly changing applications data.

To demonstrate this feature, let’s consider a scenario where you’re analyzing ServiceNow incident data ingested through zero-ETL integration using Amazon SageMaker Unified Studio. Here’s an example query that showcases Iceberg time travel:

-- Query incident data as of particular timestamp before CDC
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident" 
FOR TIMESTAMP AS OF TIMESTAMP '2024-11-06 05:10:00 UTC' 
-- update this timestamp value to before your CDC update
WHERE number="INC0000003" -- update to your Incident number
-- Compare with current data
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident"
WHERE number="INC0000003" -- update to your Incident number

In this example:

The first query uses the FOR TIMESTAMP AS OF clause for time travel queries on Iceberg tables. It retrieves incident data as it existed before CDC update for the specific incident number INC0000003.
The second query fetches the current state of the data for the same incident number.

This capability allows you to track the evolution of incidents, identify trends in resolution times, or recover information that may have been inadvertently altered.

Clean up

To avoid incurring future charges, remove up the resources used in this post from your AWS account by completing the following steps:

Delete zero-ETL integration zero-etl-demo-integration.
Delete content from the S3 bucket zeroetl-etl-demo-<your AWS Account Number>-<AWS Region>.
Delete the Data Catalog database zero_etl_demo_db.
Delete the Data Catalog connection zero_etl_demo_conn.
Delete the AWS Secrets manager Secret.

Conclusion

As the pace of business continues to accelerate, the ability to quickly and efficiently integrate data from various applications and enterprise platforms has become a critical competitive advantage. By adopting a zero-ETL integration powered by AWS Glue and its new set of managed connectors, you organization can unlock the full potential of its data across multiple platforms faster and stay ahead of the curve.

To learn more about how AWS Amazon SageMaker Lakehouse can help your organization streamline its data integration efforts, visit Amazon SageMaker Lakehouse.

Get started with zero-ETL on AWS by creating a free account today!

About the authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure and high-performance data solutions in the cloud.

Vivek Pinyani is a Data Architect at AWS Professional Services with expertise in Big Data technologies. He focuses on helping customers build robust and performant Data Analytics solutions and Data Lake migrations. In his free time, he loves to spend time with his family and enjoys playing cricket and running.

Kartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed multiple marathons, he is currently preparing for his next marathon challenge.

Caio Sgaraboto Montovani is a Sr. Specialist Solutions Architect, Data Lake and AI/ML within AWS Professional Services, developing scalable solutions according customer needs. His vast experience has helped customers in different industries such as life sciences and healthcare, retail, banking, and aviation build solutions in data analytics, machine learning, and generative AI. He is passionate about rock and roll and cooking and loves to spend time with his family.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!