The need to integrate diverse data sources has grown exponentially, but there are several common challenges when integrating and analyzing data from multiple sources, services, and applications. First, you need to create and maintain independent connections to the same data source for different services. Second, the data connectivity experience is inconsistent across different services. For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. Third, some services require you to set up and manage compute resources used for federated connectivity, and capabilities like connection testing and data preview aren’t available in all services. This fragmented, repetitive, and error-prone experience for data connectivity is a significant obstacle to data integration, analysis, and machine learning (ML) initiatives.
To solve for these challenges, we launched Amazon SageMaker Lakehouse unified data connectivity. This feature offers the following capabilities and benefits:
- With SageMaker Lakehouse unified data connectivity, you can set up a connection to a data source using a connection configuration template that is standardized for multiple services. Amazon SageMaker Unified Studio, AWS Glue, and Amazon Athena can share and reuse the same connection with proper permission configuration.
- SageMaker Lakehouse unified data connectivity supports standard methods for data source connection authorization and authentications, such as basic authorization and OAuth2. This approach simplifies your data journey and helps you meet your security requirements.
- The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of.
- SageMaker Lakehouse unified data connectivity’s data preview capability helps you map source fields to target schemas, identify needed data transformation, and plan data standardization and normalization steps.
- SageMaker Lakehouse unified data connectivity provides a set of APIs for you to use without the need to learn different APIs for various data sources, promoting coding efficiency and productivity.
With SageMaker Lakehouse unified data connectivity, you can confidently connect, explore, and unlock the full value of your data across AWS services and achieve your business objectives with agility.
This post demonstrates how SageMaker Lakehouse unified data connectivity helps your data integration workload by streamlining the establishment and management of connections for various data sources.
Solution overview
In this scenario, an e-commerce company sells products on their online platform. The product data is stored on Amazon Aurora PostgreSQL-Compatible Edition. Their existing business intelligence (BI) tool runs queries on Athena. Furthermore, they have a data pipeline to perform extract, transform, and load (ETL) jobs when moving data from the Aurora PostgreSQL database cluster to other data stores.
Now they have a new requirement to allow ad-hoc queries through SageMaker Unified Studio to enable data engineers, data analysts, sales representatives, and others to take advantage of its unified experience.
In the following sections, we demonstrate how to set up this connection and run queries using different AWS services.
Prerequisites
Before you begin, make sure you have the followings:
- An AWS account.
- A SageMaker Unified Studio domain.
- An Aurora PostgreSQL database cluster.
- A virtual private cloud (VPC) and private subnets required for SageMaker Unified Studio.
- An Amazon Simple Storage Service (Amazon S3) bucket to store output from the AWS Glue ETL jobs. In the following steps, replace
amzn-s3-demo-destination-bucket
with the name of the S3 bucket. - An AWS Glue Data Catalog database. In the following steps, replace
<your_database>
with the name of your database.
Create an IAM role for the AWS Glue job
You can either create a new AWS Identity and Access Management (IAM) role or use an existing role that has permission to access the AWS Glue output bucket and AWS Secrets Manager.
If you want to create a new one, complete the following steps:
- On the IAM console, in the navigation pane, choose Roles.
- Choose Create role.
- For Trusted entity type, choose AWS service.
- For Service or use case, choose Glue.
- Choose Next.
- For Add permissions, choose
AWSGlueServiceRole
, then choose Next. - For Role name, enter a role name (for this post,
GlueJobRole-demo
). - Choose Create role.
- Choose the created IAM role.
- Under Permissions policies, choose Add permission and Create inline policy.
- For Policy editor, choose JSON, and enter the following policy:
- Choose Next.
- For Policy name, enter a name for your policy.
- Choose Create policy.
Create a SageMaker Lakehouse data connection
Let’s get started with the unified data connection experience. The first step is to create a SageMaker Lakehouse data connection. Complete the following steps:
- Sign in to your SageMaker Unified Studio.
- Open your project.
- On your project, in the navigation pane, choose Data.
- Choose the plus sign.
- For Add data source, choose Add connection. Choose Next.
- Select PostgreSQL, and choose Next.
- For Name, enter
postgresql_source
. - For Host, enter your host name of your Aurora PostgreSQL database cluster.
- For Port, enter your port number of your Aurora PostgreSQL database cluster (by default, it’s 5432).
- For Database, enter your database name.
- For Authentication, select Username and password.
- Enter your username and password.
- Choose Add data.
After the completion, it will create a new AWS Secrets Manager secret with a name like SageMakerUnifiedStudio-Glue-postgresql_source
to securely store the specified username and password. It also creates a Glue connection with the same name postgresql_source
.
Now you have a unified connection for Aurora PostgreSQL-Compatible.
Load data into the PostgreSQL database through the notebook
You will use a JupyterLab notebook on SageMaker Unified Studio to load sample data from an S3 bucket into a PostgreSQL database using Apache Spark.
- On the top left menu, choose Build, and under IDE & APPLICATIONS, choose JupyterLab.
- Choose Python 3 under Notebook.
- For the first cell, choose Local Python, python, enter following code, and run the cell:
- For the second cell, choose PySpark, spark, enter following code, and run the cell:
The code snippet reads the sample data Parquet files from the specified S3 bucket location and stores the data in a Spark DataFrame named df
. The df.show()
command displays the first 20 rows of the DataFrame, allowing you to preview the sample data in a tabular format. Next, you will load this sample data into a PostgreSQL database.
- For the third cell, choose PySpark, spark, enter following code, and run the cell (replace
<account-id>
with your AWS account ID): - For the fourth cell, choose PySpark, spark, enter following code, and run the cell:
Let’s see if you could successfully create the new table unified_connection_test
. You can navigate to the project’s Data page to visually verify the existence of the newly created table.
- On the top left menu, choose your project name, and under CURRENT PROJECT, choose Data.
Within the Lakehouse section, expand the postgresql_source
, then the public schema, and you should find the newly created unified_connection_test
table listed there. Next, you will query the data in this table using SageMaker Unified Studio’s SQL query book feature.
Run queries on the connection through the query book using Athena
Now you can run queries using the connection you created. In this section, we demonstrate how to use the query book using Athena. Complete the following steps:
- In your project on SageMaker Unified Studio, choose the Lakehouse section, expand the
postgresql_source
, then the public - On the options menu (three vertical dots) of the table
unified_connection_test
, choose Query with Athena.
This step will open a new SQL query book. The query statement select * from "postgresql_source"."public"."unified_connection_test" limit 10;
is automatically filled.
- On the Actions menu, choose Save to Project.
- For Querybook title, enter the name of your SQL query book.
- Choose Save changes.
This will save the current SQL query book, and the status of the notebook will change from Draft to Saved. If you want to revert a draft notebook to its last published state, choose Revert to published version to roll back to the most recently published version. Now, let’s start running queries on your notebook.
- Choose Run all.
When a query finishes, results can be viewed in a few formats. The table view displays query results in a tabular format. You can download the results as JSON or CSV files using the download icon at the bottom of the output cell. Additionally, the notebook provides a chart view to visualize query results as graphs.
The sample data includes a column star_rating
representing a 5-star rating for products. Let’s try a quick visualization to analyze the rating distribution.
- Choose Add SQL to add a new cell.
- Enter the following statement:
- Choose the run icon of the cell, or you can press Ctrl+Enter or Cmd+Enter to run the query.
This will display the results in the output panel. Now you have learned how the connection works on SageMaker Unified Studio. Next, we show how you can use the connection on AWS Glue consoles.
Run Glue ETL jobs on the connection on the AWS Glue console
Next, we create an AWS Glue ETL job that reads table data from the PostgreSQL connection, converts data types, transforms the data into Parquet files, and outputs them to Amazon S3. It also creates a table in the Glue Data Catalog and add partitions so downstream data engineers can immediately use the table data. Complete the following steps:
- On the AWS Glue console, choose Visual ETL in the navigation pane.
- Under Create job, choose Visual ETL.
- At the top of the job, replace “Untitled job” with a name of your choice.
- On the Job Details tab, under Basic properties, specify the IAM role that the job will use (
GlueJobRole-demo
). - For Glue version, choose Glue version 4.0
- Choose Save.
- On the Visual tab, choose the plus sign to open the Add nodes
- Search for postgresql and add PostgreSQL as Source.
- For JDBC source, choose JDBC connection details.
- For PostgreSQL connection, choose
postgresql_source
. - For Table name, enter
unified_connection_test
- As a child of this source, search in the Add nodes menu for timestamp and choose To Timestamp.
- For Column to convert, choose
review_date
. - For Column type, choose iso.
- On the Visual tab, search in the Add nodes menu for s3 and add Amazon S3 as Target.
- For Format, choose Parquet.
- For Compression Type, choose Snappy.
- For S3 Target Location, enter your S3 output location (
s3://amzn-s3-demo-destination-bucket
). - For Data Catalog update options, choose Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
- For Database, enter your Data Catalog database (
<your_database>
). - For Table name, enter
connection_demo_tbl
. - Under Partition keys, choose Add a partition key, and choose
review_year
. - Choose Save, then choose Run to run the job.
When the job is complete, it will output Parquet files to Amazon S3 and create a table named connection_demo_tbl
in the Data Catalog. You have now learned that you can use the SageMaker Lakehouse data connection not only in SageMaker Unified Studio, but also directly in AWS Glue console without needing to create separate individual connections.
Clean up
Now to the final step, cleaning up the resources. Complete the following steps:
- Delete the connection.
- Delete the Glue job.
- Delete the AWS Glue output S3 buckets.
- Delete the IAM role
AWSGlueServiceRole
. - Delete the Aurora PostgreSQL cluster.
Conclusion
This post demonstrated how the SageMaker Lakehouse unified data connectivity works end to end, and how you can use the unified connection across different services such as AWS Glue and Athena. This new capability can simplify your data journey.
To learn more, refer to Amazon SageMaker Unified Studio.
About the Authors
Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.
Shubham Agrawal is a Software Development Engineer on the AWS Glue team. He has expertise in designing scalable, high-performance systems for handling large-scale, real-time data processing. Driven by a passion for solving complex engineering problems, he focuses on building seamless integration solutions that enable organizations to maximize the value of their data.
Joju Eruppanal is a Software Development Manager on the AWS Glue team. He strives to delight customers by helping his team build software. He loves exploring different cultures and cuisines.
Julie Zhao is a Senior Product Manager at AWS Glue. She joined AWS in 2021 and brings three years of startup experience leading products in IoT data platforms. Prior to startups, she spent over 10 years in networking with Cisco and Juniper across engineering and product. She is passionate about building products to solve customer problems.