Amazon SageMaker Lakehouse enables a unified, open, and secure lakehouse platform on your existing data lakes and warehouses. Its unified data architecture supports data analysis, business intelligence, machine learning, and generative AI applications, which can now take advantage of a single authoritative copy of data. With SageMaker Lakehouse, you get the best of both worlds—the flexibility to use cost effective Amazon Simple Storage Service (Amazon S3) storage with the scalable compute of a data lake, along with the performance, reliability and SQL capabilities typically associated with a data warehouse.
SageMaker Lakehouse enables interoperability by providing open source Apache Iceberg REST APIs to access data in the lakehouse. Customers can now use their choice of tools and a wide range of AWS services such as Amazon Redshift, Amazon EMR, Amazon Athena and Amazon SageMaker, in addition to third-party analytics engines that are compatible with Apache Iceberg REST specifications to query their data in-place.
Finally, SageMaker Lakehouse now provides secure and fine-grained access controls on data in both data warehouses and data lakes. With resource permission controls from AWS Lake Formation integrated into the AWS Glue Data Catalog, SageMaker Lakehouse lets customers securely define and share access to a single authoritative copy of data across their entire organization.
Organizations managing workloads in AWS analytics and Databricks can now use this open and secure lakehouse capability to unify policy administration and oversight of their data lake in Amazon S3. In this post, we will show you how Databricks on AWS general purpose compute can integrate with the AWS Glue Iceberg REST Catalog for metadata access and use Lake Formation for data access. To keep the setup in this post straightforward, the Glue Iceberg REST Catalog and Databricks cluster share the same AWS account.
Solution overview
In this post, we show how tables cataloged in Data Catalog and stored on Amazon S3 can be consumed from Databricks compute using Glue Iceberg REST Catalog with data access secured using Lake Formation. We will show you how the cluster can be configured to interact with Glue Iceberg REST Catalog, use a notebook to access the data using Lake Formation temporary vended credentials, and run analysis to derive insights.
The following figure shows the architecture described in the preceding paragraph.
Prerequisites
To follow along with the solution presented in this post, you need the following AWS prerequisites:
- Access to the Lake Formation data lake administrator in your AWS account. A Lake Formation data lake administrator is an IAM principal that can register Amazon S3 locations, access the Data Catalog, grant Lake Formation permissions to other users, and view AWS CloudTrail See Create a data lake administrator for more information.
- Enable full table access for external engines to access data in Lake Formation.
- Sign into Lake Formation console as an IAM administrator and choose Administration in the navigation pane.
- Choose Application integration settings and select Allow external engines to access data in Amazon S3 locations with full table access.
- Choose Save.
- An existing AWS Glue database and tables. For this post, we will use an AWS Glue database named
icebergdemodb
, which contains an Iceberg table named person and data is stored in an S3 general purpose bucket namedicebergdemodatalake
. - A user-defined IAM role that Lake Formation assumes when accessing the data in the above S3 location to vend scoped credentials. Follow the instructions provided in Requirements for roles used to register locations. For this post, we will use the IAM role
LakeFormationRegistrationRole
.
In addition to the AWS prerequisites, you need access to Databricks Workspace (on AWS) and the ability to create a cluster with No isolation shared access mode.
Set up an instance profile role. For instructions on how to create and set up the role, see Manage instance profiles in Databricks. Create customer managed policy named: dataplane-glue-lf-policy
with below policies and attach the same to the instance profile role:
For this post, we will use an instance profile role (databricks-dataplane-instance-profile-role
), which will be attached to the previously created cluster.
Register the Amazon S3 location as the data lake location
Registering an Amazon S3 location with Lake Formation provides an IAM role with read/write permissions to the S3 location. In this case, you are required to register the icebergdemodatalake
bucket location using the LakeFormationRegistrationRole
IAM role.
After the location is registered, Lake Formation assumes the LakeFormationRegistrationRole
role when it grants temporary credentials to the integrated AWS services/third-party analytics engines that are compatible(prerequisite Step 2) that access data in that S3 bucket location.
To register the Amazon S3 location as the data lake location, complete the following steps:
- Sign in to the AWS Management Console for Lake Formation as the data lake administrator .
- In the navigation pane, choose Data lake locations under Administration.
- Choose Register location.
- For Amazon S3 path, enter
s3://icebergdemodatalake
. - For IAM role, select LakeFormationRegistrationRole.
- For Permission mode, select Lake Formation.
- Choose Register location.
Grant database and table permissions to the IAM role used within Databricks
Grant DESCRIBE permission on the icebergdemodb
database to the Databricks IAM instance role.
- Sign in to the Lake Formation console as the data lake administrator.
- In the navigation pane, choose Data lake permissions and choose Grant.
- In the Principles section, select IAM users and roles and choose databricks-dataplane-instance-profile-role.
- In the LF-Tags or catalog resources section, select Named Data Catalog resources. Choose
<accountid>
for Catalogs and icebergdemodb for Databases. - Select DESCRIBE for Database permissions.
- Choose Grant.
Grant SELECT and DESCRIBE permissions on the person table in the icebergdemodb
database to the Databricks IAM instance role.
- In the navigation pane, choose Data lake permissions and choose Grant.
- In the Principles section, select IAM users and roles and choose databricks-dataplane-instance-profile-role.
- In the LF-Tags or catalog resources section, select Named Data Catalog resources. Choose
<accountid>
for Catalogs, icebergdemodb for Databases and person for table. - Select SUPER for Table permissions.
- Choose Grant.
Grant data location permissions on the bucket to the Databricks IAM instance role.
- In the Lake Formation console navigation pane, choose Data Locations, and then choose Grant.
- For IAM users and roles, choose databricks-dataplane-instance-profile-role.
- For Storage locations, select the s3://icebergdemodatalake.
- Choose Grant.
Databricks workspace
Create a cluster and configure it to connect with a Glue Iceberg REST Catalog endpoint. For this post, we will use a Databricks cluster with runtime version 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12).
- In Databricks console, choose Compute in the navigation pane.
- Create a cluster with runtime version 15.4 LTS, access mode as ‘No isolation shared‘ and choose
databricks-dataplane-instance-profile-role
as instance profile role under Configuration section. - Expand the Advanced options section. In the Spark section, for Spark Config include the following details:
- In the Cluster section, for Libraries include the following jars:
org.apache.iceberg-spark-runtime-3.5_2.12:1.6.1
software.amazon.awssdk:bundle:2.29.5
Create a notebook for analyzing data managed in Data Catalog:
- In the workspace browser, create a new notebook and attach it to the cluster created above.
- Run the following commands in the notebook cell to query the data.
- Further modify the data in the S3 data lake using the AWS Glue Iceberg REST Catalog.
This shows that you can now analyze data in a Databricks cluster using an AWS Glue Iceberg REST Catalog endpoint with Lake Formation managing the data access.
Clean up
To clean up the resources used in this post and avoid possible charges:
- Delete the cluster created in Databricks.
- Delete the IAM roles created for this post.
- Delete the resources created in Data Catalog.
- Empty and then delete the S3 bucket.
Conclusion
In this post, we have showed you how to manage a dataset centrally in AWS Glue Data Catalog and make it accessible to Databricks compute using the Iceberg REST Catalog API. The solution also enables you to use Databricks to use existing access control mechanisms with Lake Formation, which is used to manage metadata access and enable underlying Amazon S3 storage access using credential vending.
Try the feature and share your feedback in the comments.
About the authors
Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.
Venkatavaradhan (Venkat) Viswanathan is a Global Partner Solutions Architect at Amazon Web Services. Venkat is a Technology Strategy Leader in Data, AI, ML, generative AI, and Advanced Analytics. Venkat is a Global SME for Databricks and helps AWS customers design, build, secure, and optimize Databricks workloads on AWS.
Pratik Das is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems.