Fine-grained access control in Amazon EMR Serverless with AWS Lake Formation

In today’s data-driven world , enterprises are increasingly reliant on vast amounts of data to drive decision-making and innovation. With this reliance comes the critical need for robust data security and access control mechanisms. Fine-grained access control restricts access to specific data subsets, protecting sensitive information and maintaining regulatory compliance. It allows organizations to set detailed permissions at various levels, including database, table, column, and row. This precise control mitigates risks of unauthorized access, data leaks, and misuse. In the unfortunate event of a security incident, fine-grained access control helps limit the scope of the breach, minimizing potential damage.
AWS is introducing general availability of fine-grained access control based on AWS Lake Formation for Amazon EMR Serverless on Amazon EMR 7.2. Enterprises can now significantly enhance their data governance and security frameworks. This new integration supports the implementation of modern data lake architectures, such as data mesh, by providing a seamless way to manage and analyze data. You can use EMR Serverless to enforce data access controls using Lake Formation when reading data from Amazon Simple Storage Service (Amazon S3), enabling robust data processing workflows and real-time analytics without the overhead of cluster management.

In this post, we discuss how to implement fine-grained access control in EMR Serverless using Lake Formation. With this integration, organizations can achieve better scalability, flexibility, and cost-efficiency in their data operations, ultimately driving more value from their data assets.

Key use cases for fine-grained access control in analytics

The following are key use cases for fine-grained access control in analytics:

Customer 360 – You can enable different departments to securely access specific customer data relevant to their functions. For example, the sales team can be granted access only to data such as customer purchase history, preferences, and transaction patterns. Meanwhile, the marketing team is limited to viewing campaign interactions, customer demographics, and engagement metrics.
Financial reporting – You can enable financial analysts to access the necessary data for reporting and analysis while restricting sensitive financial details to authorized executives.
Healthcare analytics – You can enable healthcare researchers and data scientists to analyze de-identified patient data for medical advancements and research, while making sure Protected Health Information (PHI) remains confidential and accessible only to authorized healthcare professionals and personnel.
Supply chain optimization – You can grant logistics teams visibility into inventory and shipment data while limiting access to pricing or supplier information to relevant stakeholders.

Solution overview

In this post, we explore how to implement fine-grained access control on Iceberg tables within an EMR Serverless application, using the capabilities of Lake Formation. If you’re interested in learning how to implement fine-grained access control on open table formats in Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) instances using Lake Formation, refer to Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation.
With the data access control features available in Lake Formation, you can enforce granular permissions and govern access to specific columns, rows, or cells within your Iceberg tables. This approach makes sure sensitive data remains secure and accessible only to authorized users or applications, aligning with your organization’s data governance policies and regulatory compliance requirements.

A cross-account modern data platform on AWS involves setting up a centralized data lake in a primary AWS account, while allowing controlled access to this data from secondary AWS accounts. This setup helps organizations maintain a single source of truth for their data, provides consistent data governance, and uses the robust security features of AWS across multiple business units or project teams.

To demonstrate how you can use Lake Formation to implement cross account fine-grained access control within an EMR Serverless environment, we use the TPC-DS dataset to create tables in the AWS Glue Data Catalog in the AWS producer account and provision different user personas to reflect various roles and access levels in the AWS consumer account, forming a secure and governed data lake.

The following diagram illustrates the solution architecture.

The producer account contains the following persona:

Data engineer – Tasks include data preparation, bulk updates, and incremental updates. The data engineer has the following access:
- Table-level access – Full read/write access to all TPC-DS tables.

The consumer account contains the following personas:

Finance analyst – We run a sample query that performs a sales data analysis to guide marketing, inventory, and promotion strategies based on demographic and geographic performance. The finance analyst has the following access:
- Table-level access – Full access to tables store_sales, catalog_sales, web_sales, item, and promotion for comprehensive financial analysis.
- Column-level access – Limited access to cost-related columns in the sales tables to avoid exposure to sensitive pricing strategies. Limited access to sensitive columns like credit_rating in the customer_demographics table.
- Row-level access – Access only to sales data from the current fiscal year or specific promotional periods.
Product analyst – We run a sample query that does a customer behavior analysis to tailor marketing, promotions, and loyalty programs based on purchase patterns and regional insights. The product analyst has the following access:
- Table-level access – Full access to tables item, store_sales, and customer tables to evaluate product and market trends.
- Column-level access – Restricted access to personal identifiers in the customer table, such as customer_address , email_address, and date of birth.

Prerequisites

You should have the following prerequisites:

Set up infrastructure in the producer account

We provide a CloudFormation template to deploy the data lake stack with the following resources:

Two S3 buckets: one for scripts and query results, and one for the data lake storage
An Amazon Athena workgroup
An EMR Serverless application
An AWS Glue database and tables on external public S3 buckets of TPC-DS data
An AWS Glue database for the data lake
An IAM role and polices

Set up Lake Formation for the data engineer in the producer account

Set up Lake Formation cross-account data sharing version settings:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
Under Data Catalog settings, pick Version 4 under Cross-account version settings.

To learn more about the differences between data sharing versions, refer to Updating cross-account data sharing version settings. Make sure Default permissions for newly created databases and tables is unchecked.

Register the Amazon S3 location as the data lake location

When you register an Amazon S3 location with Lake Formation, you specify an IAM role with read/write permissions on that location. After registering, when EMR Serverless requests access to this Amazon S3 location, Lake Formation will supply temporary credentials of the provided role to access the data. We already created the role LakeFormationServiceRole using the CloudFormation template. To register the Amazon S3 location as the data lake location, complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
In the navigation pane, choose Data lake locations under Administration.
Choose Register location.
For Amazon S3 path, enter s3://<DatalakeBucketName>. (Copy the bucket name from the CloudFormation stack’s Outputs tab.)
For IAM role, enter LakeFormationServiceRoleDatalake.
For Permission mode, select Lake Formation.
Choose Register location.

Generate TPC-DS tables in the producer account

In this section, we generate TPC-DS tables in Iceberg format in the producer account.
Grant database permissions to the data engineer
First, let’s grant database permissions to the data engineer IAM role Amazon-EMR-ExecutionRole_DE that we will use with EMR Serverless. Complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
Choose Databases and Create database.
Enter iceberg_db for Name and s3://<DatalakeBucketName> for Location.
Choose Create database.
In the navigation pane, choose Data lake permissions and choose Grant.
In the Principles section, select IAM users and roles and choose Amazon-EMR-ExecutionRole_DE.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and choose tpc-source and iceberg_db for Databases.
Select Super for both Database permissions and Grantable permissions and choose Grant.

Create an EMR Serverless application

Now, let’s log in to EMR Serverless using Amazon EMR Studio and complete the following steps:

On the Amazon EMR console, choose EMR Serverless.
Under Manage applications, choose my-emr-studio. You will be directed to the Create application page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless application
Under Application settings, provide the following information:
1. For Name, enter a name emr-fgac-application.
2. For Type, choose Spark.
3. For Release version, choose emr-7.2.0.
4. For Architecture, choose x86_64.
Under Application setup options, select Use custom settings.
Under Interactive endpoint, select Enable endpoint for EMR studio
Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
Under Network connections, choose emrs-vpc for the VPC, enter any two private subnets, and enter emr-serverless-sg for Security groups.
Choose Create and start application.

Create a Workspace

Complete the following steps to create an EMR Workspace:

On the Amazon EMR console, choose Workspaces in the navigation pane and choose Create Workspace.
Enter the Workspace name emr-fgac-workspace.
Leave all other settings as default and choose Create Workspace.
Choose Launch Workspace. Your browser might request to allow pop-up permissions for the first time launching the Workspace.
After the Workspace is launched, in the navigation pane, choose Compute.
For Compute type¸ select EMR Serverless application and enter emr-fgac-application for the application and Amazon-EMR-ExecutionRole_DE as the runtime role.
Make sure the kernel attached to the Workspace is PySpark.
Navigate to the File browser section and choose Upload files.
Upload the file Iceberg-ingest-final_v2.ipynb.
Update the data lake bucket name, AWS account ID, and AWS Region accordingly.
Choose the double arrow icon to restart the kernel and rerun the notebook.

To verify that the data is generated, you can go to the AWS Glue console. Under Data Catalog, Databases, you should see TPC-DS tables ending with _iceberg for the database iceberg_db.

Share the database and TPC-DS tables to the consumer account

We now grant permissions to the consumer account, including grantable permissions. This allows the Lake Formation data lake administrator in the consumer account to control access to the data within the account.

Grant database permissions to the consumer account

Complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
In the navigation pane, choose Databases.
Select the database iceberg_db, and on the Actions menu, under Permissions, choose Grant.
In the Principles section, select External accounts and enter the consumer account.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and choose iceberg_db for Databases.
In the Database permissions section, select Describe for both Database permissions and Grantable permissions.

This allows the data lake administrator in the consumer account to describe the database and grant describe permissions to other principals in the consumer account.

Grant table permissions to the consumer account

Repeat the preceding steps to grant table permissions to the consumer account.

Choose All tables under Tables and provide select and describe permissions for Table permissions and Grantable permissions.

Set up Lake Formation in the consumer account

For the remaining section of the post, we focus on the consumer account. Deploy the following CloudFormation stack to set up resources:

The template will create the Amazon EMR runtime role for both analyst user personas.
Log in to the AWS consumer account and accept the AWS RAM invitation first:

Open the AWS RAM console with the IAM identity that has AWS RAM access.
In the navigation pane, choose Resource shares under Shared with me.
You should see two pending resource shares from the producer account.
Accept both invitations.

You should be able to see the iceberg_db database on the Lake Formation console.

Create a resource link for the shared database

To access the database and table resources that were shared by the producer AWS account, you need to create a resource link in the consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database or table. After you create a resource link to a database or table, you can use the resource link name wherever you would use the database or table name. In this step, you grant permission on the resource links to the job runtime roles for EMR Serverless. The runtime roles will then access the data in shared databases and underlying tables through the resource link.
To create a resource link, complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
In the navigation pane, choose Databases.
Select the iceberg_db database, verify that the owner account ID is the producer account, and on the Actions menu, choose Create resource links.
For Resource link name, enter the name of the resource link (iceberg_db_shared).
For Shared database’s region, choose the Region of the iceberg_db database.
For Shared database, choose the iceberg_db database.
For Shared database’s owner ID, enter the account ID of the producer account.
Choose Create.

Grant permissions on the resource link to the EMR job runtime roles

Grant permissions on the resource link to Amazon-EMR-ExecutionRole_Finance and Amazon-EMR-ExecutionRole_Product using the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
In the navigation pane, choose Databases.
Select the resource link (iceberg_db_shared) and on the Actions menu, choose Grant.
In the Principles section, select IAM users and roles, and choose Amazon-EMR-ExecutionRole_Finance and Amazon-EMR-ExecutionRole_Product.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and for Databases, choose iceberg_db_shared.
In the Resource link permissions section, select Describe for Resource link permissions.

This allows the EMR Serverless job runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.
Choose Grant.

Grant table permissions for the finance analyst

Complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
In the navigation pane, choose Databases.
Select the resource link (iceberg_db_shared) and on the Actions menu, choose Grant on target.
In the Principles section, select IAM users and roles, then choose Amazon-EMR-ExecutionRole_Finance.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and specify the following:
1. For Databases, choose iceberg_db.
2. For Tables¸ choose store_sales_iceberg.
In the Table permissions section, for Table permissions, select Select.
In the Data permissions section, select Column-based access.
Select Exclude columns and choose all cost-related columns (ss_wholesale_cost and ss_ext_wholesale_cost).
Choose Grant.
Similarly, grant access to table customer_demographics_iceberg and exclude the column cd_credit_rating.
Following the same steps, grant All data access for tables store_iceberg and item_iceberg.
For the table date_dim_iceberg, we provide selective row-level access.
Similar to the preceding table permissions, select date_dim_iceberg under Tables and in the Data filters section, choose Create new.
For Data filter name, enter FA_Filter_year.
Select Access to all columns under Column-level access.
Select Filter rows and for Row filter expression, enter d_year=2002 to only provide access to the 2002 year.
Choose Save changes.
Choose Create filter.
Make sure FA_Filter_year is selected under Data filters and grant select permissions on the filter.

Grant table permissions for the product analyst

You can provide permissions for the next set of tables required for the product analyst role using the Lake Formation console. Alternatively, you can use the AWS Command Line Interface (AWS CLI) to grant permissions. We provide grant on target permissions for the resource link iceberg_db_shared to IAM role Amazon-EMR-ExecutionRole_Product.

Similar to steps followed in previous sections, for table store_sales_iceberg, date_dim_iceberg, store_iceberg, and house_hold_demographics_iceberg, provide select permissions for All data access. Make sure the role selected is Amazon-EMR-ExecutionRole_Product.

For table customer_iceberg, we limit access to personally identifiable information (PII) columns.

Under Data permissions, select Column-based access and Exclude columns.
Choose columns c_birth_day, c_birth_month, c_birth_year, c_current_addr_sk, c_customer_id, c_email_address, and c_birth_country.

Verify access using interactive notebooks from EMR Studio

Complete the following steps to test role access:

Log in to the AWS consumer account and open the Amazon EMR console.
Choose EMR Serverless in the navigation pane and choose an existing EMR Studio.
If you don’t have EMR Studio configured, choose Get Started and select Create and launch EMR Studio.
Create a Lake Formation enabled EMR Serverless application as described in previous sections.
Create an EMR Studio Workspace as described in previous sections.
Use emr-studio-service-role for Service role and datalake-resources-<account_id>-<region> for Workspace storage, then launch your Workspace.

Now, let’s verify access for the finance analyst.

Make sure the compute type inside your Workspace is pointing to the EMR Serverless application created in the prior step and Amazon-EMR-ExecutionRole_Finance as the interactive runtime role.
Go to File browser in the navigation pane, choose Upload files, and add Notebook_FA.ipynb to your Workspace.
Run all the cells to verify fine-grained access.

Now let’s test access for the product analyst.

In the same Workspace, detach and attach the same EMR Serverless application with Amazon-EMR-ExecutionRole_Product as the interactive runtime role.
Upload Notebook_PA.ipynb under the File browser section.
Run all the cells to verify fine-grained access for the product analyst.

In a real-world scenario, both analysts will likely have their own Workspace with restricted rights to assume only the authorized interactive runtime role.

Considerations and limitations

EMR Serverless with Lake Formation uses Spark resource profiles to create two profiles and two Spark drivers for access control. Read this white paper to learn about the feature details. The user profile runs the supplied code, and the system profile enforces Lake Formation policies. Therefore, it’s recommended that you have a minimum of two Spark drivers when pre-initialized capacity is used with Lake Formation enabled jobs. No change in executor count is required. Refer to Using EMR Serverless with AWS Lake Formation for fine-grained access control to learn more about the technical implementation of the Lake Formation integration with EMR Serverless.

Clean up

To avoid incurring ongoing costs, complete the following steps to clean up your resources:

In your secondary (consumer) account, log in to the Lake Formation console.
Drop the resource share table.
In your primary (producer) account, log in to the Lake Formation console.
Revoke the access you configured.
Drop the AWS Glue tables and database.
Delete the AWS Glue job.
Delete the S3 buckets and any other resources that you created as part of the prerequisites for this post.

Conclusion

In this post, we showed how to integrate Lake Formation with EMR Serverless to manage access to Iceberg tables. This solution showcases a modern way to enforce fine-grained access control in a multi-account open data lake setup. The approach simplifies data management in the main account while carefully controlling how users access data in other secondary accounts.

Try out the solution for your own use case, and let us know your feedback and questions in the comments section.

About the Authors

Anubhav Awasthi is a Sr. Big Data Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.