AWS Lake Formation makes it straightforward to centrally govern, secure, and globally share data for analytics and machine learning (ML).
With Lake Formation, you can centralize data security and governance using the AWS Glue Data Catalog, letting you manage metadata and data permissions in one place with familiar database-style features. It also delivers fine-grained data access control, so you can make sure users have access to the right data down to the row and column level.
Lake Formation also makes it straightforward to share data internally across your organization and externally, which lets you create a data mesh or meet other data sharing needs with no data movement.
Additionally, because Lake Formation tracks data interactions by role and user, it provides comprehensive data access auditing to verify the right data was accessed by the right users at the right time.
In this two-part series, we show how to integrate custom applications or data processing engines with Lake Formation using the third-party services integration feature.
In this post, we dive deep into the required Lake Formation and AWS Glue APIs. We walk through the steps to enforce Lake Formation policies within custom data applications. As an example, we present a sample Lake Formation integrated application implemented using AWS Lambda.
The second part of the series introduces a sample web application built with AWS Amplify. This web application showcases how to use the custom data processing engine implemented in the first post.
By the end of this series, you will have a comprehensive understanding of how to extend the capabilities of Lake Formation by building and integrating your own custom data processing components.
Integrate an external application
The process of integrating a third-party application with Lake Formation is described in detail in How Lake Formation application integration works.
In this section, we dive deeper into the steps required to establish trust between Lake Formation and an external application, the API operations that are involved, and the AWS Identity and Access Management (IAM) permissions that must be set up to enable the integration.
Lake Formation application integration external data filtering
In Lake Formation, it’s possible to control which third-party engines or applications are allowed to read and filter data in Amazon Simple Storage Service (Amazon S3) locations registered with Lake Formation.
To do so, you can navigate to the Application integration settings page on the Lake Formation console and enable Allow external engines to filter data in Amazon S3 locations registered with Lake Formation, specifying the AWS account IDs from where third-party engines are allowed to access locations registered with Lake Formation. In addition, you have to specify the allowed session tag values to identify trusted requests. We discuss in later sections how these tags are used.
Lake Formation application integration involved AWS APIs
The following is a list of the main AWS APIs needed to integrate an application with Lake Formation:
- sts:AssumeRole – Returns a set of temporary security credentials that you can use to access AWS resources.
- glue:GetUnfilteredTableMetadata – Allows a third-party analytical engine to retrieve unfiltered table metadata from the Data Catalog.
- glue:GetUnfilteredPartitionsMetadata – Retrieves partition metadata from the Data Catalog that contains unfiltered metadata.
- lakeformation:GetTemporaryGlueTableCredentials – Allows a caller in a secure environment to assume a role with permission to access Amazon S3. To vend such credentials, Lake Formation assumes the role associated with a registered location, for example an S3 bucket, with a scope down policy that restricts the access to a single prefix.
- lakeformation:GetTemporaryGluePartitionCredentials – This API is identical to
GetTemporaryTableCredentials
except that it’s used when the target Data Catalog resource is of typePartition
. Lake Formation restricts the permission of the vended credentials with the same scope down policy that restricts access to a single Amazon S3 prefix.
Later in this post, we present a sample architecture illustrating how you can use these APIs.
External application and IAM roles to access data
For an external application to access resources in an Lake Formation environment, it needs to run under an IAM principal (user or role) with the appropriate credentials. Let’s consider a scenario where the external application runs under the IAM role MyApplicationRole
that is part of the AWS account 123456789012
.
In Lake Formation, you have granted access to various tables and databases to two specific IAM roles:
To enable MyApplicationRole
to access the resources that have been granted to AccessRole1
and AccessRole2
, you need to configure the trust relationships for these access roles. Specifically, you need to configure the following:
- Allow
MyApplicationRole
to assume each of the access roles (AccessRole1
and AccessRole2) using the sts:AssumeRole - Allow
MyApplicationRole
to tag the assumed session with a specific tag, which is required by Lake Formation. The tag key should beLakeFormationAuthorizedCaller
, and the value should match one of the session tag values specified in the Application integration settings page on the Lake Formation console (for example, “application1
“).
The following code is an example of the trust relationships configuration for an access role (AccessRole1
or AccessRole2
):
Additionally, the data access IAM roles (AccessRole1
and AccessRole2
) must have the following IAM permissions assigned in order to read Lake Formation protected tables:
Solution overview
For our solution, Lambda serves as our external trusted engine and application integrated with Lake Formation. This example is provided in order to understand and see in action the access flow and the Lake Formation API responses. Because it’s based on a single Lambda function, it’s not meant to be used in production settings or with high volumes of data.
Moreover, the Lambda based engine has been configured to support a limited set of data files (CSV, Parquet, and JSON), a limited set of table configurations (no nested data), and a limited set of table operations (SELECT only). Due to these limitations, the application should not be used for arbitrary tests.
In this post, we provide instructions on how to deploy a sample API application integrated with Lake Formation that implements the solution architecture. The core of the API is implemented with a Python Lambda function. We also show how to test the function with Lambda tests. In the second post in this series, we provide instructions on how to deploy a web frontend application that integrates with this Lambda function.
Access flow for unpartitioned tables
The following diagram summarizes the access flow when accessing unpartitioned tables.
The workflow consists of the following steps:
- User A (authenticated with Amazon Cognito or other equivalent systems) sends a request to the application API endpoint, requesting access to a specific table inside a specific database.
- The API endpoint, created with AWS AppSync, handles the request, invoking a Lambda function.
- The function checks which IAM data access role the user is mapped to. For simplicity, the example uses a static hardcoded mapping (
mappings={ "user1": "lf-app-access-role-1", "user2": "lf-app-access-role-2"}
). - The function invokes the sts:AssumeRole API to assume the user-related IAM data access role (
lf-app-access-role-1AccessRole1
). TheAssumeRole
operation is performed with the tagLakeFormationAuthorizedCaller
, having as its value one of the session tag values specified when configuring the application integration settings in Lake Formation (for example,{'Key': 'LakeFormationAuthorizedCaller','Value': 'application1'}
). The API returns a set of temporary credentials, which we refer to as StsCredentials1. - Using
StsCredentials1
, the function invokes the glue:GetUnfilteredTableMetadata API, passing the requested database and table name. The API returns information like table location, a list of authorized columns, and data filters, if defined. - Using
StsCredentials1
, the function invokes the lakeformation:GetTemporaryGlueTableCredentials API, passing the requested database and table name, the type of requested access (SELECT
), andCELL_FILTER_PERMISSION
as the supported permission types (because the Lambda function implements logic to apply row-level filters). The API returns a set of temporary Amazon S3 credentials, which we refer to asS3Credentials1
. - Using
S3Credentials1
, the function lists the S3 files stored in the table location S3 prefix and downloads them. - The retrieved Amazon S3 data is filtered to remove those columns and rows that the user is not allowed access to (authorized columns and row filters were retrieved in Step 5) and authorized data is returned to the user.
Access flow for partitioned tables
The following diagram summarizes the access flow when accessing partitioned tables.
The steps involved are almost identical to the ones presented for partitioned tables, with the following changes:
- After invoking the glue:GetUnfilteredTableMetadata API (Step 5) and identifying the table as partitioned, the Lambda function invokes the glue:GetUnfilteredPartitionsMetadata API using
StsCredentials1
(Step 6). The API returns, in addition to other information, the list of partition values and locations. - For each partition, the function performs the following actions:
- Invokes the lakeformation:GetTemporaryGluePartitionCredentials API (Step 7), passing the requested database and table name, the partition value, the type of requested access (
SELECT
), andCELL_FILTER_PERMISSION
as the supported permissions type (because the Lambda function implements logic to apply row-level filters). The API returns a set of temporary Amazon S3 credentials, which we refer to asS3CredentialsPartitionX
. - Uses
S3CredentialsPartitionX
to list the partition location S3 files and download them (Step 8).
- Invokes the lakeformation:GetTemporaryGluePartitionCredentials API (Step 7), passing the requested database and table name, the partition value, the type of requested access (
- The function appends the retrieved data.
- Before the Lambda function returns the results to the user (Step 9), the retrieved Amazon S3 data is filtered to remove those columns and rows that the user is not allowed access to (authorized columns and row filters were retrieved in Step 5).
Prerequisites
The following prerequisites are needed to deploy and test the solution:
- Lake Formation should be enabled in the AWS Region where the sample application will be deployed
- The steps must be run with an IAM principal with sufficient permissions to create the needed resources, including Lake Formation databases and tables
Deploy solution resources with AWS CloudFormation
We create the solution resources using AWS CloudFormation. The provided CloudFormation template creates the following resources:
- One S3 bucket to store table data (
lf-app-data-<account-id>
) - Two IAM roles, which will be mapped to client users and their associated Lake Formation permission policies (
lf-app-access-role-1
andlf-app-access-role-2
) - Two IAM roles used for the two created Lambda functions (
lf-app-lambda-datalake-population-role
andlf-app-lambda-role
) - One AWS Glue database (
lf-app-entities
) with two AWS Glue tables, one unpartitioned (users_tbl
) and one partitioned (users_partitioned_tbl
) - One Lambda function used to populate the data lake data (
lf-app-lambda-datalake-population
) - One Lambda function used for the Lake Formation integrated application (
lf-app-lambda-engine
) - One IAM role used by Lake Formation to access the table data and perform credentials vending (
lf-app-datalake-location-role
) - One Lake Formation data lake location (
s3://lf-app-data-<account-id>/datasets
) associated with the IAM role created for credentials vending (lf-app-datalake-location-role
) - One Lake Formation data filter (
lf-app-filter-1
) - One Lake Formation tag (key:
sensitive
, values:true
orfalse
) - Tag associations to tag the created unpartitioned AWS Glue table (
users_tbl
) columns with the created tag
To launch the stack and provision your resources, complete the following steps:
- Download the code zip bundle for the Lambda function used for the Lake Formation integrated application (lf-integrated-app.zip).
- Download the code zip bundle for the Lambda function used to populate the data lake data (datalake-population-function.zip).
- Upload the zip bundles to an existing S3 bucket location (for example,
s3://mybucket/myfolder1/myfolder2/lf-integrated-app.zip
ands3://mybucket/myfolder1/myfolder2/datalake-population-function.zip
) - Choose Launch Stack.
This automatically launches AWS CloudFormation in your AWS account with a template. Make sure that you create the stack in your intended Region.
- Choose Next to move to the Specify stack details section
- For Parameters, provide the following parameters:
- For powertoolsLogLevel, specify how verbose the Lambda function logger should be, from the most verbose to the least verbose (no logs). For this post, we choose DEBUG.
- For s3DeploymentBucketName, enter the name of the S3 bucket containing the Lambda functions’ code zip bundles. For this post, we use
mybucket
. - For s3KeyLambdaDataPopulationCode, enter the Amazon S3 location containing the code zip bundle for the Lambda function used to populate the data lake data (
datalake-population-function.zip
). For example,myfolder1/myfolder2/datalake-population-function.zip
. - For s3KeyLambdaEngineCode, enter the Amazon S3 location containing the code zip bundle for the Lambda function used for the Lake Formation integrated application (
lf-integrated-app.zip
). For example,myfolder1/myfolder2/lf-integrated-app.zip
.
- Choose Next.
- Add additional AWS tags if required.
- Choose Next.
- Acknowledge the final requirements.
- Choose Create stack.
Enable the Lake Formation application integration
Complete the following steps to enable the Lake Formation application integration:
- On the Lake Formation console, choose Application integration settings in the navigation pane.
- Enable Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
- For Session tag values, choose
application1
. - For AWS account IDs, enter the current AWS account ID.
- Choose Save.
Enforce Lake Formation permissions
The CloudFormation stack created one database named lf-app-entities
with two tables named users_tbl
and users_partitioned_tbl
.
To be sure you’re using Lake Formation permissions, you should confirm that you don’t have any grants set up on those tables for the principal IAMAllowedPrincipals
. The IAMAllowedPrincipals
group includes any IAM users and roles that are allowed access to your Data Catalog resources by your IAM policies, and it’s used to maintain backward compatibility with AWS Glue.
To confirm Lake Formations permissions are enforced, navigate to the Lake Formation console and choose Data lake permissions in the navigation pane. Filter permissions by Database=lf-app-entities
and remove all the permissions given to the principal IAMAllowedPrincipals
.
For more details on IAMAllowedPrincipals
and backward compatibility with AWS Glue, refer to Changing the default security settings for your data lake.
Check the created Lake Formation resources and permissions
The CloudFormation stack created two IAM roles—lf-app-access-role-1
and lf-app-access-role-2
—and assigned them different permissions on the users_tbl
(unpartitioned) and users_partitioned_tbl
(partitioned) tables. The specific Lake Formation grants are summarized in the following table.
IAM Roles |
lf-app-entities (Database) | |
users _tbl (Table) | _tbl _partitioned_tbl (Table) | |
lf-app-access-role-1 |
No access | Read access on columns uid , state , and city for all the records. Read access to all columns except for address only on rows with value state=united kingdom . |
lf-app-access-role-2 |
Read access on columns with the tag sensitive = false |
Read access to all columns and rows. |
To better understand the full permissions setup, you should review the CloudFormation created Lake Formation resources and permissions. On the Lake Formation console, complete the following steps:
- Review the data filters:
- Choose Data filters in the navigation pane.
- Inspect the
lf-app-filter-1
- Review the tags:
- Choose LF-Tags and permissions in the navigation pane.
- Inspect the
sensitive
- Review the tag associations:
- Choose Tables in the navigation pane.
- Choose the
users_tbl
- Inspect the LF-Tags associated to the different columns in the Schema
- Review the Lake Formation permissions:
- Choose Data lake permissions in the navigation pane.
- Filter by
Principal = lf-app-access-role-1
and inspect the assigned permissions. - Filter by
Principal = lf-app-access-role-2
and inspect the assigned permissions.
Test the Lambda function
The Lambda function created by the CloudFormation template accepts JSON objects as input events. The JSON events have the following structure:
Although the identity
field is always needed in order to identify the called identity, depending on the requested operation (fieldName
), different arguments should be provided. The following table lists these arguments.
Operation | Description | Needed Arguments | Output |
getDbs |
List databases | No arguments needed | List of databases the user has access to |
getTablesByDb |
List tables | db: <db_name> |
List of tables inside a database the user has access to |
getUnfilteredTableMetadata |
Return the table metadata |
|
Returns the output of the glue:GetUnfilteredTableMetadata API |
getUnfilteredPartitionsMetadata |
Return the table partitions metadata |
|
Returns the output of the glue:GetUnfilteredPartitionsMetadata API |
getTableData |
Get table data |
|
|
To test the Lambda function, you can create some sample Lambda test events. Complete the following steps:
- On the Lambda console, choose Functions on the navigation pane.
- Choose the
lf-app-lambda-engine
- On the Test tab, select Create new event.
- For Event JSON, enter a valid JSON (we provide some sample JSON events).
- Choose Test.
- Check the test results (JSON response).
The following are some sample test events you can try to see how different identities can access different sets of information.
user1 | user2 |
As an example, in the following test, we request users_partitioned_tbl
table data in the context of user1
:
The following is the related API response:
To troubleshoot the Lambda function, you can navigate to the Monitoring tab, choose View CloudWatch logs, and inspect the latest log stream.
Clean up
If you plan to explore Part 2 of this series, you can skip this part, because you will need the resources created here. You can refer to this section at the end of your testing.
Complete the following steps to remove the resources you created following this post and avoid incurring additional costs:
- On the AWS CloudFormation console, choose Stacks in the navigation pane.
- Choose the stack you created and choose Delete.
Additional considerations
In the proposed architecture, Lake Formation permissions were granted to specific IAM data access roles that requesting users (for example, the identity
field) were mapped to. Another possibility is to assign permissions in Lake Formation to SAML users and groups and then work with the AssumeDecoratedRoleWithSAML API.
Conclusion
In the first part of this series, we explored how to integrate custom applications and data processing engines with Lake Formation. We delved into the required configuration, APIs, and steps to enforce Lake Formation policies within custom data applications. As an example, we presented a sample Lake Formation integrated application built on Lambda.
The information provided in this post can serve as a foundation for developing your own custom applications or data processing engines that need to operate on an Lake Formation protected data lake.
Refer to the second part of this series to see how to build a sample web application that uses the Lambda based Lake Formation application.
About the Authors
Stefano Sandonà is a Senior Big Data Specialist Solution Architect at AWS. Passionate about data, distributed systems, and security, he helps customers worldwide architect high-performance, efficient, and secure data platforms.
Francesco Marelli is a Principal Solutions Architect at AWS. He specializes in the design, implementation, and optimization of large-scale data platforms. Francesco leads the AWS Solution Architect (SA) analytics team in Italy. He loves sharing his professional knowledge and is a frequent speaker at AWS events. Francesco is also passionate about music.