AWS Lake Formation makes it easy to centrally govern, safe, and globally share information for analytics and machine studying (ML).
With Lake Formation, you’ll be able to centralize information safety and governance utilizing the AWS Glue Knowledge Catalog, letting you handle metadata and information permissions in a single place with acquainted database-style options. It additionally delivers fine-grained information entry management, so you may make positive customers have entry to the proper information right down to the row and column degree.
Lake Formation additionally makes it easy to share information internally throughout your group and externally, which helps you to create a knowledge mesh or meet different information sharing wants with no information motion.
Moreover, as a result of Lake Formation tracks information interactions by position and consumer, it supplies complete information entry auditing to confirm the proper information was accessed by the proper customers on the proper time.
On this two-part sequence, we present find out how to combine customized functions or information processing engines with Lake Formation utilizing the third-party companies integration function.
On this submit, we dive deep into the required Lake Formation and AWS Glue APIs. We stroll by way of the steps to implement Lake Formation insurance policies inside customized information functions. For example, we current a pattern Lake Formation built-in utility carried out utilizing AWS Lambda.
The second half of the sequence introduces a pattern net utility constructed with AWS Amplify. This net utility showcases find out how to use the customized information processing engine carried out within the first submit.
By the top of this sequence, you’ll have a complete understanding of find out how to prolong the capabilities of Lake Formation by constructing and integrating your personal customized information processing parts.
Combine an exterior utility
The method of integrating a third-party utility with Lake Formation is described intimately in How Lake Formation utility integration works.
On this part, we dive deeper into the steps required to determine belief between Lake Formation and an exterior utility, the API operations which can be concerned, and the AWS Id and Entry Administration (IAM) permissions that have to be set as much as allow the combination.
Lake Formation utility integration exterior information filtering
In Lake Formation, it’s doable to manage which third-party engines or functions are allowed to learn and filter information in Amazon Easy Storage Service (Amazon S3) places registered with Lake Formation.
To take action, you’ll be able to navigate to the Software integration settings web page on the Lake Formation console and allow Enable exterior engines to filter information in Amazon S3 places registered with Lake Formation, specifying the AWS account IDs from the place third-party engines are allowed to entry places registered with Lake Formation. As well as, it’s important to specify the allowed session tag values to establish trusted requests. We focus on in later sections how these tags are used.
Lake Formation utility integration concerned AWS APIs
The next is a listing of the primary AWS APIs wanted to combine an utility with Lake Formation:
- sts:AssumeRole – Returns a set of non permanent safety credentials that you need to use to entry AWS assets.
- glue:GetUnfilteredTableMetadata – Permits a third-party analytical engine to retrieve unfiltered desk metadata from the Knowledge Catalog.
- glue:GetUnfilteredPartitionsMetadata – Retrieves partition metadata from the Knowledge Catalog that comprises unfiltered metadata.
- lakeformation:GetTemporaryGlueTableCredentials – Permits a caller in a safe surroundings to imagine a job with permission to entry Amazon S3. To vend such credentials, Lake Formation assumes the position related to a registered location, for instance an S3 bucket, with a scope down coverage that restricts the entry to a single prefix.
- lakeformation:GetTemporaryGluePartitionCredentials – This API is similar to
GetTemporaryTableCredentials
besides that it’s used when the goal Knowledge Catalog useful resource is of kindPartition
. Lake Formation restricts the permission of the vended credentials with the identical scope down coverage that restricts entry to a single Amazon S3 prefix.
Later on this submit, we current a pattern structure illustrating how you need to use these APIs.
Exterior utility and IAM roles to entry information
For an exterior utility to entry assets in an Lake Formation surroundings, it must run below an IAM principal (consumer or position) with the suitable credentials. Let’s take into account a situation the place the exterior utility runs below the IAM position MyApplicationRole
that’s a part of the AWS account 123456789012
.
In Lake Formation, you have got granted entry to varied tables and databases to 2 particular IAM roles:
To allow MyApplicationRole
to entry the assets which were granted to AccessRole1
and AccessRole2
, you could configure the belief relationships for these entry roles. Particularly, you could configure the next:
- Enable
MyApplicationRole
to imagine every of the entry roles (AccessRole1
and AccessRole2) utilizing the sts:AssumeRole - Enable
MyApplicationRole
to tag the assumed session with a selected tag, which is required by Lake Formation. The tag key ought to beLakeFormationAuthorizedCaller
, and the worth ought to match one of many session tag values specified within the Software integration settings web page on the Lake Formation console (for instance, “application1
“).
The next code is an instance of the belief relationships configuration for an entry position (AccessRole1
or AccessRole2
):
Moreover, the info entry IAM roles (AccessRole1
and AccessRole2
) will need to have the next IAM permissions assigned with a view to learn Lake Formation protected tables:
Answer overview
For our resolution, Lambda serves as our exterior trusted engine and utility built-in with Lake Formation. This instance is supplied with a view to perceive and see in motion the entry circulation and the Lake Formation API responses. As a result of it’s based mostly on a single Lambda perform, it’s not meant for use in manufacturing settings or with excessive volumes of knowledge.
Furthermore, the Lambda based mostly engine has been configured to help a restricted set of knowledge information (CSV, Parquet, and JSON), a restricted set of desk configurations (no nested information), and a restricted set of desk operations (SELECT solely). As a consequence of these limitations, the applying shouldn’t be used for arbitrary assessments.
On this submit, we offer directions on find out how to deploy a pattern API utility built-in with Lake Formation that implements the answer structure. The core of the API is carried out with a Python Lambda perform. We additionally present find out how to check the perform with Lambda assessments. Within the second submit on this sequence, we offer directions on find out how to deploy an internet frontend utility that integrates with this Lambda perform.
Entry circulation for unpartitioned tables
The next diagram summarizes the entry circulation when accessing unpartitioned tables.
The workflow consists of the next steps:
- Person A (authenticated with Amazon Cognito or different equal programs) sends a request to the applying API endpoint, requesting entry to a selected desk inside a selected database.
- The API endpoint, created with AWS AppSync, handles the request, invoking a Lambda perform.
- The perform checks which IAM information entry position the consumer is mapped to. For simplicity, the instance makes use of a static hardcoded mapping (
mappings={ "user1": "lf-app-access-role-1", "user2": "lf-app-access-role-2"}
). - The perform invokes the sts:AssumeRole API to imagine the user-related IAM information entry position (
lf-app-access-role-1AccessRole1
). TheAssumeRole
operation is carried out with the tagLakeFormationAuthorizedCaller
, having as its worth one of many session tag values specified when configuring the applying integration settings in Lake Formation (for instance,{'Key': 'LakeFormationAuthorizedCaller','Worth': 'application1'}
). The API returns a set of non permanent credentials, which we consult with as StsCredentials1. - Utilizing
StsCredentials1
, the perform invokes the glue:GetUnfilteredTableMetadata API, passing the requested database and desk identify. The API returns data like desk location, a listing of approved columns, and information filters, if outlined. - Utilizing
StsCredentials1
, the perform invokes the lakeformation:GetTemporaryGlueTableCredentials API, passing the requested database and desk identify, the kind of requested entry (SELECT
), andCELL_FILTER_PERMISSION
because the supported permission sorts (as a result of the Lambda perform implements logic to use row-level filters). The API returns a set of non permanent Amazon S3 credentials, which we consult with asS3Credentials1
. - Utilizing
S3Credentials1
, the perform lists the S3 information saved within the desk location S3 prefix and downloads them. - The retrieved Amazon S3 information is filtered to take away these columns and rows that the consumer just isn’t allowed entry to (approved columns and row filters have been retrieved in Step 5) and approved information is returned to the consumer.
Entry circulation for partitioned tables
The next diagram summarizes the entry circulation when accessing partitioned tables.
The steps concerned are nearly similar to those offered for partitioned tables, with the next modifications:
- After invoking the glue:GetUnfilteredTableMetadata API (Step 5) and figuring out the desk as partitioned, the Lambda perform invokes the glue:GetUnfilteredPartitionsMetadata API utilizing
StsCredentials1
(Step 6). The API returns, along with different data, the checklist of partition values and places. - For every partition, the perform performs the next actions:
- Invokes the lakeformation:GetTemporaryGluePartitionCredentials API (Step 7), passing the requested database and desk identify, the partition worth, the kind of requested entry (
SELECT
), andCELL_FILTER_PERMISSION
because the supported permissions kind (as a result of the Lambda perform implements logic to use row-level filters). The API returns a set of non permanent Amazon S3 credentials, which we consult with asS3CredentialsPartitionX
. - Makes use of
S3CredentialsPartitionX
to checklist the partition location S3 information and obtain them (Step 8).
- Invokes the lakeformation:GetTemporaryGluePartitionCredentials API (Step 7), passing the requested database and desk identify, the partition worth, the kind of requested entry (
- The perform appends the retrieved information.
- Earlier than the Lambda perform returns the outcomes to the consumer (Step 9), the retrieved Amazon S3 information is filtered to take away these columns and rows that the consumer just isn’t allowed entry to (approved columns and row filters have been retrieved in Step 5).
Stipulations
The next stipulations are wanted to deploy and check the answer:
- Lake Formation ought to be enabled within the AWS Area the place the pattern utility can be deployed
- The steps have to be run with an IAM principal with enough permissions to create the wanted assets, together with Lake Formation databases and tables
Deploy resolution assets with AWS CloudFormation
We create the answer assets utilizing AWS CloudFormation. The supplied CloudFormation template creates the next assets:
- One S3 bucket to retailer desk information (
lf-app-data-
) - Two IAM roles, which can be mapped to consumer customers and their related Lake Formation permission insurance policies (
lf-app-access-role-1
andlf-app-access-role-2
) - Two IAM roles used for the 2 created Lambda capabilities (
lf-app-lambda-datalake-population-role
andlf-app-lambda-role
) - One AWS Glue database (
lf-app-entities
) with two AWS Glue tables, one unpartitioned (users_tbl
) and one partitioned (users_partitioned_tbl
) - One Lambda perform used to populate the info lake information (
lf-app-lambda-datalake-population
) - One Lambda perform used for the Lake Formation built-in utility (
lf-app-lambda-engine
) - One IAM position utilized by Lake Formation to entry the desk information and carry out credentials merchandising (
lf-app-datalake-location-role
) - One Lake Formation information lake location (
s3://lf-app-data-
) related to the IAM position created for credentials merchandising (/datasets lf-app-datalake-location-role
) - One Lake Formation information filter (
lf-app-filter-1
) - One Lake Formation tag (key:
delicate
, values:true
orfalse
) - Tag associations to tag the created unpartitioned AWS Glue desk (
users_tbl
) columns with the created tag
To launch the stack and provision your assets, full the next steps:
- Obtain the code zip bundle for the Lambda perform used for the Lake Formation built-in utility (lf-integrated-app.zip).
- Obtain the code zip bundle for the Lambda perform used to populate the info lake information (datalake-population-function.zip).
- Add the zip bundles to an current S3 bucket location (for instance,
s3://mybucket/myfolder1/myfolder2/lf-integrated-app.zip
ands3://mybucket/myfolder1/myfolder2/datalake-population-function.zip
) - Select Launch Stack.
This robotically launches AWS CloudFormation in your AWS account with a template. Just be sure you create the stack in your meant Area.
- Select Subsequent to maneuver to the Specify stack particulars part
- For Parameters, present the next parameters:
- For powertoolsLogLevel, specify how verbose the Lambda perform logger ought to be, from probably the most verbose to the least verbose (no logs). For this submit, we select DEBUG.
- For s3DeploymentBucketName, enter the identify of the S3 bucket containing the Lambda capabilities’ code zip bundles. For this submit, we use
mybucket
. - For s3KeyLambdaDataPopulationCode, enter the Amazon S3 location containing the code zip bundle for the Lambda perform used to populate the info lake information (
datalake-population-function.zip
). For instance,myfolder1/myfolder2/datalake-population-function.zip
. - For s3KeyLambdaEngineCode, enter the Amazon S3 location containing the code zip bundle for the Lambda perform used for the Lake Formation built-in utility (
lf-integrated-app.zip
). For instance,myfolder1/myfolder2/lf-integrated-app.zip
.
- Select Subsequent.
- Add extra AWS tags if required.
- Select Subsequent.
- Acknowledge the ultimate necessities.
- Select Create stack.
Allow the Lake Formation utility integration
Full the next steps to allow the Lake Formation utility integration:
- On the Lake Formation console, select Software integration settings within the navigation pane.
- Allow Enable exterior engines to filter information in Amazon S3 places registered with Lake Formation.
- For Session tag values, select
application1
. - For AWS account IDs, enter the present AWS account ID.
- Select Save.
Implement Lake Formation permissions
The CloudFormation stack created one database named lf-app-entities
with two tables named users_tbl
and users_partitioned_tbl
.
To make sure you’re utilizing Lake Formation permissions, you must verify that you just don’t have any grants arrange on these tables for the principal IAMAllowedPrincipals
. The IAMAllowedPrincipals
group contains any IAM customers and roles which can be allowed entry to your Knowledge Catalog assets by your IAM insurance policies, and it’s used to take care of backward compatibility with AWS Glue.
To verify Lake Formations permissions are enforced, navigate to the Lake Formation console and select Knowledge lake permissions within the navigation pane. Filter permissions by Database=lf-app-entities
and take away all of the permissions given to the principal IAMAllowedPrincipals
.
For extra particulars on IAMAllowedPrincipals
and backward compatibility with AWS Glue, consult with Altering the default safety settings in your information lake.
Examine the created Lake Formation assets and permissions
The CloudFormation stack created two IAM roles—lf-app-access-role-1
and lf-app-access-role-2
—and assigned them totally different permissions on the users_tbl
(unpartitioned) and users_partitioned_tbl
(partitioned) tables. The particular Lake Formation grants are summarized within the following desk.
IAM Roles |
lf-app-entities (Database) | |
customers _tbl (Desk) | _tbl _partitioned_tbl (Desk) | |
lf-app-access-role-1 |
No entry | Learn entry on columns uid , state , and metropolis for all of the information. Learn entry to all columns aside from handle solely on rows with worth state=uk . |
lf-app-access-role-2 |
Learn entry on columns with the tag delicate = false |
Learn entry to all columns and rows. |
To higher perceive the complete permissions setup, you must evaluate the CloudFormation created Lake Formation assets and permissions. On the Lake Formation console, full the next steps:
- Evaluate the info filters:
- Select Knowledge filters within the navigation pane.
- Examine the
lf-app-filter-1
- Evaluate the tags:
- Select LF-Tags and permissions within the navigation pane.
- Examine the
delicate
- Evaluate the tag associations:
- Select Tables within the navigation pane.
- Select the
users_tbl
- Examine the LF-Tags related to the totally different columns within the Schema
- Evaluate the Lake Formation permissions:
- Select Knowledge lake permissions within the navigation pane.
- Filter by
Principal = lf-app-access-role-1
and examine the assigned permissions. - Filter by
Principal = lf-app-access-role-2
and examine the assigned permissions.
Check the Lambda perform
The Lambda perform created by the CloudFormation template accepts JSON objects as enter occasions. The JSON occasions have the next construction:
Though the identification
discipline is all the time wanted with a view to establish the known as identification, relying on the requested operation (fieldName
), totally different arguments ought to be supplied. The next desk lists these arguments.
Operation | Description | Wanted Arguments | Output |
getDbs |
Listing databases | No arguments wanted | Listing of databases the consumer has entry to |
getTablesByDb |
Listing tables | db: |
Listing of tables inside a database the consumer has entry to |
getUnfilteredTableMetadata |
Return the desk metadata |
|
Returns the output of the glue:GetUnfilteredTableMetadata API |
getUnfilteredPartitionsMetadata |
Return the desk partitions metadata |
|
Returns the output of the glue:GetUnfilteredPartitionsMetadata API |
getTableData |
Get desk information |
|
|
To check the Lambda perform, you’ll be able to create some pattern Lambda check occasions. Full the next steps:
- On the Lambda console, select Capabilities on the navigation pane.
- Select the
lf-app-lambda-engine
- On the Check tab, choose Create new occasion.
- For Occasion JSON, enter a legitimate JSON (we offer some pattern JSON occasions).
- Select Check.
- Examine the check outcomes (JSON response).
The next are some pattern check occasions you’ll be able to attempt to see how totally different identities can entry totally different units of data.
user1 | user2 |
For example, within the following check, we request users_partitioned_tbl
desk information within the context of user1
:
The next is the associated API response:
To troubleshoot the Lambda perform, you’ll be able to navigate to the Monitoring tab, select View CloudWatch logs, and examine the newest log stream.
Clear up
Should you plan to discover Half 2 of this sequence, you’ll be able to skip this half, as a result of you will want the assets created right here. You possibly can consult with this part on the finish of your testing.
Full the next steps to take away the assets you created following this submit and keep away from incurring extra prices:
- On the AWS CloudFormation console, select Stacks within the navigation pane.
- Select the stack you created and select Delete.
Extra issues
Within the proposed structure, Lake Formation permissions have been granted to particular IAM information entry roles that requesting customers (for instance, the identification
discipline) have been mapped to. One other chance is to assign permissions in Lake Formation to SAML customers and teams after which work with the AssumeDecoratedRoleWithSAML API.
Conclusion
Within the first a part of this sequence, we explored find out how to combine customized functions and information processing engines with Lake Formation. We delved into the required configuration, APIs, and steps to implement Lake Formation insurance policies inside customized information functions. For example, we offered a pattern Lake Formation built-in utility constructed on Lambda.
The knowledge supplied on this submit can function a basis for creating your personal customized functions or information processing engines that have to function on an Lake Formation protected information lake.
Seek advice from the second half of this sequence to see find out how to construct a pattern net utility that makes use of the Lambda based mostly Lake Formation utility.
In regards to the Authors
Stefano Sandonà is a Senior Large Knowledge Specialist Answer Architect at AWS. Obsessed with information, distributed programs, and safety, he helps prospects worldwide architect high-performance, environment friendly, and safe information platforms.
Francesco Marelli is a Principal Options Architect at AWS. He specializes within the design, implementation, and optimization of large-scale information platforms. Francesco leads the AWS Answer Architect (SA) analytics staff in Italy. He loves sharing his skilled data and is a frequent speaker at AWS occasions. Francesco can also be keen about music.