Databricks pioneered the open information lakehouse structure and has been on the forefront of format interoperability. We’re excited to see extra platforms undertake the lakehouse structure and begin to embrace interoperable codecs and requirements. Interoperability lets clients cut back costly information duplication by utilizing a single copy of knowledge with their alternative of analytics and AI instruments for his or her workloads. Specifically, a typical sample for our clients is to make use of Databricks’ best-in-class ETL worth/efficiency for upstream information, accessing it from BI and analytics instruments, akin to Snowflake.
Unity Catalog is a unified and open governance answer for information and AI belongings. A key characteristic of Unity Catalog is its implementation of the Iceberg REST Catalog APIs. This makes it easy to make use of an Iceberg-compliant reader with out having to manually refresh your metadata location.
On this weblog put up, we’ll cowl why the Iceberg REST catalog is helpful and stroll by an instance of tips on how to learn Unity Catalog tables in Snowflake.
Word: This performance is offered throughout cloud suppliers. The next directions are particular to AWS S3, however it’s attainable to make use of different object storage platforms akin to Azure Knowledge Lake Storage (ADLS) or Google Cloud Storage (GCS).
Iceberg REST API Catalog Integration
Apache Iceberg™ maintains atomicity and consistency by creating new metadata recordsdata for every desk change. This ensures that incomplete writes don’t corrupt an present metadata file. The Iceberg catalog tracks the brand new metadata per write. Nevertheless, not all engines can join to each Iceberg catalog, forcing clients to manually preserve monitor of the brand new metadata file location.
Iceberg solves interoperability throughout engines and catalogs with the Iceberg REST Catalog API. The Iceberg REST catalog is a standardized, open API specification which is a unified interface for Iceberg catalogs, decoupling catalog implementations from shoppers.
Unity Catalog has carried out the Iceberg REST Catalog APIs for the reason that launch of Common Format (UniForm) in 2023. Unity Catalog exposes the newest desk metadata, guaranteeing interoperability with any Iceberg shopper appropriate with the Iceberg REST Catalog akin to Apache Spark™, Apache Trino, and Snowflake. Unity Catalog’s Iceberg REST Catalog endpoints lengthen governance and Delta Lake desk options like Change Knowledge Feed.
Snowflake’s REST API catalog integration enables you to hook up with Unity Catalog’s Iceberg REST APIs to retrieve the newest metadata file location. Because of this with Unity Catalog, you’ll be able to learn tables instantly in Snowflake as in the event that they have been Iceberg.
Word: As of writing, Snowflake’s assist of the Iceberg REST Catalog is in Public Preview. Nevertheless, Unity Catalog’s Iceberg REST APIs are Usually Out there.
There are 4 steps to making a REST catalog integration in Snowflake:
- Allow UniForm on a Delta Lake desk in Databricks to generate Iceberg metadata
- Register Unity Catalog in Snowflake as your catalog
- Register an S3 Bucket in Snowflake so it acknowledges the supply information
- Create an Iceberg desk in Snowflake so you’ll be able to question your information
Getting Began
We’ll begin in Databricks, with our Unity Catalog-managed desk, and we’ll guarantee it may be learn as Iceberg. Then, we’ll transfer to Snowflake to finish the remaining steps.
Earlier than we begin, there are a couple of elements wanted:
- A Databricks account with Unity Catalog (That is enabled by default for brand new workspaces)
- An AWS S3 bucket and IAM privileges
- A Snowflake account that may entry your Databricks occasion and S3
Unity Catalog namespaces comply with a catalog_name.schema_name.table_name format. Within the instance beneath, we’ll use uc_catalog_name.uc_schema_name.uc_table_name for our Databricks desk.
Step 1: Allow UniForm on a Delta desk in Databricks
In Databricks, you’ll be able to allow UniForm on a Delta Lake desk. By default, new tables are managed by Unity Catalog. Full directions can be found within the UniForm documentation however are additionally included beneath.
For a brand new desk, you’ll be able to allow UniForm throughout desk creation in your workspace:
CREATE TABLE uc_table_name(c1 INT) TBLPROPERTIES(
'delta.columnMapping.mode' = 'title',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);
When you’ve got an present desk, you are able to do this through an ALTER TABLE command:
ALTER TABLE uc_table_name SET TBLPROPERTIES(
'delta.columnMapping.mode' = 'title',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);
You possibly can affirm {that a} Delta desk has UniForm enabled within the Catalog Explorer beneath the Particulars tab, with the metadata location. It ought to look one thing like this:
Step 2: Register Unity Catalog in Snowflake
Whereas nonetheless in Databricks, create a service principal from the workspace admin settings and generate the accompanying secret and shopper ID. As an alternative of a service principal, you too can authenticate with private tokens for debugging and testing functions, however we suggest utilizing a service principal for improvement and manufacturing workloads. From this step, you will want your <deployment-name> and the values in your OAuth <client-id> and <secret> so you’ll be able to authenticate the combination in Snowflake.
Now change over to your Snowflake account.
Word: There are a couple of naming variations between Databricks and Snowflake that could be complicated:
- A “catalog” in Databricks is a “warehouse” within the Snowflake Iceberg catalog integration configuration.
- A “schema” in Databricks is a “catalog_namespace” within the Snowflake Iceberg catalog integration.
You’ll see within the instance beneath that the CATALOG_NAMESPACE worth is uc_schema_name from our Unity Catalog desk.
In Snowflake, create a catalog integration for Iceberg REST catalogs. Following that course of, you’ll create a catalog integration as beneath:
CREATE OR REPLACE CATALOG INTEGRATION unity_catalog_int_oauth
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'uc_schema_name'
REST_CONFIG = (
CATALOG_URI = 'https://.cloud.databricks.com/api/2.1/unity-catalog/iceberg'
WAREHOUSE = 'uc_catalog_name>'
)
REST_AUTHENTICATION = (
TYPE = OAUTH
OAUTH_TOKEN_URI = 'https://.cloud.databricks.com/oidc/v1/token'
OAUTH_CLIENT_ID = ''
OAUTH_CLIENT_SECRET = ''
OAUTH_ALLOWED_SCOPES = ('all-apis', 'sql')
)
ENABLED = TRUE
REFRESH_INTERVAL_SECONDS = '' ;
The REST API Catalog Integration additionally unlocks time-based automated refresh. With automated refresh, Snowflake will ballot for the newest metadata location from Unity Catalog on a time interval outlined for the catalog integration. Nevertheless, automated refresh is incompatible with handbook refresh, requiring customers to attend as much as the time interval after a desk replace. The REFRESH_INTERVAL_SECONDS parameter configured on the catalog integration applies to all Snowflake Iceberg tables created with this integration. It’s not customizable per desk.
Step 3: Register your S3 Bucket in Snowflake
In Snowflake, configure an exterior quantity for Amazon S3. This includes creating an IAM function in AWS, configuring the function’s belief coverage, after which creating an exterior quantity in Snowflake utilizing the function’s ARN.
For this step, you’ll use the identical S3 bucket that Unity Catalog is pointed to.
CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume
STORAGE_LOCATIONS =
(
(
NAME = 'my-s3-us-west-2'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3:///'
STORAGE_AWS_ROLE_ARN = ''
STORAGE_AWS_EXTERNAL_ID = ''
)
);
Step 4: Create an Apache Iceberg™ desk in Snowflake
In Snowflake, create an Iceberg desk with the beforehand created catalog integration and exterior quantity to connect with the Delta Lake desk. You possibly can select the title in your Iceberg desk in Snowflake; it doesn’t have to match the Delta Lake desk in Databricks.
Word: The proper mapping for the CATALOG_TABLE_NAME in Snowflake is the Databricks desk title. In our instance, that is uc_table_name. You do not want to specify the catalog or schema at this step, as a result of they have been already specified within the catalog integration.
CREATE OR REPLACE ICEBERG TABLE <snowflake_table_name>
EXTERNAL_VOLUME = 'iceberg_external_volume'
CATALOG = 'unity_catalog_int_oauth'
CATALOG_TABLE_NAME = 'uc_table_name'
AUTO_REFRESH = TRUE;
Optionally, you’ll be able to allow auto-refresh utilizing the catalog integration time interval by including AUTO_REFRESH = TRUE to the command. Word that if auto-refresh is enabled, handbook refresh is disabled.
You have got now efficiently learn the Delta Lake desk in Snowflake.
Ending Up: Check the Connection
In Databricks, replace the Delta desk information by inserting a brand new row.
If you happen to beforehand enabled auto-refresh, the desk will replace routinely on the desired time interval. If you happen to didn’t, you’ll be able to manually refresh by working ALTER ICEBERG TABLE
Word: when you beforehand enabled auto-refresh, you can not run the handbook refresh command and might want to anticipate the auto-refresh interval to finish to refresh the desk.
Video Demo
If you want a video tutorial, this video demonstrates tips on how to deliver these steps collectively to learn Delta tables with UniForm in Snowflake.
We’re thrilled by continued assist for the lakehouse structure. Prospects now not need to duplicate information, lowering value and complexity. This structure additionally permits clients to decide on the proper device for the proper workload.
The important thing to an open lakehouse is storing your information in an open format akin to Delta Lake or Iceberg. Proprietary codecs lock clients into an engine, however open codecs offer you flexibility and portability. Regardless of the platform, we encourage clients to at all times personal their very own information as step one into interoperability. Within the coming months, we’ll proceed to construct options that make it less complicated to handle an open information lakehouse with Unity Catalog.