watsonx.data

watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

 View Only

IBM watsonx.data Integration with Unity Catalog Simplified

By Anurag Dwivedi posted 23 days ago

  

Seamless Data Management and Analytics with MDS and Unity Catalog Spark Integration

In the data ecosystem, The Unity Catalog API has marked a major step forward and Databricks’ decision to open source it is widely applauded for encouraging transparency, adaptability, and seamless tool integration.

By supporting Unity Catalog through Metadata Service, IBM watsonx.data is bridging the gap and enabling seamless, unified data interactions and integration across different data management and analytics platforms.

Let’s dive into what this integration offers—and how you can start using it in your workflows.

Unlocking new potential with Open-Source Unity REST APIs

The support for Unity Catalog and various cloud storage provides organizations more flexibility to work across different cloud platforms supporting Unity Catalog’s goal of openness and interoperability.

When users connect their storage such as Azure or GCP to watsonx.data and configure a catalog, they can integrate with the Unity Spark JAR. During the setup, watsonx.data securely generates an access token using the provided storage credentials. This token is passed on to the Unity Spark Client, enabling it to perform read and write operations directly within the connected storage. The process is secure and seamless for efficient data workflows.


Image 1: Connecting to Azure or GCP through Unity Catalog

Metadata Service (MDS) with the compatible Unity Catalog Spark 0.2.0 jar version allows customers to perform various data management activities, such as creating, viewing, and deleting schemas and tables.

Integration Points

MDS supports key operations in Unity Spark for namespaces and tables:

  • Namespace Management: Create, retrieve, and delete namespaces.
  • Table Management: Create, retrieve, and delete tables.

APIs and Endpoints

MDS exposes robust RESTful APIs to manage schemas, tables, and temporary credentials effectively. Following are the available endpoints:

Catalogs -
Schemas -
POST /schemas: Create a schema.
GET /schemas: Retrieve all schemas.
GET /schemas/{full_name}: Retrieve schema details by full name.
PATCH /schemas/{full_name}: Update schema details.
DELETE /schemas/{full_name}: Delete a schema.

Tables -
POST /tables: Create a table.
GET /tables: Retrieve all tables.
GET /tables/{full_name}: Retrieve table details by full name.
DELETE /tables/{full_name}: Delete a table.

Temporary Credentials -
POST /temporary-path-credential: Generate temporary credentials for path access.
POST /temporary-table-credential: Generate temporary credentials for table access.

Setting up Unity Spark integration

In this section, we will understand setting up Unity Spark integration.

Required JARs

For the integration, the following JARs must be placed in the local Spark JARs directory.

·      

delta-spark_2.12–3.2.0.jar
unitycatalog-spark-0.2.0.jar

Spark Configuration

Use the following Spark configuration for a smooth integration experience. Update the parameters with your CPD/SaaS environment details:

·      <mds-rest-base-uri>

·      <token>

·      <access_key>

·      <secret_key>

·      <bucket_name>

Following is an example MDS rest base URL:

mds-rest-base-uri = https://80e7cce9-c14f-4aa8-8a3b-c52adc25efac.cdc406pd09pasng7elgg.lakehouse.dev.appdomain.cloud:32230

Configuration to use Azure storage

spark-sql --name mds-uc-azure \
    --master "local[*]" \
    --packages io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0,org.apache.hadoop:hadoop-azure:3.3.6 \
    --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
    --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
    --conf spark.sql.catalog.unity_delta_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
    --conf spark.sql.catalog.unity_delta_catalog=io.unitycatalog.spark.UCSingleCatalog \
    --conf spark.sql.catalog.unity_delta_catalog.uri=<mds-rest-base-uri> \
    --conf spark.sql.catalog.unity_delta_catalog.token=<token> \
    --conf spark.sql.defaultCatalog=unity_delta_catalog

Configuration to use GCP storage

spark-sql --name mds-uc-gcp \
    --master "local[*]" \
    --packages io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0 \
    --jars https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/3.0.2/gcs-connector-3.0.2-shaded.jar \
    --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
    --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
    --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
    --conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
    --conf spark.sql.catalog.unity_hive_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
    --conf spark.sql.catalog.unity_hive_catalog=io.unitycatalog.spark.UCSingleCatalog \
    --conf spark.sql.catalog.unity_hive_catalog.uri=<mds-rest-base-uri> \
    --conf spark.sql.catalog.unity_hive_catalog.token=<token> \
    --conf spark.sql.defaultCatalog=unity_hive_catalog

Supported Spark versions

MDS supports the following Spark versions:

·      Spark 3.5.1

·      Spark 3.4.3

·      Spark 3.3.4

Sample Spark commands

Following are sample Spark commands to work with Unity Catalog.

Schemas :

create schema <catalog_name>.<schema_name> with dbproperties (locationuri='abfss://<container_name>@<storage_name>.dfs.core.windows.net/<schema_name>');

create schema <catalog_name>.<schema_name> with dbproperties (locationUri = 'gs://<bucket_name>/<schema_name>');

show NAMESPACES IN <catalog_name>;

use <catalog_name>.<schema_name>;

drop schema <catalog_name>.<schema_name>.<table_name>;


Tables :

create table <catalog_name>.<schema_name>.<table_name> (id int, name string) using DELTA location 'abfss://<container_name>@<storage_name>.dfs.core.windows.net/<schema_name>/<table_name>';

create table <catalog_name>.<schema_name>.<table_name> (name string, age int) using ORC location 'gs://<bucket_name>/<schema_name>/<table_name>';

show tables;

describe <catalog_name>.<schema_name>.<table_name>;

insert into <catalog_name>.<schema_name>.<table_name> values (value1, value2, ...);

select * from <catalog_name>.<schema_name>.<table_name>;

drop table <catalog_name>.<schema_name>.<table_name>;

NOTE: Currently, creating a table without specifying a location in the command is not supported. Ideally, the table should inherit the schema’s location, but by the time of publishing this blog, this functionality is not available in the MDS with Unity integration.

All standard Spark SQL queries supported by the Thrift endpoint are also compatible with the REST endpoint. Additionally, Unity Spark–specific syntax, such as SHOW NAMESPACES IN mds_delta_catalog is supported, thereby extending the usability beyond standard Apache Spark SQL.

Behavior and limitations

Following are some of the behaviors and limitations that must be considered when implementing an integration with Unity Catalog:

·      Location URI: MDS requires a Location URI to create a schema. For example,

CREATE NAMESPACE unity_spark_db WITH DBPROPERTIES ( locationUri = 's3a://bucket_name/schema_name' );

·      Catalog mapping: In MDS, spark_catalog is mapped to Hive as the default catalog, which may differ slightly from Unity's default behavior.

In the current implementation, Unity Spark integration works only with Azure and GCP cloud storages.

Conclusion

In summary, the integration of watsonx.data MDS with Unity Spark marks a major advancement in building unified, efficient, and seamless data management systems. This integration empowers users to streamline workflows, harness advanced analytics, and boost overall productivity.


#watsonx.data
0 comments
13 views

Permalink