Seamless Data Management and Analytics with MDS and Unity Catalog Spark Integration
In the data ecosystem, The Unity Catalog API has marked a major step forward and Databricks’ decision to open source it is widely applauded for encouraging transparency, adaptability, and seamless tool integration.
By supporting Unity Catalog through Metadata Service, IBM watsonx.data is bridging the gap and enabling seamless, unified data interactions and integration across different data management and analytics platforms.
Let’s dive into what this integration offers—and how you can start using it in your workflows.
Unlocking new potential with Open-Source Unity REST APIs
The support for Unity Catalog and various cloud storage provides organizations more flexibility to work across different cloud platforms supporting Unity Catalog’s goal of openness and interoperability.
When users connect their storage such as Azure or GCP to watsonx.data and configure a catalog, they can integrate with the Unity Spark JAR. During the setup, watsonx.data securely generates an access token using the provided storage credentials. This token is passed on to the Unity Spark Client, enabling it to perform read and write operations directly within the connected storage. The process is secure and seamless for efficient data workflows.
Image 1: Connecting to Azure or GCP through Unity Catalog
Metadata Service (MDS) with the compatible Unity Catalog Spark 0.2.0 jar version allows customers to perform various data management activities, such as creating, viewing, and deleting schemas and tables.
Integration Points
MDS supports key operations in Unity Spark for namespaces and tables:
- Namespace Management: Create, retrieve, and delete namespaces.
- Table Management: Create, retrieve, and delete tables.
APIs and Endpoints
MDS exposes robust RESTful APIs to manage schemas, tables, and temporary credentials effectively. Following are the available endpoints:
Catalogs -
Schemas -
POST /schemas: Create a schema.
GET /schemas: Retrieve all schemas.
GET /schemas/{full_name}: Retrieve schema details by full name.
PATCH /schemas/{full_name}: Update schema details.
DELETE /schemas/{full_name}: Delete a schema.
Tables -
POST /tables: Create a table.
GET /tables: Retrieve all tables.
GET /tables/{full_name}: Retrieve table details by full name.
DELETE /tables/{full_name}: Delete a table.
Temporary Credentials -
POST /temporary-path-credential: Generate temporary credentials for path access.
POST /temporary-table-credential: Generate temporary credentials for table access.
Setting up Unity Spark integration
In this section, we will understand setting up Unity Spark integration.
Required JARs
For the integration, the following JARs must be placed in the local Spark JARs directory.
·
delta-spark_2.12–3.2.0.jar
unitycatalog-spark-0.2.0.jar
Spark Configuration
Use the following Spark configuration for a smooth integration experience. Update the parameters with your CPD/SaaS environment details:
· <mds-rest-base-uri>
· <token>
· <access_key>
· <secret_key>
· <bucket_name>
Following is an example MDS rest base URL:
mds-rest-base-uri = https://80e7cce9-c14f-4aa8-8a3b-c52adc25efac.cdc406pd09pasng7elgg.lakehouse.dev.appdomain.cloud:32230
Configuration to use Azure storage
spark-sql --name mds-uc-azure \
--master "local[*]" \
--packages io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0,org.apache.hadoop:hadoop-azure:3.3.6 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.unity_delta_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.unity_delta_catalog=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.sql.catalog.unity_delta_catalog.uri=<mds-rest-base-uri> \
--conf spark.sql.catalog.unity_delta_catalog.token=<token> \
--conf spark.sql.defaultCatalog=unity_delta_catalog
Configuration to use GCP storage
spark-sql --name mds-uc-gcp \
--master "local[*]" \
--packages io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0 \
--jars https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/3.0.2/gcs-connector-3.0.2-shaded.jar \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
--conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
--conf spark.sql.catalog.unity_hive_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.unity_hive_catalog=io.unitycatalog.spark.UCSingleCatalog \
--conf spark.sql.catalog.unity_hive_catalog.uri=<mds-rest-base-uri> \
--conf spark.sql.catalog.unity_hive_catalog.token=<token> \
--conf spark.sql.defaultCatalog=unity_hive_catalog
Supported Spark versions
MDS supports the following Spark versions:
· Spark 3.5.1
· Spark 3.4.3
· Spark 3.3.4
Sample Spark commands
Following are sample Spark commands to work with Unity Catalog.
Schemas :
create schema <catalog_name>.<schema_name> with dbproperties (locationuri='abfss://<container_name>@<storage_name>.dfs.core.windows.net/<schema_name>');
create schema <catalog_name>.<schema_name> with dbproperties (locationUri = 'gs://<bucket_name>/<schema_name>');
show NAMESPACES IN <catalog_name>;
use <catalog_name>.<schema_name>;
drop schema <catalog_name>.<schema_name>.<table_name>;
Tables :
create table <catalog_name>.<schema_name>.<table_name> (id int, name string) using DELTA location 'abfss://<container_name>@<storage_name>.dfs.core.windows.net/<schema_name>/<table_name>';
create table <catalog_name>.<schema_name>.<table_name> (name string, age int) using ORC location 'gs://<bucket_name>/<schema_name>/<table_name>';
show tables;
describe <catalog_name>.<schema_name>.<table_name>;
insert into <catalog_name>.<schema_name>.<table_name> values (value1, value2, ...);
select * from <catalog_name>.<schema_name>.<table_name>;
drop table <catalog_name>.<schema_name>.<table_name>;
NOTE: Currently, creating a table without specifying a location in the command is not supported. Ideally, the table should inherit the schema’s location, but by the time of publishing this blog, this functionality is not available in the MDS with Unity integration.
All standard Spark SQL queries supported by the Thrift endpoint are also compatible with the REST endpoint. Additionally, Unity Spark–specific syntax, such as SHOW NAMESPACES IN mds_delta_catalog is supported, thereby extending the usability beyond standard Apache Spark SQL.
Behavior and limitations
Following are some of the behaviors and limitations that must be considered when implementing an integration with Unity Catalog:
· Location URI: MDS requires a Location URI to create a schema. For example,
CREATE NAMESPACE unity_spark_db WITH DBPROPERTIES ( locationUri = 's3a://bucket_name/schema_name' );
· Catalog mapping: In MDS, spark_catalog is mapped to Hive as the default catalog, which may differ slightly from Unity's default behavior.
In the current implementation, Unity Spark integration works only with Azure and GCP cloud storages.
Conclusion
In summary, the integration of watsonx.data MDS with Unity Spark marks a major advancement in building unified, efficient, and seamless data management systems. This integration empowers users to streamline workflows, harness advanced analytics, and boost overall productivity.
#watsonx.data