watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

View Only

Back to Blog List

Connecting Databricks from watsonx.data Spark engine using Unity Catalog open APIs

By Dixon Antony posted Wed April 23, 2025 05:49 AM

The interoperability between Databricks and watsonx.data, powered by the Spark engine, enables seamless Spark-based data access, metadata synchronization, and the application of Databricks governance policies. With this integration, organizations using Databricks can extend their governance framework to data stored in watsonx.data, ensuring consistent policy enforcement across platforms.

For external data access we need to enable Unity Catalog as well as to allow external engines to access data in a metastore, a metastore admin must enable external data access for the metastore. This option is disabled by default to prevent unauthorised external access.

Access from watsonx.data

The watsonx.data Spark engine retrieves data from a Databricks Unity Catalog (UC) metastore using a Databricks personal access token. UC provides temporary credentials and URLs that enable data retrieval and query execution.

Image 1: Access between watsonx.data and Unity Catalog

Provisioning watsonx.data native Spark engine

Log in to watsonx.data console.
From the navigation menu, select Infrastructure manager.
To provision an engine, click Add component and select Add engine.
Specify the storage volume that is considered as Engine home, which stores the Spark events and logs that are generated while running spark applications.

Setting up watsonx.data Spark lab

Install a desktop version of Visual Studio Code.
Install watsonx.data extension from VS Code Marketplace.
Ensure that you have public-private SSH key pair to establish an SSH connection with the Spark lab.
Install the extension Remote - SSH from Visual Studio Code marketplace.
Create a Spark lab.

To create a new Spark lab, click the + icon. The Create Spark Lab window opens. Specify your public SSH key and the public SSH keys of the users whom you want to grant access to Spark lab. Specify each public SSH key on a new line.
Click Create. Click Refresh to see the Spark lab in the left window. This is the dedicated Spark cluster for application development.
Open the Spark lab to access the file system, terminal, and work with it.
In the Explorer window, you can view the file system, where you can upload the files, and view logs.

Accessing Databricks UC from the watsxon.data Spark lab

The following code is used to access UC using Pyspark. Create a Python file in any of the listed folders and add the following code.

Pre-reqs to access catalog from external engine

Catalog must be created after enabling Unity catalog metastore.
Enable External data access for the metastore
Enable EXTERNAL USE SCHEMA for catalog or schema
Personal access token (PAT) of Databricks workspace
Access keys of the storage account where the UC is configured.

from pyspark.sql import SparkSession
import os

def init_spark():
    spark = SparkSession.builder.appName("data-test") \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0") \
        .config("spark.sql.catalog.spark_catalog", "io.unitycatalog.spark.UCSingleCatalog") \
        .config("spark.sql.catalog.<UC Catalog>", "io.unitycatalog.spark.UCSingleCatalog") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.sql.catalog.<UC Catalog>.uri", "https://<Databricks Workspace URL>/api/2.1/unity-catalog") \
        .config("spark.sql.catalog.<UC Catalog>.token", "<Databricks Workspace PAT>") \
        .config("spark.sql.defaultCatalog", "<UC Catalog>") \
        .config("fs.azure.account.key.<storage-account>.dfs.core.windows.net", "<UC storage account access key>") \
    .getOrCreate()
    return spark
def create_database(spark,catalog,databasename):
    spark.sql(f"create database if not exists {catalog}.{databasename}")
   
def list_databases(spark,catalog):
    spark.sql(f"SHOW SCHEMAS").show()

def basic_iceberg_table_operations(spark,catalog,databasename):
    # demonstration: Create a basic table, insert some data and then query table
    # print("creating table") # Create table is restricted currently and need databricks support to enale it
    # spark.sql(f"create table if not exists {catalog}.{databasename}.testTable(id INTEGER, name VARCHAR(10), age INTEGER)").show()
    print("table created") # Currently Alter table is not supported in UC
    # spark.sql(f"ALTER TABLE {catalog}.{databasename}.testTable ADD COLUMNS (salary DECIMAL(10, 2))").show()
    print("table altered")
    spark.sql(f"insert into {catalog}.{databasename}.testTable values(1,'Alan',23,3400.00),(2,'Ben',30,5500.00),(3,'Chen',35,6500.00)")
    print("data inserted")
    spark.sql(f"select * from {catalog}.{databasename}.testTable").show()



def clean_database(spark,catalog,databasename):
    # clean-up the demo database
    spark.sql(f'drop table if exists {catalog}.{databasename}.testTable purge')
    spark.sql(f'drop database if exists {catalog}.{databasename} cascade')

def view_data(spark):
    spark.sql(f'select * from  ams_test.test.employees').show()

def main():
    try:
        spark = init_spark()
        list_databases(spark,"data_poc")
        view_data(spark)
        create_database(spark,"data_poc","dischema")
        list_databases(spark,"data_poc")
        basic_iceberg_table_operations(spark,"data_poc","test")
       
       
    finally:
        # clean-up the demo database
        clean_database(spark,"data_poc","dischema")
        spark.stop()

if __name__ == '__main__':
    main()

<UC Catalog> — The Databricks catalog to which you have access to run DDL/DML statements.
<Databricks Workspace URL> — The URL through to access the Databricks UC workspace.
<Databricks Workspace PAT> — The Databricks workspace personal access token (PAT), which is used to authenticate the user to the Databricks platform.
<storage-account> — The cloud storage account name.
<UC storage account access key> — The cloud storage account access key.

The following additional parameters need to be added to the configuration in order to connect to AWS S3 buckets.

.config("spark.hadoop.fs.s3a.bucket.<Databricks Bucket>.endpoint", <AWS Object Store URL>) \
 .config("spark.hadoop.fs.s3a.bucket.<Databricks Bucket>.access.key", "<AWS Access Key>") \
 .config("spark.hadoop.fs.s3a.bucket.<Databricks Bucket>.secret.key", "<AWS Secret Key>") \

NOTE: Certain SQL statements, such as CREATE TABLE, may not function due to the limitations and restrictions of the Unity Catalog Spark JAR. Additionally, when the EXTERNAL_USE_SCHEMA permission is granted for access from an external engine, other permissions—such as SELECT—are not being validated.

Reference

Unity Catalog API & Iceberg REST Catalog API in watsonx.data

#watsonx.data

0 comments

26 views

Permalink

https://community.ibm.com/community/user/blogs/dixon-antony/2025/04/23/connecting-databricks-from-watsonxdata-spark-engin

watsonx.data

watsonx.data

Connecting Databricks from watsonx.data Spark engine using Unity Catalog open APIs

By Dixon Antony posted Wed April 23, 2025 05:49 AM

Access from watsonx.data

Provisioning watsonx.data native Spark engine

Setting up watsonx.data Spark lab

Accessing Databricks UC from the watsxon.data Spark lab

Pre-reqs to access catalog from external engine

Reference

Permalink

Additional
Resources

Office

Quick Links

watsonx.data

watsonx.data

Connecting Databricks from watsonx.data Spark engine using Unity Catalog open APIs

By Dixon Antony posted Wed April 23, 2025 05:49 AM

Access from watsonx.data

Provisioning watsonx.data native Spark engine

Setting up watsonx.data Spark lab

Accessing Databricks UC from the watsxon.data Spark lab

Pre-reqs to access catalog from external engine

Reference

Permalink

Additional Resources

Office

Quick Links

Additional
Resources