watsonx.data

watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

 View Only

Connect to Snowflake Open Catalog from watsonx.data Spark

By Hemant Marve posted Tue May 27, 2025 02:42 AM

  

In this blog, we will see how to connect to Snowflake Open Catalog from watsonx.data using Spark. For detailed instructions on setting up Snowflake Open Catalog, see https://other-docs.snowflake.com/en/opencatalog/overview

Snowflake Open Catalog offers an implementation of the Iceberg REST Catalog, which is distinct from the traditional Snowflake Catalog. Unlike Snowflake’s native catalog, Snowflake Open Catalog is designed specifically for managing Iceberg tables. It supports two types of catalogs: external and internal.

  • External Catalog: This is a read-only catalog that is not managed by Snowflake Open Catalog itself. Instead, it syncs data from an external catalog, enabling Snowflake to access and query Iceberg tables stored outside of its environment.
  • Internal Catalog: This catalog is fully managed by Snowflake Open Catalog. It allows users to perform full CRUD (Create, Read, Update, Delete) operations on Iceberg tables directly within Snowflake.

With Snowflake Open Catalog, users can seamlessly integrate and manage Iceberg tables, giving them flexibility in how they work with both externally and internally managed catalog. Service connections created in Snowflake connect from external engines.

Submitting Spark application from watsonx.data native Spark to Snowflake Open Catalog

We will use the following sample Python Application (sample_open_catalog.py) for Snowflake Open Catalog CRUD operations on Iceberg Tables with watsonx.data Spark integration.

from pyspark.sql import SparkSession
def init_spark():
  spark = SparkSession.builder.appName("snowflake-open-catalog-test").getOrCreate()
sc = spark.sparkContext
return spark,sc

def main():
  spark,sc = init_spark()
  spark.sql("USE <catalog_name>")
  spark.sql("SHOW NAMESPACES").show()
  spark.sql("CREATE NAMESPACE IF NOT EXISTS <namespace/schema>").show()
  spark.sql("CREATE TABLE IF NOT EXISTS <catalog_name>.<namespace/schema>.<table-name>(ID INTEGER) USING ICEBERG").show()
  spark.sql("SHOW TABLES from <catalog_name>.<namespace/schema>").show()
  spark.sql("INSERT INTO <catalog_name>.<namespace/schema>.<table-name> VALUES(1)").show()
  spark.sql("SELECT * from <catalog_name>.<namespace/schema>.<table-name>").show(
  spark.stop()

if __name__ == '__main__':
  main()

Complete the following steps to submit the application:

1.        Upload sample_open_catalog.py file on bucket.

2.        Submit the pyspark application in watsonx.data. Use the following request body for Snowflake Open Catalog. For more information, see Submitting Spark application by using native Spark engine

{
  	"application_details": {
    "application": "<bucket-file-path>",
    "conf": {
      "spark.hadoop.fs.s3a.bucket.<bucket-name-1>.endpoint": "<bucket-endpoint>",
      "spark.hadoop.fs.s3a.bucket.<bucket-name-1>.access.key": "<access_key>",
      "spark.hadoop.fs.s3a.bucket.<bucket-name-1>.secret.key": "<secret_key>",
      "spark.sql.catalog.<catalog_name>.uri": "https://<open_catalog_account_identifier>.snowflakecomputing.com/polaris/api/catalog",
      "spark.sql.catalog.<catalog_name>.warehouse": "iceberg_open_spec",
      "spark.sql.catalog.<catalog_name>.scope": "PRINCIPAL_ROLE:<principal_role_name>",
      "spark.sql.catalog.<catalog_name>.type": "rest",
      "spark.sql.catalog.<catalog_name>": "org.apache.iceberg.spark.SparkCatalog",
      "spark.sql.defaultCatalog": "<catalog_name>",
      "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
      "spark.sql.catalog.<catalog_name>.credential": "<client_id>:<client_secret>",
      "spark.sql.catalog.<catalog_name>.s3.access-key-id": "<access_key>",
      "spark.sql.catalog.<catalog_name>.s3.secret-access-key": "<secret-key>",
      "spark.sql.catalog.<catalog_name>.client.region": "<target_s3_region>",
      "spark.sql.defaultCatalog": "<catalog_name>"
    }
  }
}

In the request,

·      bucket-name-1 is the bucket where the sample_open_catalog.py file is uploaded.

·      bucket-file-path is the path to the bucket where the file is uploaded. For example, s3a://bucket-name/sample_open_catalog.py.

·      catalog-name is the name of the catalog to connect to in Snowflake Open Catalog.

     In the following image, catalog-name is iceberg_open_spec

Image 1: Catalog in Snowflake

·      client_id specifies the client ID for the service principal to use.

·      client_secret specifies the client secret for the service principal to use.

The following image depicts that service connection is client-id:client-secret


Image 2: Service connection in Snowflake

·      open_catalog_account_identifier specifies the account identifier for your Open Catalog account. Depending on the region and cloud platform for the account, this identifier could be the account locator by itself. For example, xy12345 or include additional segments. For more information, see Using an account locator as an identifier.

·      principal_role_name specifies the principal role that is granted to the service principal. The following image depicts that principal_role_name is writer. You can then assign the principal role to the catalog role.


Image 3: Service connection configuration in Snowflake

You can then assign the principal role to the catalog role.
Image4: Writer principal role granted to catalog writer role
  • target_s3_region: Specifies the region code where the S3 bucket containing your Apache Iceberg tables is located. For the region codes, see AWS service endpoints and refer to the Region column in the table.


#watsonx.data
0 comments
10 views

Permalink