In this blog, we will see how to connect to Snowflake Open Catalog from watsonx.data using Spark. For detailed instructions on setting up Snowflake Open Catalog, see https://other-docs.snowflake.com/en/opencatalog/overview
Snowflake Open Catalog offers an implementation of the Iceberg REST Catalog, which is distinct from the traditional Snowflake Catalog. Unlike Snowflake’s native catalog, Snowflake Open Catalog is designed specifically for managing Iceberg tables. It supports two types of catalogs: external and internal.
- External Catalog: This is a read-only catalog that is not managed by Snowflake Open Catalog itself. Instead, it syncs data from an external catalog, enabling Snowflake to access and query Iceberg tables stored outside of its environment.
- Internal Catalog: This catalog is fully managed by Snowflake Open Catalog. It allows users to perform full CRUD (Create, Read, Update, Delete) operations on Iceberg tables directly within Snowflake.
With Snowflake Open Catalog, users can seamlessly integrate and manage Iceberg tables, giving them flexibility in how they work with both externally and internally managed catalog. Service connections created in Snowflake connect from external engines.
Submitting Spark application from watsonx.data native Spark to Snowflake Open Catalog
We will use the following sample Python Application (sample_open_catalog.py) for Snowflake Open Catalog CRUD operations on Iceberg Tables with watsonx.data Spark integration.
from pyspark.sql import SparkSession
def init_spark():
spark = SparkSession.builder.appName("snowflake-open-catalog-test").getOrCreate()
sc = spark.sparkContext
return spark,sc
def main():
spark,sc = init_spark()
spark.sql("USE <catalog_name>")
spark.sql("SHOW NAMESPACES").show()
spark.sql("CREATE NAMESPACE IF NOT EXISTS <namespace/schema>").show()
spark.sql("CREATE TABLE IF NOT EXISTS <catalog_name>.<namespace/schema>.<table-name>(ID INTEGER) USING ICEBERG").show()
spark.sql("SHOW TABLES from <catalog_name>.<namespace/schema>").show()
spark.sql("INSERT INTO <catalog_name>.<namespace/schema>.<table-name> VALUES(1)").show()
spark.sql("SELECT * from <catalog_name>.<namespace/schema>.<table-name>").show(
spark.stop()
if __name__ == '__main__':
main()
Complete the following steps to submit the application:
1. Upload sample_open_catalog.py file on bucket.
2. Submit the pyspark application in watsonx.data. Use the following request body for Snowflake Open Catalog. For more information, see Submitting Spark application by using native Spark engine
{
"application_details": {
"application": "<bucket-file-path>",
"conf": {
"spark.hadoop.fs.s3a.bucket.<bucket-name-1>.endpoint": "<bucket-endpoint>",
"spark.hadoop.fs.s3a.bucket.<bucket-name-1>.access.key": "<access_key>",
"spark.hadoop.fs.s3a.bucket.<bucket-name-1>.secret.key": "<secret_key>",
"spark.sql.catalog.<catalog_name>.uri": "https://<open_catalog_account_identifier>.snowflakecomputing.com/polaris/api/catalog",
"spark.sql.catalog.<catalog_name>.warehouse": "iceberg_open_spec",
"spark.sql.catalog.<catalog_name>.scope": "PRINCIPAL_ROLE:<principal_role_name>",
"spark.sql.catalog.<catalog_name>.type": "rest",
"spark.sql.catalog.<catalog_name>": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.defaultCatalog": "<catalog_name>",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.<catalog_name>.credential": "<client_id>:<client_secret>",
"spark.sql.catalog.<catalog_name>.s3.access-key-id": "<access_key>",
"spark.sql.catalog.<catalog_name>.s3.secret-access-key": "<secret-key>",
"spark.sql.catalog.<catalog_name>.client.region": "<target_s3_region>",
"spark.sql.defaultCatalog": "<catalog_name>"
}
}
}
In the request,
· bucket-name-1 is the bucket where the sample_open_catalog.py file is uploaded.
· bucket-file-path is the path to the bucket where the file is uploaded. For example, s3a://bucket-name/sample_open_catalog.py.
· catalog-name is the name of the catalog to connect to in Snowflake Open Catalog.
In the following image, catalog-name is iceberg_open_spec
You can then assign the principal role to the catalog role.
#watsonx.data