watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

View Only

Back to Blog List

Connecting Apache Spark to watsonx.data using the Iceberg REST API

By Hemant Marve posted Mon May 19, 2025 02:57 PM

In this blog, we will see how Spark-based systems can integrate with the Iceberg REST Catalog of Metadata System (MDS) in watsonx.data, enabling you to combine data, metadata, and execution engines as needed.

Complete the following steps to connect Apache Spark with watsonx.data:

1. In the watsonx.data admin console, go to Infrastructure manager & create an Iceberg type catalog associated with any bucket of type S3a, GCS, Azure.

2. Navigate to the Catalog details page and get the details of the MDS Rest endpoint. The following images show the Catalog details page in watsonx.data SaaS and on IBM Cloud Pak for Data (CPD).

NOTE: For CPD, use the Metastore External REST endpoint.

Image 1: Catalog details in watsonx.data on CPD

Image 2: Catalog details in watsonx.data SaaS

NOTE: If you are connecting to watsonx.data on CPD, complete steps 3-5. If you are connecting to watsonx.data SaaS, go to step 6.

3. Use the following command to fetch the MDS certificates.

echo QUIT | openssl s_client -showcerts -connect <mds-rest-endpoint-host>:443 | awk '/-----BEGIN CERTIFICATE-----/ {p=1}; p; /-----END CERTIFICATE-----/ {p=0}' > mds.cert

4. Concatenate the mds.cert. Run the command – cat mds.cert

Image 3: Concatenated certificate

5. Import both certificates into the Java truststore that Spark will use at runtime.

keytool -import -alias <alias_name> -file <certificate_file> -keystore <truststore_file> -storepass <truststore_password>

Example:

sudo keytool -import -alias mycert -file mycert.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass change

6. Configure Spark.

Example:
./bin/spark-sql -v \
--packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.3.6,software.amazon.awssdk:bundle:2.27.14 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.iceberg.vectorization.enabled=false \
--conf spark.sql.catalog.iceberg_data.header.Authorization='Bearer <token>’\
--conf spark.sql.catalog.iceberg_data=org.apache.iceberg.spark.SparkCatalog  \
--conf spark.sql.catalog.iceberg_data.type=rest \
--conf spark.sql.catalog.iceberg_data.warehouse=iceberg_data \
--conf spark.sql.catalog.iceberg_data.s3.path-style-access=true \
--conf spark.sql.catalog.iceberg_data.s3.access-key-id=demo_access_key \
--conf spark.sql.catalog.iceberg_data.s3.secret-access-key=demo_secret_key \
--conf spark.sql.catalog.iceberg_data.client.region=us-south \
--conf spark.sql.catalog.iceberg_data.s3.endpoint=https://s3.us-south.cloud-object-storage.appdomain.cloud/ \
--conf spark.sql.catalog.iceberg_data.uri=https://<mds-rest-host>/mds/iceberg

1. After starting the Spark session, you can try the following SQL commands:

a.	USE <catalog-name>;
b.	create namespace order;
c.	USE namespace order;
d.	create table food(id integer,name string);
e.	insert into food values(1, 'pizza');
f.	select * from food;
g.	alter table add columns(price double)
h.	drop table food PURGE;

NOTE: When you USE<catalog-name> is called for the first time, the sql command is called by default, which overrides the prefix with the catalog name using a configuration endpoint. This means that the catalog will be set for each subsequent operation.For example, if you configured your catalog name as "warehouse", it will be used as the default prefix for future operations.

catalog-name is the wxd catalog name – All the queries will be pointed to the wxd catalog.

#watsonx.data

0 comments

13 views

Permalink

https://community.ibm.com/community/user/blogs/hemant-marve/2025/05/19/connecting-apache-spark-to-watsonxdata-using-the-i

watsonx.data

watsonx.data

Connecting Apache Spark to watsonx.data using the Iceberg REST API

By Hemant Marve posted Mon May 19, 2025 02:57 PM

Permalink

Additional
Resources

Office

Quick Links

watsonx.data

watsonx.data

Connecting Apache Spark to watsonx.data using the Iceberg REST API

By Hemant Marve posted Mon May 19, 2025 02:57 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources