In this blog, we will see how Spark-based systems can integrate with the Iceberg REST Catalog of Metadata System (MDS) in watsonx.data, enabling you to combine data, metadata, and execution engines as needed.
Complete the following steps to connect Apache Spark with watsonx.data:
1. In the watsonx.data admin console, go to Infrastructure manager & create an Iceberg type catalog associated with any bucket of type S3a, GCS, Azure.
2. Navigate to the Catalog details page and get the details of the MDS Rest endpoint. The following images show the Catalog details page in watsonx.data SaaS and on IBM Cloud Pak for Data (CPD).
NOTE: For CPD, use the Metastore External REST endpoint.
Image 1: Catalog details in watsonx.data on CPD
Image 2: Catalog details in watsonx.data SaaS
NOTE: If you are connecting to watsonx.data on CPD, complete steps 3-5. If you are connecting to watsonx.data SaaS, go to step 6.
3. Use the following command to fetch the MDS certificates.
echo QUIT | openssl s_client -showcerts -connect <mds-rest-endpoint-host>:443 | awk '/-----BEGIN CERTIFICATE-----/ {p=1}; p; /-----END CERTIFICATE-----/ {p=0}' > mds.cert
4. Concatenate the mds.cert. Run the command – cat mds.cert
Image 3: Concatenated certificate
5. Import both certificates into the Java truststore that Spark will use at runtime.
keytool -import -alias <alias_name> -file <certificate_file> -keystore <truststore_file> -storepass <truststore_password>
Example:
sudo keytool -import -alias mycert -file mycert.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass change
6. Configure Spark.
Example:
./bin/spark-sql -v \
--packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.3.6,software.amazon.awssdk:bundle:2.27.14 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.iceberg.vectorization.enabled=false \
--conf spark.sql.catalog.iceberg_data.header.Authorization='Bearer <token>’\
--conf spark.sql.catalog.iceberg_data=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.iceberg_data.type=rest \
--conf spark.sql.catalog.iceberg_data.warehouse=iceberg_data \
--conf spark.sql.catalog.iceberg_data.s3.path-style-access=true \
--conf spark.sql.catalog.iceberg_data.s3.access-key-id=demo_access_key \
--conf spark.sql.catalog.iceberg_data.s3.secret-access-key=demo_secret_key \
--conf spark.sql.catalog.iceberg_data.client.region=us-south \
--conf spark.sql.catalog.iceberg_data.s3.endpoint=https://s3.us-south.cloud-object-storage.appdomain.cloud/ \
--conf spark.sql.catalog.iceberg_data.uri=https://<mds-rest-host>/mds/iceberg
7.
1. After starting the Spark session, you can try the following SQL commands:
a. USE <catalog-name>;
b. create namespace order;
c. USE namespace order;
d. create table food(id integer,name string);
e. insert into food values(1, 'pizza');
f. select * from food;
g. alter table add columns(price double)
h. drop table food PURGE;
NOTE: When you USE<catalog-name> is called for the first time, the sql command is called by default, which overrides the prefix with the catalog name using a configuration endpoint. This means that the catalog will be set for each subsequent operation.For example, if you configured your catalog name as "warehouse", it will be used as the default prefix for future operations.
catalog-name is the wxd catalog name – All the queries will be pointed to the wxd catalog.
#watsonx.data