watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

View Only

Back to Blog List

How to recover your watsonx.data iceberg table if you accidentally drop it ?

By YUXIANG CAO posted Fri November 03, 2023 02:52 AM

In watonx.data , iceberg table play a important role. In this blog, I present a way to recover watonx.data iceberg table if you accidentally drop it. Also, this method can be considered to migrate your existing iceberg table to watsonx.data.

1 in my lakehouse_data which is iceberg catalog I have a few tables from tpch. in this demo,I just purpose to drop table nation.

2 by design, iceberg table is considered as external table. the biggest difference is that when you drop table, you actually just update catalog but we don't actually prune your data. we could actually check this from infrastructure manager via corresponding bucket

3 to recover nation, we need spark to run iceberg table procedure

https://iceberg.apache.org/docs/latest/spark-procedures/#register_table

the idea is that since the data and metadata still in the storage side, we should be able to use this procedure to register this table again.

so in this register_table procedure, the key is to get metadata json file which we could find from metadata folder

4 let's build our spark code to recover this table . in this case, I running my Jupiter notebook to invoke my spark job so that is easy to do the demo

5. let's explain a code a bit in details

spark = SparkSession.builder.master(sc_master) \
.appName(sc_app) \
.config("spark.sql.catalogImplementation", "hive") \

                    .config("spark.driver.extraClassPath", "/opt/ibm/connectors/iceberg-lakehouse/iceberg-3.3.2-1.2.1-hms-4.0.0-shaded.jar") \

                    .config("spark.hive.metastore.uris", "thrift://ibm-lh-lakehouse-hive-metastore-svc.cpd-instance.svc.cluster.local:9083") \

.config("spark.hive.metastore.use.SSL", "true") \

                    .config("spark.hive.metastore.truststore.path", "file:///project_data/data_asset/truststore.jks") \

.config("spark.hive.metastore.truststore.password", "fn9BAntzCffd3zwozmFaztcp") \
.config("spark.hive.metastore.truststore.type", "JKS") \
.config("spark.hive.metastore.client.auth.mode", "PLAIN") \
.config("spark.hive.metastore.client.plain.username", "frank") \
.config("spark.hive.metastore.client.plain.password", "frank") \

                    .config("spark.driver.extraJavaOptions", "-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true -Djdk.tls.trustNameService=true") \

                    .config("spark.executor.extraJavaOptions", "-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true -Djdk.tls.trustNameService=true") \

.config("spark.hadoop.hive.metastore.schema.verification", "false") \
.config("spark.hadoop.hive.metastore.schema.verification.record.version", "false") \
.config("spark.hadoop.datanucleus.schema.autoCreateTables", "false") \

                    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \

.config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.lakehouse.type" ,"hive") \
.config("spark.sql.iceberg.vectorization.enabled" ,"false") \

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.endpoint" ,"https://minio-minio.apps.653dbde1abddc300171b754e.cloud.techzone.ibm.com") \

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.access.key" ,"DEVELOPER-ACCESS-KEY") \

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.secret.key" ,"DEVELOPER-SECRET-KEY") \

.config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.path.style.access" ,"true") \

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.impl" ,"org.apache.hadoop.fs.s3a.S3AFileSystem") \

.enableHiveSupport() \
.getOrCreate()
sc=spark.sparkContext

since I am using IBM watson studio, when I start my Jupiter Notebook, it is already provide necessary jar file to execute iceberg code. we just need to add this ClassPath

                    .config("spark.driver.extraClassPath", "/opt/ibm/connectors/iceberg-lakehouse/iceberg-3.3.2-1.2.1-hms-4.0.0-shaded.jar") \

I need to access to watsonx.data hive metastore. since my watson studio is in same project in my watsonx.data , I could just use hive metastore service name to access to it .

for external access , truststore and truststore password, we need to follow offical documentation

https://www.ibm.com/docs/en/watsonxdata/1.0.x?topic=administering-accessing-hive-metastore-hms-using-nodeport

https://www.ibm.com/docs/en/watsonxdata/1.0.x?topic=administering-importing-hms-self-signed-certificates-java-truststore

                    .config("spark.hive.metastore.uris", "thrift://ibm-lh-lakehouse-hive-metastore-svc.cpd-instance.svc.cluster.local:9083") \
                    .config("spark.hive.metastore.use.SSL", "true") \
                    .config("spark.hive.metastore.truststore.path", "file:///project_data/data_asset/truststore.jks") \
                    .config("spark.hive.metastore.truststore.password", "fn9BAntzCffd3zwozmFaztcp") \
                    .config("spark.hive.metastore.truststore.type", "JKS") \
                    .config("spark.hive.metastore.client.auth.mode", "PLAIN") \

in our spark code, we also need to configure s3 bucket access which is our lakehouse-bucket

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.endpoint" ,"https://minio-minio.apps.653dbde1abddc300171b754e.cloud.techzone.ibm.com") \

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.access.key" ,"DEVELOPER-ACCESS-KEY") \

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.secret.key" ,"DEVELOPER-SECRET-KEY") \

.config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.path.style.access" ,"true") \

                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.impl" ,"org.apache.hadoop.fs.s3a.S3AFileSystem") \

note there is trick , when you want to define your bucket access key ,secret key and endpoint , you will need to write as below

spark.hadoop.fs.s3a.bucket.<bucket name>.<key name such as secret.key or access.key>. so from this point, we are ready to connect to watonx.data hive metastore and same s3 bucket .

remember in the previous step, we have identify the metadata json file location.

spark.sql("call lakehouse.system.register_table(table =>'iceberg_tpch_sf100.nation',metadata_file => 's3a://lakehouse-bucket/iceberg_tpch_sf100/nation/metadata/00000-de1201db-b078-4a3f-a7d8-94ef8668a8e2.metadata.json')").show()

the output would be like below

6 finally we could see the nation is available again in watonx.data

#watsonx.data
#Catalog
#Bucket
#HiveMetastore

0 comments

20 views

Permalink

https://community.ibm.com/community/user/blogs/yuxiang-cao/2023/11/03/how-to-recover-your-watsonxdata-iceberg-table-if-y

watsonx.data

watsonx.data

How to recover your watsonx.data iceberg table if you accidentally drop it ?

By YUXIANG CAO posted Fri November 03, 2023 02:52 AM

Permalink

Additional
Resources

Office

Quick Links

watsonx.data

watsonx.data

How to recover your watsonx.data iceberg table if you accidentally drop it ?

By YUXIANG CAO posted Fri November 03, 2023 02:52 AM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources