watsonx.data

 View Only

How to recover your watsonx.data iceberg table if you accidentally drop it ?

By YUXIANG CAO posted Fri November 03, 2023 02:52 AM

  

In watonx.data , iceberg table play a important role. In this blog, I present a way to recover watonx.data iceberg table if you accidentally drop it. Also, this method can be considered to migrate your existing iceberg table to watsonx.data. 

1 in my lakehouse_data which is iceberg catalog I have a few tables from tpch. in this demo,I just purpose to drop table nation.

I accidentally drop my iceberg table
2 by design, iceberg table is considered as external table. the biggest difference is that when you drop table, you actually just update catalog but we don't actually prune your data. we could actually check this from infrastructure manager via corresponding bucket 
data
3 to recover nation, we need spark to run iceberg table procedure 
https://iceberg.apache.org/docs/latest/spark-procedures/#register_table
the idea is that since the data and metadata still in the storage side, we should be able to use this procedure to register this table again.
so in this register_table procedure, the key is to get metadata json file which we could find from metadata folder 
4  let's  build our spark code to recover this table . in this case, I running my Jupiter notebook to invoke my spark job so that is easy to do the demo  
5. let's explain a code a bit in details
spark = SparkSession.builder.master(sc_master) \
                    .appName(sc_app) \
                    .config("spark.sql.catalogImplementation", "hive") \
                    .config("spark.driver.extraClassPath", "/opt/ibm/connectors/iceberg-lakehouse/iceberg-3.3.2-1.2.1-hms-4.0.0-shaded.jar") \
                    .config("spark.hive.metastore.uris", "thrift://ibm-lh-lakehouse-hive-metastore-svc.cpd-instance.svc.cluster.local:9083") \
                    .config("spark.hive.metastore.use.SSL", "true") \
                    .config("spark.hive.metastore.truststore.path", "file:///project_data/data_asset/truststore.jks") \
                    .config("spark.hive.metastore.truststore.password", "fn9BAntzCffd3zwozmFaztcp") \
                    .config("spark.hive.metastore.truststore.type", "JKS") \
                    .config("spark.hive.metastore.client.auth.mode", "PLAIN") \
                    .config("spark.hive.metastore.client.plain.username", "frank") \
                    .config("spark.hive.metastore.client.plain.password", "frank") \
                    .config("spark.driver.extraJavaOptions", "-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true -Djdk.tls.trustNameService=true") \
                    .config("spark.executor.extraJavaOptions", "-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true -Djdk.tls.trustNameService=true") \
                    .config("spark.hadoop.hive.metastore.schema.verification", "false") \
                    .config("spark.hadoop.hive.metastore.schema.verification.record.version", "false") \
                    .config("spark.hadoop.datanucleus.schema.autoCreateTables", "false") \
                    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
                    .config("spark.sql.catalog.lakehouse", "org.apache.iceberg.spark.SparkCatalog") \
                    .config("spark.sql.catalog.lakehouse.type" ,"hive") \
                    .config("spark.sql.iceberg.vectorization.enabled" ,"false") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.endpoint" ,"https://minio-minio.apps.653dbde1abddc300171b754e.cloud.techzone.ibm.com") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.access.key" ,"DEVELOPER-ACCESS-KEY") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.secret.key" ,"DEVELOPER-SECRET-KEY") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.path.style.access" ,"true") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.impl" ,"org.apache.hadoop.fs.s3a.S3AFileSystem") \
                    .enableHiveSupport() \
                    .getOrCreate()
sc=spark.sparkContext
since I am using IBM watson studio, when I start my Jupiter Notebook, it is already provide necessary jar file to execute iceberg code. we just need to add this ClassPath 
                    .config("spark.driver.extraClassPath", "/opt/ibm/connectors/iceberg-lakehouse/iceberg-3.3.2-1.2.1-hms-4.0.0-shaded.jar") \
I need to access to watsonx.data hive metastore. since my watson  studio is in same project in my watsonx.data , I could just use hive metastore service name to access to it . 
for external access , truststore and truststore password, we need to follow offical documentation 
https://www.ibm.com/docs/en/watsonxdata/1.0.x?topic=administering-accessing-hive-metastore-hms-using-nodeport
https://www.ibm.com/docs/en/watsonxdata/1.0.x?topic=administering-importing-hms-self-signed-certificates-java-truststore
                    .config("spark.hive.metastore.uris", "thrift://ibm-lh-lakehouse-hive-metastore-svc.cpd-instance.svc.cluster.local:9083") \
                    .config("spark.hive.metastore.use.SSL", "true") \
                    .config("spark.hive.metastore.truststore.path", "file:///project_data/data_asset/truststore.jks") \
                    .config("spark.hive.metastore.truststore.password", "fn9BAntzCffd3zwozmFaztcp") \
                    .config("spark.hive.metastore.truststore.type", "JKS") \
                    .config("spark.hive.metastore.client.auth.mode", "PLAIN") \
in our spark code, we also need to configure  s3 bucket access which is our lakehouse-bucket
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.endpoint" ,"https://minio-minio.apps.653dbde1abddc300171b754e.cloud.techzone.ibm.com") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.access.key" ,"DEVELOPER-ACCESS-KEY") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.secret.key" ,"DEVELOPER-SECRET-KEY") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.path.style.access" ,"true") \
                    .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.impl" ,"org.apache.hadoop.fs.s3a.S3AFileSystem") \
note there is trick , when you want to define your bucket access key ,secret key and endpoint , you will need to write as below 
spark.hadoop.fs.s3a.bucket.<bucket name>.<key name such as secret.key or access.key>. so from this point, we are ready to connect to watonx.data hive metastore and same s3 bucket . 
remember in the previous step, we have identify the metadata json file location. 
spark.sql("call lakehouse.system.register_table(table =>'iceberg_tpch_sf100.nation',metadata_file => 's3a://lakehouse-bucket/iceberg_tpch_sf100/nation/metadata/00000-de1201db-b078-4a3f-a7d8-94ef8668a8e2.metadata.json')").show()
the output would be like below 
6 finally we could see the nation is available again in watonx.data 

#watsonx.data
#Catalog
#Bucket
#HiveMetastore
0 comments
18 views

Permalink