Originally posted by: TingXue
Delta Lake is an open source storage layer that brings reliability to data lakes. It’s not included with IBM Spectrum Conductor 2.4.0; however, you can integrate the two. Here's is an example of adding Delta Lake 0.4.0 to an existing IBM Spectrum Conductor 2.4 instance group called SIG243.
- Download the Delta Lake package for Scala 2.11 (delta-core_2.11-0.4.0.jar) from the Maven repository:
- Specify the instance group where you will add the Delta Lake package:
- From the IBM Spectrum Conductor cluster management console, click Workload > Instance Groups:
- Select the name of the instance group (SIG243, in this example):
- Stop the instance group: click Manage > Stop.
In this example, we’ll stop SIG243 so we can add the Delta Lake package to it as an additional package, later on:
- Deploy the Delta Lake package to your instance group:
- When the instance group status changes to Ready, click Manage > Configure.
Here, SIG243 is ready:
- Add the Delta Lake package to the instance group: from the Packages page, select Create Single-File Packages and browse to the location where you downloaded the delta-core_2.11-0.4.0.jar package, and open it:
Then, click Modify Instance Group:
- When the deployment finishes and the instance group state returns to Ready again, click Start to start the instance group.
- Test Delta Lake table operations to ensure that your Delta Lake package works properly.
You can quickly test the Delta Lake package using Notebooks within the cluster management console (if you don’t have notebooks created for users, see IBM Knowledge Center to create one).
This example uses a Jupyter 5.0 notebook for the user called Admin.
- From the instance group management page, click the Notebooks tab (for example, Workload > Instance Groups > SIG243 > Notebooks):
- Click My Notebooks and then the notebook and user name combination (in this example, the Jupyter 5.4.0 - Owned by Admin):
- In the Jupyter console, click New > Spark Python (Spark Cluster Mode):
- Run the following Python code provided by Delta Lake Quickstart documentation, to verify that the instance group properly loaded the Delta Lake package. Note that <shared location> is a shared location accessible by all the hosts in the Spark executor resource group for the instance group (by default, is the ComputeHosts resource group):
data = spark.range(0, 5)
df = spark.read.format("delta").load("<shared location>//delta-table")
This code creates a Delta Lake table, reads from the table, and displays the table. If it run successfully, it displays an empty table without any errors, like this:
And it's that simple! Enjoy the integration and example.
As always, let us know what you think using our Slack channel!