Blogs

Running RStudio in IBM Spectrum Conductor with Spark 2.2.0 using Sparklyr and Docker

By Archive User posted Fri March 10, 2017 11:54 AM

Originally posted by: WillNguyen

For IBM Spectrum Conductor with Spark 2.2.0, a new notebook sample is available to deploy RStudio servers inside of an IBM Spectrum Conductor with Spark cluster. Each RStudio server is configured to connect to a Spark instance group, allowing users to easily run R scripts and notebooks on Spark. The sample uses the following software components:

RStudio: An open-source, web-based integrated development environment (IDE) for the R programming language.
R: A programming language used for statistical computing and graphics, which is popular among statisticians and data analysts.
Sparklyr: An R interface for Apache Spark. RStudio connects to Apache Spark using this library.
Docker: A platform for deploying applications in containers.

The RStudio server is deployed as a Dockerized notebook in IBM Spectrum Conductor with Spark. After following the steps below, a user can run R scripts and notebooks to perform data analysis and build machine learning models using data that is loaded into the Spark instance group.

In addition, the RStudio server that is deployed by the sample is designed to be compatible with the RStudio in IBM Data Science Experience. This means that a user can export the R script or notebook that they created on IBM Data Science Experience along with the data, import it into IBM Spectrum Conductor with Spark, and execute the R script or notebook without any code change. Furthermore, a user can also export scripts from IBM Spectrum Conductor with Spark and run it on IBM Data Science Experience without any code change. This flexibility gives users the freedom to run their code on premise or in the cloud depending on their requirements.

In this blog, we highlight the steps for deploying RStudio in IBM Spectrum Conductor with Spark, importing R scripts or notebooks from IBM Data Science Experience, and running them in IBM Spectrum Conductor with Spark. The detailed steps can be found here.

Prerequisite steps

Install an IBM Spectrum Conductor with Spark cluster.
Install Docker on a subset of compute hosts that will run the RStudio server.
Build the Docker image. This Docker image is based on the tidyverse Docker image and contains RStudio, as well as other R packages like sparklyr, dplyr, etc.

Deploy the RStudio server in IBM Spectrum Conductor with Spark

Create the RStudio server deployment package using the files included in the sample. The package name is RStudio-1.0.44.tar.gz. The complete set of steps for creating the package are covered here.
Add the RStudio server deployment package:
1. Add a new notebook package.
2. Fill in the notebook settings, selecting the RStudio-1.0.44.tar.gz package. The complete set of settings to input are covered here.
3. The new notebook should now appear in the list.
Create a Spark instance group with an RStudio server attached:
1. Create a new Spark instance group.
2. Fill in the Spark instance group details, selecting RStudio 1.0.44 (Dockerized) for the notebook.
3. Configure the Spark configuration, and ensure the spark.local.dir points to a directory other than /tmp. This is necessary to allow multiple RStudio server instances to run on the same host.
4. Click Create and Deploy Instance Group. Wait until the Spark instance group is in the Ready state.
Start the Spark instance group.
Assign users to the RStudio server:
1. Click Assign Users to Notebook.
2. Select the user that you want to assign to the notebook.
3. Click Assign. The notebook is started automatically.
Launch the RStudio web interface.
Run the R notebook example:
1. Click on the examples directory in the Files tab, and select Sparklyr_NotebookExample.Rmd to open the example.
2. Click the arrows in each of the highlighted sections to run the code. The image above shows you some of the arrows that must be clicked. After the code finishes executing, the result is shown immediately below the highlighted area.

After you create the Spark context, the Spark tab in the upper right corner displays the connection information for the Spark instance group. Any data that you load into the Spark instance group is also shown in that tab.The buttons in the Spark tab are fully functional. You can click the SparkUI button to access the Spark driver user instance, click on the log button to access the Spark driver logs, or click the disconnect button to disconnect from the Spark instance group.

If you previously created R scripts or notebooks on IBM Data Science Experience, follow the steps in the next section to import and execute them in IBM Spectrum Conductor with Spark.

Import R notebooks from IBM Data Science Experience and run them on IBM Spectrum Conductor with Spark

Log in to IBM Data Science Experience and export the R notebook or script, as well as any data files from IBM Data Science Experience:
1. Log in to http://datascience.ibm.com.
2. From the menu, select RStudio.
3. When you are in the RStudio web interface, export the R notebook and data files. The files are downloaded to your local machine. If you exported an entire directory, the contents are downloaded as a zip file.
Import the R notebook and data files into an RStudio server running in IBM Spectrum Conductor with Spark:
1. Log in to IBM Spectrum Conductor with Spark, and launch the RStudio web interface as you did previously.
2. In the Files tab, click the upload button, and select the file that you exported from IBM Data Science Experience. If you select a zip file, the contents are unzipped into a directory.
3. Click on the imported file to load it.
4. Run it as you normally would in R Studio. In this example, the user loaded an R script that can be executed using Ctrl+Alt+R. The R script should be able to run without any code changes.

Download IBM Spectrum Conductor with Spark 2.2.0 today!

If you have not yet tried IBM Spectrum Conductor with Spark 2.2.0, you can download an evaluation version here. If you have any questions or require more information about this blog, post in our forum!

#SpectrumComputingGroup

0 comments

1 view

Blogs

Running RStudio in IBM Spectrum Conductor with Spark 2.2.0 using Sparklyr and Docker

By Archive User posted Fri March 10, 2017 11:54 AM

Permalink

Additional
Resources

Office

Quick Links

Blogs

Running RStudio in IBM Spectrum Conductor with Spark 2.2.0 using Sparklyr and Docker

By Archive User posted Fri March 10, 2017 11:54 AM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources