As we are in the era of Big Data, companies are now focusing on decision-making based on the hidden patterns buried deep within data dug up through data analysis and visualization. More importantly, machine learning and data analysis comes complementary with many consumer and enterprise products.
Data scientists and data engineers use these features in their day-to-day work and must handle immense amounts of data that are unable to fit in a single computer's memory. Even when able to fit, running scripts often takes prolonged periods of time. Due to this, most of the data is stored in databases which are able to handle huge amounts of data. From a programmer's perspective, it is important to have an easy and simple to use interface that the programmer can use to interact with the database and extract data as needed.
We want to show how simple DB2 is to use as a data source for machine learning projects. More importantly, how data scientists and data engineers can communicate with DB2 and extract data with a straightforward interface. In order to accomplish this, we created code samples that use DB2 with a popular machine learning library. This allows our target audience to have a place to learn how to use Db2 in their projects.
About The Code Samples
The three machine learning libraries that we paired DB2 with are:
Scikit-Learn - A machine learning library that has simple and efficient tools for data mining and data analysis.
Tensorflow - Google’s open source library to help you develop and train ML models.
H2o - H2o.ai’s open source machine learning and artificial intelligence platform.
All of the samples are done in Python and Jupyter Notebook. Using DB2 as a data source, connecting and extracting the data into a Pandas’s Dataframe for each of these machine learning libraries are exactly the same. This allows programmers to seamlessly switch between libraries if they choose to, creating a sense of consistency.
You simply have to import IBM’s “ibm_db” and “ibm_dbi” libraries and enter in your service credentials for your specific Db2 instance. Once that is done, you must extract the specific table where your data is located via SQL statements. With just a couple of lines, your data is ready to be used!
Pre-requisites and Special Notes
The code samples that have been published on Github, have been tested only with Db2 on Cloud instances. We are running Python 3.7 and using Jupyter Notebooks 6.0.0 and are using the latest versions of the three machine learning libraries when this blog was published.
When trying to connect to your DB2 instance, ensure that you are importing both “ibm_db” and “ibm_dbi.” The library “ibm_db” is a lower level library that directly communicates with the database, while “ibm_dbi” is an easy interface that communicates with the user and ibm_db to get the data you want.
When you run !pip install ibm_db, it will install ibm_db and ibm_dbi.
In some cases, the command !pip install ibm_db will not work. In order to work around this, you have to have run Jupyter Notebook within a Docker container.
Links to Db2 ML Samples
Db2 with Scikit-Learn - https://github.com/IBM/db2-samples/blob/master/db2_for_machine_learning_samples/notebooks/Db2%20Sample%20For%20Tensorflow.ipynb
Db2 with Tensorflow - https://github.com/IBM/db2-samples/blob/master/db2_for_machine_learning_samples/notebooks/Db2%20Sample%20For%20Scikit-Learn.ipynb
Db2 with H2o - https://github.com/IBM/db2-samples/blob/master/db2_for_machine_learning_samples/notebooks/Db2%20Sample%20For%20H2o.ipynb