Authors: Aakanksha Joshi and Yushu (Jade) Zhou
Hello reader! This blog accompanies our identically titled session from THINK Digital 2020. First of all, let’s introduce you to IBM’s Data Science and AI Elite Team, Cloud Pak for Data and Industry Accelerators.
The Data Science and AI Elite (DSE) Team is a global and diverse team of data science and AI experts who work side by side with data scientists of client teams to accelerate clients’ data science projects and achieve business impact. You can read more about the team here.
IBM Cloud Pak for Data is a fully integrated data and AI platform that modernizes how businesses collect, organize and analyze data and infuse AI throughout their organizations. You can read more about Cloud Pak for Data here.
The Industry Accelerators are a packaged set of assets that address common business issues. Each accelerator is an end to end implementation, from data prep to deployment, of a data science use case with a clear business outcome. These accelerators help clients kick-start their own implementation while expediting time to value.
The goal of the blog is to showcase what it’s like to engage with the DSE Team and co-develop data science solutions using Cloud Pak for Data. We’ll do this by using the example of a (hypothetical) client, called Capital Asset Build Inc. (CAB). The end-to-end data science life cycle resonates with CAB and they want to see how Cloud Pak for Data can help them streamline their workflows with a relevant use case. Enter DSE. The first step in our engagement is what we call a use case deep dive where we try to understand the business challenges the client is trying to address.
CAB is trying to identify customers who are a higher risk of attrition.
The key personas who interact with CAB’s data are Andy, the Data Engineer; Jane, the Data Scientist; and Justin, the Data Science Operations Manager. Each persona has their own set of questions around their data, their processes and how Cloud Pak for Data can help. Let’s see from a high-level what kind of questions this team may have and how Cloud Pak for Data can help address them.
Now that we’ve scoped out the use case and discovered the pain points and key questions, we can start co-creating the solution alongside the client team. For this Customer Attrition use case, we can get started with an Industry Accelerator for Customer Attrition that we already have available.
This picture gives an overview of the process we will follow.
So now, let’s go one step at a time and see how each component of this accelerator connects to the questions that the client wants answers to.
Andy: Collect, Connect and Organize
For these steps of the process Andy will use Data Virtualization and Watson Knowledge Catalog services.
His day starts by looking at new data requests that may have come his way. He sees a new request from Jane.
He needs three data sets to complete Jane’s request: Customer, Customer Summary, and Account. Customer lies in a MySQL database, whereas Customer Summary and Account lie in DB2 Warehouse. So, he’ll need to create a joined view of these three data sets before Jane can use them.
He chooses the option to Virtualize data, which takes him to the page within Data Virtualization where he can see his connections to databases.
He sees these connections have been established so he can directly go and start virtualizing. If he didn’t see them, he would’ve had to add them by going to “Add new data source.” There he would’ve had to provide information like this:
Once he has the connections he needs, he can go and virtualize data.
He can filter data by schema names. The he can select the data sets he needs, and add them to his cart.
He can review dataset details and then hit virtualize. Notice how, at this point, they have all become part of the same schema.
Once he virtualizes the data, he will see the files under “My virtualized data.” Now, he needs to create that joined view. He has two options to do this:
Now he needs to share this file with Jane. Since Jane isn’t a Data Engineer, she doesn’t have direct access to Data Virtualization. CAB uses Watson Knowledge Catalog as a central repository for all data sets that data engineers and data scientists can use for sharing data with each other. So, Andy will add this new virtualized data to the Enterprise catalog, that Jane has “Viewer” access to.
- If he had to do a simple join between two data sets, he could use the visual method. He could select the two data sets he wants to join and then click on “Join”. He’d see the following screen:
He could select the Primary Key, “Customer ID” in this case, and hit “Preview” or “Next” based on what he wants to do. If he hit “Next”, he’d be asked to provide this joined view a name. Once he completed the steps, this new data set would show up under “My virtualized data.” But he needs to join three data sets, so he needs something a little more complex.
- He can use the SQL Editor. Within this he can write more complex queries, even create nested joins and build new features. Once he has the query he wants, he can hit “Run all.” His new view, called “Customer_History” will show under “My virtualized data” once his query finishes running.
First, he’ll need to add the Data Virtualization database to the Catalog. He can do that by clicking on Add to Catalog -> Connection. There he’ll need to add the Data Virtualization database details, which he can find under My Instances -> data-virtualization in the Main Menu. Once he has added the connection, he can add connected assets by clicking on Add to Catalog -> Connected Asset -> Select Source. There he will be able to find the data he just virtualized.
He’ll then have added this data set to the Catalog from where Jane can add it to her project. He can go back to the Data Request and hit Deliver.
For these steps of the process, Jane will use Watson Studio and Watson Machine Learning services.
Her day starts by checking if her requested data is ready. Once Andy finalized the data set, Jane can directly use the search bar on Cloud Pak for Data to find “customer_history” file.
Jane can add this data file to her analysis project by clicking on the data, which will take her to Watson Knowledge Catalog. Then she can add this data to her analysis project using the “Add to Project +” button.
An analysis project is, in other words, Watson Studio. This is where data scientists develop, enhance and deploy their models using open-source or IBM-proprietary tools. Apart from common choices of IDE like Jupyter lab and RStudio, Cloud Pak for Data provides SPSS Modeler and Cognos as add-ons of Watson Studio for citizen data scientists. More importantly, instead of coding independently or locally, data science team can collaborate easily, checking teammates’ updates and sharing ideas live using Watson Studio. In our case, Jane likes using Jupyter notebooks, so she will create a notebook and build her own model. Then she will work with her colleagues to see which model performs best and how they can improve best models.
Jane already has a project she created by uploading the project she downloaded from the Customer Attrition Prediction Accelerator page. She can open the “1-model_training” notebook and replace the data being used in the notebook with the data she just added to her project from the Catalog. She can then run the notebook and either use the models already defined in the notebooks or build her own custom models. These notebooks can be used as a method to quickly prototype the process from model development to model deployment. In real life, data scientists won’t create only one notebook to finish these steps, but we use this one as a high-level example of their daily life.
Once the final version of the model is built, she will deploy the best model as a REST API using Watson Machine Learning, IBM’s deployment service. The deployments are hosted in another section of Cloud Pak for Data called the Deployment Spaces. These can be found under Main Menu -> Analyze -> Analytics deployments. The Deployment Spaces provide clear demarcation between development and production environments. To hand off the work to Data Science Operations Manager, Jane will add Justin to her deployment space so that he can see final version of the model. She can also simply add her best model to the Deployment Space and let Justin handle the model deployment as well.
Training and improving a model is always the most interesting and sweet work enjoyed by data scientists. However, to leverage the value of a model in business, a pickle file or a REST API is not enough. Based on what we have seen from various clients, after the model is finished, there are two keys to success of the model in business. The first is how downstream applications interact with the model, and the second is how we continuously monitor model performances to confirm they’re still in good shape when seeing new data.
Let's look at the first key. The request to build a customer attrition model is from CAB's Customer Management team. They have a dashboard where customer managers can review basic information about their clients. They want to know how likely it is that some clients will quit the service, so that they can come up with corresponding strategies to save clients. Now, since Jane and her team has already built the model and exposed it as a REST API, Justin and his team need to help the Customer Management team embed the model in their dashboard. Justin has found out the expected input of Jane’s API is different from the one this dashboard can provide because Jane’s API expects processed input but dashboard provides only raw input. Therefore, Justin needs to build up a predictive service taking raw customer information as input, transforming raw data into engineered data, passing it to Jane’s API and returning predictions. You can find all these steps included in “2-model_scoring” notebook. After Justin has exposed the predictive service as an API, he will work with dashboard developers to embed the API.
Then once the predictive service is live, Jane’s model will predict attrition on customers. To make sure this model can handle new data well, Justin needs to continuously monitor the performance of this predictive service. To simplify this type of work, Justin directly registers his predictive service with Watson OpenScale, another add-on provided on Cloud Pak For Data, by running “3-OpenScale_and WML_Configuration” notebook. These notebooks are not part of the Customer Attrition Prediction accelerator by default but can be recreated using these references.
Once the model gets configured, Justin can monitor the data received by the model in production. Watson OpenScale provides different perspectives of model performance for live data monitoring including fairness and quality monitoring and drift detection. It also provides the explanation for each prediction to help your business stakeholders gain more trust in black-box models. In customer attrition scenario, Justin can rely on its Drift Detection capability. This functionality estimates the accuracy of the model (accuracy drift) and checks if the data is very different from the model’s training data (data drift). Justin can find out which record contributes to accuracy drift and which one contributes to data drift in OpenScale dashboard. He can keep collecting these problematic records for retraining once model performance is lower than the threshold. The image below shows an example.
Since Justin still doesn’t have ground truth for the new customer - whether a new customer predicted to attrition actually attritioned or not - he will send out a new data request to Andy, his data engineer contact, hoping him to find out ground truth and merge it with newly-collected data. This helps with the collection of live data.
And that finalizes our whole story line about how a Data Engineer, a Data Scientist and a Data Science Operations Manager can collaboratively address a business challenge using a data-driven solution. Through this story we saw how Cloud Pak for Data and Industry Accelerators can help Enterprises make their people's lives easier and how the Data Science and AI Elite team can help them with the adoption of both Cloud Pak for Data and more data-driven solutions. So in case you were wondering, Andy, Jane and Justin were not trying to figure all of this out alone - we were there with them each step of the way.
Interested in learning how to kick-start your data science project with the right expertise, tools and resources? The DSE team can plan, co-create and prove the project with you based on our proven Agile AI methodology.
Request a free consultation: ibm.co/DSE-Consultation
Visit ibm.co/DSE-Community to connect with us, explore our resources and learn more about Data Science and AI Elite.