How Cloud Pak for Data & Cloud Pak for Business Automation Can Work Together to Drive Better Credit Decisions
Sepideh Seifzadeh, PhD. Senior Data Scientist, IBM Data Science Elite Team (DSE)
Pierre Berlandier, Customer Success Practice Lead - Business Automation
Kai Niu, Data Scientist, IBM Data Science Elite Team (DSE)
Ravikumar Govindan, Data Science Engineer, IBM Data Science Elite Team (DSE)
Dheeraj Arremsetty, Data Scientist, IBM Data Science Elite Team (DSE)
Erin Hwang, Customer Success Practice – Data & AI
Use case definition
A major North American bank was looking to implement a new, future-proof credit decisioning platform which could support client and account level decisioning of credit strategies and scoring. The platform should have given them the flexibility of experimenting with a variety of modern predictive models, leverage multiple data sources at both at rest and in motion to refine and optimize the models, and operationalize them along with their business policies for on-line transaction processing and batch.
The solution required a complete platform to ingest large volumes of data in order to preprocess and analyze credit information, and would allow building machine learning models either with open source analytics tools and engines such as R, Python, and Spark or out-of-the-box solutions. Subsequently, these models would be exposed as versioned endpoints in a containerized environment so they could be operationalized through a rule-based decisioning platform for online or large-scale batch credit decisioning.
Some of the key requirements for the platform were to have:
- An intuitive, business-friendly decision management platform
- Agility to quickly experiment and integrate new data elements in the decision model without requiring heavy IT intervention
- Ability to perform decision testing and simulation at scale on millions of scenarios
- Ability to run decisions both in batches and through on-line transactions
- Real time monitoring of decisions with KPI dashboards
- Ability to capture the decisions traces in a data lake to support scoring model refinement
- Ability to execute ML-based scoring decisions from the rule-based credit decisions
To support these requirements, the solution architecture uses the following combination of capabilities from the Cloud Pak for Business Automation (CP4BA) and the Cloud Pak for Data (CP4D).
From Cloud Pak for Data 3.0.1:
- Watson Studio (WS) and subcomponents:
- Analytics Engine Powered by Apache Spark,
- Data Refinery,
- Hadoop Execution Engine,
- Jupyter Notebook Server with Python 3.6 + Jupyter Lab (with Spark),
- RStudio Server with R3.6.0,
- SPSS Modeler
- Watson Machine Learning (WML)
Watson Knowledge Catalog (WKC)
- Watson OpenScale (WOS) including its pre-req of Db2 Warehouse
From Cloud Pak for Business Automation 20.0.1:
- Operational Decision Manager (ODM)
- Business Automation Insights (BAI)
About Cloud Pak for Data
IBM Cloud Pak for Data is a comprehensive data and AI platform that unifies and modernizes data and AI capabilities. Built on Red Hat OpenShift, it delivers an integrated architecture with capabilities from IBM and IBM partners, including open source.
Using Cloud Pak for Data, it is possible to deploy services on any cloud or on premises, fully-managed as-a-service or with an optimized system. With flexibility at its core, IBM Cloud Pak for Data enables you to modernize at your own pace.
Using the different components of Cloud Pak for Data gives us the following capabilities in support of the requirements:
- Ability to create and maintain a pipeline from start to finish - running preprocess (ETL) operations on large datasets in addition to training a model.
- Being able to use different tools to analyze the data – utilizing both open-source and proprietary tools.
ODM provides a platform to model, test, manage, govern and operationalize rules-based business decisions. It allows business users to take ownership of operational business policies and actively maintain their implementation as they evolve. To fix the ideas, an (over) simplified example of rules involved in credit decisioning could look like this:
Using ODM and BAI gives us the following capabilities in support of the requirements:
- Dynamic object model can be used to support the quick integration of new features or data elements associated to the decision.
- Tests and simulations data can be harnessed by custom scenario providers, allowing flexible access to multitudes of credit decision scenarios from external data sources.
- The decision engine can be embedded and deployed to a Spark grid, providing massive parallel testing, simulating, and batch processing.
- The decision traces can be directed to a data lake so they can be used as additional data points to improve the scoring models.
Conceptual view of the integration
As data comes-in in real time, from a data storage in a batch mode or from a data lake, the data management component helps ingest, curate, integrate and orchestrate those data sources.
In next phase of the pipeline, open source components as well as IBM data science tools can be used in the same integrated platform to analyze the data and help build predictive or optimization models based on the use case using R, Python, Spark or out-of-the-box tools. The Data Science component is a collaborative platform that allows users to build, test and run their Data Science workloads. There is a monitoring and reporting piece to help monitor the models after deploying and operationalizing them using Watson Machine Learning to make sure that the models are not biased and are in-line with rules and regulations, this component of Cloud Pak for Data is called OpenScale.
Component view of the integration
The figure below shows a component architecture of the solution for on-line transactions, along with the different personas involved in the solution. For more information on reference architecture for Cloud Pak for Data and Cloud Pak for Business Automation, you can visit the IBM Cloud Architecture Center at https://www.ibm.com/cloud/architecture/
Performing simulations at scale
To enable our client to perform ODM simulation at scale, we need to first automate the ODM simulation process and then perform the simulation in distributed manner. IBM ODM offer running simulation as an independent application which provide us leverage to automate the simulation process using IBM Cloud Pak for Data Job Scheduler. To scale the simulation, we configured the IBM Cloud Pak for Data Spark cluster to execute the IBM ODM model locally in order to run the simulation in parallel. The solution is illustrated as follow:
Analytics engine SDK- ibmaemagic
IBM Analytics Engine provides an architecture for Hadoop clusters that decouples the compute and storage tiers. Instead of a permanent cluster formed of dual-purpose nodes, the Analytics Engine allows users to store data in other storage layer such as IBM Cloud and spins up clusters of compute nodes when needed using Kubernetes, it’s scalable on demand. Which makes AE easy to configure, easy to use and with very low maintenance.
Separating compute from storage helps to transform the flexibility, scalability and maintainability of big data analytics platforms.
Analytics engine, which runs spark underneath can we used to process datasets stored in different locations that can be in file storage or Hadoop HDFS or data bases or from the cloud. And can also process data set of size from gigabytes or terabytes or petabytes or exabytes, depending on the resources available.
Analytics Engine can be used for both Interactive and Batch processing:
Interactive mode is a way for launching Spark Jupyter notebook from Cloud Pak for Data, where Spark context is connect to the spark engine running inside Analytics Engine.
Unlike interactive mode, batch mode is more of submitting spark jobs and making it to run in offline mode. Batch mode is more of API support accessing spark in Analytics engine.
As the outcome of this engagement, Analytics Engine SDK was created and contributed to open source community as library called ibmaemagic for everyone to easily use Analytics Engine as processing engine on Cloud Pak for Data Platform, please contact us for more details.