Part 1: Create a simple data exploration dashboard
As a data scientist, I’ve learned two things in the past few years. When it comes to making decisions about machine learning (ML) models, you need to gather the right evidence regarding your results, and communicate it to your project team and stakeholders. In other words, you can’t put models in production if you don’t look at the right things and don’t do it collectively.
In this blog post series, you’ll learn how to do that by building and monitoring ML models in Watson Studio (IBM Cloud Pak for Data’s set of Data Science services) and by visualizing and interacting with these models from Streamlit:
- How to simulate a sample use case
- How to share early EDA results in an interactive way
- How to make your first Streamlit app
- How your Streamlit app gets the data from Watson Studio
- How to visualize the data
Can’t wait to get started? Explore the full code used for this post here.
A quick note on using IBM Cloud Pak and Streamlit
I’ll be using the IBM Cloud Pak for Data “as a Service” version. You can get it for free here. And building your Streamlit app is free too, as it’s open-source under the Apache 2.0 license. You can build it on your computer, then host it on a service such as IBM Cloud Code Engine (IBM’s serverless platform to host apps and containers) or deploy it directly to Streamlit Cloud.
On one hand, we’ll be using Watson Studio to gather and share data, build models using code, low-code and no-code tools, deploy these models, and monitor them. Rather than an extensive list of everything you can do on this platform, we’ll see most of these capabilities in action in this series. On the other hand, we’ll be building a progressively complex Streamlit app to explore the data, test the models interactively, and analyze their performance beyond simple performance metrics.
For both components to communicate, I’ll introduce you to a handful of easy-to-use APIs to connect to Watson Studio from Streamlit along the way.
How to simulate a sample use case
The use case we’ll be simulating here is one where trust in ML models is paramount: credit scoring.
We’ll be using data from a FICO challenge around Home Equity Line of Credit loans (HELOC). You can learn more about it and request access to data on the FICO website. The goal is to use the information from loan applicants to predict whether they’ll repay their loan within two years. The features are anonymized credit bureau data, mostly quantitative.
The two things that make this type of use case interesting are:
- The importance of interpretability (because you need to let applicants know why they’re accepted or rejected)
- The importance of monotonicity constraints (I’ll describe this in more detail in the next post of this series)
Typically, you’d take this type of use case further by incorporating what’s called “alternative data.” In other words, you’d judge applicants not only on their prior credit behaviors but also on whether or not they avoided overdrafts on their checking accounts, paid their bills on time, etc.
To learn more about credit scoring using alternative data, check out one of my past projects, IBM’s Loan Default Industry Accelerator.
Once FICO grants you dataset access, they’ll share a data dictionary in excel format and a CSV file with 24 columns including the good/bad credit target called RiskPerformance.
To start working with it, create a Watson Studio project and upload the two files:
Instead of these files, we could also work off of the data connection to one of the many data source types supported. Check out the Adding data to a project section of the Cloud Pak for Data as a Service documentation.
Once the data is added to a project, data scientists can start inspecting it using the Data profiling capabilities. You can also do it in Jupyter Notebooks which can be started on-demand, shared with collaborators, and run with a Python, R, or Scala backend.
How to share early EDA results in an interactive way
Let’s skip forward and pretend that a data science team has been exploring this data for a couple of days or even weeks, trying to find good indicators of risk, thinking of modeling approaches and good features to build. Before moving forward, they’d want to show some of their findings to subject matter experts on the project, to check if the insights they found are relevant or are due to data quality issues or misinterpretations.
Notebooks are great to document work, but I’ve found that sharing them with the stakeholders in this format can be overwhelming. Therefore, the goal here is to take the most interesting findings from the dataset and put them in a Streamlit app instead. Once you have an app for your data discussions, your team can answer questions such as “Have you thought about looking at the relationship between these two variables instead of the ones you’re showing?” right on the spot.
How to make your first Streamlit app
Streamlit apps are written as Python scripts. You can alternate components to appear in the UI and data wrangling logic. All it takes is to install the streamlit package with pip (check out the Get started page) and import streamlit in a Python script. You can write something like st.write(“Hello world”), save the script, run streamlit run <your-script>.py, and bingo! You have your first Streamlit app running!
The one thing I love about it is that you can set your app to auto re-run every time you make changes to the code. This means you only use the streamlit run command once and improve your code iteratively. It uses only a couple of basic components:
First, we gather input from the user in different ways: text boxes, dropdown lists, number inputs, radio buttons. We also display information to the end-user through markdown and by displaying a pandas dataframe. If you look at the code behind this page (in app.py), Streamlit code (for example, st.write, the most basic “magic” command to make anything appear in the UI) is interwoven with Python logic to load and prepare data. As you start making your first app in Streamlit, you’ll find that this approach makes app development very easy.
Another thing I love about Streamlit is, though I quickly start modularizing my code, the structure of a Streamlit app starts off really simple. For example, it’s even easier to write than RShiny, where things typically start with separate ui and server functions (and scratching your head on to find how to pass data from one to the other). None of that happens in the Streamlit world!
How your Streamlit app gets the data from Watson Studio
As I mentioned above, there are several public REST APIs available to interact with Watson Studio projects. One of them is the Watson Data API, available both on Cloud and on premise, which lets you interact with a variety of project assets. All the code to interact with this API is abstracted away in cpd_helpers.py. Once the end-user provides their personal API key, we use it to authenticate to the API. This then lets us access the endpoints to list projects and data assets this user has access to.
After the user has selected the dataset they wish to explore in our app’s interface, we use two endpoints to load the data, as described in the API reference. At a high level, we get details on where the associated file lives. Then we get a signed URL to load that file into our app. If you’re not familiar with signed URLs, they’re basically secure URLs that let you access data for a limited amount of time, after you’ve authenticated and the platform has verified your permissions.
And that’s it!
We pass the signed URL directly to the famous pd.read_csv() method, and it downloads the data and loads it into a dataframe in memory.
How to visualize the data
Once you have the data loaded, you can look at a couple of rows in your app. For this example, I created two types of visualizations.
- The univariate distributions of each feature conditioned on values of the target:
Here I look at the distribution of each feature for the “good” credit applicants and the “bad” credit applicants separately. Because I don’t want to overwhelm the subject data experts with 24 univariate plots, I rely on Streamlit’s interactiveness by adding a dropdown to go through different features.
2. The percentage of good vs bad applicants per bin of the numerical feature:
Here I want to answer this question: among applicants with a high value of feature X, is there a higher proportion of bad loans? I used a slider to determine bin sizes, which in turn are determined by quantiles using pandas’s qcut function. Check out the code in app.py again to see how I built these two charts.
This is a great example of one of the strengths of Streamlit. There are lots of plotting libraries available in Python. Everyone has their favorite. In my case, I sometimes switch between them in the same project, and Streamlit supports all of them. I could’ve gathered visualization code from one teammate written in Altair, another in Bokeh, and easily put both in my app!
To learn more about charts and Streamlit, check out the Charts elements section of the API reference.
In this post, I’ve shown you how I’d typically start a data science project — first in a Watson Studio project, then pulling some interesting insights in a Streamlit app. In the next post of this series, I’ll start training models, saving and deploying them, and then accessing their predictions APIs from the Streamlit app. This will make us add pages to our app, explore new Streamlit components, and start optimizing the app’s code.
Thank you for reading! If you have any questions, please leave them in the comments below or add me on LinkedIn.
Until the next post!
Update 2/2: Post #2 is now up! Check it out.