Watson Assistant

 View Only

How Do I Manage My Mammoth Chatbot??

By Daniel Toczala posted Wed March 03, 2021 04:11 PM


Sometimes in our rush to apply technology, we lose sight of the problems that we were originally solving. Chatbots are no exception to this rule. Often organizations and individuals become SO FOCUSED on their chatbots, that they forget that the bot was built and deployed for a reason. Usually, that reason is some combination of improved customer experience, call center deflection, and providing customers answers to some common questions (and not every question under the sun).

Teams also often fall into the trap of doing constant adjustments to their chatbots, without knowing what the adjustments are fixing, and having no way to measure the results of their changes. This is one of the most common “traps” that I see our customers fall into. Without some kind of objective and automated chatbot testing framework in place, you are just mindlessly twisting knobs and hoping that results improve. So spend some time and PUT AN AUTOMATED CHATBOT TESTING FRAMEWORK IN PLACE. You can do some simple k-fold testing, to begin with. Use some of the iPython/Jupyter notebooks that are available to you, and review my earlier blog posts on Conversational Assistants and Quality with Watson Assistant. Do all of this, and you too can create a chatbot that is awesome.

So please — set up a testing environment so you can evaluate your chatbot with objective measures.

What are you doing? Why?

Lately, I have been getting a lot of BIG chatbot questions. Customers asking me about how to manage chatbots with hundreds of intents. How do you provide enough good training data, and what rules of thumb do you use with training data? Can we just add keywords to the training data? Our users talk to our chatbot like a Google search, how should I structure my intents?

These are all good questions, but there is something lurking here beneath the surface. The chatbot is getting larger, and harder to manage. When working with your chatbot, remember that YOU train the intents. With Watson Assistant, you should have between 10 and 40 (as a rule of thumb) different training examples for each intent, if you want the classification to work well. Whenever possible, it is best to use actual user inputs as intent training examples. This will give your chatbot some “real world” experience. Let me say that again. Whenever possible, it is best to use actual user inputs as intent training examples. This is VERY important — since our chatbot users often do things that we do not think of when we are building out and designing our chatbots.

Doing this usually requires a weekly session — going through the chatbot logs and finding user utterances that serve as good examples for your intents. Have your chatbot development team do this each week — it keeps them in touch with HOW people are seeking to use the chatbot. The best chatbots grow, mature, and evolve over time — moving closer to the interaction patterns that their users desire. If your business thinks that you will just deploy a chatbot and not put in any effort to evolve and maintain that chatbot, then you need to educate them about how this will result in them not getting the significant value from their chatbot that is possible.

One place where this was evident was with the various different Covid-19 chatbots that I have seen deployed in the past year. All of the organizations that deployed Covid-19 chatbots as part of our Covid chatbot program last year, saw an initial spike in usage from users when first deploying their chatbots. Organizations that updated and evolved their chatbots, changing them to answer new areas of citizen concern, saw continued engagement from their users, and the typical usage curve tended to follow the ebbs and flows of Covid infection rates in those communities (higher rates of infection led to more people asking the chatbot questions). 

Those organizations that did not maintain and evolve their chatbots saw a slow trailing off of engagement from their user communities. As citizens saw the answers to their questions become “outdated”, and saw that there was no updated guidance on new areas of concern, they stopped interacting with the chatbot.

When Is BIG too big?

All of the above focuses on measuring and evolving your chatbot and improving your intents. What about chatbots with a large number of intents? What about my chatbot that has 100 or more intents? These kinds of questions lead to a discussion about some of the realities of a large chatbot.

When you have 100 or more intents, it is very hard to not have some overlap of intents. You have so many intents that you often find it a non-trivial task to separate the training data between the two very similar intents. At some point, are the intents REALLY different? Or do we have the same intent, but with some different qualifiers or metadata?

Watson Assistant has a feature called intent recommendations which can help you detect situations in your user logs where you may have user utterances that can be used to better define existing intents, or to define new intents. The other automated feature of Watson Assistant that comes in handy here is the ability to handle intent conflicts. In these situations, Watson Assistant will alert you when it detects what appears to be conflicting training data. You should resolve these conflicts as soon as possible.

The other thing to keep in mind is that intents that are close to each other can be allowed to disambiguate. Using the disambiguation feature of Watson Assistant will let your users decide exactly which intent they intend when you have user statements that resolve to more than one possible intent. If you also use the Autolearning feature of Watson Assistant, you can have your chatbot learn from these disambiguation events.

Another issue that you will see if you analyze chat logs from a chatbot with a large number of intents, is that most of your user interactions will occur in a handful of your “most popular” intents. Conversely in a chatbot with 100 or more intents, you will note that over half of your intents are either never hit, or hit extremely rarely. 

At the end of the day, how useful is a chatbot with 100 or 200 intents? How many times are the “back 50” (the 50 least exercised intents) being hit by your chatbot users? You are expending a lot of effort, maintenance, and complexity, to address things that happen 0.02% of the time (some refer to this as intent starvation). Wouldn’t things like this be better handled with a “long tail search” using a Watson Discovery based search skill?

But We’re Different…..

You’ve read everything above, and you still think that you have a need for a chatbot that has 100 or more intents. It could be worthwhile to look into a cascading classifier approach. In this approach, you have an initial classifier that does a “rough sort” of the intent, and then passes the user utterance to a “detail classifier” that does the more nuanced intent detection. For example, a banking or financial chatbot might have an initial classifier that determines what you are trying to do (get a loan, check account balances, open an account, close an account, do a wire transfer, etc.), while detailed classifiers would handle the intents associated with each large group. So the initial classifier might take “I need to do a refinance of my house in Texas”, and classify it to “Loans”, and would then send this utterance to a Loans classifier, which would determine the type of loan and loan activity being asked for, in this case, a homeowner refinance loan (and not a new mortgage, not a car loan, not a student loan, etc.). It’s a “divide and conquer” method, which aims to keep each classifier manageable, and also allows you to more easily focus efforts to do targeted improvements since changes to the loan model only impact the loan model, not the investment model.

We’re Still Learning

After reading all of this, you may still be left a little bit disappointed. Using classifiers to build out chatbots is a relatively young practice, so new techniques and approaches are being tried all over the world. Did one of the above approaches work well for you? Did something else not mentioned here work for you? Share your experiences in the comments for this blog.