IBM Watson Conversation is one of the main services available to build smart agents. With an easy and intuitive interface it allows the training of a chatbot in a specific domain. Creating a chatbot with Watson is easy and doesn’t require any kind of advanced programming skills, but to create robust, enterprise-grade ones that will change user experience it is necessary to have an efficient building and validation method, based on accurate metrics.
The first step to develop a chatbot starts with collecting user’s intentions and examples. This is a more current theme in posts and forums and will not be covered here. In this post we will focus on quality and test process of IBM’s Watson Conversation implementations. We hope to share a bit of knowledge that will help you deliver a higher quality and accurate chatbot.
Stage 1: Development test
Before beginning any development or testing effort, you should adopt a strategy of separation of test and training bases. As a rule of thumb, you should have at least 10 examples per intention. Watson Conversation works well starting at 5 intentions but with 10 you can assure the minimum number of example and still have a good number for test cases.
It is important to split the data into two groups: training and testing. In common business scenarios we are always discussing and asking about known informations. But when we talk about predictive models there is not a group of trusted informations and the focus is to validate a theory: “If user describe X he or she is trying to talk about intention Y”. This is why it is important to collect part of the known data to validate the results of our training. Common setup will have 30/70 or 20/80 distribution, where the smaller number represents the test base. This way we can use, for example, 70% of the extracted original data to train and the other 30% examples left to make submissions to the new bot.
Back to the testing process, this circle normally validates the bot in expected scenarios with known examples. Here we engage the development/training team of the bot.
- To validate conversational flow we can use a spreadsheet with intentions and training examples. It’s a simple way of checking if the conversation flows naturally and if there isn’t any error in the dialog. One way to make this automatic is creating a script with a sequence of phrases to be inserted in the chatbot and verify if the answers are still the same.
- To validate the training efficiency we use examples of the test base. In simple terms we are testing if the bot provides the correct answer to the right question.
Stage 2: Internal team test
In this second stage of testing we still focus on expected scenarios but with more diverse examples. To achieve this diversity in general it is necessary to involve people who did not participate in the training. They too may use of a spreadsheet as a way of organizing and guiding work, but ideally this will only have the intentions - the subjects about which the bot can speak - without directly providing the examples. The involvement of people who have not yet had contact with the bot should generate positive and rich feedback.
During this step, try to investigate situations where certain examples return two or more intentions with high confidence. You may need to change, add, or remove words to differentiate intentions. Our aim here is to have only one intention returning with high confidence.
Stage 3: Customer test
The third stage of the process validates possible but not exactly expected scenarios. We were able to map this "unknown" involving business users as beta testers. This step is usually the first contact of the users who helped define intentions and select examples used in the training. Expectations are usually high and it should be explained that Watson is still on training and that it is not expected that all questions will be answered assertively. In addition to recording what answers they found to be wrong, also request that they record the expected response to this case. This new set of information will be fed back into your Conversation instance.
Watson Conversation allows you to define a threshold to limit the level of confidence from which it will respond - below that it will fall into the general "everything else" intent. With a lower confidence threshold your chatbot will do more “guesses” and this can lead to more errors and frustrated users. But it will too generate more feedback and improvement opportunities. Evaluate what users expect and how aggressive the team wants to be in this stage.
Step 4: Early production test
Once validated by the development team, alpha and beta testers, come the time to engage end users. At this point Watson Conversation will be open for all previous scenarios and especially for unexpected and/or impossible scenarios. The go live, like any other software deploy, should be gradual. Choose a smaller set of users, involving only one department, or a percentage of your site traffic, etc. This partial production entry allows you to learn and adapt, fix problems, and make adjustments before engaging a larger audience.
Step 5: Continuous training
The process of training a bot never ends. You should be continually improving it in an agile and iterative way. Collect a certain amount of conversations from your production environment every 5 or 10 days, and evaluate the questions and answers you get. The first possible evaluation would consider a generic sample, which takes 200~300 random conversations from its base and classifies them according to the answer: (1) correct answer, (2) acceptable answer, (3) not acceptable answer. This evaluation will give you an idea of the general behavior of your chatbot. Repeat it often and compare the evolution over time.
The second possible evaluation focuses on finding the most problematic points. Choose some cases it was not able to answer and those that scaled-up to a human agent. If you have some sort filter (like / dislike) these responses are also potential targets for inspection.
Resources and tools
There are a number of resources you can use to support in evaluating and testing your Watson Conversation. See which suits you best.
- Chatbottest is a compilation of several questions that guide developers during the testing process of a chatbot. Very useful and can be used as a product quality checklist.
- WCS Quality Analysis: Renato Leal is a IBMer from Brazil. In his github you can find a notebook describing an approach to evaluate chatbot's classification quality when using Watson Conversation Service.
- Sites to test chatbots: There is a range of sites to automatically testing a chatbot.
- Bot Testing
- Dimon
- Qmetry Bot Tester
We hope this post has helped those who are venturing into using Watson Conversation Service. In the next post we will talk about the main machine learning metrics to apply in training Watson! See you!