Maintainable Accuracy

One of the main hurdles in building chatbots is the dialog understanding (DU) module. It is needed to turn user utterance into some structured event, which then is processed by the dialog engine to move the conversation forward and eventually connect users with their desired service.
There are two closely related understanding problems here: figuring out what user wants or their intent, and figuring how they want that or slot value. For example, when user input is: “I like to fly to Shanghai tomorrow”, we like to convert this into a frame event in form of “book_flight(destination=Shanghai, date=…)”, where book_flight is the intent, and Shanghai is the value for destination slot, and so on. Today these natural language understanding problems are generally solved by deep learning methods. This means we first prepare a reasonable sized labeled dataset, consisting of pairs of potential user utterance and corresponding frame events. Then we pass this dataset onto natural language understanding (NLU) experts. These experts then iteratively pick the algorithm and their hyperparameters to turn this dataset into a production worthy model, which is used to convert user utterance into structured events.
These problems are typically formulated as intent classification and sequence tagging problems. While they are well-studied in the NLU research communities, researchers mostly care about the stationary accuracy, the problem of finding a model with the most accurate prediction given a set of labeled examples. In the real world, we don’t yet have 100% accurate language understanding technology. So launching a chatbot is not the end, but the beginning of iterative dialog understanding improvements. What we really need is maintainable accuracy: a dialog understanding model that can be fixed easily.
However, escalating every understanding issue to the dedicated NLU/ML team for a fix is not a good option, latency is too high to maintain a reasonable user experience. For the existing GUI application development team, there is still a big learning curve before they can practice effective statistical learning-based natural language understanding. For one, they are used to event extraction from user interaction are taken care for them by libraries like reactjs; for another, the statistical way of thinking involved in ML is a sharp departure of their deterministic modeling of the application. This greatly limits the number of the team that can create an effective conversational user interface, and partially responsible for the nonexistence of good conversational experience, despite the widely claimed benefit for end-users.
Luckily, with the recent advances in NLU research, we can create a tool that hides all the statistical learning concepts and implementation, but can still be used to hotfix any understanding issues simply by providing a labeled example. Let’s use the intent prediction problem to explain why the typical machine learning like the intent classification model fails to deliver maintainable accuracy, and how nonparametric few shot methods can help us get here.
The starting point of intent prediction problem is a set of labeled examples, {(x[i], y[i])…}, x[i] is the ith user exampler utterance from all possible user utterances X, y[i] is the corresponding intent, taken from Y, set of all possible intent a chatbot need to understand.
The goal of the intent classification training is to produce a model M, which is a function that can map any user utterance x into an intent y (for now, assuming there are is a catch-all Other intent for ease of discussion). While this problem is well-studied in the past, it is not production-friendly: every time we add a new intent to chatbot, or just want to fix an understanding bug, we need both expertise and sizeable labeled data to retrain the model, which is slow and costly.
However, coupling the recent pretrain-based sample efficient learning algorithms with the nonparametric approach, we can solve the problems as follows: given a new user utterance x, we found the correct understanding as follows. First, we use text retrieval to get a small set of exampler utterances that closely resemble x, then we use a text equivalency model E to find the best match exampler for x, and use the corresponding y as the structured understanding. The BERT based text equivalency model E is used to tell if two sentences are the same, it can be pretrained and then fine-tuned for any intents, even when it is not seen in the training dataset. This approach is very easy for jump start and hotfix understanding model, allowing one to get to great user experience without needing lengthy ML training.
With tools and processes that can deliver maintainable accuracy by a regular GUI application development team, we can build effective CUI applications without NLU/MU experts, internal or external. This can overnight increase the capacity of building service chatbot, and potentially make the conversational experience a lot more available.
Reference:
- Xing Shi, Scot Fang, Kevin Knight, A BERT-based Unified Span Detection Framework for Schema-Guided Dialogue State Tracking
- Vahid Noroozi, Yang Zhang, Evelina Bakhturina, Tomasz Kornuta, A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset