A spoken dialog system that connects “words” in a conversation with real-world “objects” to understand and learn new words User-Friendly Spoken Dialog System
If asked to bring a snack, humans will carry out this task based on knowledge that the snack is on the shelf, etc. How can we develop a spoken dialog system which is equipped with such capacity and is truly able to support humans?
Index
Share
A speech-based UI that communicates based on its environment
This project involves research and development on spoken dialog systems capable of supporting people. Take, for instance, when introducing service robots into homes in the future, the first UI to be considered will be spoken dialog. This UI may also be useful in the mobility sector such as in automated driving taxis.
Some speech-based UIs have already been realized in products such as smart speakers. Being able to give instructions using only your voice, including asking for weather forecasts and turning on indoor lighting, is very simple and convenient. However, in the case of service robots and so on that involve actual objects, the AI needs to understand not only speech but also the environment.
Consider, for example, a user asking, “Bring me a snack.” A human would be able to carry out this task normally
because it knows that the snack is on the shelf and food should be placed on the table, but it is difficult for present-day spoken dialog systems. Then again, giving detailed instructions every time like, “Bring the snack that’s on the shelf and place it on that table,” would also be inconvenient.
Also, when you ask someone to bring your medicine, you want them to bring you water along with your medicine, but they can only do this once they know that medicine is taken with water. Having to explain such tasks in every particular detail would be stressful and very tedious.
With the advent of the innovative technology of deep learning, AI has advanced in leaps and bounds. Dialog systems, like chat bots, are now capable of somewhat natural exchanges. However, current AI does not “understand” the content of conversations like humans. There is no connection whatsoever between “words” in a conversation and “objects” in the real world. This is called the symbol grounding problem. HRI is working to resolve this problem in the development of spoken dialog systems.
The hardest part is “knowing what you don’t know”
Currently under consideration is a modular dialog system. Comprised of multiple modules, first speech is converted into text using Automatic Speech Recognition (ASR). This is then sequentially processed by Natural Language Understanding (NLU), Dialog Manager and Generator (sentence generation), and finally Text-to-Speech (TTS) converts the text into speech which is output.
Of these modules, our research group is focused on NLU, which has the function of analyzing user speech. In our research, we have adopted a hybrid system that is comprised of novel machine-learning-based NLU and traditional rule-based NLU
One of the difficulties when realizing a dialog system using AI is understanding new words. As times change, new colloquialisms and abbreviations emerge every day. For a robot to live side by side with people, it will need to have a system by which it can automatically respond to words that have not been preset.
However, to recognize a word as new, you first need to realize that it is a word you do not know. This is something that humans can usually figure out, but for machine learning, this is, in fact, difficult. A characteristic of machine learning is that, even if it does not know a word, the machine will give some kind of incoherent answer, so left unchanged, it will never realize it is a word it does not know.
While machine learning alone might one day be able to solve this problem, it will probably still require a number of breakthroughs. The purpose of using rule-based NLU in this project in combination with ML-based NLU is precisely to detect these “unknowns.” By adopting a hybrid method, we were able to obtain test results indicating an improvement in recognition accuracy.
We still do not know whether AI can really achieve this kind of comprehension. And while there are considerable challenges in realizing a spoken dialog system that is truly user-friendly, we will continue our research and development with an aim of practical application.
Voice
TAKEUCHI, Johane
It is probably a traditional characteristic of Honda, but HRI has a culture where it is easy to talk with and work with others no matter who they are. Although I work in a research position that is oriented toward engineering, I always find it rewarding when some functional thing that I created is used.