What if we could speak with our devices, cars, and homes just as easily as we do with our friends?
Conversation is the bedrock of human communication, a transformative tool that reveals what’s inside our heads and hearts. Voice is our primary means of connecting with others—and, increasingly, it’s how we want to engage with the machines around us, too.
Thanks to advances in speech recognition, artificial intelligence, neural networks, and processing power, we can tap into the capabilities of our machines simply by speaking. The virtual assistants that enable these interactions live inside all sorts of products, from wristwatches and smart speakers to microwave ovens and eyeglasses.
We’re now entering a new age of voice computing, and it’s enabling opportunities and experiences that once seemed possible only in sci-fi movies.
But there’s a hitch: Conversation is an incredibly hard problem for algorithms to master.
A look at the current landscape
Table of Contents
Today’s AI technology allows machines to drive cars, predict stock prices, manage global-scale logistics, search for a cure for cancer, create sophisticated playlists, and defeat grandmasters at chess. However, most AI struggles to carry on a simple chat, like the one you’d have with a neighbor over a backyard fence.
Some of the latest efforts, such as Google Duplex, are closing the gap. But the art of human conversation, a task so simple that toddlers can do it, can be maddeningly difficult for even very sophisticated machines. This is one reason why many of us are still saddled at home with a dozen confusing remote controls that only one person, usually a teen, knows how to operate.
But that is changing fast. Virtual assistants are already on their way to becoming ubiquitous. According to eMarketer, nearly 40% of internet users already use voice assistants such as Google Assistant, Alexa, Siri, Bixby, and Cortana. And with each passing month, conversational AI is becoming smarter and less constrained by specific use cases—which means not only that most households may soon trade in those myriad controllers for a voice assistant but also that those voice assistants will grow more human-like, intuitive, and useful.
From chatbots for customer service to voice assistants that let consumers access a business’s services through a smart speaker, voice-based AI is already part of the daily routine for millions of people. And demand for this kind of technology in every sector and walk of life is only going to increase: The global conversational AI market is expected to grow at a compound annual growth rate of just over 30% between 2019 and 2024, according to MarketsandMarkets.
The explosion of conversational AI use cases has been astounding—and I’m optimistic that the industry will continue to deliver as we make progress on the fundamental challenges of conversation. Gartner predicts “by 2021, nearly one in six customer service interactions globally will be handled by AI. We expect that 40% of chatbot/virtual assistant applications launched in 2018 will have been abandoned by 2020.”
The likelihood that so many applications will be quickly abandoned points to the improvements that still need to be made—which raises two obvious questions: How do these conversational technologies work? And how are they being made better?
Listening is easy. Understanding is tough.
Conversation is second nature for humans, but conversational AI is extraordinarily difficult for machines. Why is that? Let’s deconstruct what happens when you participate in a conversation.
1. First, you listen. You process sound waves, filter out background noise and other voices, compensate for the speaker’s accent (and the head cold they might have), and turn that audio signal into a sequence of words. For machines, this phase is called speech recognition, or speech-to-text conversion.
2. Next, you understand. This involves correcting misheard words as well as sorting through homonyms, homophones, and unknown words. For example, the word “bank” has different meanings when you’re fishing, driving, shooting pool, or handling money. This is called semantic understanding. For machines, this step is where it starts to get tricky.
3. You consider the context. This includes conversational context (what was previously said in this conversation), personal context (your relationship with the speaker), situational context (what’s happening where you’re talking), and world context (what’s happening on a more global scale). For example, someone may ask you how the weather looks on a cloudy day, but what they really want to know is whether they should bring an umbrella when they go outside or cancel the ski trip they were planning for the weekend. This is called natural language understanding, or NLU. Machines in general still struggle with this, so they often fail to understand the underlying meaning, intent, and purpose of an utterance.
4. After the speech is understood, you need to determine the kind of message that will satisfy the question. This is called response generation, and it’s influenced by the same contextual nuances we’ve seen before, such as the identity of the speaker and the situation in which the exchange occurs. A simple request like “Do you take American Express?” might be addressed with a simple “no.” That’s informative, but it’s not very useful. “No, but we take Visa” is a more natural human response. To do this, you need to understand the speaker’s underlying intent (in this case, to pay with a credit card) and respond in a manner that satisfies the intent, not just the literal interpretation of the words. Because NLU is difficult, generating an appropriate response that’s relevant to the speaker’s intent and purpose is also difficult.
5. You now need to determine the specific expression of the message, or the actual words you’ll use to respond. This is called natural language generation, or NLG. Machines typically do this using templated responses, which sound unnatural to humans. Choosing the right words to use is a difficult process, even for humans, as it’s dependent on context (see step 3) and subject to misinterpretation (see step 2).
6. And finally, you need to say those words out loud. This is called speech synthesis, or text-to-speech conversion. This step, like the first one, is relatively simple for machines to complete; it’s everything in between that remains challenging.
Closing the gap
What’s fascinating is that we humans do all of this in a fraction of a second—seemingly without thinking and often even before the other person has completed their turn—while simultaneously engaged in other complex activities like washing dishes, changing diapers, or dancing in a noisy disco.
All of this is very hard to model in a computer, and most machines aren’t anywhere close to doing it reliably and consistently except in highly constrained situations—though researchers, such as the Google Brain team that recently introduced the Meena neural network to help companies build better chatbots, are closing the gap all the time.
Such progress is great news, adding to the numerous ways in which conversational AI is already helping businesses to:
- Build interfaces that enable near-human conversation
- Offer more personalized and intuitive customer care
- Understand how people feel about their products and brand
- Use machine learning models to score the perceived impact a comment might have on a conversation
- Add multilingual support to serve more people around the world
These solutions are available as APIs, meaning the technology has been packaged up in a way that you can simply use it without having to understand how the algorithms work.
The building blocks of conversational AI can improve many tasks outside of the office as well. Voice recognition and real-time conversation via an assortment of devices will be an essential part of a smart home, used in everything from temperature control to calendar creation to online purchasing, as well as smart cars, retail shopping, and many other scenarios.
We’re on a path to creating solutions that are much closer to what we need. The AI community has already made enormous strides, and I’m optimistic we’ll make even greater progress in the next few years. When it comes to human-level conversational AI, let’s keep the discussion going—there are a lot of exciting things to talk about!
Visit cloud.google.com to learn more about AI solutions for speech-to-text, text-to-speech, natural language, sentiment analysis, conversational interfaces, perceived impact, and translation.