I bet that the speech-driven reality presented in the movie Her is a realistic view of the foreseeable future. Here are some arguments in favor of this.
We have mature hardware technology that can already provide a reasonable user experience with a speech-controlled UI in the form of wireless devices with an integrated assistant (e.g., Google Assistant, Siri, Alexa).
Also, the relevant software technology evolves quickly in terms of accuracy and coverage of speech processing (e.g., background noise and strong accents do not currently present substantial obstacles for speech understanding).
Speech is a convenient interaction medium: because they don’t require hands, speech interfaces allow for parallelization with other activities such as cooking, driving, washing dishes, or painting. So far, convenience has been a major factor for adoption of initially imperfect systems, and this use drives their improvement. And all we need is a sufficient amount of users until we reach the tipping point for large-scale adoption. These users will provide data to prove that this is a continuously growing market and that there is added value to investing in implementing various types of interactions in the assistants.
In the meantime, usage is also going to drive ML research innovation in terms of finding ways to optimize and generalize the process of creating new interactions, rather than implementing all of them from scratch. Moreover, as more people start using speech interfaces, the more there will be an incentive for researchers to advance new areas of research, confronting barriers both in terms of hardware and software (e.g., blocking sound from other people talking in the same space).
Google’s search broke the dense search market at that time through its simplicity, focused only on a single utility: search. It was easy, unambiguous, and convenient for people to use. This allowed for a steady growth of users that eventually reached the point in which their usage data made the results better.
We have all the paving stones for the road to simple and convenient speech-driven interaction. And yes, it is already quite noisy and annoying when people talk on their phones in public spaces. But there is no way of stopping it, so stop complaining. People walk on the street talking on their phones or headsets, in buses, trains, and other public places. We do it because it is convenient—you can do it while walking, or sitting during your commute. And you don’t have to take anything out of your bag, you don’t have to hold anything in your hand—just talk. Most people are already with noise cancellation headsets walking on the street or working at their desk; this makes it even easier to adopt the speech interaction with assistants in a public space.
There are some counterexamples; for example, it would be quite annoying if everybody at home talked to their devices and didn’t communicate with each other. Just as with texting and social media we gained a new skill to communicate instantly all the time but also developed new social norms for when and how to use it (or at least some of us have), I believe that in the same way, we will develop new social norms on where and how we can use such interaction.
In the movie Her, there is a scene in which the main character is coming home from work and walking across a large outdoor plaza. He is talking to his AI assistant (and lover) through some kind of wireless headset device in his ear. Across the plaza, there are many people who also appear to be leaving work and are also talking on their headsets, presumably to their computers. My reaction to this scene was that it was an improbable future. Initially, my primary reason was simply that I thought the social pressure would prevent this from being widespread, just as the social pressure not to talk to someone on your mobile phone in public prevents most of us from doing it and leaves us feeling annoyed when others do.
I expressed my skepticism to Dr. Aroyo, who, on the contrary, found this to be a nearly certain future. We started an informal bet at that time, for and against the prediction that assistant voice interfaces would never become mainstream—most people will continue to use keyboards, mice, and touch screens to interact with machines. At the instigation of the AI Bookies column, we encouraged ourselves to go through the process of formalizing the bet.
Like many bets and predictions, the need to avoid an open-ended condition drove us to specify a time limit: the year 2025.
The first obstacle we encountered was turning “mainstream” into something decidable. What objective criteria would we use? As I began to think about my motivations and think about the problem in more concrete provable/disprovable terms, I began to think of more rigorous and serious reasons why I believe speech will never be a mainstream interface.
While speech is convenient for humans to interact with each other, we use it because we don’t have anything better - such as buttons. If there was a button to press to get a kid to clean their room, no parent would waste time with words. Speech is indeed mostly a waste of time.
It’s a waste of time because a lot of speech is negotiation. If I want someone to do something, I have to do more than tell just tell them; I have to negotiate with them using sticks or carrots until they agree to do it or I move on to find someone else. Star Trek’s Picard’s “make it so” works only in special circumstances and organizational structures. With an assistant, this negotiation is not necessary, but there is a very important second part to the negotiation: the meaning. And therein lies the rub.
Speech is a lousy way to communicate meaning. When designing applications that actually do the things we want our assistants to do, a lot of work goes into the user interface. Arguably it was Google’s user interface, not the quality of the search results, that won the day during the search engine wars; Google’s interface was simple—just a text box. But an assistant is much more than a search box, and each bit of functionality needs to be implemented and the interface designed. Take a simple example: ordering a taxi using Uber or Lyft or equivalent. I could say to my assistant, “Hey Google, order me a taxi.” “Where are you going?” it would ask. “To Matsui’s Sushi.” “Do you mean Matsui’s hair salon or Matsui’s Japanese Restaurant?” “The restaurant, obviously, you idiot!” “Great. Your taxi will arrive in five minutes.” My obnoxiousness notwithstanding, Matsui’s Japanese Restaurant is 140 miles away. I don’t realize this because I’ve mistaken the name “Matsui” for “Musashi,” and I’ve just ordered a 140-mile taxi ride. Now maybe, hypothetically, the assistant here is really smart, and it realizes something is unusual, and instead of ordering the taxi it says, “Matsui’s Japanese Restaurant is 140 miles away; are you sure you want me to order a taxi?” At which point I say, “No!” and perhaps I continue negotiating with my assistant about it, or perhaps it knows my history and enough about speech similarity and American confusions about Japanese names to figure out what I meant. But none of this intelligence would be necessary if I was using the right interface for ordering a taxi: a map! At some point pretty early on—depending, of course, on my mobility and access to my hands and so forththe negotiation of the meaning becomes so inefficient that I give up and switch to the actual app that provides a well-honed interface to the thing I want done.
Negotiating meaning is certainly an interesting and very hard problem, one studied and emphasized in the past but less so today, and one reason for that decline in emphasis is economic. There exists alternative cheaper solutions that make it unlikely that we will have an effective solution that does not annoy us too much.
There are clearly some things for which a speech interface is effective (e.g., question answering), and in certain conditions the best possible option (e.g., while driving). Furthermore, there has been impressive progress in speech understanding and generation recently. However, speech will not become the primary interface to our assistants and devices simply because there are far more examples of cases for which it is not the best, and many for which it is simply terrible.
We discussed many aspects of this disagreement with the other bookies, who will act as adjudicators, to help us hone down the adjuticatable portion of the bet. Ultimately, the bettors agreed that some representative of the number of minutes spent talking to devices in a year, normalized by the number of devices available, would be the metric we are looking for. Dr. Welty argues that a graph of this usage ratio over the next few years until 2025 will never be more than linear. Dr. Aroyo argues that it will reach a tipping point, defined by a superlinear bend in the usage curve before that time.
There are several sources that might be used to provide an approximation of this metric. TechCrunch and SearchEngineLand report on the assistant device market, giving us a normalization factor. Voicebot.ai and alphametic.com report on aspects of the speech understanding industry, and perhaps can be influenced to gather data about the amount of time spent speaking to devices. At the present time, we couldn’t find precisely the data we want being gathered today, so we will report back in the next issue.