Alexa, Please Save the World
Speech recognition is going to change the way we compute and how we think
KIDS TODAY will grow up thinking a keyboard is some antediluvian tool like an abacus or butter churn, which they might encounter only because it’s nailed to a wall of a TGI Fridays.
Voice is taking over as the way we interact with technology and input words. Actually, it was supposed to have taken over a long time ago. Back in 1998, I wrote a column for USA Today saying that “speech-recognition technology looks ready to change the world,” though I also noted that when I tried to say “two turntables and a microphone” into the latest and greatest speech-recognition software, it thought I said something like “two torn labels and an ice cream cone.” Turns out that was about 20 years too soon.
But the technology works now. Microsoft, Google, Amazon, IBM, China’s Baidu and a handful of startups have been driving hard to build arti
cial intelligence software that can understand nuanced speech and reply coherently. Late last year, Microsoft said its speech-recognition technology had caught up to human understanding. Its “word error rate” got down to 5.9 percent, about the same as people who had transcribed the same conversation—and much better than the word error rate in any conversation between a parent and his or her teenage son.
Google’s speech-recognition technology is learning human languages at a rapid clip. In August, it added 30 new ones, including Azerbaijani and Javanese, bringing the total to 119. IBM’S Watson technology has become well known for interacting with humans—you’ve probably seen the commercial showing Watson talking with Bob Dylan. OK, it’s an ad. But even implying that a machine can comprehend what Dylan is saying is groundbreaking.
Companies are lining up to get ready for a ood of speech-driven commerce. The main reason Amazon wants to get Alexa into your home is so
you’ll get used to shopping by just speaking to the thing. In August, Google and Walmart announced a partnership that will allow users of the Google Home gadget to use speech to buy directly from the world’s biggest retailer. “We are trying to help customers shop in ways that they may have never imagined,” said Marc Lore, CEO of Walmart ecommerce U.S. (Lore joined Walmart when it bought the online retailer he founded, Jet.com.) All around retail, chatbot shopping through apps from the likes of Wechat, Kik and Hipmunk is the new hot thing. Most shopping bots today are text-based but are moving toward speech. According tocomscore, half of all searches will be voice searches by 2020—and search is most consumers’ rst step toward buying.
Ever since Apple introduced Siri in 2011, we’ve come to expect our phones and apps to comprehend spoken queries, which is an underappreciated, monumental achievement after so many decades of trying. It’s like the turning point in the 1910s, when people started to expect that airplanes would actually y.
IBM demonstrated the rst voice-recognition machine, called Shoebox, at the 1962 World’s Fair in Seattle. The device could understand all of 16 words—the numbers zero to nine and instructions like “plus” and “minus.” To let you know it understood you, Shoebox would do simple math and print the result.
In the 1970s, the U.S. military’s research arm, the Defense Advanced Research Projects Agency, or DARPA, funded a massive speech-recognition program that got the total of words understood by a machine up to about 1,000—still far from practical yet roughly equivalent to our current president’s vocabulary. In the 1980s, James Baker, a professor at Carnegie Mellon University, co-founded Dragon Systems, based on his speech-recognition research. In 1990, Dragon’s
rst consumer dictation-taking product cost $9,000 and mostly just frustrated users. In 1998, when I stopped in at IBM Research to check on progress in the eld, speech recognition was still not yet good enough for everyday use.
Why has the technology suddenly gotten so good? The onslaught since 2007 of mobile devices and cloud computing has allowed massive data centers operated by giants such as Google and Amazon to learn language from hundreds of billions of conversations around the world. Every time you ask something of an Alexa or a Watson, the system learns a little more about how people say stu . Because the software can learn, no one has to punch in data about every slang word or accent. The software will keep improving, and soon it will understand our speech better than the typical human does.
And that could radically change the world. Shopping may be an early application, but the technology can even alter the way we think. A couple of generations learned to think with a keyboard and mouse—a tactile experience. “The creative process is changed,” a Dragon executive named Joel Gould told me back in 1998, anticipating changes. “You’ll have to learn to think with your mouth.” In a way, it’s taking us back to the way our brains were meant to work— the way people thought and created for thousands of years before pens and typewriters and word processors. Homer didn’t need to type to conjure up The Iliad.
In a speech-processing world, illiteracy no longer has to be a barrier to a decent life. Google is aggressively adding languages from
EVEN IMPLYING THAT A MACHINE CAN COMPREHEND WHAT BOB DYLAN IS SAYING IS GROUNDBREAKING.
developing nations because it sees a path to consumers it could never before touch: the 781 million adults who can’t read or write. By just speaking into a cheap phone, this swath of the population could do basic things like sign up for social services, get a bank account or at least watch cat videos.
The technology will a ect things in odd, small ways too. One example: At a conference not long ago, I listened to the head of Amazon Music, Steve Boom, talk about the impact Alexa will have on the industry. New bands are starting to realize they must have a name people can pronounce, unlike MGMT or Chvrches. When I walked over to my Alexa and asked it to play “Chu-ver-ches,” it gave up and played “Pulling Muscles From the Shell” by Squeeze.
In fact, as good as the technology is today, it still has a lot to learn about context. I asked Alexa, “What is ‘two turntables and a microphone’?” Instead of replying with anything about Beck, she just said, “Hmm, I’m not sure.” But at least she didn’t point me to the nearest ice cream cone.
+ NO RAGE AGAINST THE MACHINE: Tech executives at the CES trade show in Las Vegas in 2017. Many companies are building arti cial intelligence software that can understand nuanced speech and reply coherently.