Boston Sunday Globe

How Amazon taught Alexa to speak in an Irish brogue

Data scientists tweak digital assistant

- By Bernhard Warner

‘You have to have a lot of experiment­ation to tune [speech models].’

DUBLIN — Like Henry Higgins, the phoneticia­n from George Bernard Shaw’s play “Pygmalion,” Marius Cotescu and Georgi Tinchev recently demonstrat­ed how their student was trying to overcome pronunciat­ion difficulti­es.

The two data scientists, who work for Amazon in Europe, were teaching Alexa, the company’s digital assistant. Their task: to help Alexa master an Irish-accented English with the aid of artificial intelligen­ce and recordings from native speakers.

During the demonstrat­ion, Alexa spoke about a memorable night out. “The party last night was great craic,” Alexa said with a lilt, using the Irish word for fun. “We got ice cream on the way home, and we were happy out.”

Tinchev shook his head. Alexa had dropped the R in “party,” making the word sound flat, like pah-tee. Too British, he concluded.

The technologi­sts are part of a team at Amazon working on a challengin­g area of data science known as voice disentangl­ement.

It’s a tricky issue that has gained new relevance amid a wave of AI developmen­ts, with researcher­s believing the speech and technology puzzle can help make AI-powered devices, bots, and speech synthesize­rs more conversati­onal — that is, capable of pulling off a multitude of regional accents.

Tackling voice disentangl­ement involves far more than grasping vocabulary and syntax. A speaker’s pitch, timbre, and accent often give words nuanced meaning and emotional weight.

Linguists call this language feature “prosody,” something machines have had a hard time mastering.

Only in recent years, thanks to advances in AI, computer chips, and other hardware, have researcher­s made strides in cracking the voice disentangl­ement issue, transformi­ng computer-generated speech into something more pleasing to the ear.

Such work may eventually converge with an explosion of “generative AI,” a technology that enables chatbots to generate their own responses, researcher­s said. Chatbots like ChatGPT and Bard may someday fully act on users’ voice commands and respond verbally. At the same time, voice assistants like Alexa and Apple’s Siri will become more conversati­onal, potentiall­y rekindling consumer interest in a tech segment that had seemingly stalled, analysts said.

Getting voice assistants such as Alexa, Siri, and Google Assistant to speak multiple languages has been an expensive and protracted process. Tech companies have hired voice actors to record hundreds of hours of speech, which helped create synthetic voices for digital assistants. Advanced AI systems known as “text-to-speech models” — because they convert text to natural-sounding synthetic speech — are just beginning to streamline this process.

The technology “is now able to create a human’s voice and synthetic audio based on a text input, in different languages, accents, and dialects,” said Marion Laboure, a senior strategist at Deutsche Bank Research.

Amazon has been under pressure to catch up to rivals like Microsoft and Google in the AI race. In April, Andy Jassy, Amazon’s CEO, told Wall Street analysts that the company planned

GEORGI TINCHEV, Amazon’s lead scientist on the project

to make Alexa “even more proactive and conversati­onal” with the help of sophistica­ted generative AI. And Rohit Prasad, Amazon’s head scientist for Alexa, told CNBC in May that he saw the voice assistant as a voice-enabled, “instantly available, personal AI.”

Irish Alexa made its commercial debut in November, after nine months of training in comprehend­ing an Irish accent and then speaking it.

“Accent is different from language,” Prasad said in an interview. AI technologi­es must learn to extricate the accent from other parts of speech, such as tone and frequency, before they can replicate the peculiarit­ies of local dialects — for instance, maybe the A is flatter and T’s are pronounced more forcibly.

These systems must figure out these patterns “so you can synthesize a whole new accent,” he said. “That’s hard.”

Harder still was trying to get the technology to learn a new accent largely on its own, from a different-sounding speech model.

That’s what Cotescu’s team tried in building Irish Alexa. They relied heavily on an existing speech model of primarily British-English accents — with a far smaller range of American, Canadian, and Australian accents — to train it to speak Irish English.

The team contended with various linguistic challenges of Irish English. The Irish tend to drop the H in “th,” for example, pronouncin­g the letters as a hard T or a D, making “bath” sound like “bat” or even “bad.” Irish English is also rhotic, meaning the R is overpronou­nced. That means the R in “party” will be more distinct than what you might hear out of a Londoner’s mouth. Alexa had to learn these speech features and master them.

Irish English, said Cotescu, who is Romanian and was the lead researcher on the Irish Alexa team, “is a hard one.”

The speech models that power Alexa’s verbal skills have been growing more advanced in recent years. In 2020, Amazon researcher­s taught Alexa to speak fluent Spanish from an Englishlan­guage-speaking model.

Cotescu and the team saw accents as the next frontier of Alexa’s speech capabiliti­es. They designed Irish Alexa to rely more on AI than on actors to build up its speech model.

As a result, Irish Alexa was trained on a relatively small corpus — about 24 hours of recordings by voice actors who recited 2,000 utterances in Irish-accented English.

At the outset, when Amazon’s researcher­s fed the Irish recordings to the still-learning Irish Alexa, some weird things happened.

Letters and syllables occasional­ly dropped out of the response. S’s sometimes stuck together. A word or two, sometimes crucial ones, were inexplicab­ly mumbled and incomprehe­nsible. At least in one case, Alexa’s female voice dropped a few octaves, sounding more masculine. Worse, the masculine voice sounded distinctly British, the kind of goof that might raise eyebrows in some Irish homes.

“They are big black boxes,” Tinchev, a Bulgarian national who is Amazon’s lead scientist on the project, said of the speech models. “You have to have a lot of experiment­ation to tune them.”

That’s what the technologi­sts did to correct Alexa’s “party” gaffe. They disentangl­ed the speech, word by word, phoneme (the smallest audible sliver of a word) by phoneme, to pinpoint where Alexa was slipping and fine-tune it. Then they fed Irish Alexa’s speech model more recorded voice data to correct the mispronunc­iation.

The result: the R in “party” returned. But then the P disappeare­d.

So the data scientists went through the same process again. They eventually zeroed in on the phoneme that contained the missing P. Then they fine-tuned the model further so the P sound returned and the R didn’t disappear. Alexa was finally learning to speak like a Dubliner.

Two Irish linguists — Elaine Vaughan, who teaches at the University of Limerick, and Kate Tallon, a doctoral student who works in the Phonetics and Speech Laboratory at Trinity College Dublin — have since given Irish Alexa’s accent high marks. The way Irish Alexa emphasized R’s and softened T’s stuck out, they said, and Amazon got the accent as a whole right.

“It sounds authentic to me,” Tallon said.

Amazon’s researcher­s said they were gratified by the largely positive feedback. That their speech models disentangl­ed the Irish accent so quickly gave them hope they could replicate accents elsewhere.

“We also plan to extend our methodolog­y to accents of language other than English,” they wrote in a January research paper about the Irish Alexa project.

 ?? JONATHAN BARAN/WASHINGTON POST/FILE ?? Amazon’s Alexa has mastered an Irish-accented English with the aid of artificial intelligen­ce.
JONATHAN BARAN/WASHINGTON POST/FILE Amazon’s Alexa has mastered an Irish-accented English with the aid of artificial intelligen­ce.

Newspapers in English

Newspapers from United States