DICK POUNTAIN

Bad news for voice actors: if the tech firms have their way, this decade’s documentaries and films will be voiced by AI

2021-11-01 - dick@dickpountain.co.uk

Bad news for voice actors: if the tech firms have their way, this decade’s documentaries and films will be voiced by artificial intelligence.

W hile casually watching a Hawaiian volcano erupt on YouTube, as you do, I detected something creepy about the narrator that I couldn’t quite identify. It sounded like an adult American male, but with something subtly wrong about his rhythm. I started noticing the same in other videos, and posted on Facebook to ask whether anyone else thought synthetic digital voices were being used: the consensus was “probably not”.

Then last month the MIT

Technology Review published an article about AI voice actors ( pcpro.

link/325mit) that said “deepfake voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech.” You can now sample the voice of a human actor, or someone in your firm, then have a company rent you a synthesiser that speaks your PR materials so well as to be undetectable.

I’ve always had an interest in voice synthesis. Most people nowadays regard computers as visual devices, but to me making them talk is just as interesting as drawing pictures on them. The first halfway decent textto-speech (TTS) program I used was back in the Windows 3.1 days; called Monologue, it came bundled with my first Sound Blaster card. Monologue had a raw, Steven Hawking-like delivery, but it did support a simple syntax for marking up texts to add some degree of expression, and I amused myself by having it read poetry, including a poem that Felix Dennis dedicated to me. You can listen to it at pcpro.link/325primal. O

ver the next few years I kept in touch with the state of text-to-speech and voicerecognition art, particularly the ground-breaking work of the Belgian researchers Lernhout and Hauspie whom I mentioned in this column back in 1999. During the 13 years I spent living part-time in Italy I keenly followed the progress being made by Google with its voice and translation engines, and by the 2000-teens I could use an Android phone like Star

Trek’s universal translator. When I needed to extend my local vocabulary, I would type what I wanted to say into Google Translate, have it spoken back to me in Italian and practise it before going into, say, a police station or hardware store. I didn’t quite have the nerve to hold up the phone to speak for me.

By this time not only were cloudbased voice services getting good, the ubiquity of powerful smartphones enabled people to write small “edgebased” apps that performed TTS on the phone by calling on a cloud service. Several such apps became available for free, and one called Vocality caught my ear.

An interface to Google Speech services, Vocality offered control over speed and pitch of speech in a large selection of national voices. For example, it let me create comical action-movievillain dialogues by choosing, say, a Russian voice and setting pitch ridiculously low. Politically incorrect perhaps, but fun. I also discovered that by typing in strings of random characters and setting the speed to high, I could generate something resembling “mouth music”, as in the little ditty at pcpro.link/325ditty. B

efore writing this column, I checked out the current state of local TTS apps and found dozens of free ones that are massively improved: Balabolka, Natural Reader, Panopreter, TTSReader and Wordtalk all offer good quality speech and even customisable voices.

But it’s in the cloud-based arena that things get scary. Nuance is a company that offers “a human-like, engaging, and personalised user experience. Enhance any customer self-service application with highquality audio tailored to your brand.” Or there’s Amazon’s Polly API, which lets developers add Alexa-like abilities to their products. For movie professionals, LucasFilm offers Respeecher to “create speech that’s indistinguishable from the original speaker. Perfect for dubbing an actor’s voice in post production, bringing back the voice of an actor who passed away, and other content creators’ problems.”

However, it’s Amai ( amai.io) that really spells it out: “Sorry, voice actors, we will replace you soon […] this text is painted with the Love emotion. You can highlight any text, choose any emotion and listen to how it sounds, for example this phrase is pronounced with the Happiness.” Go to Amai’s site to hear the perky result.

By the 2000-teens it was becoming possible for me to use an Android phone like Star Trek’s universal translator

By typing in strings of random characters and setting the speed to high, I could generate something resembling ‘mouth music’

?? ?? Dick Pountain is editorial fellow of PC Pro and has a special message for you all at pcpro.link/325dick — Dick Pountain is editorial fellow of PC Pro and has a special message for you all at pcpro.link/325dick

DICK POUNTAIN

Bad news for voice actors: if the tech firms have their way, this decade’s documentaries and films will be voiced by AI

Newspapers in English

Newspapers from United Kingdom

DICK POUNTAIN

Bad news for voice actors: if the tech firms have their way, this decade’s documentar­ies and films will be voiced by AI

Newspapers in English

Newspapers from United Kingdom

Bad news for voice actors: if the tech firms have their way, this decade’s documentaries and films will be voiced by AI