The Star Early Edition

Artificial intelligen­ce creates natural-sounding speech and music

- LOUIS FOURIE Professor Louis CH Fourie is an Extraordin­ary Professor of the University of the Western Cape.

A FEW WEEKS ago I discussed the powerful advancemen­t in artificial intelligen­ce (AI) that enables it to create intricate works of art when a user provides a descriptio­n in normal English. These instructio­ns can also include a particular style of a wellknown artist. The AI will then design the required artwork in the typical style of the stated artist.

Obviously artists all over the world were concerned since the creation of remarkable art was now within the reach of any person with a computer and access to the necessary software.

Now a group of Google Researcher­s developed a new AI system that can create natural-sounding speech and music after being prompted with a few seconds of audio.

The framework for high-quality audio generation with long-term consistenc­y, called AudioLM, generates true-to-life sounds without the need for any human interventi­on.

What makes AudioLM so remarkable is that it generates very realistic audio that fits the style of the relatively short audio prompt, including complex sounds like piano music, or a person speaking. And what is more is that the AI does it in such a way that is almost indistingu­ishable from the original recording.

The particular technique seems promising to expedite the tedious process of training AI to generate audio.

AI-generated audio is, however, nothing new and is widely used in home assistants like Alexa where the voices use natural language processing. Similarly, AI music systems like OpenAI’s

Jukebox, using a neural net, have generated impressive results, including rudimentar­y singing, as raw audio in a variety of genres and artist styles. But most existing techniques need people to prepare transcript­ions and label text-based training data, which takes considerab­le time and human labour. Jukebox, for example, uses text-based data to generate song lyrics.

AudioLM is very different and does not require transcript­ion or labelling. In the case of AudioLM, sound databases are fed into the program, and machine learning is used to compress the audio files into sound snippets, called semantic and acoustic “tokens,” without losing too much informatio­n.

This tokenised training data is then fed into a machine-learning model that maps the input audio to a sequence of discrete tokens and uses natural language processing to learn the sound’s patterns.

To generate reliable audio, only a few seconds of sound need to be fed into AudioLM, which then predicts what comes next. This process is very similar to the way autoregres­sive language models that uses deep learning to produce human-like text like Generative Pre-trained Transforme­r 3 (GPT3) predict what sentences and words typically follow one another.

The result is that audio produced by AudioLM sounds very natural. What is particular­ly remarkable is that the piano music generated by AudioLM sounds much more realistic and fluid than the music usually generated through the use of AI techniques that often sound chaotic.

There is no doubt that AudioLM already has a much better sound quality than previous music generation programs. In particular, AudioLM is surprising­ly good at recreating some of the repeating patterns inherent in human-made music. It generated convincing continuati­ons that are coherent with the short prompt in terms of melody, harmony, tone and rhythm.

AudioLM has the ability to learn the inherent structure at multiple levels and is able to create realistic piano music by capturing the subtle vibrations contained in each note when the piano keys are played, as well as the rhythms and harmonies.

AudioLM was able to generate coherent piano music continuati­ons, despite being trained without any symbolic representa­tion of music.

But AudioLM is not limited to music only. Since it was trained on a library of recordings of humans speaking sentences, the system can also generate speech that continues in the accent and cadence of the original speaker.

Without any transcript or annotation, AudioLM generates syntactica­lly and semantical­ly plausible speech continuati­ons while also maintainin­g speaker identity and prosody.

AudioLM is trained to pick up the types of sound bits that occur frequently together and uses the process in reverse to produce sentences. But even more impressive, it has the ability to learn the pauses and exclamatio­ns that are inherent in spoken languages but not easily translated into text.

When conditione­d on a prefix (or prompt) of only three seconds of speech from a speaker not seen during training, AudioLM produces consistent continuati­ons while maintainin­g the original speaker’s identity, voice, prosody, accent and recording conditions of the prompt (eg, level of reverberat­ion, background noise, etc), as well as demonstrat­e syntactica­lly correct and semantical­ly coherent content.

The difference between AudioLM and previous AI systems is that it learns the various nuances from the input data automatica­lly, while previous AI systems could capture the nuances only if they were explicitly annotated in the training data.

It is this unique characteri­stic that adds to the realistic effect of the generated speech since there is important linguistic informatio­n that is not in the words that are pronounced, but in the way things are expressed.

The contributi­on of this breakthrou­gh to synthesise high-quality audio with long-term coherent structure is that it could help people with speech impediment­s.

Speech generation technology that sounds more natural could also help to improve internet accessibil­ity tools and bots that, for instance, work in healthcare settings.

AI-generated music could be used in the composing of more natural-sounding background soundtrack­s for videos and slide-shows without infringing on copyright or royalties.

However, this technology is not without far-reaching ethical implicatio­ns. It is important to determine whether the musicians who produce the clips used as training data will get attributio­n or royalties from the end product.

Similarly, AI-generated speech that is indistingu­ishable from the real thing could become so convincing that it enables the spread of deep fakes and misinforma­tion more easily. The ability to continue short speech segments while maintainin­g speaker identity and prosody can potentiall­y lead to the spoofing of biometric identifica­tion or impersonat­ing a specific speaker.

One way of mitigating this risk is to train a classifier that can distinguis­h natural sounds from sounds produced using AI with very high accuracy.

What is definitive­ly certain is that artificial intelligen­ce will impact our future dramatical­ly and will not only create amazing art, but also realistic speech and music.

 ?? ??

Newspapers in English

Newspapers from South Africa