‘AI’ WARS
Everything’s gone ‘AI’. Derek Powell seeks for real rather than faux AI, and debates how this could affect the future of audio technology.
Everything seems to be going ‘AI’, but is it really intelligent?
Artificial Intelligence (AI) is everywhere at the moment. And frankly, the claims for products that are said to incorporate “AI technology” are often pretty dubious. Everyone, it seems, has a product that is now artificially intelligent: ranging from AI lawnmowers (from Roomba, Honda and even LG) to AI lawyers. Yes, lawyers (and in Darwin, of all places... Ailira, which is short for Artificially Intelligent Legal Information Resource Assistant, can write a properly certified will for you based on your responses to some standard questions).
There are even AI fridges, like the Samsung Smart InstaView refrigerator, which allows you to see inside your fridge from any smartphone, if you forget whether you need milk while at the supermarket.
The AI industry has even appropriated the ‘dot ai’ domain. While .ai should (like .au for Australia) properly refer to a website located in the Caribbean island nation Anguilla, shameless proponents of supposedly smart products are now beating a path to the domain registries of the most northerly of the Leeward Islands in the Lesser Antilles to claim their “dot ai” websites, like minimally intelligent moths to the proverbial flame.
Intelligence vs automation
But are products that claim AI actually intelligent — or just automated? It can be hard to determine the difference, as there is no really strictly applied standard for AI. For now, let’s take the commonly applied criterion that AI applies to systems that can perform two specific functions: problem solving, and learning from experience.
So can AI be usefully applied to audio systems? Can we use artificial intelligence (AI) to create IA — Intelligent Audio?
Many of the common ‘AI’ systems, like the digital assistants Alexa, Cortana and Siri, appear to reside in audio products, of course. However, these rely on remote processing power for answers, which are merely delivered via audio systems that are themselves really only conventional microphones and speakers acting as the interface to the real AI, which is in the cloud.
Intelligent Audio
But there is an expanding list of tasks in sound reproduction where AI takes the lead in improving the audio experience. Some of these show real promise and it is worth spending time in studying these early steps to Intelligent Audio.
I’d classify the current research into Intelligent Audio into two broad categories: first, using AI to analyse sound (and make useful suggestions as a result), and second, to enhance sound in ways that conventional audio techniques can’t manage.
The first category is currently the most active, spurred on by some big players who are close to really making money from recognising and recommending music to their users. Music recommendation software forms a vital part of the appeal of programs like Spotify and others. Finding new music based on what you enjoy is an essential part of the business model of these services. Having the largest catalogue is important, but what really brings subscriptions is the ability to accurately predict that ‘if you enjoy this’ then ‘try that’.
A commonly used method is called Collaborative Filtering, which assumes that people who can be shown to be similar in terms of behaviour or demographics to the target user, and who rate other items in a similar way, will have similar preferences in music — and thus their music choices will be good recommendations. The method relies on data mining and harvesting information from sources like social media to come up with a group of like-minded individuals and applying their preferences to the target user.
This method works well, but has a bias toward recommending songs that are already popular; it doesn’t analyse and recommend new music. Analysing and classifying new music automatically would be very useful, but this requires a higher level of AI.
Music classification
Plenty of researchers are working on this problem. The appropriately named Yading Song, of Queen Mary University of London, has written a useful paper1 comparing the many approaches. He notes that the contentbased approach to music classification attempts to extract and compare the acoustic features of music, such as timbre and rhythm, to recommend songs similar to those the user has listened to in the past. This is more difficult than it sounds (if you’ll pardon the pun) and requires a lot of small steps, including tonal analysis and beat tracking, that are still only imperfectly understood as yet. One small area of study is ‘onset detection’, which pinpoints the start of an audio event (like an individual note). This is the necessary first step in many of the analysis techniques mentioned above. While trivial for human musicians, a computer algorithm has to resort to the following complex steps to do this — first computing a “spectral novelty function”, then finding the peaks in that function, and finally backtracking from each peak to a preceding local minimum.
Going a little further, the analysis of music is now being applied to classifying sounds in general. Can a computer detect that a certain sound is actually a dog barking rather than explosions from fireworks?
New soundtracks
Going a lot further, researchers at the University of North Carolina have demonstrated how they can train a machine learning algorithm to generate realistic sound effects to match with video clips2. Taking a video clip of a dog, or a chainsaw, as input, their algorithm has come up with a matching soundtrack so realistic and well synchronised with the vision that it is difficult to tell the fake sound from the original sounds of the video recording.
There are all sorts of applications for creating sounds with AI, and some of them are downright scary. One application I came across uses a deep learning voice system to copy and reproduce the voices of literally thousands of people using around half an hour of sample recordings of their speech. On the one hand, voice cloning technology could be used to allow people who have lost the use of their voice through degenerative disease or injury (like the late Professor Hawking) to speak naturally, rather than with the familiar robot-like intonations. On the other hand, it could equally be used by the unscrupulous to spoof someone’s identity on the phone.
Enhancing sound
But let’s go back a step. If AI can classify sounds, can it go further and actually separate out components in a complicated audio signal? This belongs to the second of the two categories we set out to examine — enhancing sound. Writing in the online blog “Towards Data Science”, software developer Daniel Rothman has rounded up a collection of advancements in audio processing. He describes how AI techniques such as “deep learning” are being used in software by Izotope to “separate spoken dialogue from background noise such as crowds, traffic, footsteps, weather, or other noise with highly variable characteristics.”
Humans do this all the time — we can easily follow conversations in noisy environments. But as anyone who has recorded an interview in such conditions will tell
you, separating speech from such variable background noise simply can’t be done by analogue filters or any conventional audio technique. Indeed this exact task is the Holy Grail for hearing aid manufacturers, so as you can imagine lots of research effort is currently going on. There is great promise that deep neural networks (the kind of AI technology that is used by Google in its ‘image search’ algorithm or by Shazam to identify songs from a small sample) may one day be used to allow hearing aids to first recognise then zero in on particular components of a complicated audio signal. Such a system could just amplify speech while ignoring unwanted sounds like passing cars.
Beyond audio
While we’ve mainly looked at AI in audio this time, the benefits of nearly all these methods are also being applied in the video domain.
AI-enhanced ‘smart speakers’ like Amazon Echo, Apple Home Pod and Google Home are rapidly being joined on the market by upmarket TVs with AI smart voice interfaces. As Stephen Dawson reports this issue (p12-3), LG’s latest range comes with a new operating system “web OS with AI”, pushing back against audio-only smart devices.
Like Spotify, Netflix uses AI Analysis techniques in its recommendations to viewers. In a 2015 article4 Netflix revealed that the recommender section of their site is responsible for 80% of users’ viewing hours, while searching for content that viewers already know about accounts for only 20%. Presenting great recommendations is therefore vital, especially since their research shows users will take on average no more than 90 seconds to move on from Netflix to another service if they don’t find something they would like to watch. They calculate that the AI techniques helping users to find something is saving up to $1bn a year in potential lost viewers.
These statistics probably sum up the driving force behind the incorporation of AI into new entertainment products and services. Where AI can meet an important revenue goal (like driving viewing or listening hours by making intelligent recommendations) — and if it can do it automatically and quickly — then expect to hear lots more about AI in Sound+Image! Derek Powell REFERENCES: avhub.com.au/soundoff 1: https://www.researchgate.net/profile/ Yading_Song/publication/277714802_ A_Survey_of_Music_Recommendation _Systems_and_Future_Perspectives/ links/5571726608aef8e8dc633517.pdf 2: http://bvision11.cs.unc.edu/bigpen/ yipin/visual2sound_webpage/ visual2sound.html 3: https://www.izotope.com/en/ products/repair-and-edit/rx/ features/dialogue-isolate.html 4: The Netflix Recommender System: Algorithms, Business Value, and Innovation published 2015 in ACM Trans. Management Inf. Syst.