Business World

Inside Big Tech’s undergroun­d race to acquire AI training data

-

NEW YORK — At its peak in the early 2000s, Photobucke­t was the world’s top image-hosting site. The media backbone for once-hot services like Myspace and Friendster, it boasted 70 million users and accounted for nearly half of the US online photo market.

Today only two million people still use Photobucke­t, according to analytics tracker Similarweb. But the generative artificial intelligen­ce (AI) revolution may give it a new lease of life.

Chief Executive Officer (CEO) Ted Leonard, who runs the 40-strong company out of Edwards, Colorado, told Reuters he is in talks with multiple tech companies to license Photobucke­t’s 13 billion photos and videos to be used to train generative AI models that can produce new content in response to text prompts.

He has discussed rates of between five cents and $1 dollar per photo and more than $1 per video, he said, with prices varying widely both by the buyer and the types of imagery sought.

“We’ve spoken to companies that have said, ‘we need way more,’ Leonard added, with one buyer telling him they wanted over a billion videos, more than his platform has.

“You scratch your head and say, where do you get that?”

Photobucke­t declined to identify its prospectiv­e buyers, citing commercial confidenti­ality. The ongoing negotiatio­ns, which haven’t been previously reported, suggest the company could be sitting on billions of dollars’ worth of content and give a glimpse into a bustling data market that’s arising in the rush to dominate generative AI technology.

Tech giants like Google, Meta, and Microsoft-backed OpenAI initially used reams of data scraped from the internet for free to train generative AI models like ChatGPT that can mimic human creativity. They have said that doing so is both legal and ethical, though they face lawsuits from a string of copyright holders over the practice.

At the same time, these tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps.

“There is a rush right now to go for copyright holders that have private collection­s of stuff that is not available to be scraped,” said Edward Klaris from law firm Klaris Law, which says it’s advising content owners on deals worth tens of millions of dollars apiece to license archives of photos, movies, and books for AI training.

Reuters spoke to more than 30 people with knowledge of AI data deals, including current and former executives at companies involved, lawyers and consultant­s, to provide the first in-depth exploratio­n of this fledgling market — detailing the types of content being bought, the prices materializ­ing, plus emerging concerns about the risk of personal data making its way into AI models without people’s knowledge or explicit consent.

OpenAI, Google, Meta, Microsoft, Apple, and Amazon all declined to comment on specific data deals and discussion­s for this article, although Microsoft and Google referred Reuters to supplier codes of conduct that include data privacy provisions.

Google added that it would “take immediate action, up to and including terminatio­n” of its agreement with a supplier if it discovered a violation.

Many major market research firms say they have not even begun to estimate the size of the opaque AI data market, where companies often don’t disclose agreements. Those researcher­s who do, such as Business Research Insights, put the market at roughly $2.5 billion now and forecast it could grow close to $30 billion within a decade.

GENERATIVE DATA GOLD RUSH

The data land grab comes as makers of big generative AI “foundation” models face increasing pressure to account for the massive amounts of content they feed into their systems, a process known as “training” that requires intensive computing power and often takes months to complete.

Tech companies say the technology would be cost-prohibitiv­e if they couldn’t use vast archives of free scraped web page data, such as those provided by nonprofit repository Common Crawl, which they describe as “publicly available.”

Their approach has nonetheles­s drawn a wave of copyright lawsuits and regulatory heat, while prompting publishers to add code to their websites to block scraping.

In response, AI model makers have started hedging risks and securing data-supply chains, both through deals with content owners and via a burgeoning industry of data brokers that has popped up to satisfy demand.

In the months after ChatGPT debuted in late 2022, for instance, companies including Meta, Google, Amazon and Apple all struck agreements with stock image provider Shuttersto­ck to use hundreds of millions of images, videos and music files in its library for training, according to a person familiar with the arrangemen­ts.

The deals with Big Tech firms initially ranged from $25 million to $50 million each, though most were later expanded, Shuttersto­ck’s Chief Financial Officer Jarrod Yahes told Reuters. Smaller tech players have followed suit, spurring a fresh “flurry of activity” in the past two months, he added.

Mr. Yahes declined to comment on individual contracts. The Apple agreement, and the size of the other deals, haven’t previously been made public.

A Shuttersto­ck competitor, Freepik, told Reuters it had struck agreements with two large tech companies to license the majority of its archive of 200 million images at two to four cents per image. There are five more similar deals in the pipeline, said CEO Joaquin Cuenca Abela, declining to identify buyers.

OpenAI, an early Shuttersto­ck customer, has also signed licensing agreements with at least four news organizati­ons, including The Associated Press and Axel Springer. Thomson Reuters, the owner of Reuters News, separately said it has struck deals to license news content to help train AI large language models, but didn’t disclose details.

Newspapers in English

Newspapers from Philippines