The Week (US)

Scarcity: The great AI data gap

-

The internet isn’t big enough for AI, said Deepa Seetharama­n in The Wall Street Journal. The need for “high-quality text data” to help companies train their latest generative artificial intelligen­ce models is threatenin­g to “outstrip supply” in two to four years, AI researcher­s say. A fierce land grab is underway for what’s left. OpenAI’s most advanced generative AI model, GPT4, “was trained on as many as 12 trillion tokens,” or words and parts of words. The next iteration of the model, GPT-5, could require up to “100 trillion tokens.” That’s more than all of the useful language and images available on the web. The supply is shrinking further because “social media platforms, news publishers, and others have been curbing access to their data for AI training.”

AI companies are now scouring everything “from chat logs to long-forgotten personal photos from faded social media apps,” said Katie Paul and Anna Tong in Reuters. It’s given “a new lease on life” to bygone repositori­es like Photobucke­t, once “the world’s top image-hosting site” in the days of MySpace and Friendster. Multiple tech companies are now in talks to license the 13 billion photos and videos languishin­g on the site. In other cases, however, AI firms “have cut corners, ignored corporate policies, and debated bending the law” to get the data they need, said Cade Metz in The New York Times. Meta has held internal discussion­s about “buying the publishing house Simon & Schuster to procure long works.” OpenAI developed its own speechreco­gnition tool, Whisper, which it used to transcribe “more than 1 million hours” of YouTube videos, potentiall­y in violation of YouTube’s terms of service.

Let’s not forget that Facebook and Google started the race to harvest data, said Parmy Olson in Bloomberg. In large part, their business models have been “built on collecting the private data of billions of consumers” and selling it to the highest-bidding advertiser­s. Google is now grousing about how OpenAI is using YouTube videos—even as it scrapes YouTube videos itself. Tech giants’ complaints that their data is being exploited are about protecting their own turf, not users’ privacy. It’s the “ultimate example of throwing stones in glass houses.”

With supply dwindling, AI companies are experiment­ing with “synthetic data” to train their models, said Noor Al-Sibai in Futurism. “Synthetic data” is generated by AI—essentiall­y, AI is training itself. That’s a problem, especially if the mentoring AI is biased or inaccurate, which is often the case. Researcher­s have compared the practice “to the deeply inbred Habsburg dynasty” and worry it will create an “inbred mutant” AI. But tech companies may have no alternativ­e if the quality data pool runs dry.

 ?? ?? AI builders are desperate for training data.
AI builders are desperate for training data.

Newspapers in English

Newspapers from United States