Scarcity: The great AI data gap
The internet isn’t big enough for AI, said Deepa Seetharaman in The Wall Street Journal. The need for “high-quality text data” to help companies train their latest generative artificial intelligence models is threatening to “outstrip supply” in two to four years, AI researchers say. A fierce land grab is underway for what’s left. OpenAI’s most advanced generative AI model, GPT4, “was trained on as many as 12 trillion tokens,” or words and parts of words. The next iteration of the model, GPT-5, could require up to “100 trillion tokens.” That’s more than all of the useful language and images available on the web. The supply is shrinking further because “social media platforms, news publishers, and others have been curbing access to their data for AI training.”
AI companies are now scouring everything “from chat logs to long-forgotten personal photos from faded social media apps,” said Katie Paul and Anna Tong in Reuters. It’s given “a new lease on life” to bygone repositories like Photobucket, once “the world’s top image-hosting site” in the days of MySpace and Friendster. Multiple tech companies are now in talks to license the 13 billion photos and videos languishing on the site. In other cases, however, AI firms “have cut corners, ignored corporate policies, and debated bending the law” to get the data they need, said Cade Metz in The New York Times. Meta has held internal discussions about “buying the publishing house Simon & Schuster to procure long works.” OpenAI developed its own speechrecognition tool, Whisper, which it used to transcribe “more than 1 million hours” of YouTube videos, potentially in violation of YouTube’s terms of service.
Let’s not forget that Facebook and Google started the race to harvest data, said Parmy Olson in Bloomberg. In large part, their business models have been “built on collecting the private data of billions of consumers” and selling it to the highest-bidding advertisers. Google is now grousing about how OpenAI is using YouTube videos—even as it scrapes YouTube videos itself. Tech giants’ complaints that their data is being exploited are about protecting their own turf, not users’ privacy. It’s the “ultimate example of throwing stones in glass houses.”
With supply dwindling, AI companies are experimenting with “synthetic data” to train their models, said Noor Al-Sibai in Futurism. “Synthetic data” is generated by AI—essentially, AI is training itself. That’s a problem, especially if the mentoring AI is biased or inaccurate, which is often the case. Researchers have compared the practice “to the deeply inbred Habsburg dynasty” and worry it will create an “inbred mutant” AI. But tech companies may have no alternative if the quality data pool runs dry.