Mint Hyderabad

For data-guzzling AI companies, the internet is too small

- Deepa Seetharama­n feedback@livemint.com © 2024 DOW JONES & CO. INC.

Companies racing to develop more powerful artificial intelligen­ce are rapidly nearing a new problem: The internet might be too small for their plans.

Ever more powerful systems developed by OpenAI, Google and others require larger oceans of informatio­n to learn from. That demand is straining the available pool of quality public data online at the same time that some data owners are blocking access to AI companies.

Some executives and researcher­s say the industry’s need for high-quality text data could outstrip supply within two years, potentiall­y slowing AI’s developmen­t.

AI companies are hunting for untapped informatio­n sources, and rethinking how they train these systems. OpenAI, the maker of ChatGPT, has discussed training its next model, GPT-5, on transcript­ions of public YouTube videos, people familiar with the matter said.

Companies also are experiment­ing with using AI-generated, or synthetic, data as training material—an approach many researcher­s say could actually cause crippling malfunctio­ns.

These efforts are often secret, because executives think solutions could be a competitiv­e advantage.

The data shortage “is a frontier research problem”, said Ari Morcos , an AI researcher who worked at Meta Platforms and Google’s DeepMind unit before founding DatologyAI last year. His company, whose backers include a number of AI pioneers, builds tools to improve data selection, which could help companies train AI models for cheaper. “There is no establishe­d way of doing this.”

Data is among several essential AI resources in short supply. The chips needed to run so-called large-language models behind ChatGPT, Google’s Gemini and other AI also are scarce. And industry leaders worry about a dearth of data centers and the electricit­y needed to power them.

AI language models are built using text vacuumed up from the internet, including scientific research, news articles and Wikipedia entries. That material is broken into tokens—words and parts of words that the models use to learn how to formulate humanlike expression­s.

Generally, AI models become more capable the more data they train on. OpenAI bet big on this approach, helping it become the most prominent AI company in the world.

OpenAI doesn’t disclose details of the training material for its current most-advanced language model, called GPT-4, which has set the standard for advanced generative AI systems.

But Pablo Villalobos , who studies artificial intelligen­ce for research institute Epoch, estimated that GPT-4 was trained on as many as 12 trillion tokens. Based on a computer-science principle called the Chinchilla scaling laws, an AI system like GPT-5 would need 60 trillion to 100 trillion tokens of data if researcher­s continued to follow the current growth trajectory, Villalobos and other researcher­s have estimated.

Harnessing all the highqualit­y language and image data available could still leave a shortfall of 10 trillion to 20 trillion tokens or more, Villalobos said. And it isn’t clear how to bridge that gap. Two years ago, Villalobos and his colleagues wrote that there was a 50% chance that the demand for high-quality data would outstrip supply by mid-2024 and a 90% chance that it would happen by 2026. They have since become a bit more optimistic, and plan to update their estimate to 2028.

Most of the data available online is useless for AI training because it contains flaws such as sentence fragments or doesn’t add to a model’s knowledge. Villalobos estimated that only a sliver of the internet is useful for such training—perhaps just onetenth of the informatio­n gathbots ered by the nonprofit Common Crawl, whose web archive is widely used by AI developers.

At the same time, socialmedi­a platforms, news publishers and others have been curbing access to their data for AI training over concerns about issues including fair compensati­on . And there is little public will to hand over private conversati­onal data— such as chats over iMessage— to help train these models.

Mark Zuckerberg recently touted Meta Platforms’ access to data on its platforms as a significan­t advantage in its AI efforts. He said Meta can mine hundreds of billions of publicly shared images and videos across its networks, including Facebook and Instagram, that are collective­ly larger than most commonly used data sets. It isn’t clear what percentage of that data would be considered high quality.

One strategy used by DatologyAI, the data-selection-tool startup, is called curriculum learning, in which data is fed to language models in a specific order in hopes that the AI will form smarter connection­s between concepts. In a 2022 paper, DatologyAI’s Morcos and co-authors estimated that models can achieve the same results with half the data—if it is the right data—potentiall­y lowering the immense cost of training and running large generative AI systems.

Other research so far suggests the curriculum learning method hasn’t been effective, but Morcos says they are continuing to adapt their approach.

“This is the dirty secret of deep learning: It’s throwing spaghetti against the wall,” said Morcos.

Some tech companies, including OpenAI partner Microsoft , are building smaller language models that are a fraction of the size of GPT-4 but could accomplish specific objectives.

OpenAI Chief Executive Sam Altman has indicated the company is working on new methods to train future models.

“I think we’re at the end of the era where it’s going to be these giant, giant models,” he said at a conference last year. “And we’ll make them better in other ways.”

OpenAI also has discussed creating a data market where it could build a way to attribute how much value each individual data point contribute­s to the final trained model and pay the provider of that content, people familiar with the matter said.

This same idea is being discussed within Google. But researcher­s have so far struggled to build such a system and it isn’t clear whether they will ever find a breakthrou­gh.

OpenAI is also working to gather everything useful that is already out there. Executives have discussed transcribi­ng high-quality examples of video and audio on the internet using Whisper, its automatic speech-recognitio­n tool, people familiar with the matter said. Some of that would be through public YouTube videos, a subset of which were already used to train GPT-4, the people said.

“Our data sets are unique, and we curate them to help our models’ understand­ing of the world,” a spokeswoma­n for OpenAI said, adding that it draws from publicly available content and gets nonpublic data through partnershi­ps.

Google didn’t return a request for comment.

Companies also are experiment­ing with making their own data.

Feeding a model text that is itself generated by AI is considered the computer-science version of inbreeding. Such a model tends to produce nonsense, which some researcher­s call “model collapse.”

In one experiment, discussed in a research paper last year, Canadian and British researcher­s found that the later generation of such a model, when asked to discuss 14th century English architectu­re, babbled about nonexisten­t species of jack rabbits.

Researcher­s at OpenAI and Anthropic are trying to avoid these problems by creating so-called synthetic data of higher quality.

In a recent interview, Anthropic’s chief scientist, Jared Kaplan , said some types of synthetic data can be helpful. Anthropic said it used “data we generate internally” to inform its latest versions of its Claude models. OpenAI also is exploring synthetic data generation, the spokeswoma­n said.

Many who study the data issue are ultimately sanguine that solutions will emerge. Villalobos compares it to “peak oil,” the fear that oil production could top out and start an economical­ly painful collapse. That concern has proven inaccurate thanks to new technology, such as fracking in the early 2000s.

It is possible that the AI world could see a similar developmen­t, he says. “The biggest uncertaint­y is what breakthrou­ghs you’ll see.”

AI firms, hunting for untapped informatio­n sources, are rethinking ways of training systems

 ?? REUTERS ?? Some executives and researcher­s say the industry’s need for high-quality text data could outstrip supply within two years, potentiall­y slowing AI’s developmen­t.
REUTERS Some executives and researcher­s say the industry’s need for high-quality text data could outstrip supply within two years, potentiall­y slowing AI’s developmen­t.
 ?? ??

Newspapers in English

Newspapers from India