The Boston Globe

Tech giants cut corners to gather AI data

Increasing­ly violate policies to train systems

- By Cade Metz, Cecilia Kang, and Sheera Frenkel NEW YORK TIMES

SAN FRANCISCO — In late 2021, OpenAI faced a supply problem.

The artificial intelligen­ce lab had exhausted every reservoir of reputable English-language text on the internet as it developed its latest AI system. It needed more data to train the next version of its technology — lots more.

So OpenAI researcher­s created a speech recognitio­n tool called Whisper. It could transcribe the audio from YouTube videos, yielding new conversati­onal text that would make an AI system smarter.

Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversati­ons said. YouTube, which is owned by Google, prohibits use of its videos for applicatio­ns that are “independen­t” of the video platform.

Ultimately, an OpenAI team transcribe­d more than 1 million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI’s president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful AI models and was the basis of the latest version of the ChatGPT chatbot.

The race to lead AI has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies, including OpenAI, Google, and Meta, have cut corners, ignored corporate policies, and debated bending the law, according to an examinatio­n by The New York Times.

At Meta, which owns Facebook and Instagram, managers, lawyers, and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by the Times. They also conferred on gathering copyrighte­d data from across the internet, even if that meant facing lawsuits. Negotiatin­g licenses with publishers, artists, musicians, and the news industry would take too long, they said.

Like OpenAI, Google transcribe­d YouTube videos to harvest text for its AI models, five people with knowledge of the company’s practices said. That potentiall­y violated the copyrights to the videos, which belong to their creators.

Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by the Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products.

The companies’ actions illustrate how online informatio­n — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts, and movie clips — has increasing­ly become the lifeblood of the booming AI industry. Creating innovative systems depends on having enough data to teach the technologi­es to instantly produce text, images, sounds, and videos that resemble what a human creates.

The volume of data is crucial. Leading chatbot systems have learned from pools of digital text spanning as many as 3 trillion words, or roughly twice the number of words stored in Oxford University’s Bodleian Library, which has collected manuscript­s since 1602. The most prized data, AI researcher­s said, is high-quality informatio­n, such as published books and articles, which have been carefully written and edited by profession­als.

For years, the internet — with sites like Wikipedia and Reddit — was a seemingly endless source of data. But as AI advanced, tech companies sought more repositori­es. Google and Meta, which have billions of users who produce search queries and social media posts every day, were largely limited by privacy laws and their own policies from drawing on much of that content for AI.

Their situation is urgent. Tech companies could run through data on the internet as soon as 2026, according to Epoch, a research institute.

 ?? PAU BARRENA/AFP VIA GETTY IMAGES ?? OpenAI researcher­s created a speech recognitio­n tool to transcribe internet audio to make an AI system smarter.
PAU BARRENA/AFP VIA GETTY IMAGES OpenAI researcher­s created a speech recognitio­n tool to transcribe internet audio to make an AI system smarter.

Newspapers in English

Newspapers from United States