The Guardian (USA)

New York Times, CNN and Australia’s ABC block OpenAI’s GPTBot web crawler from accessing content

- Ariel Bogle

News outlets including the New York Times, CNN, Reuters and the Australian Broadcasti­ng Corporatio­n (ABC) have blocked a tool from OpenAI, limiting the company’s ability to continue accessing their content.

OpenAI is behind one of the best known artificial intelligen­ce chatbots, ChatGPT. Its web crawler – known as GPTBot – may scan webpages to help improve its AI models.

The Verge was first to report the New York Times had blocked GPTBot on its website. The Guardian subsequent­ly found that other major news websites, including CNN, Reuters, the Chicago Tribune, the ABC and Australian Community Media (ACM) brands such as the Canberra Times and the Newcastle Herald, appear to have also disallowed the web crawler.

So-called large language models such as ChatGPT require vast amounts of informatio­n to train their systems and allow them to answer queries from users in ways that resemble human language patterns. But the companies behind them are often tightlippe­d about the presence of copyrighte­d material in their datasets.

The block on GPTBot can be seen in the robots.txt files of the publishers which tell crawlers from search engines and other entities what pages they are allowed to visit.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabiliti­es and safety,” OpenAI said in a blogpost that included instructio­ns on how to disallow the crawler.

All the outlets examined added the block in August. Some have also disallowed CCBot, the web crawler for an open repository of web data known as Common Crawl that has also been used for AI projects.

CNN confirmed to Guardian Australia that it recently blocked GPTBot across its titles, but did not comment on whether the brand plans to take further action about the use of its content in AI systems.

A Reuters spokespers­on said it regularly reviews its robots.txt and site terms and conditions. “Because intellectu­al property is the lifeblood of our business, it is imperative that we protect the copyright of our content,” she said.

The New York Times’ terms of service were recently updated to make the prohibitio­n against “the scraping of our content for AI training and developmen­t … even more clear,” according to a spokespers­on.

As of 3 August, its website rules explicitly prohibits the publisher’s content to be used for “the developmen­t of any software program, including, but not limited to, training a machine learning or artificial intelligen­ce (AI) system” without consent.

News outlets globally are faced with decisions about whether to use AI as part of news gathering, and also how to deal with their content potentiall­y being sucked into training pools by companies developing AI systems.

In early August, outlets including Agence France-Presse and Getty Images signed an open letter calling for regulation of AI, including transparen­cy about “the makeup of all training sets used to create AI models” and consent for the use of copyrighte­d material.

Google has proposed that AI systems should be able to scrape the work of publishers unless they explicitly opt out.

In a submission to the Australian government’s review of the regulatory framework around AI, the company argued for “copyright systems that enable appropriat­e and fair use of copyrighte­d content to enable the training of AI models in Australia on a broad and diverse range of data, while supporting workable opt-outs”.

Research from Originalit­yAI, a company that checks for the presence of AI content, shared this week found that major websites including Amazon and Shuttersto­ck had also blocked GPTBot.

The Guardian’s robot.txt file does not disallow GPTBot.

The ABC, Australian Community Media, the Chicago Tribune, OpenAI and Common Crawl did not respond by deadline.

 ?? Corporatio­n. Photograph: Jonathan Raa/NurPhoto/Shuttersto­ck ?? OpenAI’s web crawler – known as GPTBot – has been blocked by the New York Times, CNN, Reuters and the Australian Broadcasti­ng
Corporatio­n. Photograph: Jonathan Raa/NurPhoto/Shuttersto­ck OpenAI’s web crawler – known as GPTBot – has been blocked by the New York Times, CNN, Reuters and the Australian Broadcasti­ng

Newspapers in English

Newspapers from United States