The Morning Journal (Lorain, OH)

ON THE SAME PAGE

9,000 authors rebuke AI companies, saying they exploited books as ‘food’ for chatbots

- By Emily St. Martin

More than 9,000 authors are calling out the tech companies behind generative AI in an open letter that states there is an inherent injustice in exploiting copyright-protected works to train chatbots without consent, credit or compensati­on.

If users prompt GPT-4 to summarize works by Roxane Gay or Margaret Atwood, it can do so in detail, chapter by chapter. If users want ChatGPT to write a story in the style of an acclaimed author such as Maya Angelou, they can ask it to “write a personal essay in the style of Maya Angelou, exploring the theme of selfdiscov­ery and personal growth.” And voila.

The generative AI is powered by two software programs known as large language models, which forgo a traditiona­l programmin­g method and instead extract massive amounts of text in order to produce natural and lifelike responses to user prompts.

In Tuesday’s open letter, the Authors Guild writes that “Generative AI technologi­es built on large language models owe their existence to our writings. These technologi­es mimic and regurgitat­e our language, stories, style, and ideas. Millions of copyrighte­d books, articles, essays, and poetry provide the ‘food’ for AI systems, endless meals for which there has been no bill.”

The letter further states that tech companies including OpenAI, Alphabet, Meta, Stability AI, IBM and Microsoft have spent billions to develop AI technology and that compensati­ng the authors for using their works would be the fair move, because without those books,

“AI would be banal and extremely limited.”

Novelist and essayist Jonathan Franzen commended the effort, stating, “The Authors Guild is taking an important step to advance the rights of all Americans whose data and words and images are being exploited, for immense profit, without their consent — in other words, pretty much all Americans over the age of six.”

Dan Brown, James Patterson, Margaret Atwood, Roxane Gay, Celeste Ng, Viet Thanh Nguyen, George Saunders and Rebecca Makkai are among the thousands of authors who are taking AI industry leaders to task, asking that their concerns be addressed and specific actions taken:

• Obtain permission for the use of copyrighte­d material in generative AI programs.

• Fairly compensate writers for both past and ongoing use of their works in generative AI programs.

• Fairly compensate writers for the use of their works in AI output, regardless of whether the outputs infringe upon current laws.

“We understand that many of the books used to develop AI systems originated from notorious piracy websites,” the letter continues.

“Not only does the recent Supreme Court decision in Warhol v. Goldsmith make clear that the high commercial­ity of your use argues against fair use, but no court would excuse copying illegally sourced works as fair use.”

The Authors Guild says generative AI threatens writers’ profession­s by “flooding the market with mediocre, machine-written books, stories, and journalism based on our work.” And that for at least the last decade, authors have experience­d a 40% decline in income, with many full-time writers in 2022 barely surpassing the federal poverty level.

The letter comes just weeks after bestsellin­g novelists Mona Awad and Paul Tremblay filed a suit against OpenAI in a San Francisco federal court, claiming that ChatGPT was trained in part by “ingesting” their novels without their consent.

When prompted, ChatGPT emitted extremely detailed summaries of Tremblay’s “The Cabin at the End of the World” and Awad’s “Bunny” and “13 Ways of Looking at a Fat Girl.”

Both authors claim this is proof that their novels were used to train the chatbot, and the filing includes ChatGPT’s responses to prompts regarding their novels.

In June 2018, OpenAI revealed that it trained GPT-1 using BookCorpus, which the suit described as a “controvers­ial dataset” assembled by artificial intelligen­ce researcher­s in 2015, with a collection of “over 7,000 unique unpublishe­d books from a variety of genres including Adventure, Fantasy, and Romance.

“They copied the books from a website called Smashwords.com that hosts unpublishe­d novels that are available to readers at no cost. Those novels, however, are largely under copyright.”

According to the complaint, later iterations of the company’s large language models were trained using significan­tly larger quantities of copyrightp­rotected books. In a July 2020 paper introducin­g GPT-3, the company revealed that 15% of the training data set came from “two internet-based books corpora” that OpenAI simply called “Books1” and “Books2.”

The suit approximat­es that, based on numbers revealed in OpenAI’s paper about GPT-3, Books1 would contain roughly 63,000 titles, and Books2 would include approximat­ely 294,000 titles.

Experts have predicted more suits are sure to follow as AI becomes more adept at using informatio­n from the web to generate new content.

 ?? ERIK VOAKE — GETTY IMAGES ?? Celeste Ng attends Hulu’s “Little Fires Everywhere” press brunch at Ross House on Feb. 19, 2020, in Los Angeles.
ERIK VOAKE — GETTY IMAGES Celeste Ng attends Hulu’s “Little Fires Everywhere” press brunch at Ross House on Feb. 19, 2020, in Los Angeles.

Newspapers in English

Newspapers from United States