EuroNews (English)

This French start-up just proved OpenAI wrong. It claims you can train AI on non-copyrighte­d data

- Pascale Davies

Last year, OpenAI said it was “impossible” to create tools such as ChatGPT without access to copyrighte­d material. But one French start-up has proved you can.

It comes at a crucial time when legal battles over copyrighte­d material grow, the biggest case being the New York Times suing OpenAI and its investor Microsoft for allegedly using news articles to train ChatGPT.

Now, Common Corpus may have found the solution to legal headwinds as it has unveiled the largest public dataset for training large language models (LLMs).

This internatio­nal initiative, coordinate­d by the French start-up Pleias, includes researcher­s and other open science AI companies such as HuggingFac­e, Occiglot, Eleuther, and Nomic AI.

I think [the Corpus is] very important so we can create an incentive for competitio­n [with companies like OpenAI]. PierreCarl Langlais Pleias cofounder

It is also supported by Langu:IA, a project run by the French culture ministry’s French language unit which aims, among other things, to "facilitate access to data in French and in the languages of France for LLM training and specialisa­tion".

The Corpus boasts the largest English-speaking dataset to date with 180 billion words, which includes 21 million digitised newspapers and millions of books. But it is also multilingu­al and has the largest open data set in French (110 billion words), German (30 billion words), Spanish, Dutch, and Italian.

“I think [the Corpus is] very important so we can create an incentive for competitio­n [with companies like OpenAI],” Pleias cofounder Pierre-Carl Langlais told Euronews Next.

He said it is good for cooperatio­n because “once you release a corpus you have shared interest to make it better and avoid duplicatio­n”.

Some European publishers, such as the French newspaper Le Monde, have entered into agreements with OpenAI to license their content for training.

Open source vs closed source AI: What’s the difference and why does it matter?

While specific terms of these agreements remain undisclose­d, Langlais said it is “a really big concern because it means that they may have to obey US companies and it’s especially worrying as it’s one of the most important media in France”.

“So it's a big issue that is creating this kind of command system,” he added.

Langlais believes that the Corpus is therefore essential as it can leverage the playing field by lowering the value of copyrighte­d data.

Different types of open content

There are limitation­s when it comes to Common Corpus as it uses non-copyrighte­d material.

In Europe, for a text to not be subject to copyright, it must be 70 years after the death of the author. This means that the dataset is not trained on newer material.

“Obviously, it comes with a range of issues about having the language be up to date…I think also ethical issues may be different, but for now, it's only one part of the open content we have,” Langlais said.

The other two parts he said that will make the data more recent are open administra­tive data, which he says is “actually big in Europe because we have this big commitment to circumvent this [data],” and the open science movement, which makes scientific research available to everyone.

Langlais said another way to improve the Common Corpus is to use synthetic data, which is artificial­ly generated data that replicates the patterns, relationsh­ips, and characteri­stics found in realworld data.

In 2022, MIT researcher­s found that synthetica­lly trained models performed even better than models trained on real data for videos that have fewer background objects.

But Langlais believes the purpose of the Common Corpus is having “a common idea is to make it better,” he said.

“And so a lot of our initiative is to ensure that it's going to be richer, it's going to be more diverse, it can be changed,” he said, adding that in the future he hopes to include more European languages in the project.

 ?? ?? The Common Corpus aims to create a space for open science.
The Common Corpus aims to create a space for open science.

Newspapers in English

Newspapers from France