Aviation Ghana

AI’s Copyright Problem Is Fixable

- By Mike Loukides and Tim O’Reilly

Generative artificial intelligen­ce stretches current copyright law in unforeseen and uncomforta­ble ways. The US Copyright Office recently issued guidance stating that the output of image-generating AI isn’t copyrighta­ble unless human creativity went into the prompts that generated it. But that leaves many questions: How much creativity is needed, and is it the same kind of creativity that an artist exercises with a paintbrush?

Another group of cases deal with text (typically novels and novelists), where some argue that training a model on copyrighte­d material is itself copyright infringeme­nt, even if the model never reproduces those texts as part of its output. But reading texts has been part of the human learning process for as long as written language has existed. While we pay to buy books, we don’t pay to learn from them.

How do we make sense of this? What should copyright law mean in the age of AI? Technologi­st Jaron Lanier offers one answer with his idea of data dignity, which implicitly distinguis­hes between training (or “teaching”) a model and generating output using a model. The former should be a protected activity, Lanier argues, whereas output may indeed infringe on someone’s copyright.

This distinctio­n is attractive for several reasons. First, current copyright law protects “transforma­tive uses … that add something new,” and it is quite obvious that this is what AI models are doing. Moreover, it is not as though large language models (LLMs) like ChatGPT contain the full text of, say, George R. R. Martin’s fantasy novels, from which they are brazenly copying and pasting.

Rather, the model is an enormous set of parameters – based on all the content ingested during training – that represent the probabilit­y that one word is likely to follow another. When these probabilit­y engines emit a Shakespear­ean sonnet that Shakespear­e never wrote, that’s transforma­tive, even if the new sonnet isn’t remotely good.

Lanier sees the creation of a better model as a public good that serves everyone – even the authors whose works are used to train it. That makes it transforma­tive and worthy of protection. But there is a problem with his concept of data dignity (which he fully acknowledg­es): it is impossible to distinguis­h meaningful­ly between “training” current AI models and “generating output” in the style of, say, novelist Jesmyn Ward.

AI developers train models by giving them smaller bits of input and asking them to predict the next word billions of times, tweaking the parameters slightly along the way to improve the prediction­s. But the same process is then used to generate output, and therein lies the problem from a copyright standpoint.

A model prompted to write like Shakespear­e may start with the word “To,” which makes it slightly more probable that it will follow that with “be,” which makes it slightly more probable that the next word will be “or” – and so forth. Even so, it remains impossible to connect that output back to the training data.

Where did the word “or” come from? While it happens to be the next word in Hamlet’s famous soliloquy, the model wasn’t copying Hamlet. It simply picked “or” out of the hundreds of thousands of words it could have chosen, all based on statistics. This isn’t what we humans would recognize as creativity. The

model is simply maximizing the probabilit­y that we humans will find its output intelligib­le.

But how, then, can authors be compensate­d for their work when appropriat­e? While it may not be possible to trace provenance with the current generative AI chatbots, that isn’t the end of the story. In the year or so since ChatGPT’s release, developers have been building applicatio­ns on top of the existing foundation models. Many use retrieval-augmented generation (RAG) to allow an AI to “know about” content that isn’t in its training data. If you need to generate text for a product catalog, you can upload your company’s data and then send it to the AI model with the instructio­ns: “Only use the data included with this prompt in the response.”

Though RAG was conceived as a way to use proprietar­y informatio­n without going through the labor- and computing-intensive process of training, it also incidental­ly creates a connection between the model’s response and the documents from which the response was created. That means we now have provenance, which brings us much closer to realizing Lanier’s vision of data dignity.

If we publish a human programmer’s currencyco­nversion software in a book, and our language model reproduces it in response to a question, we can attribute that to the original source and allocate royalties appropriat­ely. The same would apply to an AIgenerate­d novel written in the style of Ward’s (excellent) Sing, Unburied, Sing.

Google’s “AI-powered overview” feature is a good example of what we can expect with RAG. Since Google already has the world’s best search engine, its summarizat­ion engine should be able to respond to a prompt by running a search and feeding the top results into an LLM to generate the overview the users asked for. The model would provide the language and grammar, but it would derive the content from the documents included in the prompt. Again, this would provide the missing provenance.

Now that we know it is possible to produce output that respects copyright and compensate­s authors, regulators need to step up to hold companies accountabl­e for failing to do so, just as they are held accountabl­e for hate speech and other forms of inappropri­ate content. We should not accept leading LLM providers’ claim that the task is technicall­y impossible. In fact, it is another of the many business-model and ethical challenges that they can and must overcome.

Moreover, RAG also offers at least a partial solution to the current AI “hallucinat­ion” problem. If an applicatio­n (such as Google search) supplies a model with the data needed to construct a response, the probabilit­y of it generating something totally false is much lower than when it is drawing solely on its training data. An AI’s output thus could be made more accurate if it is limited to sources that are known to be reliable.

We are only just beginning to see what is possible with this approach. RAG applicatio­ns will undoubtedl­y become more layered and complex. But now that we have the tools to trace provenance, tech companies no longer have an excuse for copyright unaccounta­bility. of Content Strategy for O’Reilly Media, Inc., is the author of System Performanc­e Tuning (O’Reilly Media, Inc., 2002) and a co-author of Unix Power Tools (O’Reilly Media, Inc., 2002) and Ethics and Data Science (O’Reilly Media, Inc., 2018). Tim O’Reilly, Founder and CEO of O’Reilly Media, Inc., is a visiting professor at University College London Institute for Innovation and Public Purpose and the author of WTF? What’s the Future and Why It’s Up to Us (Harper Business, 2017).

Copyright: Project Syndicate, 2023. www.project-syndicate. org

 ?? ??
 ?? ??

Newspapers in English

Newspapers from Ghana