Copyright and generative AI

To kick off 2024, I’d like to talk about the current copyright situation for generative models.

This is a highly topical issue, since two lawsuits on this subject are currently before the Anglo-Saxon courts: the first, in Great Britain, pits the Getty Images image library against Stability AI, a company that supplies an image-generating model. The second lawsuit is being brought in the USA by the New York Times against OpenAI and Microsoft.

In this article, I’ll give a general overview of the situation. I’ll cover the dispute between the New York Times and OpenAI/Microsoft in more detail in the next article, and I’ll also try to mention a few possible avenues of development.

As we shall see, the potential impact for the generative AI sector and its users is great. I’d like to encourage you to read through the following text. It may seem dry and fussy at first, but it’s well worth the effort.

Disclaimer: I’m not a lawyer, so the following is not legal advice.

With that in mind, let’s get started…

1. A few notions of copyright

Copyright confers on the author of a creative activity a monopoly on the revenues resulting from the economic exploitation of the work. In practice, these revenues derive from the making of reproductions of the work and its communication to the public, which cannot be carried out without the author’s prior authorization (usually in return for payment). Copyright also confers moral rights on the author, such as authorship, but this is beyond the scope of this discussion.

This right of exploitation is limited in time, i.e. 70 years after the author’ s death for Belgium; the work then passes into the public domain, meaning that it can then be exploited economically without constraint.

The term “creative activity” is fairly broad, covering not only literary, photographic, musical, sound and audiovisual artistic creations, but also computer software and applied art (clothing, furniture, architectural plans, objects, graphics, jewelry, etc.).

On the other hand, a legislative or administrative text, a satellite image or a painting by a monkey are not considered creations of the human mind. Nor can a technical invention be protected by copyright, but it can be protected by patent.

Finally, certain derogations from copyright are accepted because they do not interfere with the normal exploitation of the work while serving the public interest. For example, the presentation of excerpts from works for educational purposes in schools, or for information purposes in the media, are authorized, as are uses for academic research purposes.

Grouped together under the Anglo-Saxon term “fair use”, these exceptions are important because they will come into play in the discussion on AI. Is training a generative model on data subject to copyright fair use? This is a complex question, at the heart of the dispute between OpenAI and the New York Times.

Having said that, let’s look at the points of friction between generative models and copyright. There are two main problems, the problem of training (upstream) and that of generation (downstream), as well as a third, related problem, that of artificial creation. Let’s look at them in turn.

2. The upstream problem: model training

The training problem is simple to understand: generative models need a prodigious volume of digital data for their training. These data come from copies of the entire Internet, made over time by programs that have siphoned off all the publicly accessible data they can find: social networks, search engines, digital libraries, newspapers, statistical databases, blogs, encyclopedias etc….

This data is consolidated into huge aggregates, the best-known of which is the Common Crawl, accessible here.

However, “publicly accessible” on the Internet in no way means that the author confers any rights on the user beyond simple online consultation. And therefore no implicit authorization to train an AI model…

To make matters worse, this problem is almost universal. With the exception of a small minority of texts in the public domain and the few AI-generated texts whose status is currently unclear, virtually everything else automatically falls under copyright.

If the problem is simple to understand, we have to admit that its solution is Dantesque: the whole of the Internet means millions, even tens of millions of authors involved, texts whose authorship is often difficult to attribute, and for which the prior agreement of the rights holder would have to be obtained…

This is why the major players in the sector (OpenAI and others) have sought to short-circuit the problem by declaring that model training is a matter of fair use, and therefore does not require the prior agreement of rights holders.

The AI giants’ main argument is that the generation algorithms ingest so much data from different authors and transform it to such an extent that individual authors’ rights are not impacted. They also argue that the wider the access to data, the better the models will be, and that to deny them this access is a death sentence for an industry that is symbolic of progress, and which could make a huge contribution to society in the future.

The authors retort that algorithms abuse their creations for profit, and may infringe their exploitation rights. They point to examples of AI creations that are very similar, if not identical, to their own…

My layman’s intuition is that the technical arguments of the AI sector are valid (transformative character and volume of training data), but the argument of public utility is specious and serves as a screen for the lucrative aims of the players in generative AI…

The issue of rights to training data is crucial for the entire AI industry, which is largely based on data-intensive machine learning algorithms of all types, although generative AI (mainly images and text) crystallizes the problem given the potential competition with authors.

However, even if developers somehow obtain permission to use copyrighted data for model training, this does not necessarily mean that users are free to produce and distribute their generations as they see fit….which brings us to the downstream problem.

3. The downstream problem: generation

The generation problem is this: if a user uses an AI program to produce an image (or text) that is substantially similar to a protected work, who is responsible for the potential infringement (plagiarism)?

Is it the company that produced the AI tool? The user who guided the tool in its generation? The person who distributed the image? The platform used to distribute the image?

It’s useful to know that the companies that make the models available tend to push this responsibility back onto the user in their conditions of use: their position is that the user pilots the tool via the prompt and is responsible for what it generates and the use he or she then makes of it.

And the risk is real. Image and language models sometimes reproduce images or texts similar to what was in their training data.

A major complication is that this is possible not only if the user requests it, but also without the user having explicitly requested it: for example, it is possible to recreate images of Star Wars characters or vehicles without these terms appearing in the prompt. The same applies to the texts generated by the New York Times in its dispute with OpenAI: the newspaper managed to reproduce almost exact copies of some of its articles without the newspaper’s name appearing in the prompt.

In any case, this weakens the Pontius Pilate position of the model developers: it’s hard to put the blame on the model user if the model creates counterfeits without the latter’ s knowledge… the question of respective responsibilities won’t be easy to decide.

If you wish to understand this issue in more detail, I refer you to the excellent article by Gary Marcus and Reid Soutern published a few days ago in IEEE Spectrum, and accessible here.

In any case, the generation problem depends on solving the training problem. The best outcome would be for the model developers to reach a (pecuniary) agreement with the authors that would allow both training AND unconstrained generation, killing two birds with one stone.

On the other hand, if the entrainment issue is resolved to the detriment of the authors – for example, if the courts rule in favor of fair use – there is a great risk that the latter will turn against user-generated images to assert their rights, shifting the heart of the dispute from entrainment to generation.

4. Artificial creation

As mentioned above, current copyright implies creation by a human being. But for the first time, a non-human creative activity becomes possible. AI generation therefore introduces another legal question: let’s forget for a moment about existing authors’ rights, and imagine a completely original artificial creation. Does this work, in turn, deserve some form of copyright protection?

And if future legislation were to assign copyright, to whom would it belong? The owner of the model or the user, or perhaps one day the AI itself?

Finally, we may need to distinguish between fully autonomous artificial creation and that in which the human continues to play a pilot role, via a prompt for example, assisted by an AI reduced to the role of a generative tool…

The question of artificial creation is important in principle, but its resolution is less urgent than the other two. So it’s likely to remain an open question for some time to come.

5. Thoughts

Copyright has a long history. Throughout its history, it has regularly found itself in conflict with technological progress. Imagine the reaction of 19th-century painters to the first photographs, or that of novelists to the first photocopiers in the 1970s, not to mention audio cassettes and VHS video recorders in the 1980s… Copyright has evolved over time without ceasing to play its protective role for creators. The advent of generative models is just the latest twist in this co-evolution.

A radical – albeit unlikely – outcome would be the outright banning of generative models. A similar scenario took place in 2001 with the banning of Napster following a lawsuit initiated by the band Metallica. Napster enabled users to download music free of charge, irrespective of applicable copyrights – a more direct transgression than that of generative models! Nevertheless, it serves as a reminder that technology is not always successful in challenging copyright.

It’s also interesting to note that the European AI Act only deals with copyright indirectly, by requiring generative model developers to specify which copyrighted works have been used to train the model. This is not illogical, as copyright is subject to a separate set of European directives, and the essential clarifications will probably appear in a future iteration of these.

Moreover, it is quite possible that different jurisdictions will adopt different approaches. There is no guarantee that the USA and Europe will follow the same logic, especially as the risk of regulatory capture cannot be ruled out given the financial resources of the private players involved. Japan has already taken an initiative in this field, authorizing the training of generative models on data subject to copyright (subject to certain limitations).

And finally, a final complication: what about open-source generative models? Is it possible to organize a possible remuneration of authors in the absence of financial flows from users to model developers? Will these models have to make do with public domain or even synthetic data for their training? Or will they disappear? As you can see, there’s a lot to think about, and the questions are technical, legal and financial.

PS: In this later blog, we reflect on copyright issues possibly driving model diversity as well.

6. References

AIdoes.eu…

Translated with DeepL and adapted from our partner Arnaud Stevins’ blog (Dec. 25th, 2023).

February 17, 2024