AI training and the global battle over copyright

The Invisible Fuel of AI

When we use tools like ChatGPT, Gemini, or Midjourney, it often seems like magic. A simple text input transforms into complex essays, poems, or photorealistic images. But this “artificial” creativity doesn’t come from nothing. It’s based on a foundation of human creativity: the collective knowledge and output of humanity available on the internet.

Large language and image models are only as good as the data they’re trained on. To build these models, companies like OpenAI, Google, and Stability AI have “read” the internet on a petabyte scale—books, articles, blog posts, artwork, photos, and code.

This very process—the training—has unleashed a global legal and ethical avalanche. The core question: Is it legal to use copyrighted works without permission and without compensation to train a commercial AI that might ultimately replace the copyright holders themselves?

künstliche-intelligenz-training-urheberrecht

The technical problem: How an AI “learns”

To understand the legal debate, one must understand the technical process. Training an AI is not a simple copying process like we know from computers.

Data Collection (Scraping): First, enormous amounts of data are collected. For text models, this often happens through “web scraping,” where bots automatically crawl the public internet and save texts (e.g., in the “Common Crawl” dataset). For image models, datasets like “LAION” were used, which collected billions of images and their text descriptions from the web.

Training (The “Learning Process”): The AI doesn’t “read” this data like a human. Instead, it analyzes statistical patterns, correlations, stylistic features, and semantic relationships. It learns “which word is most likely to follow another” or “which pixel patterns are associated with the word ‘cat’.”

The Result (The Model): The final product—the AI model—is a gigantic neural network consisting of billions of “parameters” (mathematical values). These parameters represent the learned knowledge. The model does not contain the original works themselves, but rather the patterns it has abstracted from them.

AI companies argue: “It’s like a person going to a library, reading thousands of books, and then learning to write themselves. The person isn’t copying the books; they’re learning.”
Authors argue: “No, it’s like wasting thousands of books to create a new one—without asking or paying the authors.”

The legal fronts: “Fair Use” vs. “Text and Data Mining”

The legal battle is being fought on two main fronts with different types of weapons, primarily in the US and the EU.

A) The US Front: The “Fair Use” Doctrine

In the US, the decisive factor is the “fair use” doctrine. It permits the use of copyrighted material under certain circumstances. Whether something qualifies as “fair use” is determined by four factors:

Purpose and nature of the use: (The most important point) Is the use “transformative”? Does it create something new with a new purpose, or does it simply replace the original?
Type of copyrighted work: (Creative works enjoy more protection than factual texts).
Scope of use: (The entire work was used, not just a quotation).
Impact on the market: (Does AI harm the market for the original work? Yes, say the artists, because AI replaces them).

AI companies say: Yes, it is highly transformative. An AI model is not a book or a collection of images, but a completely new tool that has learned patterns.

Creators say: No, it is not transformative if the result (e.g., an image in the style of artist X) directly competes with the work of artist X.

B) The EU Front: The “Text and Data Mining” Limitation

In the EU, the legal situation is less flexible and more strongly regulated by directives. The relevant directive here is the Copyright Directive (DSMD) of 2019. It contains specific exceptions (limitations) for “Text and Data Mining” (TDM).

In the EU, the legal situation is less flexible and more strongly regulated by directives.

TDM for Research: TDM (i.e., the automated analysis of data) is generally permitted for scientific research purposes.
Commercial TDM: (This is where it gets complicated) TDM is also permitted for commercial purposes (such as training ChatGPT), BUT: copyright holders can object (obtain an “opt-out”).

This “rights reservation” (opt-out) must be machine-readable, e.g., through an entry in a website’s robots.txt file or in the metadata. However, many AI companies have argued that they collected data before this regulation was clear or before the copyright holders knew they had to object.

The major lawsuits: Who is fighting whom?

These theoretical conflicts are currently being played out in practice through multi-billion-dollar lawsuits.

Authors vs. OpenAI (e.g., Authors Guild, George R.R. Martin): Authors accuse OpenAI of illegally using their books to train ChatGPT. They argue that the AI can now write summaries of their books or even sequels in their style, directly infringing on their rights.

Artists vs. Stability AI (e.g., Sarah Andersen, Getty Images): Image generators like Stable Diffusion were trained on billions of images. Artists are suing because the AI has “learned” their unique style and can now create works “in their style” at the touch of a button. Getty Images even found remnants of its watermark in AI-generated images, proving that its database was used.

Publishers vs. AI (e.g., The New York Times vs. OpenAI/Microsoft): This is perhaps the strongest claim. The NYT argues not only that its articles were used for training, but also that the AI (ChatGPT/Bing) can now regurgitate its articles almost verbatim. This undermines its subscription model and constitutes direct competition.

The “output problem”: When the AI spits out the original.

Even if the training were considered legal (e.g., “transformative”), there’s a second copyright issue: the output.

What happens if the AI generates a result that is “substantially similar” to an existing work?
If Midjourney creates an image that is almost identical to a photograph by a specific photographer.
If ChatGPT spits out code that has been copied verbatim from a GitHub page (including the original programmer’s comments).
If an AI generates music that clearly contains the melody of a copyrighted song.

In these cases, there is a classic copyright infringement. The problem is proving it: How can an artist prove that the AI didn’t “accidentally” create a similar image, but rather because it was trained on their work? The New York Times has a strong case here, as it was able to precisely prove this “regurgitation.”

Solutions and the future of copyright

The status quo is a “Wild West” scenario that is unsustainable. Various solutions are currently being discussed and some are already being implemented:

Licensing models (The “Axel Springer approach”): More and more publishers and rights holders are entering into licensing agreements with AI companies. OpenAI, for example, pays Axel Springer (Bild, Welt) and the Associated Press (AP) for the legal right to use their (current) content for training. This ensures that the AI is trained with high-quality data and that the copyright holders receive compensation.

Strict opt-out systems: The idea of opting out could become the standard. Platforms like DeviantArt have already introduced switches that allow artists to exclude their work from AI training. The problem: It is difficult to control and does not apply retroactively to models that have already been trained.

OpenAI pays Axel Springer (Bild, Welt) and Associated Press (AP), for example, for the right to legally use their (current) content for training purposes. Transparency Obligations (The “EU AI Act”): New regulations like the EU AI Act aim for transparency. AI providers will be required to disclose which copyrighted data they have used for training. This gives copyright holders at least the opportunity to assert their rights (e.g., to compensation).

Training with “Clean” Data: Some companies (e.g., Adobe with “Firefly”) are taking a different approach. They train their models exclusively with data that they themselves have licensed (e.g., from their own Adobe Stock database) or that is in the public domain. These models are legally “clean,” but often less powerful than their competitors trained with the “entire internet.”

Conclusion

The conflict between AI developers and creators is more than just a legal battle. It’s a fundamental negotiation about the value of data and creativity in the 21st century.

Courts and legislators face a difficult balancing act: How can they foster innovation without undermining the rights and economic livelihoods of the creators whose work makes that innovation possible in the first place? The rulings in the coming years will forever change the digital economy and the way we create and consume content.

Beliebte Beiträge

2904, 2023

Import Stock Quotes into Excel – Tutorial
Import Stock Quotes into Excel – Tutorial

Import Stock Quotes into Excel – Tutorial

Michael2023-06-04T09:28:57+02:00April 29th, 2023|Categories: Shorts & Tutorials, Internet, Finance & Shopping, Microsoft Excel, Microsoft Office, Office 365|Tags: Excel, Excel tables, finance|

Importing stock quotes into Excel is not that difficult. And you can do a lot with it. We show you how to do it directly without Office 365.

2104, 2023

Create Excel Budget Book – with Statistics – Tutorial
Create Excel Budget Book – with Statistics – Tutorial

Create Excel Budget Book – with Statistics – Tutorial

Michael2023-06-04T09:35:56+02:00April 21st, 2023|Categories: Shorts & Tutorials, Internet, Finance & Shopping, Microsoft Excel, Microsoft Office, Office 365|Tags: Excel, Excel Calendar, Excel tables, finance|

Create your own Excel budget book with a graphical dashboard, statistics, trends and data cut-off. A lot is possible with pivot tables and pivot charts.

1704, 2023

Excel random number generator – With Analysis function

Michael2023-06-04T08:50:43+02:00April 17th, 2023|Categories: Shorts & Tutorials, Microsoft Excel, Microsoft Office, Office 365|Tags: Excel, Excel tables|

You can create random numbers in Excel using a function. But there are more possibilities with the analysis function in Excel.

1504, 2023

Excel Database with Input Form and Search Function

Michael2023-06-04T08:51:18+02:00April 15th, 2023|Categories: Shorts & Tutorials, Microsoft Excel, Microsoft Office, Office 365|Tags: Excel, Excel tables|

So erstellen Sie eine Datenbank mit Eingabemaske und Suchfunktion OHNE VBA KENNTNISSE in Excel ganz einfach. Durch eine gut versteckte Funktion in Excel geht es recht einfach.

1204, 2023

Enable developer tools in Office 365
Enable developer tools in Office 365

Enable developer tools in Office 365

Michael2023-06-04T08:53:18+02:00April 12th, 2023|Categories: Shorts & Tutorials, Microsoft Excel, Microsoft Office, Microsoft Outlook, Microsoft PowerPoint, Microsoft Word, Office 365, Uncategorized|Tags: Excel, Excel tables, letters, Outlook, Word|

Unlock developer tools in Excel, Word and Outlook. Expand the possibilities with additional functions in Office 365.

104, 2023

Dictate text in Word and have it typed

Michael2023-06-04T08:59:45+02:00April 1st, 2023|Categories: Shorts & Tutorials, Artificial intelligence, Microsoft Office, Microsoft Word|Tags: AI, letters, Microsoft, Word|

Dictating text in Word is much easier and faster than typing everything on the keyboard. Speech recognition in Word works just like external speech recognition software.

AI training and the global battle over copyright

The Invisible Fuel of AI

Topic Overview

JETZT NEU BEI UNS:

The technical problem: How an AI “learns”

The legal fronts: “Fair Use” vs. “Text and Data Mining”

A) The US Front: The “Fair Use” Doctrine

B) The EU Front: The “Text and Data Mining” Limitation

The major lawsuits: Who is fighting whom?

The “output problem”: When the AI ​​spits out the original.

Solutions and the future of copyright

Conclusion

Search for:

You might also be interested in:

Latest Posts:

About the Author:

Search by category:

Search by keyword:

Beliebte Beiträge

Offers 2024: Word & Excel Templates

Related Posts

Popular Posts:

Search by category:

Search by keyword:

Autumn Specials:

Title

Unterstützen Sie unsere Arbeit

Neueste Artikel

The “output problem”: When the AI spits out the original.