AI training and the global battle over copyright
The Invisible Fuel of AI
When we use tools like ChatGPT, Gemini, or Midjourney, it often seems like magic. A simple text input transforms into complex essays, poems, or photorealistic images. But this “artificial” creativity doesn’t come from nothing. It’s based on a foundation of human creativity: the collective knowledge and output of humanity available on the internet.
Large language and image models are only as good as the data they’re trained on. To build these models, companies like OpenAI, Google, and Stability AI have “read” the internet on a petabyte scale—books, articles, blog posts, artwork, photos, and code.
This very process—the training—has unleashed a global legal and ethical avalanche. The core question: Is it legal to use copyrighted works without permission and without compensation to train a commercial AI that might ultimately replace the copyright holders themselves?

The technical problem: How an AI “learns”
To understand the legal debate, one must understand the technical process. Training an AI is not a simple copying process like we know from computers.
Data Collection (Scraping): First, enormous amounts of data are collected. For text models, this often happens through “web scraping,” where bots automatically crawl the public internet and save texts (e.g., in the “Common Crawl” dataset). For image models, datasets like “LAION” were used, which collected billions of images and their text descriptions from the web.
Training (The “Learning Process”): The AI doesn’t “read” this data like a human. Instead, it analyzes statistical patterns, correlations, stylistic features, and semantic relationships. It learns “which word is most likely to follow another” or “which pixel patterns are associated with the word ‘cat’.”
Training (The “Learning Process”): The AI doesn’t “read” this data like a human. Instead, it analyzes statistical patterns, correlations, stylistic features, and semantic relationships. It learns “which word is most likely to follow another” or “which pixel patterns are associated with the word ‘cat’.”
The Result (The Model): The final product—the AI model—is a gigantic neural network consisting of billions of “parameters” (mathematical values). These parameters represent the learned knowledge. The model does not contain the original works themselves, but rather the patterns it has abstracted from them.
- AI companies argue: “It’s like a person going to a library, reading thousands of books, and then learning to write themselves. The person isn’t copying the books; they’re learning.”
- Authors argue: “No, it’s like wasting thousands of books to create a new one—without asking or paying the authors.”
The legal fronts: “Fair Use” vs. “Text and Data Mining”
The legal battle is being fought on two main fronts with different types of weapons, primarily in the US and the EU.
A) The US Front: The “Fair Use” Doctrine
In the US, the decisive factor is the “fair use” doctrine. It permits the use of copyrighted material under certain circumstances. Whether something qualifies as “fair use” is determined by four factors:
- Purpose and nature of the use: (The most important point) Is the use “transformative”? Does it create something new with a new purpose, or does it simply replace the original?
- Type of copyrighted work: (Creative works enjoy more protection than factual texts).
- Scope of use: (The entire work was used, not just a quotation).
- Impact on the market: (Does AI harm the market for the original work? Yes, say the artists, because AI replaces them).
AI companies say: Yes, it is highly transformative. An AI model is not a book or a collection of images, but a completely new tool that has learned patterns.
Creators say: No, it is not transformative if the result (e.g., an image in the style of artist X) directly competes with the work of artist X.
B) The EU Front: The “Text and Data Mining” Limitation
In the EU, the legal situation is less flexible and more strongly regulated by directives. The relevant directive here is the Copyright Directive (DSMD) of 2019. It contains specific exceptions (limitations) for “Text and Data Mining” (TDM).
In the EU, the legal situation is less flexible and more strongly regulated by directives.
- TDM for Research: TDM (i.e., the automated analysis of data) is generally permitted for scientific research purposes.
- Commercial TDM: (This is where it gets complicated) TDM is also permitted for commercial purposes (such as training ChatGPT), BUT: copyright holders can object (obtain an “opt-out”).
This “rights reservation” (opt-out) must be machine-readable, e.g., through an entry in a website’s robots.txt file or in the metadata. However, many AI companies have argued that they collected data before this regulation was clear or before the copyright holders knew they had to object.
The major lawsuits: Who is fighting whom?
These theoretical conflicts are currently being played out in practice through multi-billion-dollar lawsuits.
Authors vs. OpenAI (e.g., Authors Guild, George R.R. Martin): Authors accuse OpenAI of illegally using their books to train ChatGPT. They argue that the AI can now write summaries of their books or even sequels in their style, directly infringing on their rights.
Artists vs. Stability AI (e.g., Sarah Andersen, Getty Images): Image generators like Stable Diffusion were trained on billions of images. Artists are suing because the AI has “learned” their unique style and can now create works “in their style” at the touch of a button. Getty Images even found remnants of its watermark in AI-generated images, proving that its database was used.
Publishers vs. AI (e.g., The New York Times vs. OpenAI/Microsoft): This is perhaps the strongest claim. The NYT argues not only that its articles were used for training, but also that the AI (ChatGPT/Bing) can now regurgitate its articles almost verbatim. This undermines its subscription model and constitutes direct competition.
The “output problem”: When the AI spits out the original.
Even if the training were considered legal (e.g., “transformative”), there’s a second copyright issue: the output.
- What happens if the AI generates a result that is “substantially similar” to an existing work?
- If Midjourney creates an image that is almost identical to a photograph by a specific photographer.
- If ChatGPT spits out code that has been copied verbatim from a GitHub page (including the original programmer’s comments).
- If an AI generates music that clearly contains the melody of a copyrighted song.
In these cases, there is a classic copyright infringement. The problem is proving it: How can an artist prove that the AI didn’t “accidentally” create a similar image, but rather because it was trained on their work? The New York Times has a strong case here, as it was able to precisely prove this “regurgitation.”
Solutions and the future of copyright
The status quo is a “Wild West” scenario that is unsustainable. Various solutions are currently being discussed and some are already being implemented:
Licensing models (The “Axel Springer approach”): More and more publishers and rights holders are entering into licensing agreements with AI companies. OpenAI, for example, pays Axel Springer (Bild, Welt) and the Associated Press (AP) for the legal right to use their (current) content for training. This ensures that the AI is trained with high-quality data and that the copyright holders receive compensation.
Strict opt-out systems: The idea of opting out could become the standard. Platforms like DeviantArt have already introduced switches that allow artists to exclude their work from AI training. The problem: It is difficult to control and does not apply retroactively to models that have already been trained.
OpenAI pays Axel Springer (Bild, Welt) and Associated Press (AP), for example, for the right to legally use their (current) content for training purposes. Transparency Obligations (The “EU AI Act”): New regulations like the EU AI Act aim for transparency. AI providers will be required to disclose which copyrighted data they have used for training. This gives copyright holders at least the opportunity to assert their rights (e.g., to compensation).
Training with “Clean” Data: Some companies (e.g., Adobe with “Firefly”) are taking a different approach. They train their models exclusively with data that they themselves have licensed (e.g., from their own Adobe Stock database) or that is in the public domain. These models are legally “clean,” but often less powerful than their competitors trained with the “entire internet.”
Conclusion
The conflict between AI developers and creators is more than just a legal battle. It’s a fundamental negotiation about the value of data and creativity in the 21st century.
Courts and legislators face a difficult balancing act: How can they foster innovation without undermining the rights and economic livelihoods of the creators whose work makes that innovation possible in the first place? The rulings in the coming years will forever change the digital economy and the way we create and consume content.
Beliebte Beiträge
More than just a password: Why 2-factor authentication is mandatory today
Why is two-factor authentication (2FA) mandatory today? Because passwords are constantly being stolen through data leaks and phishing. 2FA is the second, crucial barrier (e.g., via an app) that stops attackers – even if they know your password. Protect yourself now!
Beware of phishing: Your PayPal account has been restricted.
Beware of the email "Your PayPal account has been restricted." Criminals are using this phishing scam to steal your login information and money. They pressure you into clicking on fake links. We'll show you how to recognize the scam immediately and what to do.
Excel Tutorial: How to quickly and safely remove duplicates
Duplicate entries in your Excel lists? This distorts your data. Our tutorial shows you, using a practical example, how to clean up your data in seconds with the "Remove Duplicates" function – whether you want to delete identical rows or just values in a column.
Who owns the future? AI training and the global battle for copyright.
AI companies are training their models with billions of copyrighted works from the internet – often without permission. Is this transformative "fair use" or theft? Authors and artists are complaining because AI is now directly competing with them and copying their styles.
Dynamische Bereiche in Excel: BEREICH.VERSCHIEBEN Funktion
Die BEREICH.VERSCHIEBEN (OFFSET) Funktion in Excel erstellt einen flexiblen Bezug. Statt =SUMME(B5:B7) zu fixieren, findet die Funktion den Bereich selbst, z. B. für die "letzten 3 Monate". Ideal für dynamische Diagramme oder Dashboards, die automatisch mitwachsen.
Die INDIREKT-Funktion in Excel meistern
Die INDIREKT Funktion in Excel wandelt Text in einen echten Bezug um. Statt =Januar!E10 manuell zu tippen, nutzen Sie =INDIREKT(A2 & "!E10"), wobei in A2 'Januar' steht. Erstellen Sie so mühelos dynamische Zusammenfassungen für mehrere Tabellenblätter.


























