Office, Karriere und Technik Blog

Office, Karriere und Technik Blog

Anzeige

Transparenz: Um diesen Blog kostenlos anbieten zu können, nutzen wir Affiliate-Links. Klickst du darauf und kaufst etwas, bekommen wir eine kleine Vergütung. Der Preis bleibt für dich gleich. Win-Win!

AI training and the global battle over copyright

The Invisible Fuel of AI

When we use tools like ChatGPT, Gemini, or Midjourney, it often seems like magic. A simple text input transforms into complex essays, poems, or photorealistic images. But this “artificial” creativity doesn’t come from nothing. It’s based on a foundation of human creativity: the collective knowledge and output of humanity available on the internet.

Large language and image models are only as good as the data they’re trained on. To build these models, companies like OpenAI, Google, and Stability AI have “read” the internet on a petabyte scale—books, articles, blog posts, artwork, photos, and code.

This very process—the training—has unleashed a global legal and ethical avalanche. The core question: Is it legal to use copyrighted works without permission and without compensation to train a commercial AI that might ultimately replace the copyright holders themselves?

künstliche-intelligenz-training-urheberrecht

Topic Overview

Anzeige

The technical problem: How an AI “learns”

To understand the legal debate, one must understand the technical process. Training an AI is not a simple copying process like we know from computers.

Data Collection (Scraping): First, enormous amounts of data are collected. For text models, this often happens through “web scraping,” where bots automatically crawl the public internet and save texts (e.g., in the “Common Crawl” dataset). For image models, datasets like “LAION” were used, which collected billions of images and their text descriptions from the web.

Training (The “Learning Process”): The AI ​​doesn’t “read” this data like a human. Instead, it analyzes statistical patterns, correlations, stylistic features, and semantic relationships. It learns “which word is most likely to follow another” or “which pixel patterns are associated with the word ‘cat’.”

Training (The “Learning Process”): The AI ​​doesn’t “read” this data like a human. Instead, it analyzes statistical patterns, correlations, stylistic features, and semantic relationships. It learns “which word is most likely to follow another” or “which pixel patterns are associated with the word ‘cat’.”

The Result (The Model): The final product—the AI ​​model—is a gigantic neural network consisting of billions of “parameters” (mathematical values). These parameters represent the learned knowledge. The model does not contain the original works themselves, but rather the patterns it has abstracted from them.

  • AI companies argue: “It’s like a person going to a library, reading thousands of books, and then learning to write themselves. The person isn’t copying the books; they’re learning.”
  • Authors argue: “No, it’s like wasting thousands of books to create a new one—without asking or paying the authors.”

The legal fronts: “Fair Use” vs. “Text and Data Mining”

The legal battle is being fought on two main fronts with different types of weapons, primarily in the US and the EU.

A) The US Front: The “Fair Use” Doctrine

In the US, the decisive factor is the “fair use” doctrine. It permits the use of copyrighted material under certain circumstances. Whether something qualifies as “fair use” is determined by four factors:

  • Purpose and nature of the use: (The most important point) Is the use “transformative”? Does it create something new with a new purpose, or does it simply replace the original?
  • Type of copyrighted work: (Creative works enjoy more protection than factual texts).
  • Scope of use: (The entire work was used, not just a quotation).
  • Impact on the market: (Does AI harm the market for the original work? Yes, say the artists, because AI replaces them).

AI companies say: Yes, it is highly transformative. An AI model is not a book or a collection of images, but a completely new tool that has learned patterns.

Creators say: No, it is not transformative if the result (e.g., an image in the style of artist X) directly competes with the work of artist X.

B) The EU Front: The “Text and Data Mining” Limitation

In the EU, the legal situation is less flexible and more strongly regulated by directives. The relevant directive here is the Copyright Directive (DSMD) of 2019. It contains specific exceptions (limitations) for “Text and Data Mining” (TDM).

In the EU, the legal situation is less flexible and more strongly regulated by directives.

  • TDM for Research: TDM (i.e., the automated analysis of data) is generally permitted for scientific research purposes.
  • Commercial TDM: (This is where it gets complicated) TDM is also permitted for commercial purposes (such as training ChatGPT), BUT: copyright holders can object (obtain an “opt-out”).

This “rights reservation” (opt-out) must be machine-readable, e.g., through an entry in a website’s robots.txt file or in the metadata. However, many AI companies have argued that they collected data before this regulation was clear or before the copyright holders knew they had to object.

The major lawsuits: Who is fighting whom?

These theoretical conflicts are currently being played out in practice through multi-billion-dollar lawsuits.

Authors vs. OpenAI (e.g., Authors Guild, George R.R. Martin): Authors accuse OpenAI of illegally using their books to train ChatGPT. They argue that the AI ​​can now write summaries of their books or even sequels in their style, directly infringing on their rights.

Artists vs. Stability AI (e.g., Sarah Andersen, Getty Images): Image generators like Stable Diffusion were trained on billions of images. Artists are suing because the AI ​​has “learned” their unique style and can now create works “in their style” at the touch of a button. Getty Images even found remnants of its watermark in AI-generated images, proving that its database was used.

Publishers vs. AI (e.g., The New York Times vs. OpenAI/Microsoft): This is perhaps the strongest claim. The NYT argues not only that its articles were used for training, but also that the AI ​​(ChatGPT/Bing) can now regurgitate its articles almost verbatim. This undermines its subscription model and constitutes direct competition.

The “output problem”: When the AI ​​spits out the original.

Even if the training were considered legal (e.g., “transformative”), there’s a second copyright issue: the output.

  • What happens if the AI ​​generates a result that is “substantially similar” to an existing work?
  • If Midjourney creates an image that is almost identical to a photograph by a specific photographer.
  • If ChatGPT spits out code that has been copied verbatim from a GitHub page (including the original programmer’s comments).
  • If an AI generates music that clearly contains the melody of a copyrighted song.

In these cases, there is a classic copyright infringement. The problem is proving it: How can an artist prove that the AI ​​didn’t “accidentally” create a similar image, but rather because it was trained on their work? The New York Times has a strong case here, as it was able to precisely prove this “regurgitation.”

Solutions and the future of copyright

The status quo is a “Wild West” scenario that is unsustainable. Various solutions are currently being discussed and some are already being implemented:

Licensing models (The “Axel Springer approach”): More and more publishers and rights holders are entering into licensing agreements with AI companies. OpenAI, for example, pays Axel Springer (Bild, Welt) and the Associated Press (AP) for the legal right to use their (current) content for training. This ensures that the AI ​​is trained with high-quality data and that the copyright holders receive compensation.

Strict opt-out systems: The idea of ​​opting out could become the standard. Platforms like DeviantArt have already introduced switches that allow artists to exclude their work from AI training. The problem: It is difficult to control and does not apply retroactively to models that have already been trained.

OpenAI pays Axel Springer (Bild, Welt) and Associated Press (AP), for example, for the right to legally use their (current) content for training purposes. Transparency Obligations (The “EU AI Act”): New regulations like the EU AI Act aim for transparency. AI providers will be required to disclose which copyrighted data they have used for training. This gives copyright holders at least the opportunity to assert their rights (e.g., to compensation).

Training with “Clean” Data: Some companies (e.g., Adobe with “Firefly”) are taking a different approach. They train their models exclusively with data that they themselves have licensed (e.g., from their own Adobe Stock database) or that is in the public domain. These models are legally “clean,” but often less powerful than their competitors trained with the “entire internet.”

Conclusion

The conflict between AI developers and creators is more than just a legal battle. It’s a fundamental negotiation about the value of data and creativity in the 21st century.

Courts and legislators face a difficult balancing act: How can they foster innovation without undermining the rights and economic livelihoods of the creators whose work makes that innovation possible in the first place? The rulings in the coming years will forever change the digital economy and the way we create and consume content.

About the Author:

Michael W. SuhrDipl. Betriebswirt | Webdesign- und Beratung | Office Training
After 20 years in logistics, I turned my hobby, which has accompanied me since the mid-1980s, into a profession, and have been working as a freelancer in web design, web consulting and Microsoft Office since the beginning of 2015. On the side, I write articles for more digital competence in my blog as far as time allows.
Transparenz: Um diesen Blog kostenlos anbieten zu können, nutzen wir Affiliate-Links. Klickst du darauf und kaufst etwas, bekommen wir eine kleine Vergütung. Der Preis bleibt für dich gleich. Win-Win!
Blogverzeichnis Bloggerei.de - Computerblogs

Search by category:

Beliebte Beiträge

1311, 2025

Das HBO-Portfolio zerbricht: Was Sky-Kunden 2026 verlieren

November 13th, 2025|Categories: Shorts & Tutorials, Internet, Finance & Shopping, Streaming|Tags: , |

Sky & WOW verlieren 2026 die exklusiven HBO-Rechte, da HBO Max in Deutschland startet. Neue Top-Serien wie die "Harry Potter"-Serie laufen künftig exklusiv bei Max. Überraschend: Laufende Hits wie "House of the Dragon" bleiben Sky-Kunden erhalten. Alle Details zur neuen Streaming-Lage.

1311, 2025

Warum die Streaming-Zersplitterung nur einen Verlierer kennt

November 13th, 2025|Categories: Shorts & Tutorials, Internet, Finance & Shopping, Streaming|Tags: , |

Die goldene Streaming-Ära ist vorbei. Netflix, Disney+, Sky & bald HBO Max zersplittern den Markt. Die Folge: Abo-Müdigkeit, steigende Kosten und Frust statt Komfort. Warum der Kunde der große Verlierer dieser Entwicklung ist.

1311, 2025

Training Data Liability: Tech-Aktien im freien Fall

November 13th, 2025|Categories: Artificial intelligence, Internet, Finance & Shopping|Tags: , |

Der KI-Boom steht auf wackeligen Füßen. "Training Data Liability" (Haftung für Trainingsdaten) wird zum Top-Risiko. Urheberrechtsklagen & DSGVO-Strafen bedrohen die Geschäftsmodelle der Tech-Giganten. Warum der Markt jetzt panisch reagiert.

1111, 2025

Vodafone earthquake at DE-CIX: The end of the open network?

November 11th, 2025|Categories: Uncategorized, Internet, Finance & Shopping, Shorts & Tutorials|Tags: , |

A bombshell in the internet world: Vodafone is ending free public peering at DE-CIX. Data traffic will now be routed through its partner Inter.link – for a fee. What does this change in strategy mean for net neutrality and the quality of your stream?

1111, 2025

Warning: The “Black Friday” trap in the office mailbox

November 11th, 2025|Categories: Career, Data Protection, Internet, Finance & Shopping|Tags: , , , |

Black Friday is full of dangerous traps lurking in office inboxes. Phishing emails disguised as great deals can lead to data theft and ransomware. Learn how to recognize these fraudulent emails immediately and effectively protect your business.

1011, 2025

The worst-case scenario: How a massive data leak should shake us all up

November 10th, 2025|Categories: Data Protection, Internet, Finance & Shopping|Tags: , |

A massive data breach is once again shaking the digital world. Millions of passwords and personal data are circulating – perhaps yours too. Our article shows you how to reliably check if you've been affected and what 5 steps you need to take immediately to prevent identity theft.

Anzeige

Offers 2024: Word & Excel Templates

Anzeige
Ads

Popular Posts:

Search by category:

Autumn Specials:

Anzeige
Go to Top