Der Blog für digitale Kompetenz

Der Blog für digitale Kompetenz

   
Anzeige

AI training and the global battle over copyright

The Invisible Fuel of AI

When we use tools like ChatGPT, Gemini, or Midjourney, it often seems like magic. A simple text input transforms into complex essays, poems, or photorealistic images. But this “artificial” creativity doesn’t come from nothing. It’s based on a foundation of human creativity: the collective knowledge and output of humanity available on the internet.

Large language and image models are only as good as the data they’re trained on. To build these models, companies like OpenAI, Google, and Stability AI have “read” the internet on a petabyte scale—books, articles, blog posts, artwork, photos, and code.

This very process—the training—has unleashed a global legal and ethical avalanche. The core question: Is it legal to use copyrighted works without permission and without compensation to train a commercial AI that might ultimately replace the copyright holders themselves?

künstliche-intelligenz-training-urheberrecht

Topic Overview

Anzeige

The technical problem: How an AI “learns”

To understand the legal debate, one must understand the technical process. Training an AI is not a simple copying process like we know from computers.

Data Collection (Scraping): First, enormous amounts of data are collected. For text models, this often happens through “web scraping,” where bots automatically crawl the public internet and save texts (e.g., in the “Common Crawl” dataset). For image models, datasets like “LAION” were used, which collected billions of images and their text descriptions from the web.

Training (The “Learning Process”): The AI ​​doesn’t “read” this data like a human. Instead, it analyzes statistical patterns, correlations, stylistic features, and semantic relationships. It learns “which word is most likely to follow another” or “which pixel patterns are associated with the word ‘cat’.”

Training (The “Learning Process”): The AI ​​doesn’t “read” this data like a human. Instead, it analyzes statistical patterns, correlations, stylistic features, and semantic relationships. It learns “which word is most likely to follow another” or “which pixel patterns are associated with the word ‘cat’.”

The Result (The Model): The final product—the AI ​​model—is a gigantic neural network consisting of billions of “parameters” (mathematical values). These parameters represent the learned knowledge. The model does not contain the original works themselves, but rather the patterns it has abstracted from them.

  • AI companies argue: “It’s like a person going to a library, reading thousands of books, and then learning to write themselves. The person isn’t copying the books; they’re learning.”
  • Authors argue: “No, it’s like wasting thousands of books to create a new one—without asking or paying the authors.”

The legal fronts: “Fair Use” vs. “Text and Data Mining”

The legal battle is being fought on two main fronts with different types of weapons, primarily in the US and the EU.

A) The US Front: The “Fair Use” Doctrine

In the US, the decisive factor is the “fair use” doctrine. It permits the use of copyrighted material under certain circumstances. Whether something qualifies as “fair use” is determined by four factors:

  • Purpose and nature of the use: (The most important point) Is the use “transformative”? Does it create something new with a new purpose, or does it simply replace the original?
  • Type of copyrighted work: (Creative works enjoy more protection than factual texts).
  • Scope of use: (The entire work was used, not just a quotation).
  • Impact on the market: (Does AI harm the market for the original work? Yes, say the artists, because AI replaces them).

AI companies say: Yes, it is highly transformative. An AI model is not a book or a collection of images, but a completely new tool that has learned patterns.

Creators say: No, it is not transformative if the result (e.g., an image in the style of artist X) directly competes with the work of artist X.

B) The EU Front: The “Text and Data Mining” Limitation

In the EU, the legal situation is less flexible and more strongly regulated by directives. The relevant directive here is the Copyright Directive (DSMD) of 2019. It contains specific exceptions (limitations) for “Text and Data Mining” (TDM).

In the EU, the legal situation is less flexible and more strongly regulated by directives.

  • TDM for Research: TDM (i.e., the automated analysis of data) is generally permitted for scientific research purposes.
  • Commercial TDM: (This is where it gets complicated) TDM is also permitted for commercial purposes (such as training ChatGPT), BUT: copyright holders can object (obtain an “opt-out”).

This “rights reservation” (opt-out) must be machine-readable, e.g., through an entry in a website’s robots.txt file or in the metadata. However, many AI companies have argued that they collected data before this regulation was clear or before the copyright holders knew they had to object.

The major lawsuits: Who is fighting whom?

These theoretical conflicts are currently being played out in practice through multi-billion-dollar lawsuits.

Authors vs. OpenAI (e.g., Authors Guild, George R.R. Martin): Authors accuse OpenAI of illegally using their books to train ChatGPT. They argue that the AI ​​can now write summaries of their books or even sequels in their style, directly infringing on their rights.

Artists vs. Stability AI (e.g., Sarah Andersen, Getty Images): Image generators like Stable Diffusion were trained on billions of images. Artists are suing because the AI ​​has “learned” their unique style and can now create works “in their style” at the touch of a button. Getty Images even found remnants of its watermark in AI-generated images, proving that its database was used.

Publishers vs. AI (e.g., The New York Times vs. OpenAI/Microsoft): This is perhaps the strongest claim. The NYT argues not only that its articles were used for training, but also that the AI ​​(ChatGPT/Bing) can now regurgitate its articles almost verbatim. This undermines its subscription model and constitutes direct competition.

The “output problem”: When the AI ​​spits out the original.

Even if the training were considered legal (e.g., “transformative”), there’s a second copyright issue: the output.

  • What happens if the AI ​​generates a result that is “substantially similar” to an existing work?
  • If Midjourney creates an image that is almost identical to a photograph by a specific photographer.
  • If ChatGPT spits out code that has been copied verbatim from a GitHub page (including the original programmer’s comments).
  • If an AI generates music that clearly contains the melody of a copyrighted song.

In these cases, there is a classic copyright infringement. The problem is proving it: How can an artist prove that the AI ​​didn’t “accidentally” create a similar image, but rather because it was trained on their work? The New York Times has a strong case here, as it was able to precisely prove this “regurgitation.”

Solutions and the future of copyright

The status quo is a “Wild West” scenario that is unsustainable. Various solutions are currently being discussed and some are already being implemented:

Licensing models (The “Axel Springer approach”): More and more publishers and rights holders are entering into licensing agreements with AI companies. OpenAI, for example, pays Axel Springer (Bild, Welt) and the Associated Press (AP) for the legal right to use their (current) content for training. This ensures that the AI ​​is trained with high-quality data and that the copyright holders receive compensation.

Strict opt-out systems: The idea of ​​opting out could become the standard. Platforms like DeviantArt have already introduced switches that allow artists to exclude their work from AI training. The problem: It is difficult to control and does not apply retroactively to models that have already been trained.

OpenAI pays Axel Springer (Bild, Welt) and Associated Press (AP), for example, for the right to legally use their (current) content for training purposes. Transparency Obligations (The “EU AI Act”): New regulations like the EU AI Act aim for transparency. AI providers will be required to disclose which copyrighted data they have used for training. This gives copyright holders at least the opportunity to assert their rights (e.g., to compensation).

Training with “Clean” Data: Some companies (e.g., Adobe with “Firefly”) are taking a different approach. They train their models exclusively with data that they themselves have licensed (e.g., from their own Adobe Stock database) or that is in the public domain. These models are legally “clean,” but often less powerful than their competitors trained with the “entire internet.”

Conclusion

The conflict between AI developers and creators is more than just a legal battle. It’s a fundamental negotiation about the value of data and creativity in the 21st century.

Courts and legislators face a difficult balancing act: How can they foster innovation without undermining the rights and economic livelihoods of the creators whose work makes that innovation possible in the first place? The rulings in the coming years will forever change the digital economy and the way we create and consume content.

About the Author:

Michael W. SuhrDipl. Betriebswirt | Webdesign- und Beratung | Office Training
After 20 years in logistics, I turned my hobby, which has accompanied me since the mid-1980s, into a profession, and have been working as a freelancer in web design, web consulting and Microsoft Office since the beginning of 2015. On the side, I write articles for more digital competence in my blog as far as time allows.
Blogverzeichnis Bloggerei.de - Computerblogs Blogverzeichnis

Search by category:

Beliebte Beiträge

2610, 2025

Die besten Fernwartungstools für Windows und Mac

October 26th, 2025|Categories: Shorts & Tutorials, Artificial intelligence, Internet, Finance & Shopping|Tags: , |

Welches Fernwartungstool ist das beste für Windows & Mac? Von TeamViewer über AnyDesk bis Splashtop: Wir vergleichen die Top-Lösungen für IT-Support und Home-Office. Finden Sie das Tool mit der besten Performance, Sicherheit und dem fairsten Preis-Leistungs-Verhältnis.

2510, 2025

Die Rabatt-Falle: Warum Supermarkt-Apps wie Lidl Plus & Co. uns nichts schenken

October 25th, 2025|Categories: Shorts & Tutorials, Artificial intelligence, Internet, Finance & Shopping|Tags: , |

Supermarkt-Apps wie Lidl Plus locken mit Rabatten. Doch wir bekommen nichts geschenkt. Wir bezahlen mit unseren intimsten Einkaufsdaten. Diese Daten machen uns zum gläsernen Kunden. Der Handel nutzt sie, um unser Kaufverhalten zu analysieren und gezielt zu manipulieren.

2410, 2025

Wie die digitale Identität den Bürger zum Überwachungsobjekt macht

October 24th, 2025|Categories: Shorts & Tutorials, Artificial intelligence, AutoGPT, Career, ChatGPT, Google, Internet, Finance & Shopping, LLaMa, TruthGPT|Tags: , , , |

Wir tauschen Privatsphäre gegen Bequemlichkeit. Unsere digitale Identität – von der e-ID bis zum Social Media Like – wird zum Werkzeug. Konzerne und Staat verknüpfen die Daten und machen den Bürger zum kalkulierbaren und transparenten Überwachungsobjekt.

2310, 2025

Vom Assistenten zum Agenten: Der Copilot von Microsoft

October 23rd, 2025|Categories: Shorts & Tutorials, Artificial intelligence, AutoGPT, ChatGPT, Homeoffice, LLaMa, Microsoft Excel, Microsoft Office, Microsoft Outlook, Microsoft PowerPoint, Microsoft Teams, Microsoft Word, Office 365, TruthGPT, Windows 10/11/12|Tags: , , , |

Copilot wird erwachsen: Microsofts KI ist kein Assistent mehr, sondern ein proaktiver Agent. Mit "Vision" sieht er Ihren Windows-Desktop, in M365 analysiert er als "Researcher" Daten und in GitHub korrigiert er Code autonom. Das größte Update bisher.

2010, 2025

5 einfache Sicherheitsregeln gegen Phishing und Spam, die jeder kennen sollte

October 20th, 2025|Categories: Shorts & Tutorials, Data Protection, Homeoffice, Internet, Finance & Shopping, Microsoft Office, Office 365, Software, Windows 10/11/12|Tags: , , , |

Täuschend echte Mails von Bank, DHL oder PayPal? Das ist Phishing! Datenklau & Viren sind eine tägliche Gefahr. Wir zeigen 5 simple Regeln (2FA, Passwort-Manager & Co.), mit denen Sie sich sofort & effektiv schützen und Betrüger erkennen.

1710, 2025

Nie wieder das Gleiche tun: So nehmen Sie ein Makro in Excel auf

October 17th, 2025|Categories: Shorts & Tutorials, Homeoffice, Microsoft Excel, Microsoft Office, Office 365|Tags: , , |

Genervt von repetitiven Aufgaben in Excel? Lernen Sie, wie Sie mit dem Makro-Rekorder Ihren ersten persönlichen "Magie-Knopf" erstellen. Automatisieren Sie Formatierungen und sparen Sie Stunden – ganz ohne zu programmieren! Hier geht's zur einfachen Anleitung.

Anzeige

Offers 2024: Word & Excel Templates

Ads

Popular Posts:

Search by category:

Autumn Specials:

Anzeige
Go to Top