IP-Recht DOI: 10.38023/14cdbd3b-a488-4d47-aa91-4024a032438c

Verbatim Memorization in language models and EU Copyright Law

Stepanka Havlikova
Stepanka Havlikova
Rechtsgebiete:

IP-Recht

Sammlung:

Tagungsband IRIS 2025

Zitiervorschlag: Stepanka Havlikova, Verbatim Memorization in language models and EU Copyright Law, in: Jusletter IT 30. April 2025

Empirical studies suggest that – although technically not storing the raw training dataset – language models as statistical models assigning a probability to a sequence of words may be able to extract hundreds of verbatim text sequences from the model’s training data. And thus if language models are trained on publicly available data, such data memorisation might lead to infringement of copyright and database rights. Recently adopted set of two exceptions from copyright and database protection for purposes of so -called “text and data mining” introduced by the CDSM Directive could emerge as pivotal when aiming to justify use of publicly available data to train artificial intelligence. However, the applicability TDM exceptions is limited as to the purpose of generating new information as well as to the scope of permitted actions permitting solely reproduction or extraction of protected content. Although language models adopt additional measures to prevent data memorisation and dissemination of verbatim snippets – such as de-duplication or outputfilters – these measures might not be bulletproof, especially due to jailbreaking which may manipulate AI models into bypassing such measures. Question remains, is there a meaningful solution to preventing copyright infringement while not hindering training of language models on publicly available data?


Inhaltsverzeichnis

  • 1. Introduction
  • 2. Verbatim Memorisation
  • 3. Data Memorisation as Copyright Infringement
  • 4. Text and Data Mining Exceptions as Legal Basis
  • 5. Practical Solution: De-duplication or Output Filters?
  • 6. Originality in the Sense of the Author’s Intellectual Creation versus Statistical Probability of AI Outputs
  • 7. Impact on AI Transparency
  • 8. Conclusion
  • 9. Literatur
Loggen Sie sich bitte ein, um den ganzen Text zu lesen.
Für Campus registrieren? Mehr dazu
Login Poster