Constrained Retrieval-Augmented Language Models

Project Goals

The CORAL project aims to research methods for the construction and use of large language models (LLMs) that are subject to legal, technical, and qualitative constraints. With the fulfillment of legal requirements for the training data of LLMs and the referential provenance of the generated texts, our focus is on two central criteria that are indispensable for the professional use of language models. To this end, we are researching new methods for the constrained training of LLMs and the retrieval augmented generation (RAG) of texts.

Coral Illustration

Data

CORAL uses data from the partners involved. This includes the digital holdings of the German National Library (DNB), and web crawls from the Internet Archive and the Common Crawl, amounting to petabytes, many years of European-language news crawls from Wortschatz Leipzig, and proprietary data from the financial sector. Apart from the Common Crawl, this data has so far not been usable for training LLMs, as it is not made publicly available in its original form for legal reasons. We are therefore investigating the extent to which this data can be used legally for the training of LLMs in obfuscated form and how far the obfuscation of the data may go in order to construct useful large language models.

Research Questions

Central research questions are: Which training methods and model architectures are robust against data constraints? How resource-efficient can useful large language models be trained? Which methods of obfuscation, un-learning and negated augmentation effectively prevent the disclosure of protected data? How can the transparency and soundness, originality, and referenceability of the generated texts be ensured? How vulnerable are the methods used to secure the training data of LLMs?

CORAL is thus making important contributions to the future establishment of a German market for large language models.

Publications

tba.

Partners

Institute for Applied Informatics / Institute für Angewandte Informatik e.V.

University of Kassel / Universität Kassel

Anhalt University of Applied Sciences / Hochschule Anhalt

German National Library / Deutsche Nationalbibliothek