
AI developers who source data from online platforms to train General Purpose AI (“<span class="news-text_medium">GPAI</span>”) models are now required to compile a list of the websites they most frequently source data from and make this information publicly available on their websites. This new obligation is part of the <span class="news-text_italic-underline">EU AI Act</span>, which will begin to take effect on 2 August 2025. The requirement aims to increase transparency in the training of GPAI models, which often involve large amounts of data, including copyrighted content.
Dr Nils Rauer, an expert in AI regulation and intellectual property law, explained that Article 53(1)(d) of the <span class="news-text_italic-underline">EU AI Act</span> obligates providers of GPAI models to prepare and publicly release a detailed summary of the data used in training their models. This summary should include information from both the pre-training and training phases and focus primarily on copyrighted content, although it covers all types of proprietary information used in the training process.
The new template released by the AI Office provides a framework for companies to comply with this obligation. It emphasises the importance of transparency but also takes into account the need to protect trade secrets and confidential business information. While providers are not required to disclose specific details about the data used, they must offer a summary that provides enough detail to ensure meaningful transparency and enable parties with legitimate interests to exercise their rights under EU law.
Providers of GPAI models that use data scraped from online sources must list the top 10% of domain names from which they sourced the most data, with smaller providers only required to list the top 5% or up to 1,000 domains. Private datasets not commercially licensed by rightsholders need to be disclosed only if they are publicly known or if the provider chooses to disclose them.
The summary must cover all stages of model training, from pre-training to post-training and should be updated to reflect any new data used in the training process. The information must be published on the provider's website and any other channels used to distribute the GPAI models.
The rules set out in Article 53 of the <span class="news-text_italic-underline">EU AI Act</span> will take effect on 2 August 2025, but providers of GPAI models placed on the market before this date have until 2 August 2027, to comply with the new disclosure requirements. However, providers may be allowed to exclude certain information if they cannot access it or retrieving it would impose a disproportionate burden.
Failure to comply with the requirements outlined in the new template could result in fines of up to 3% of a provider's global turnover or €15 million, whichever is higher. Enforcement of these rules will begin on 2 August 2026. The publication of the template follows the release of the finalised GPAI code of practice, which providers can voluntarily follow to demonstrate compliance with the <span class="news-text_italic-underline">EU AI Act's</span> provisions.