EU AI Act · NIST AI 600-1 · GPAI

AI training data documentation must reflect what was true at training time.

That is a different and much harder question than whether documentation exists today. It is the question the EU AI Act, training-data copyright courts, and US copyright regulators are now deciding AI compliance on.

risk-assessment.docxSHA-256on device0xa3f9…b2c7XRPL public ledgerSTAYS ON DEVICEPUBLIC
The regulations

Every framework asks the same question: when.

The EU AI Act, NIST, and US copyright regulators all converge on the same requirement: prove your training data record existed, in this form, at this date, to a party that does not trust you.

EU AI Act · Article 53(1)(a)
Technical documentation

Providers of general-purpose AI models must draw up and keep up-to-date the technical documentation of the model, including its training and testing process and the results of its evaluation. The documentation must reflect what was true at training time, not what can be reconstructed after a complaint. Obligations applied from 2 August 2025.

EU AI Act · Annex XI, Section 1(2)(c)
Training data provenance

Technical documentation must include type and provenance of data and curation methodologies. Provenance means where the data came from, how it was curated, and in what form. A model card written after training cannot demonstrate that provenance was recorded at training time.

EU AI Act · Article 10(2) + Annex IV item 2
Data governance

Training data for high-risk AI systems must be subject to data governance practices, and technical documentation must cover training methodologies and the provenance of those data sets. Provenance is named twice in the regulation at two different levels of specificity.

EU AI Act · Article 55(1)(a)
Adversarial testing documentation

Providers of GPAI models with systemic risk must perform and document adversarial testing. Documentation of adversarial testing is only useful if it can be shown to predate deployment, not postdate a complaint. The contemporaneity requirement is implicit in the purpose.

EU AI Act · Article 12(1)
Automatic logging

High-risk AI systems shall technically allow for the automatic recording of events (logs) over the lifetime of the system. Logs that can be altered by the provider they are supposed to audit are not logs that satisfy this requirement. The integrity of the log is inseparable from its utility.

NIST AI 600-1 (July 2024)
Content provenance

The US Generative AI Profile identifies content provenance as a cross-cutting requirement throughout the AI risk lifecycle. NIST recommends logging, metadata annotation, and documentation of the source, legal rights, privacy status, generation date, method, and lineage of training data. The profile identifies the gap explicitly: AI developers often fail to vet or adequately document the training data they are using.

US Copyright Office · Part 2 (February 2025)
Human authorship requirement

Human authorship is a bedrock requirement for copyright protection. Works in training datasets that contain human creative expression are protected. An AI provider that cannot demonstrate exactly what human-authored content was in its training corpus, and on what basis its use was authorised, lacks the evidential foundation for a clean copyright defence.

The evidence gap

A model card written after training is forensically identical to one written before it.

Documentation in Notion, Confluence, SharePoint, and Hugging Face model cards is editable. File timestamps can be rewritten. Metadata can be changed. No document audit can tell the difference.

Model cards

Written after training concludes, sometimes months after, sometimes in response to litigation.

Data sheets

Completed after the fact are forensically indistinguishable from those completed at training time.

Cloud storage

Provider-level modification, government orders, and operational bugs are outside the retention settings.

Cases

The litigation is already here. The question in every case is when.

Providers cannot point to a contemporaneous, tamper-evident record of exactly what was in their training datasets at the moment training ran.

Getty Images v. Stability AI
US District Court, D. Del. · 1:23-cv-00135
Ongoing

The exact composition of the LAION training corpus at the time Stable Diffusion trained, and the CMI status of images at ingestion, cannot be established from retrospective documentation.

New York Times v. OpenAI / Microsoft
US District Court, S.D.N.Y. · 1:23-cv-11195
Ongoing

Scope of the training corpus; what NYT content was ingested; when training ran on which dataset version. Retrospective model cards cannot answer these questions.

Thomson Reuters v. ROSS Intelligence
US District Court, D. Del. · February 2025 ruling
First US ruling rejecting AI training fair use

Which headnotes were in training data and whether use was transformative. ROSS could not demonstrate with contemporaneous records what was in the dataset at training time.

Andersen v. Stability AI
US District Court, N.D. Cal. · 3:23-cv-00201
Ongoing

Composition of LAION training datasets at training time and CMI presence at ingestion. Section 1202 claims dismissed for insufficient CMI evidence illustrating the evidentiary gap.

Concord Music v. Anthropic
US District Court, M.D. Tenn. · 3:23-cv-01092
Ongoing

What music content was in Claude's training corpus and whether the acquisition method was lawful. Contemporaneous dataset records did not exist.

EU AI Act enforcement
EU AI Office · from 2026
Enforcement horizon

Whether GPAI technical documentation under Article 53 and Annex XI reflects training-data provenance at training time, not reconstruction prepared after the fact.

Litigation costs in training-data copyright suits are running to tens of millions of dollars before trial. The EU AI Act carries penalties of EUR 15 million or 3% of worldwide annual turnover for failure to produce compliant technical documentation.

How it works

The dataset stays private. The proof is public.

A SHA-256 hash is a one-way fingerprint. Anyone can verify it matches the original. No-one can reconstruct the original from it.

risk-assessment.docxSHA-256on device0xa3f9…b2c7XRPL public ledgerSTAYS ON DEVICEPUBLIC
01
Dataset stays on your infrastructure

The SHA-256 hash is computed locally. The training dataset, model checkpoint, or evaluation log is not transmitted, stored, or visible to immut.

02
Hash anchored to the XRP Ledger

The hash is written to the public XRP Ledger at the moment you finalise the dataset. Once written, no party can alter or delete it.

03
Certificate issued immediately

immut generates a court-ready certificate containing the hash, XRPL transaction ID, ledger sequence number, and UTC timestamp.

04
Proof outlives immut

The record lives on a public blockchain and remains verifiable even if immut ceased to exist. No dependency on immut's servers or continued operation.

Integrates with MLflow, Weights and Biases, DVC, Hugging Face, and any data pipeline with an API or webhook.

Legal acceptance

88 countries. 171 jurisdictions. Already accepted.

United States
US v. Sterlingov (2024)
Federal

The US District Court for DC admitted blockchain transaction records as primary evidence, establishing that public blockchain data satisfies US federal evidentiary standards without requiring expert testimony on the underlying technology.

European Union
EU Regulation 2025/2531 (eIDAS-2)
All 27 Member States

The updated eIDAS framework recognises qualified electronic time-stamps as having the legal effect of evidence of the date and time indicated and the integrity of the data, binding across all EU Member States.

France
AZ Factory v. Valeria Moda (2025)
Tribunal Judiciaire de Marseille

A blockchain timestamp was accepted as proof of prior creation in an IP infringement dispute. The court found the blockchain record established both the date and integrity of the original file without requiring production of the file itself.

China
China Supreme People's Court (2018)
1,400+ subsequent cases

The Supreme People's Court ruled that blockchain-stored evidence is presumptively authentic and meets the standard for electronic evidence. Over 1,400 IP cases have since been decided on blockchain-anchored evidence.

Question to ask yourself

If the EU AI Office asked you to prove your training dataset was exactly as your data card describes at the moment training ran, could you?

Prove your first file in minutes.

Takes seconds. Works on any file type. No installation required.

Sign up for free