AI training data documentation must reflect what was true at training time.
That is a different and much harder question than whether documentation exists today. It is the question the EU AI Act, training-data copyright courts, and US copyright regulators are now deciding AI compliance on.
Every framework asks the same question: when.
The EU AI Act, NIST, and US copyright regulators all converge on the same requirement: prove your training data record existed, in this form, at this date, to a party that does not trust you.
Providers of general-purpose AI models must draw up and keep up-to-date the technical documentation of the model, including its training and testing process and the results of its evaluation. The documentation must reflect what was true at training time, not what can be reconstructed after a complaint. Obligations applied from 2 August 2025.
Technical documentation must include type and provenance of data and curation methodologies. Provenance means where the data came from, how it was curated, and in what form. A model card written after training cannot demonstrate that provenance was recorded at training time.
Training data for high-risk AI systems must be subject to data governance practices, and technical documentation must cover training methodologies and the provenance of those data sets. Provenance is named twice in the regulation at two different levels of specificity.
Providers of GPAI models with systemic risk must perform and document adversarial testing. Documentation of adversarial testing is only useful if it can be shown to predate deployment, not postdate a complaint. The contemporaneity requirement is implicit in the purpose.
High-risk AI systems shall technically allow for the automatic recording of events (logs) over the lifetime of the system. Logs that can be altered by the provider they are supposed to audit are not logs that satisfy this requirement. The integrity of the log is inseparable from its utility.
The US Generative AI Profile identifies content provenance as a cross-cutting requirement throughout the AI risk lifecycle. NIST recommends logging, metadata annotation, and documentation of the source, legal rights, privacy status, generation date, method, and lineage of training data. The profile identifies the gap explicitly: AI developers often fail to vet or adequately document the training data they are using.
Human authorship is a bedrock requirement for copyright protection. Works in training datasets that contain human creative expression are protected. An AI provider that cannot demonstrate exactly what human-authored content was in its training corpus, and on what basis its use was authorised, lacks the evidential foundation for a clean copyright defence.
A model card written after training is forensically identical to one written before it.
Documentation in Notion, Confluence, SharePoint, and Hugging Face model cards is editable. File timestamps can be rewritten. Metadata can be changed. No document audit can tell the difference.
Written after training concludes, sometimes months after, sometimes in response to litigation.
Completed after the fact are forensically indistinguishable from those completed at training time.
Provider-level modification, government orders, and operational bugs are outside the retention settings.
The litigation is already here. The question in every case is when.
Providers cannot point to a contemporaneous, tamper-evident record of exactly what was in their training datasets at the moment training ran.
The exact composition of the LAION training corpus at the time Stable Diffusion trained, and the CMI status of images at ingestion, cannot be established from retrospective documentation.
Scope of the training corpus; what NYT content was ingested; when training ran on which dataset version. Retrospective model cards cannot answer these questions.
Which headnotes were in training data and whether use was transformative. ROSS could not demonstrate with contemporaneous records what was in the dataset at training time.
Composition of LAION training datasets at training time and CMI presence at ingestion. Section 1202 claims dismissed for insufficient CMI evidence illustrating the evidentiary gap.
What music content was in Claude's training corpus and whether the acquisition method was lawful. Contemporaneous dataset records did not exist.
Whether GPAI technical documentation under Article 53 and Annex XI reflects training-data provenance at training time, not reconstruction prepared after the fact.
Litigation costs in training-data copyright suits are running to tens of millions of dollars before trial. The EU AI Act carries penalties of EUR 15 million or 3% of worldwide annual turnover for failure to produce compliant technical documentation.
The dataset stays private. The proof is public.
A SHA-256 hash is a one-way fingerprint. Anyone can verify it matches the original. No-one can reconstruct the original from it.
The SHA-256 hash is computed locally. The training dataset, model checkpoint, or evaluation log is not transmitted, stored, or visible to immut.
The hash is written to the public XRP Ledger at the moment you finalise the dataset. Once written, no party can alter or delete it.
immut generates a court-ready certificate containing the hash, XRPL transaction ID, ledger sequence number, and UTC timestamp.
The record lives on a public blockchain and remains verifiable even if immut ceased to exist. No dependency on immut's servers or continued operation.
Integrates with MLflow, Weights and Biases, DVC, Hugging Face, and any data pipeline with an API or webhook.
88 countries. 171 jurisdictions. Already accepted.
The US District Court for DC admitted blockchain transaction records as primary evidence, establishing that public blockchain data satisfies US federal evidentiary standards without requiring expert testimony on the underlying technology.
The updated eIDAS framework recognises qualified electronic time-stamps as having the legal effect of evidence of the date and time indicated and the integrity of the data, binding across all EU Member States.
A blockchain timestamp was accepted as proof of prior creation in an IP infringement dispute. The court found the blockchain record established both the date and integrity of the original file without requiring production of the file itself.
The Supreme People's Court ruled that blockchain-stored evidence is presumptively authentic and meets the standard for electronic evidence. Over 1,400 IP cases have since been decided on blockchain-anchored evidence.
If the EU AI Office asked you to prove your training dataset was exactly as your data card describes at the moment training ran, could you?
Prove your first file in minutes.
Takes seconds. Works on any file type. No installation required.