Datasets with Croissant
EU AI Act Art. 10 requires providers of high-risk systems to establish data governance practices that include a description of the characteristics of the training and evaluation dataset. Annex IV §2 specifically requests information on data provenance and characteristics. Venturalítica materialises this requirement through Croissant manifests: the dataset is not referenced by a file path but by a structured manifest declared in sei.yaml, whose digest is anchored in the signed evidence bundle.
Why Croissant and not read_csv
Section titled “Why Croissant and not read_csv”Loading a dataset directly with pd.read_csv("data/file.csv") produces no provenance artefact: there is no field description, no declared sensitive attributes, no verifiable hash, and the evaluation code mixes loading logic with measurement logic.
A Croissant manifest (specification mlcommons.org/croissant/1.0) decouples the dataset description from its consumption:
- It describes the distribution (
fileObject,sha256), schema (recordSet,field), and AI responsibility metadata (rai:dataBiases,rai:sensitiveData). - The eval materialises it via
mlcroissant.Dataset, notread_csv. The field name at load time becomes a reference to the manifest, not a hardcoded path. - The manifest digest enters the signed evidence bundle — any change to the dataset description breaks the signature and is tracked in git.
Declaration in sei.yaml
Section titled “Declaration in sei.yaml”The dataset.croissant field points to the manifest relative to the repository root:
dataset: { croissant: data/german_credit.croissant.json }This field is the only dataset reference the engine needs. sei run includes the manifest digest in the bundle; sei status verifies it against sei.lock.
Minimum Croissant manifest structure for sei
Section titled “Minimum Croissant manifest structure for sei”The following example is a representative excerpt from the real german_credit.croissant.json manifest of the loan scenario. It shows the three mandatory blocks for Annex IV §2 traceability:
{ "@context": { "@vocab": "https://schema.org/", "cr": "http://mlcommons.org/croissant/", "rai": "http://mlcommons.org/croissant/RAI/", "dct": "http://purl.org/dc/terms/" }, "@type": "sc:Dataset", "conformsTo": "http://mlcommons.org/croissant/1.0", "name": "german-credit-eval", "description": "UCI German Credit (1000 applications) — eval-set for loan. Protected attributes: gender, age.", "citeAs": "UCI Machine Learning Repository, Statlog (German Credit Data), dataset #144.", "license": "https://creativecommons.org/licenses/by/4.0/", "url": "https://archive.ics.uci.edu/dataset/144", "rai:dataBiases": "Gender imbalance in the approval base rate (motivates fairness controls).", "rai:sensitiveData": "gender, age (protected attributes for bias evaluation).", "distribution": [ { "@type": "cr:FileObject", "@id": "german_credit.csv", "name": "german_credit.csv", "contentUrl": "german_credit.csv", "encodingFormat": "text/csv", "sha256": "eba521d4cf4573eae070dac58e535d1e730e66e996f76ba0180d9d1b608eb043" } ], "recordSet": [ { "@type": "cr:RecordSet", "@id": "applications", "name": "applications", "field": [ { "@type": "cr:Field", "@id": "applications/age", "name": "age", "dataType": "sc:Integer", "source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "age" } } }, { "@type": "cr:Field", "@id": "applications/gender", "name": "gender", "dataType": "sc:Text", "source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "gender" } } }, { "@type": "cr:Field", "@id": "applications/target", "name": "target", "dataType": "sc:Integer", "source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "target" } } } ] } ]}The three blocks serve distinct functions from a regulatory standpoint:
| Block | Function for Annex IV §2 |
|---|---|
Root metadata (name, description, license, url, citeAs) | Dataset provenance and citation |
rai:dataBiases, rai:sensitiveData | Data characteristics (Art. 10(2)(f) and (g)) |
distribution[].sha256 | Verifiable integrity; the engine anchors this hash |
recordSet / field | Declared schema; mlcroissant uses it to materialise the DataFrame |
How the eval loads the dataset via Croissant
Section titled “How the eval loads the dataset via Croissant”The framework-agnostic eval (compliance_eval.py) loads the dataset via mlcroissant.Dataset, not read_csv. The manifest acts as the loading contract:
import mlcroissant as mlc
CROISSANT = "data/german_credit.croissant.json"
def load_applications(croissant_path: str = CROISSANT) -> pd.DataFrame: ds = mlc.Dataset(jsonld=croissant_path) df = pd.DataFrame(list(ds.records(record_set="applications"))) df = df.rename(columns=lambda c: c.split("/", 1)[1] if "/" in c else c) return dfmlcroissant reads the manifest, validates the CSV file’s sha256 against the value declared in distribution, and materialises the recordSet as Python records. The evaluation code never references the CSV path directly; it does so through the manifest.
Digest anchoring in the evidence
Section titled “Digest anchoring in the evidence”When sei run executes the pipeline, the engine computes the Croissant manifest digest and includes it in the signed evidence bundle (.sei/bundle.json). sei.lock stores that digest alongside the treatment code digest and model digest:
sei status --repo .# → checks the Croissant digest against sei.lock; if the manifest changed,# the dataset section appears as stale and sei run re-anchorsThis mechanism ensures that any modification to the manifest — including a change to rai:dataBiases or the distribution sha256 — is recorded in the git history of the bundle. The cloud renders this digest as part of the Annex IV §2 section.
References
Section titled “References”sei.yamlreference —dataset.croissantfield and manifest syntax- Concepts: EU AI Act — Art. 10 (data governance) and Annex IV §2