Skip to content

Datasets with Croissant

EU AI Act Art. 10 requires providers of high-risk systems to establish data governance practices that include a description of the characteristics of the training and evaluation dataset. Annex IV §2 specifically requests information on data provenance and characteristics. Venturalítica materialises this requirement through Croissant manifests: the dataset is not referenced by a file path but by a structured manifest declared in sei.yaml, whose digest is anchored in the signed evidence bundle.


Loading a dataset directly with pd.read_csv("data/file.csv") produces no provenance artefact: there is no field description, no declared sensitive attributes, no verifiable hash, and the evaluation code mixes loading logic with measurement logic.

A Croissant manifest (specification mlcommons.org/croissant/1.0) decouples the dataset description from its consumption:

  • It describes the distribution (fileObject, sha256), schema (recordSet, field), and AI responsibility metadata (rai:dataBiases, rai:sensitiveData).
  • The eval materialises it via mlcroissant.Dataset, not read_csv. The field name at load time becomes a reference to the manifest, not a hardcoded path.
  • The manifest digest enters the signed evidence bundle — any change to the dataset description breaks the signature and is tracked in git.

The dataset.croissant field points to the manifest relative to the repository root:

sei.yaml (excerpt)
dataset: { croissant: data/german_credit.croissant.json }

This field is the only dataset reference the engine needs. sei run includes the manifest digest in the bundle; sei status verifies it against sei.lock.


Minimum Croissant manifest structure for sei

Section titled “Minimum Croissant manifest structure for sei”

The following example is a representative excerpt from the real german_credit.croissant.json manifest of the loan scenario. It shows the three mandatory blocks for Annex IV §2 traceability:

data/german_credit.croissant.json (excerpt)
{
"@context": { "@vocab": "https://schema.org/", "cr": "http://mlcommons.org/croissant/",
"rai": "http://mlcommons.org/croissant/RAI/", "dct": "http://purl.org/dc/terms/" },
"@type": "sc:Dataset",
"conformsTo": "http://mlcommons.org/croissant/1.0",
"name": "german-credit-eval",
"description": "UCI German Credit (1000 applications) — eval-set for loan. Protected attributes: gender, age.",
"citeAs": "UCI Machine Learning Repository, Statlog (German Credit Data), dataset #144.",
"license": "https://creativecommons.org/licenses/by/4.0/",
"url": "https://archive.ics.uci.edu/dataset/144",
"rai:dataBiases": "Gender imbalance in the approval base rate (motivates fairness controls).",
"rai:sensitiveData": "gender, age (protected attributes for bias evaluation).",
"distribution": [
{
"@type": "cr:FileObject",
"@id": "german_credit.csv",
"name": "german_credit.csv",
"contentUrl": "german_credit.csv",
"encodingFormat": "text/csv",
"sha256": "eba521d4cf4573eae070dac58e535d1e730e66e996f76ba0180d9d1b608eb043"
}
],
"recordSet": [
{
"@type": "cr:RecordSet",
"@id": "applications",
"name": "applications",
"field": [
{ "@type": "cr:Field", "@id": "applications/age", "name": "age", "dataType": "sc:Integer",
"source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "age" } } },
{ "@type": "cr:Field", "@id": "applications/gender", "name": "gender", "dataType": "sc:Text",
"source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "gender" } } },
{ "@type": "cr:Field", "@id": "applications/target", "name": "target", "dataType": "sc:Integer",
"source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "target" } } }
]
}
]
}

The three blocks serve distinct functions from a regulatory standpoint:

BlockFunction for Annex IV §2
Root metadata (name, description, license, url, citeAs)Dataset provenance and citation
rai:dataBiases, rai:sensitiveDataData characteristics (Art. 10(2)(f) and (g))
distribution[].sha256Verifiable integrity; the engine anchors this hash
recordSet / fieldDeclared schema; mlcroissant uses it to materialise the DataFrame

How the eval loads the dataset via Croissant

Section titled “How the eval loads the dataset via Croissant”

The framework-agnostic eval (compliance_eval.py) loads the dataset via mlcroissant.Dataset, not read_csv. The manifest acts as the loading contract:

compliance_eval.py (excerpt — loading via Croissant)
import mlcroissant as mlc
CROISSANT = "data/german_credit.croissant.json"
def load_applications(croissant_path: str = CROISSANT) -> pd.DataFrame:
ds = mlc.Dataset(jsonld=croissant_path)
df = pd.DataFrame(list(ds.records(record_set="applications")))
df = df.rename(columns=lambda c: c.split("/", 1)[1] if "/" in c else c)
return df

mlcroissant reads the manifest, validates the CSV file’s sha256 against the value declared in distribution, and materialises the recordSet as Python records. The evaluation code never references the CSV path directly; it does so through the manifest.


When sei run executes the pipeline, the engine computes the Croissant manifest digest and includes it in the signed evidence bundle (.sei/bundle.json). sei.lock stores that digest alongside the treatment code digest and model digest:

Verify manifest integrity
sei status --repo .
# → checks the Croissant digest against sei.lock; if the manifest changed,
# the dataset section appears as stale and sei run re-anchors

This mechanism ensures that any modification to the manifest — including a change to rai:dataBiases or the distribution sha256 — is recorded in the git history of the bundle. The cloud renders this digest as part of the Annex IV §2 section.