Datasets with Croissant

EU AI Act Art. 10 requires providers of high-risk systems to establish data governance practices that include a description of the characteristics of the training and evaluation dataset. Annex IV §2 specifically requests information on data provenance and characteristics. Venturalítica materialises this requirement through Croissant manifests: the dataset is not referenced by a file path but by a structured manifest declared in froga.yaml, whose digest is anchored in the signed evidence bundle.

Why Croissant and not `read_csv`

Loading a dataset directly with pd.read_csv("data/file.csv") produces no provenance artefact: there is no field description, no declared sensitive attributes, no verifiable hash, and the evaluation code mixes loading logic with measurement logic.

A Croissant manifest (specification mlcommons.org/croissant/1.0) decouples the dataset description from its consumption:

It describes the distribution (fileObject, sha256), schema (recordSet, field), and AI responsibility metadata (rai:dataBiases, rai:sensitiveData).
The eval materialises it via mlcroissant.Dataset, not read_csv. The field name at load time becomes a reference to the manifest, not a hardcoded path.
The manifest digest enters the signed evidence bundle — any change to the dataset description breaks the signature and is tracked in git.

Declaration in `froga.yaml`

The dataset.croissant field points to the manifest relative to the repository root:

dataset: { croissant: data/german_credit.croissant.json }

This field is the only dataset reference the engine needs. froga run includes the manifest digest in the bundle; froga status verifies it against froga.lock.

Minimum Croissant manifest structure for froga

The following example is a representative excerpt from the real german_credit.croissant.json manifest of the loan scenario. It shows the three mandatory blocks for Annex IV §2 traceability:

{
  "@context": { "@vocab": "https://schema.org/", "cr": "http://mlcommons.org/croissant/",
                "rai": "http://mlcommons.org/croissant/RAI/", "dct": "http://purl.org/dc/terms/" },
  "@type": "sc:Dataset",
  "conformsTo": "http://mlcommons.org/croissant/1.0",
  "name": "german-credit-eval",
  "description": "UCI German Credit (1000 applications) — eval-set for loan. Protected attributes: gender, age.",
  "citeAs": "UCI Machine Learning Repository, Statlog (German Credit Data), dataset #144.",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "url": "https://archive.ics.uci.edu/dataset/144",
  "rai:dataBiases": "Gender imbalance in the approval base rate (motivates fairness controls).",
  "rai:sensitiveData": "gender, age (protected attributes for bias evaluation).",
  "distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "german_credit.csv",
      "name": "german_credit.csv",
      "contentUrl": "german_credit.csv",
      "encodingFormat": "text/csv",
      "sha256": "eba521d4cf4573eae070dac58e535d1e730e66e996f76ba0180d9d1b608eb043"
    }
  ],
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "applications",
      "name": "applications",
      "field": [
        { "@type": "cr:Field", "@id": "applications/age",    "name": "age",    "dataType": "sc:Integer",
          "source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "age" } } },
        { "@type": "cr:Field", "@id": "applications/gender", "name": "gender", "dataType": "sc:Text",
          "source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "gender" } } },
        { "@type": "cr:Field", "@id": "applications/target", "name": "target", "dataType": "sc:Integer",
          "source": { "fileObject": { "@id": "german_credit.csv" }, "extract": { "column": "target" } } }
      ]
    }
  ]
}

The three blocks serve distinct functions from a regulatory standpoint:

Block	Function for Annex IV §2
Root metadata (`name`, `description`, `license`, `url`, `citeAs`)	Dataset provenance and citation
`rai:dataBiases`, `rai:sensitiveData`	Data characteristics (Art. 10(2)(f) and (g))
`distribution[].sha256`	Verifiable integrity; the engine anchors this hash
`recordSet` / `field`	Declared schema; `mlcroissant` uses it to materialise the DataFrame

How the eval loads the dataset via Croissant

The framework-agnostic eval (compliance_eval.py) loads the dataset via mlcroissant.Dataset, not read_csv. The manifest acts as the loading contract:

import mlcroissant as mlc

CROISSANT = "data/german_credit.croissant.json"

def load_applications(croissant_path: str = CROISSANT) -> pd.DataFrame:
    ds = mlc.Dataset(jsonld=croissant_path)
    df = pd.DataFrame(list(ds.records(record_set="applications")))
    df = df.rename(columns=lambda c: c.split("/", 1)[1] if "/" in c else c)
    return df

mlcroissant reads the manifest, validates the CSV file’s sha256 against the value declared in distribution, and materialises the recordSet as Python records. The evaluation code never references the CSV path directly; it does so through the manifest.

Digest anchoring in the evidence

When froga run executes the pipeline, the engine computes the Croissant manifest digest and includes it in the signed evidence bundle (.froga/bundle.json). froga.lock stores that digest alongside the treatment code digest and model digest:

froga status --repo .
# → checks the Croissant digest against froga.lock; if the manifest changed,
#   the dataset section appears as stale and froga run re-anchors

This mechanism ensures that any modification to the manifest — including a change to rai:dataBiases or the distribution sha256 — is recorded in the git history of the bundle. The cloud renders this digest as part of the Annex IV §2 section.

References

froga.yaml reference — dataset.croissant field and manifest syntax
Concepts: EU AI Act — Art. 10 (data governance) and Annex IV §2