Integrate your pipeline (DVC · MLflow · Dagster)

✅ Stable — DVC, MLflow and Dagster tested in CI against the same loan scenario; lakehouse pending.

The froga engine never imports any MLOps tool. Pipeline reproducibility is delegated to the Reproducer seam: one adapter per tool, all implementing the same interface. The pipeline.tool field in froga.yaml determines which adapter is activated when the engine calls froga run.

The concrete proof of this agnosticism is the loan scenario: the same evaluation code (compliance_eval.py), the same treatment (train.py), the same risk program (the risk: section of froga.yaml) — and three different backends, verified in CI.

MLOps categories and supported tools

Cat.	Paradigm	`pipeline.tool`	Status
1	Git-native / versioned files	`dvc`	Stable
2	Experiment → Registry	`mlflow`	Stable
3	Asset graph with lineage	`dagster`	Stable
4	Lakehouse / tables	(pending)	Future

DVC (cat. 1 — git-native)

DVC extends git to version data and models with content-addressable storage, and describes the pipeline as a DAG in dvc.yaml. It is category 1 because the unit of change is a file versioned in git.

Configuration in `froga.yaml`

pipeline: { tool: dvc, metrics: metrics.json }

Multi-stage DAG

The loan scenario defines two stages: featurize prepares the dataset via Croissant (Annex IV §2) and produces a cached data/features.parquet; evaluate trains the model and writes metrics.json and model.pkl.

stages:
  featurize:
    cmd: .venv/bin/python featurize.py
    deps:
      - featurize.py
      - compliance_eval.py
      - data/german_credit.csv
      - data/german_credit.croissant.json
    outs:
      - data/features.parquet
  evaluate:
    cmd: .venv/bin/python evaluate.py
    deps:
      - evaluate.py
      - compliance_eval.py
      - train.py          # the TREATMENT — its change marks evaluate as stale
      - data/features.parquet
      - shared_data/policies/assessment_plan.oscal.yaml
    params:
      - seed
    outs:
      - model.pkl:
          cache: true
    metrics:
      - metrics.json:
          cache: false

Selective staleness and typed drift

dvc repro recomputes only the stages whose dependencies have changed. When the treatment consists of replacing train.py (V1 → V2), only the evaluate stage becomes stale; featurize remains cached because the data has not changed. This behavior is the concrete expression of class B typed drift (model drift) in the engine: the digest of train.py enters the model phase, and featurize (class C, data) is not touched.

froga run calls dvc repro, reads metrics.json, and anchors the digest of dvc.lock in the signed evidence bundle (pipeline_lock_digest). In this way, the pipeline lock is part of the evidence.

`metrics.json` schema (contract shared by all backends)

metrics.json is a JSON object whose keys are control_ids (the control identifiers from assessment_plan.oscal.yaml). Each value accepts two forms, and the engine accepts both within the same file:

{
  "unfair-credit-exclusion": {
    "value": 0.046,
    "power": {
      "n": 200,
      "ci_low": 0.003,
      "ci_high": 0.174,
      "ci_level": 0.95,
      "method": "bootstrap",
      "n_boot": 1000,
      "seed": 42
    }
  },
  "global_dice": {
    "value": 0.91,
    "power": {
      "n": 1000,
      "ci_low": 0.0,
      "ci_high": 0.071,
      "ci_level": 0.95,
      "method": "bootstrap",
      "n_boot": 1000,
      "seed": 42
    }
  }
}

Scalar form — "control_id": <number>. The measured value, with no statistical reliability. Compatible with SDK < 0.6.11; the engine emits no power-stats warning for that metric.
Power form — "control_id": { "value": <number>, "power": { … } }. The power block (optional) carries the bootstrap confidence interval that feeds the underpowered warning.

When the power block is present, these fields are required (a power missing any of them makes the metrics.json read fail loudly; it is not silently ignored):

Field	Type	Meaning
`n`	integer	Effective sample size (rows).
`ci_low`	number	Lower bound of the percentile CI of the estimator.
`ci_high`	number	Upper bound of the percentile CI.
`ci_level`	number	Confidence level (e.g. `0.95`).
`method`	string	CI method (`"bootstrap"` or `"cluster_bootstrap"`).
`n_boot`	integer	Number of bootstrap resamples (B).
`seed`	integer	Fixed bootstrap seed (byte-for-byte determinism).

Optional power fields: n_clusters (integer — number of clusters if the control declared input.cluster) and groups (object {"<group>": <integer>} — per-subgroup size when the control slices by group). metrics.json is produced by the venturalitica-sdk; this is the contract the engine consumes, identical for DVC, MLflow, and Dagster.

MLflow (cat. 2 — experiment → registry)

MLflow manages the experiment → registration → promotion cycle. The unit of change is a model version in the Model Registry; promotion to @champion is the treatment.

Configuration in `froga.yaml`

pipeline: { tool: mlflow, metrics: metrics.json }

MLflow entry point

The adapter expects an eval/mlflow_entry.py that opens an MLflow run, executes the agnostic eval, registers the real model in the Registry, and, if the blocking control passes, promotes it to the @champion alias:

with mlflow.start_run() as run:
    _, model = compliance_eval.run(train.build_model)
    metrics = json.load(open("metrics.json"))
    info = mlflow.sklearn.log_model(model, name="model",
                                    registered_model_name=REGISTERED_MODEL)
    version = info.registered_model_version
    if _val(metrics.get("unfair-credit-exclusion", 1.0)) < 0.092:
        client.set_registered_model_alias(REGISTERED_MODEL, "champion", version)

The store is local (file:./mlruns) and requires no server. MlflowReproducer reads metrics from the run via mlflow runs describe and delivers them to the engine.

Dagster (cat. 3 — asset graph with lineage)

Dagster materializes assets with explicit lineage (code_version) and allows declaring native quality checks (asset_check). Staleness is detected via code_version, derived from the hash of the treatment file.

Configuration in `froga.yaml`

pipeline: { tool: dagster, metrics: metrics.json }

Graph definition

The scenario defines two assets in a chain and one asset check:

def _treatment_code_version() -> str:
    return "train-" + hashlib.sha256(Path("train.py").read_bytes()).hexdigest()[:12]

@asset(code_version=FEATURES_CODE_VERSION)
def credit_features(context):
    rows = len(compliance_eval.load_applications())
    context.add_output_metadata({"rows": rows})
    return rows

@asset(deps=[credit_features], code_version=_treatment_code_version())
def compliance_evaluation(context):
    compliance_eval.run(train.build_model)
    metrics = json.load(open("metrics.json"))
    context.add_output_metadata({k: MetadataValue.float(float(_val(v)))
                                  for k, v in metrics.items()})
    return metrics

@asset_check(asset=compliance_evaluation,
             description="Fairness gate as a native Dagster check")
def unfair_credit_exclusion_gate(context):
    dp = _val(json.load(open("metrics.json")).get("unfair-credit-exclusion", 1.0))
    return AssetCheckResult(passed=dp < 0.092,
                            severity=AssetCheckSeverity.WARN,
                            metadata={"demographic_parity_diff": dp, "threshold": 0.092})

code_version is derived from the hash of train.py. When the treatment is applied (V1 → V2), the hash changes and Dagster marks compliance_evaluation as genuinely stale. froga run calls dagster asset materialize --select '*' and the asset check acts as a native parallel expression of the fairness gate (severity WARN; the authoritative verdict is emitted by the froga engine from the OSCAL).

Operational summary

For each backend, the command flow is identical:

uv run froga compile       # risk program (risk: in froga.yaml) → assessment_plan.oscal.yaml
uv run froga run           # Reproducer → dvc repro | mlflow run | dagster materialize
uv run froga status        # detects drift without recomputing
uv run froga verify        # verifies the bundle signature
uv run froga reconstruct   # reconstructs the ISO 23894 cycle by git replay

The only thing that changes between scenarios is the value of pipeline.tool in froga.yaml and the pipeline definition files (dvc.yaml, eval/mlflow_entry.py, dagster_defs.py). The Rust core, the controls, the AssuranceProgram, and the agnostic eval remain unchanged.

References

Reference froga.yaml — pipeline field and full manifest syntax
Typed drift A/B/C — how the type of change determines what is recomputed

Integrate your pipeline (DVC · MLflow · Dagster)

MLOps categories and supported tools

DVC (cat. 1 — git-native)

Configuration in froga.yaml

Multi-stage DAG

Selective staleness and typed drift

metrics.json schema (contract shared by all backends)

MLflow (cat. 2 — experiment → registry)

Configuration in froga.yaml

MLflow entry point

Dagster (cat. 3 — asset graph with lineage)

Configuration in froga.yaml

Graph definition

Operational summary

References

Configuration in `froga.yaml`

`metrics.json` schema (contract shared by all backends)

Configuration in `froga.yaml`

Configuration in `froga.yaml`