Tutorial: medical system (human HOTL oversight)

🟡 Partial — The medical.json scenario runs the full ex-ante evaluation (11 controls, GREEN gate). The worst-subgroup control (worst-subgroup-dice) is AUDIT: it fails on its point value but does NOT block the gate; the risk is treated with declared human HOTL oversight (Art.14). The evaluation does not run in hermetic CI (requires GPU + TCIA cache).

This tutorial walks through Venturalítica’s medical scenario: a high-risk medical device software (SaMD) under the MDR (EU Reg. 2017/745) and EU AI Act Annex III §6. The vertebral segmentation engine is TotalSegmentator operating on the public TCIA Spine-Mets-CT-SEG cohort (≈55 real CT cases). The configuration fragments below are illustrative examples: they show how to write a sei.yaml for a medical-imaging system, not a literal dump of any internal file.

What you will learn:

How the engine manages a medical imaging system with segmentation metrics (Dice, es_dice) and equity metrics (score_gap, worst_cell_score)
Why the gate can stay GREEN even with an unmet AUDIT control (the worst subgroup): an audit control records the residual but does not block
How that residual is treated with human HOTL oversight (Art.14, the human-oversight-worst-case-review control), not with a parameter or code change
How the MDR framework is applied via frameworks in the risk programme (the risk: section of sei.yaml)
The contrast with tutoriales/primer-sistema (loan = code change): same engine, different regulatory framework

Project context

medical models a vertebral-segmentation medical device software (SaMD) for spinal surgical planning:

High-risk under MDR Annex I (medical device software) and EU AI Act Annex III §6
Task: volumetric vertebral segmentation in CT
Model: TotalSegmentator v2 (BYO, no proprietary fine-tuning)
Cohort: TCIA Spine-Mets-CT-SEG — ~55 real CT cases with spinal metastases (effective cohort ≈33 after filtering), multi-scanner, multi-protocol
Pipeline: DVC, a single evaluate stage (segments in memory, no intermediate CSV — removed because it was PHI-shaped and a source of staleness)
AssuranceProgram (the risk: section of sei.yaml): 4 risks, 14 measures → 11 ex-ante controls

The accuracy measurement is the Dice (Sørensen similarity coefficient) per case, with bootstrap confidence intervals clustered by patient. Equity metrics are es_dice (Equity-Scaled Dice, FairSeg ICLR’24) and score_gap.

Step 1 — Prepare the environment

Verify that sei is compiled:
Terminal
```
sei --help
```
All 16 subcommands should appear. If not, follow the installation guide.
Create a working directory and copy the scenario files:

This tutorial uses the example resources from the engine repository (contributor access). Once the code is published (open-core strategy), these resources will be available to everyone.
Terminal
```
mkdir /tmp/medical && cd /tmp/medical
cp -r /path/to/seigarrena/crates/seigarrena-cli/tests/resources/medical/. .
```

Initialise git and DVC, and register the cohort metadata:

git init
dvc init
dvc add data/trusted_metadata.csv
git add sei.yaml dvc.yaml \
        data/trusted_metadata.csv.dvc data/cohort.croissant.json \
        eval/ .dvc/ .gitignore

Create the Python environment with uv:

uv venv .venv
uv pip install \
  "venturalitica[imaging]==0.6.12" \
  "dvc>=3" \
  "TotalSegmentator" \
  "monai" "nibabel" "pydicom" "pydicom-seg" \
  "SimpleITK" "pandas" "numpy" "pyyaml"

venturalitica[imaging] includes the native segmentation metrics from the registry (Dice, NSD, es_dice, score_gap), available from version 0.6.12.

Step 2 — Project files

medical/
├── sei.yaml                              # System manifest (includes the Art. 9 + MDR risk programme)
├── shared_data/policies/                 # destination of the OSCAL assessment_plan generated by sei compile
├── dvc.yaml                              # Pipeline: 1 stage `evaluate` (in-memory, no CSV)
├── data/
│   ├── trusted_metadata.csv             # Real acquisition metadata (TCIA via NBIA API)
│   └── cohort.croissant.json            # Data governance §2 (Croissant, EU AI Act Art. 10)
└── eval/
    ├── eval.py                           # DVC stage: segments in memory + audits with SDK
    ├── segment.py                        # segment_cohort() — TotalSegmentator/GPU
    └── dicom_utils.py                    # DICOM/SEG utilities

sei.yaml — the manifest

Below is an illustrative, abbreviated version of the manifest (the risks: block is summarised; the scenario file ships the full one). It is meant to convey the shape of a sei.yaml:

apiVersion: seigarrena.dev/v1alpha1
kind: AISystem
system:
  name: spine-segmentation
  intended_purpose: "Segmentacion vertebral en TC sobre cohorte de validacion (eval-only)."
task:
  modality: image
  type: segmentation
eval: { script: eval/eval.py }
pipeline: { tool: dvc, metrics: metrics.json }
oscal: { assessment_plan: shared_data/policies/assessment_plan.oscal.yaml }
dataset: { croissant: data/cohort.croissant.json }
artifacts:
  model: { kind: totalsegmentator, seed: 0 }
risk:
  appetite: { individual: HIGH, society: HIGH, organization: HIGH }
  criteria: { scale: "5x5" }
  risks:
    # 4 risks with their measures (each carrying MDR frameworks) — see table below

The appetite is set to HIGH for all three dimensions: the device operates in a high-risk environment (MDR) where the appetite is higher than in a financial system. The gate controls (global Dice, equity by sex/age, robustness by scanner) must exceed the clinical threshold for the gate to be GREEN; the audit controls (such as the worst subgroup) record their value but do not block it.

dvc.yaml — a single stage

The pipeline has a single evaluate stage that segments the cohort in memory (no intermediate CSV) and audits the results against the OSCAL assessment plan:

stages:
  evaluate:
    cmd: .venv/bin/python eval/eval.py
    deps:
      - eval/eval.py
      - eval/segment.py
      - eval/dicom_utils.py
      - data/trusted_metadata.csv
      - shared_data/policies/assessment_plan.oscal.yaml
    metrics:
      - metrics.json:
          cache: false

The handoff between segmentation and auditing is in memory: segment_cohort() returns a DataFrame with a Dice column per patient, which the SDK evaluates as registry metrics (mean_score, worst_cell_score, es_dice, etc.) with power-stats clustered by patient.

The `risk:` section — 4 risks, dual regulation

The AssuranceProgram (the risk: section of sei.yaml) declares 4 risks with 14 measures. Each measure carries frameworks: [eu/mdr@2017#gspr-<n>] — the GSPR from MDR Annex I that corresponds to that control:

Risk	Key measures	MDR GSPR
`risk.subgroup-segmentation-bias`	`worst-subgroup-dice` (audit, critical — does not block), `human-oversight-worst-case-review` (HOTL, Art.14)	GSPR-9 (risk control)
`risk.segmentation-accuracy-failure`	`vertebra-segmentation-dice` (gate), `data-leakage-check` (gate)	GSPR-17.1 (software performance)
`risk.data-governance-failure`	`fairness-sex-disparate-impact` (`es_dice`, gate), `privacy-k-anonymity` (gate)	GSPR-17.1 (data governance)
`risk.scanner-protocol-bias`	`robustness-scanner-bias`, `fairness-scanner-model` (both gate)	GSPR-17.1 (multi-scanner performance)

worst-subgroup-dice has enforcement: audit (severity: critical but non-blocking): it records the worst subgroup and surfaces the residual honestly, but does not bring down the gate. The gate coverage of prEN 18228 cl. 9.2 comes from vertebra-segmentation-dice and fairness-sex-disparate-impact (both enforcement: gate).

The two post-market controls (postmarket-subgroup-drift, postmarket-dice-drift) and the HOTL oversight (human-oversight-worst-case-review) have lifecycle: [monitoring]: sei compile excludes them from the 11 ex-ante controls of the assessment plan (14 measures − 3 monitoring = 11).

See The three treatment modalities for the full regulatory framing of each risk.

Step 3 — `sei compile`: AssuranceProgram → OSCAL (11 ex-ante controls)

sei compile

sei compile reads the risk: section of sei.yaml, builds the AssuranceProgram in memory, and generates shared_data/policies/assessment_plan.oscal.yaml with the 11 ex-ante controls (excluding the two lifecycle: [monitoring] ones). This file is the dependency of the evaluate DVC stage; its digest enters dvc.lock and the signed bundle.

git add shared_data/policies/assessment_plan.oscal.yaml
git commit -m "init medical REAL (TCIA cohort + TotalSegmentator GPU; online, no cohort_results.csv)"

See the sei CLI reference for details on the compile subcommand.

Step 4 — `sei run`: ex-ante evaluation with a GREEN gate and an unmet AUDIT control

sei run

sei run triggers dvc repro, which executes the evaluate stage:

segment_cohort() passes all TCIA images through TotalSegmentator/GPU and returns a DataFrame with the real Dice per case
The SDK (vl.enforce(data=df, cluster=PatientID, phase=validation)) evaluates the 11 assessment-plan controls with power-stats clustered by patient
The engine applies the gate and signs the bundle

Real result documented: sei run returns exit 0 — GREEN gate. The model is globally fit (global Dice ≈0.90). The results are:

Control	Operator	Threshold	Result	Enforcement	Status
`vertebra-segmentation-dice`	`>`	0.85	~0.90 (global)	gate	PASSES
`data-leakage-check`	`<`	0.99	(no Dice≈1.0)	gate	PASSES
`fairness-sex-disparate-impact`	`>`	0.80	fit (`es_dice`)	gate	PASSES
`worst-subgroup-dice`	`>`	0.75	below threshold (worst cell)	audit (non-blocking)	UNMET (recorded)

The gate stays GREEN because all enforcement: gate controls pass (global Dice, equity by sex/age, robustness by scanner, k-anonymity). The worst-subgroup-dice control (MDR GSPR-9, severity: critical) fails on its point value —the worst composite subgroup falls on atypical cases— but it has enforcement: audit: it records the residual and surfaces it honestly in Annex IV §4, without bringing the gate down. The worst subgroup in a small cohort with atypical cases is a limitation of the BYO model, not a defect to be corrected with a gate; it is treated with human oversight (HOTL), not by gating to green.

Verify the signature and check the status:

sei verify      # OK — ECDSA-P256+DSSE+in-toto signature valid
sei status      # exit 0 — GREEN gate (prints worst-subgroup-dice as an unmet AUDIT control)

sei status returns exit 0 (green gate) and prints worst-subgroup-dice because it is an unmet AUDIT control —so the residual stays visible— but it does not put the gate in the red.

Step 5 — Treating the subgroup residual: human HOTL oversight (Art.14)

The worst-subgroup-dice control fails on its point value, but it is audit (does not block): the gate is already GREEN. The subgroup residual is not treated with a parameter or code change —the model is BYO (TotalSegmentator), not retrained— but with a declared ex-ante organisational control: human-in-the-loop oversight (HOTL, Art.14).

In the risk: section of risk risk.subgroup-segmentation-bias, the treatment is method: REDUCE with controls: [eu/ai-act@2024#art-15, eu/ai-act@2024#art-14], and the confirming control is:

- id: human-oversight-worst-case-review
  metric: hotl_worst_case_review_declared
  constraint: "> 0"
  severity: high
  enforcement: audit
  evaluation: manual
  lifecycle: [monitoring]
  article: "14"
  frameworks: [eu/mdr@2017#gspr-17.1]
  standard_clauses: ["eu/pren-18228@2026#14", "iso/42105@2026#intervention"]

A clinician validates the segmentation of the worst-subgroup cases before downstream use (surgical planning). The control is declared ex-ante in the manifest and operates post-market (Phase 2, lifecycle: [monitoring]): that is why sei compile excludes it from the 11 ex-ante controls, but it is recorded as a means of conformance for Art.14.

Declaring that oversight is the ISO 23894 §6.5 treatment: a versioned change in sei.yaml, with authorship and timestamp, documenting the proportionate response to the subgroup residual.

git commit -m "treat(medical): declare HOTL oversight (Art.14) for worst-subgroup cases"

See The three treatment modalities for the regulatory framing of the human-oversight modality.

Step 6 — Scenario status: GREEN gate, residual treated

sei verify      # OK — signed bundle, GREEN gate
sei status      # exit 0 — GREEN gate; worst-subgroup-dice listed as an unmet AUDIT control

The medical.json scenario ends here: the gate is GREEN, the model is globally fit (global Dice ≈0.90), and the subgroup residual is recorded (by the audit control worst-subgroup-dice) and treated (by the declared HOTL oversight). There is no FAILS→PASSES arc from a parameter adjustment: the gate was never red. The ISO 23894 §6.5 cycle for risk risk.subgroup-segmentation-bias is closed by the HOTL treatment, not by re-measurement.

Step 7 — `sei conformance`: MDR + EU AI Act projection

sei conformance --standard eu/pren-18228@2026 --out

The same bundle projects onto prEN 18228 (the harmonised risk management standard, Art. 9):

Clause	Result	Why
cl. 9.2 (control measures)	Covered	Provided by the gate controls `vertebra-segmentation-dice` and `fairness-sex-disparate-impact` (`enforcement: gate`, aligned with cl. 9.2); `worst-subgroup-dice` is audit (confirming, does not provide the gate coverage)
cl. 10 (global residual)	Covered / Gap	The engine aggregates the global residual (`evaluate_overall_residual()`) and `sei conformance` cl.10 emits COVERED or GAP; it is only advisory in the `sei run` gate

sei conformance --standard iso/23894@2023

ISO 23894 projection: clause §6.5 is covered when the cycle of risk.subgroup-segmentation-bias is closed (here, by the declared HOTL treatment).

The frameworks: [eu/mdr@2017#gspr-9] field in the risk: section causes the bundle to include MDR traceability in the conformance reports without any additional annotation.

The Annex IV (Art. 11) is not emitted by the engine: it is assembled and rendered by the cloud (control plane) from the signed bundle.json, with provenance per field, and delivered as a PDF. There is no sei annex-iv subcommand. The cloud can mark §2/§3/§4/§9 PENDING if their input is missing (data / misuse+residuals+Art.14 / control results / post-market measures); §7 and §8 are never pending. In this scenario, with data and controls present, only §3 and §9 would stay pending. See Guide: Render the Annex IV.

Contrast with `tutoriales/primer-sistema` (loan)

Dimension	`loan` (credit)	`medical` (medical imaging)
Regulatory framework	EU AI Act + DORA (financial entity)	EU AI Act + MDR (SaMD)
Treatment modality	Code change (`train.py` V1→V2, fairlearn)	Human HOTL oversight (Art.14, declared organisational control)
Model	LogReg, trained in-pipeline	TotalSegmentator v2, BYO (no fine-tuning)
Primary metric	`demographic_parity_diff` (tabular)	Volumetric Dice (image)
Equity	`disparate_impact` (binary, by gender)	`es_dice` (ESSP, equity over continuous Dice)
Power-stats	Bootstrap per sample (n=1000)	Bootstrap per patient cluster (n≈33 effective)
Gate at T0	RED — global control fails (fairness)	GREEN — all gate controls pass; the worst subgroup (audit) records the residual
Response to residual	Retrain (Drift B: model changes)	Declare HOTL oversight (organisational control, model and data unchanged)

The engine is the same in both scenarios. The Reproducer seam dispatches dvc repro in both cases. What changes is the treatment modality and the regulatory framework — declared in the AssuranceProgram via frameworks and projected in the conformance reports without modifying the engine.

What this tutorial demonstrates

An AssuranceProgram with dual regulation (EU AI Act + MDR) declared per measure via frameworks
11 ex-ante controls generated by sei compile from 4 risks and 14 measures (excluding 3 with lifecycle: [monitoring])
A GREEN gate on a real GPU cohort: all gate controls pass (global Dice ≈0.90); the worst subgroup (audit) records the residual without blocking
The treatment via human HOTL oversight (Art.14): a declared ex-ante organisational control, not a parameter or code change
Honest UNDERPOWERED: at an effective cohort ≈33 the bootstrap CIs are wide; the worst subgroup is modelled as audit precisely because at small n it can be noise
Dual conformance projected over the same bundle: prEN 18228 + ISO 23894

Pending gaps

Gap	Why it matters
HOTL declared ex-ante, post-market operation without live data	The `human-oversight-worst-case-review` control documents the plan; the clinical-review evidence is runtime (Phase 2)
NSD / HD95 as cross-check	The Dice size-bias caveat (arXiv 2509.19778) requires cross-checking with boundary metrics; future line
Post-market controls (`postmarket-subgroup-drift`)	Ex-ante only in v1; Art. 72 not modelled in the engine
Annex IV §3/§9 PENDING in this scenario	§2/§3/§4/§9 can be pending if their input is missing; here only §3 and §9 (depend on the deployer’s environment); §7 and §8 are never pending

See State and incompleteness for the complete gap map.